Recombinant Proteins & Nucleic Acids – Subtopic Landscape
A synthetic biology perspective
The subset of SynBio – recombinant proteins & nucleic acids related patents were further investigated to identify subtopics and assess trending areas.
The topic model leverages a hybrid approach based on the optimised extractive summary for each publication. Using a combination of topic discovery via fine-tuned transformer based deep learning and ground truth cross referencing via keyword and classification codes. The process enables a patent to belong to more than one topic for accurate multi-classification trends, accounting for multiple invention embodiments. Please see the topic model page for further details regarding the topic model methodology to avoid duplication here.Subtopic landscape
The synthetic biology – recombinant proteins & nucleic acids topic model is visualised in figure 13.9, based on the dimensionality reduction of vector embeddings to map each patent to a contextually relevant x & y coordinate, the categorical clusters are colour coded to support review. The visual is based on patents assigned to one key subtopic for simplicity. However, trend analysis also enables a patent to belong to more than one subtopic which is consistent with the topic model methodology throughout this project.
Subtopic model – technology cluster totals
The hybrid topic model methodology identified 20 diverse topics which are ranked based on the total number of published applications in figure 13.10. A patent application can be counted more than once as it can belong to multiple topics.
In figure 13.10, the analysis enables multilabel classification for each patent application, to account for multiple invention embodiments. During the 20 year publication period 2004-2023, cancer related is the largest subtopic identified with 38.6% of documents classified in this area. The next major distribution is the antibody uses/therapeutics subtopic with 35.4% of documents classified here. Other notable distributions include:
- Fusion proteins (22%) which may be expressed in a different organism than their origin or created artificially by recombinant DNA technology.
- Genetically modified microorganisms engineered to produce recombinant proteins such as bacteria or yeast, which may express a gene of interest. Escherichia Coli and Saccharomyces Cerevisiae are examples of leading microorganisms.
- Research has identified the protein based therapeutics market including antibodies and protein hormones, etc. has expanded to nearly $400 billion by 2025. Beyond GMO based production, 67% of cell-based products approved by the FDA are actually produced in mammalian cells. The engineered cells topic accounted for 7.5% of the documents classified.
- Transgenic plants accounted for 11.5% of the documents classified, recombinant proteins, etc. can be produced in transgenic plants using recombinant DNA technology.
- There exits diverse applications for recombinant proteins with gene editing technology such as CRISPR (6.7%), vaccines (9.8%), gene therapy and silencing related topics such as interfering nucleic acids (19.4%), enzyme related (17.8%) and niche topics such as biofuels (2.9%) and alternative proteins (1.2%).
Subtopic publication trends
The recombinant proteins & nucleic acids subtopic publication year trends are shown in figure 13.11. Publication trends discussed below are based on EP A1/A2 applications, identified patents can belong to more than one subtopic due to multiple invention embodiments.
In figure 13.11, the CRISPR subtopic is the fastest growing area at an impressive 64.2% compound annual growth rate (CAGR). Above the 20% threshold, Immunotherapy peptides & receptors (t-cell) (21.8%) and Adeno-Associated Virus & Adenovirus (21.5%) are both growing rapidly. Recombinant AAV (rAAV) vectors are used in gene therapy for delivery of therapeutic DNA without integrating into the hosts genome, engineered for enhanced specificity. Ongoing clinical trials are exploring the use of rAAVs for ocular, neurological, metabolic, haematological, neuromuscular, and cardiovascular diseases and cancers. Recombinant proteins provide precise products for stimulating the immune system and are particularly important for targeting cancer and other diseases. The topics related to the production of recombinant proteins and nucleic acids are growing rapidly including engineered cells e.g. stem cells (16.4%) and genetically modified microorganisms (13.6%). Above the 10% threshold there is a therapeutic focus with drug delivery & targeting e.g. nanoparticles (11.2%), vaccine related (10.7%) and fusion proteins at 10.4% CAGR. The larger subtopics identified which include antibody uses/therapeutics (8.8%) and cancer related (8.6%) are continuing to show solid signs of growth despite their overall size. Both topics had more than 800 publications during 2023. The next largest topic in 2023 is genetically modified microorganisms (657 publications).
Subtopic top 20 assignees distributions (2014-23)
The patent portfolios of the top 20 assignees within the SynBio – recombinant proteins & nucleic acids dataset are analysed in figure 13.12. The portfolios are restricted to publications during 2014-23, mapped to the 20 subtopics identified, the counts represent total EPO publications.
The heatmap in figure 13.12 reveals the distribution of the top 20 recombinant proteins & nucleic acids assignees during 2014-23, publications can be assigned to more than one subtopic, reflecting multiple invention embodiments. Within transgenic pants, MONSANTO & PIONEER HI BRED are standout applicants. Genetically modified microorganisms are a popular topic amongst the top 20 assignees, led by UNIVERSITY OF CALIFORNIA, MIT, UNIVERSITY OF PENNSYLVANIA and HARVARD. From a UK perspective, GLAXOSMITHKLINE is also active in this area. Recombinant proteins and nucleic acids for immunotherapies are led by IMMATICS BIOTECHNOLOGIES which is also focused towards cancer therapeutics. REGENERON has a notable distribution within the antibody uses/therapeutics and the engineered cells subtopic. The American universities and research institutes have diverse interests across the subtopics identified and there is prolific R&D occurring within the CRISPR and cancer related subtopics, amongst others. The niche topic distributions are led by NOVOZYMES within alternative proteins and biofuels and MONSANTO within alternative proteins.
The analysis does not account for earlier publications prior to 2014, which may have contributed to companies developing market share, etc. and potential licensing and acquisitions (subsidiaries). Data cleaning was carried out to clean names and consolidate. The analysis is an informative guide as some specific subtopics have strict content boundaries to enable differentiation, whilst others are broader to capture more generic areas.
Patent family territory analysis
The INAPDOC patent families comprising the identified recombinant proteins & nucleic acids related EPO patents were analysed to identify the top 30 territories where patents are filed. Analysing the publication countries alone is insufficient as major countries such as France, the UK, Germany, etc. may not publish patents going through the European (EPO) route, especially when pending. To further supplement the available data, a bespoke analysis was conducted standardising the publication countries and including ‘protected countries’ to include patent rights which are pending or granted based on legal status. There are caveats which include:
- The study methodology is focused on EPO patents and may not capture assignees/applicants that file only in home territories or don’t file in Europe via EPO filings.
- The protected country data may not be fully up to date, due to INPADOC data availability and where EPO patents are recent filings.
The standardisation procedure ensures a territory is only counted once per family. The territory analysis is visualised in figure 13.13, EPO and WO (PCT) patents have been included for reference purposes. Despite the caveats, the analysis provides useful indicators regarding territories where applicants are filing patents within the recombinant proteins & nucleic acids field, based on 2014-23 publications for a relatively recent perspective.
In figure 13.13, 90% of the patent families identified had at least one US national filing. Other key territories include China (69.4%), Japan (68.3%), Canada (64.7%) and Australia (55%). Below the 50% threshold, key territories include Republic of Korea (42.4%), India (36.5%) and Brazil (36.3%).
Subtopic keyword trends
Investigating keyword trends provides a different perspective beyond the recombinant proteins & nucleic acids subtopic model. The smart summaries used during the topic model stage were data mined for the most contextually important keywords leveraging transformer based embeddings. Identifying keywords and phrases most similar to the document plus manual auditing for relevance to the SynBio project, visualised in figure 13.14. The visualisation indicates how the cumulative publication counts have changed between the publication periods during 2014-18 & 2019-23. The methodology aims to identify contextually relevant and reliable keywords as a source of ground truth, signify important keywords within the corpus and audit the topic model subtrend analysis already carried out.
In figure 13.14, the following key findings are observed and also support the trending areas identified by the subtopic modelling:
- Cancer is an important therapeutic area increasing from 764 publications during 2014-18 to 1396 publications during 2019-23. Antibody grew from 667 to 884 publications during this period, an important therapeutic.
- Pharmaceutical grew from 718 to 1243 publications during 2019-23 and therapeutic increased to 1156 publications during 2019-23, indicating the medicinal focus of the patenting within this topic.
Subtopic keyword analysis
For a further perspective of contextually important keywords, a statistical procedure was applied selecting six subtopics from the corpus. The analysis contrasts how the usage or frequency of the keywords / phrases differs across the subtopics using a weighted log odds ratio. This aims to identify which differences are meaningful and weight the log odds ratio by a prior outlined in Monroe, Colaresi, and Quinn (2008). The statistical procedure requires the prior is estimated from the data itself rather than an uninformative prior, such as a Dirichlet prior. The procedure is an empirical Bayes approach with results identified in figure 13.15. A further motivation is to audit the subtopics for result relevance and transparency and provide insights into content. As a sidenote the transformer based keyword analysis provides powerful methods to review subtopics and extend the analytical power beyond procedures of evaluating a corpus such as TF-IDF (term frequency-inverse document frequency).
In figure 13.15, the keywords outlined are most characteristic of each subtopic based on the weighted log odds score which is labelled. Another implication of higher log odds scores is the ability to define the keyword identified as more likely to be used within the specific subtopic. This is interesting as some of the log odds scores are not very high, which is not surprising given the overlap encountered between the multiple subtopics identified within the specific topic landscape.
Some key findings observed are:
- Genetically modified microorganism - Adeno-associated viruses (AAV) is an important organism, plus the engineering of capsid proteins and gene therapy applications.
- Alternative proteins – produced recombinantly with engineered sequences and also extracted from plants.
- Biofuel related – highlighting the use of recombinant yeast engineered to encode specific sequences to facilitate ethanol production.
It is difficult to distil and characterise the coverage of the subtopics via restricted keywords and phrases, this is also complicated by the weighting not always being frequency led but reflective of the terminology and context which is more characteristic of one subtopic in relation to others. It is fair to conclude that the subtopic model has successfully captured an extensive set of subtrends which are distinct, overlap exists but the trends are accurate once audited. The keywords are relevant to real word applications and suggest the insights identified are a useful tool to examine the specific topic landscape.