Genetically Modified Microorganisms – Subtopic Landscape

A synthetic biology perspective

The subset of SynBio – genetically modified microorganisms (GMOs) related patents were further investigated to identify subtopics and assess trending areas. The topic model leverages a hybrid approach based on the optimised extractive summary for each publication. Using a combination of topic discovery via fine-tuned transformer based deep learning and ground truth cross referencing via keyword and classification codes. The process enables a patent to belong to more than one topic for accurate multi-classification trends, accounting for multiple invention embodiments. Please see the topic model page for further details regarding the topic model methodology to avoid duplication here.

Subtopic landscape

The synthetic biology – GMOs topic model is visualised in figure 7.9, based on the dimensionality reduction of vector embeddings to map each patent to a contextually relevant x & y coordinate, the categorical clusters are colour coded to support review. The visual is based on patents assigned to one key subtopic for simplicity. However, trend analysis also enables a patent to belong to more than one subtopic which is consistent with the topic model methodology throughout this project.

Subtopic model – technology cluster totals

The hybrid topic model methodology identified 20 diverse topics which are ranked based on the total number of published applications in figure 7.10. A patent application can be counted more than once as it can belong to multiple topics.

In figure 7.10, the analysis enables multilabel classification for each patent application, to account for multiple invention embodiments. During the 20 year publication period 2004-2023, nearly 50% of the GMOs dataset was classified within the bacteria related topic (48.2%). Approx.40.6% of documents were classified in the virus related subtopic. Yeast are also important microorganisms within the field, with 21.1% of documents classified within the subtopic and approx.17.1% of documents classified in the E.Coli subtopic. Food & probiotics (11%) supported by lactic acid type bacteria (9%) and biofuels (6.8%) are important areas for GMOs research and development. E.coli is identified as a key microorganism in synthetic biology for efficiency of growth and established genetic engineering, whilst adeno-associated virus (AAV - 12.4%) is a leading platform for gene therapy due to its effectiveness in therapeutic gene delivery.

Bacillus type bacteria (6.8%) have diverse applications but are recognised biocontrol agents in agriculture, acting as versatile cell factories. Bacillus thuringiensis naturally produces insecticidal toxins and is a prominent biocide used in pest control. Bacillus subtilis has a highly efficient protein secretion system and adaptable metabolism, unsurprisingly it has diverse applications as a cell factory producing chemicals, enzymes and antimicrobials. The Corynebacterium species and in particular Corynebacterium glutamicum, has great potential for producing high value chemicals and is an emerging host for expressing heterologous proteins. Algal biofuels could become one of the highest energy and carbon efficient bioproduction methods.

The GMOs subtopic publication year trends are shown in figure 7.11. Publication trends discussed below are based on EP A1/A2 applications, identified patents can belong to more than one subtopic due to multiple invention embodiments.

In figure 7.11, the fastest growing subtopics based on compound annual growth rate (CAGR) during 2014-23 are Adeno-Associated Virus & Adenovirus (24.8%), virus related (22.6%), Bacteriophage (17.7%), Escherichia coli (14.7), Bacillus Subtilis (10.7%), GMOs for genetic engineering in plants (10.5%), Saccharomyces Cerevisiae (10%) and approaching the 10% threshold is Corynebacterium (9.9%), Food & probiotics (9.8%) and finally Bacillus (9.1%). The AAV therapies global market reached $1.9 billion in 2022 and is projected to grow to $11.1 billion by 2035. It is also the fastest growing subtopic during 2014-23. The UK based Purespring Therapeutics has developed technology enabling delivery of gene therapies specifically to podocytes, creating an opportunity to create novel treatments for a range of genetic and non-genetic glomerular kidney diseases via AAV delivery, etc.

The fast growth of bacteriophages is interesting as phage-based technologies can be used in a variety of ways. UK based NexaBiome is developing phage based next generation antibiotics and Aparon is developing phage based antimicrobials for humans and animals. E.coli and B.subtilis continue to represent key GMOs within the synthetic biology field and patenting activity continues to build on the large portfolio of R&D that exists. The accession no. proxy subtopic reveals steady growth in the number of patents protecting genetically modified microorganisms (4.1% CAGR).

Subtopic top 20 assignees distributions (2014-23)

The patent portfolios of the top 20 assignees within the SynBio – GMOs dataset are analysed in figure 7.12. The portfolios are restricted to publications during 2014-23, mapped to the 20 subtopics identified, the counts represent total EPO publications.

The heatmap in figure 7.12 reveals the distribution of the top 20 GMOs related assignees during 2014-23, publications can be assigned to more than one subtopic, reflecting multiple invention embodiments. In figure 7.12, CJ CHEILJEDANG is the leading assignee for Corynebacterium species, the identified CJ CHEILJEDANG portfolio is heavily distributed toward bacteria type species. For yeast, DSM IP ASSETS, NOVOZYMES and DANISCO are standout assignees. CHR HANSEN have expertise in lactic acid bacteria and are focused towards food and probiotics via bacterial isolates. CJ CHEILJEDANG and GENOMATICA have the largest E.coli distributions, GENOMATICA also has a reasonable distribution in the biofuels subtopic, along with DSM IP ASSETS, NOVOZYMES and DANISCO.

The analysis does not account for earlier publications prior to 2014, which may have contributed to companies developing market share, etc. and potential licensing and acquisitions (subsidiaries). Data cleaning was carried out to clean names and consolidate. The analysis is an informative guide as some specific subtopics have strict content boundaries to enable differentiation, whilst others are broader to capture more generic areas.

Patent family territory analysis

The INAPDOC patent families comprising the identified GMOs related EPO patents were analysed to identify the top 30 territories where patents are filed. Analysing the publication countries alone is insufficient as major countries such as France, the UK, Germany, etc. may not publish patents going through the European (EPO) route, especially when pending. To further supplement the available data, a bespoke analysis was conducted standardising the publication countries and including ‘protected countries’ to include patent rights which are pending or granted based on legal status. There are caveats which include:

  • The study methodology is focused on EPO patents and may not capture assignees/applicants that file only in home territories or don’t file in Europe via EPO filings.
  • The protected country data may not be fully up to date, due to INPADOC data availability and where EPO patents are recent filings.

The standardisation procedure ensures a territory is only counted once per family. The territory analysis is visualised in figure 7.13, EPO and WO (PCT) patents have been included for reference purposes. Despite the caveats, the analysis provides useful indicators regarding territories where applicants are filing patents within the GMOs field, based on 2014-23 publications for a relatively recent perspective.

In figure 7.13, 89% of the of the patent families identified had at least one US national filing. Other key territories with at least one national filing include China (67.4%), Japan (62%) and Canada (53.7%). Below the 50% threshold key territories include Australia (44.7%), Republic of Korea (40.2%), Brazil (34.1%) and India (31%).

Investigating keyword trends provides a different perspective beyond the genetically modified microorganisms subtopic model. The smart summaries used during the topic model stage were data mined for the most contextually important keywords leveraging transformer based embeddings. Identifying keywords and phrases most similar to the document plus manual auditing for relevance to the SynBio project, visualised in figure 7.14. The visualisation indicates how the cumulative publication counts have changed between the publication periods during 2014-18 & 2019-23. The methodology aims to identify contextually relevant and reliable keywords as a source of ground truth, signify important keywords within the corpus and audit the topic model subtrend analysis already carried out.

In figure 7.14, the following key findings are observed and also support the trending areas identified by the subtopic modelling:

  • Keywords related to genetic engineering and encoding nucleic acids and proteins etc. have undergone rapid growth in 2019-23 when compared with 2014-18. For example encoding (1102 to 2760 publications), vector (932 to 2465 publications) and expression (855 to 2033 publications) keyword trends.
  • AAV (Adeno-associated virus) is an important microorganism (179 to 765 publications) and is related to the growth of the gene therapy keyword (134 to 489 publications) where AAVs are important vectors to deliver therapeutic genes into cells.
  • The virus keyword (408 to 1322 publications) is larger than bacteria (354 to 527 publications), yeast grew from 278 to 335 publications in 2019-23. Corynebacterium grew from 47 to 128 publications in 2019-23, a promising microorganism for industrial production of amino acids, fuels, and various value-added chemicals.

Subtopic keyword analysis

For a further perspective of contextually important keywords, a statistical procedure was applied selecting six subtopics from the corpus. The analysis contrasts how the usage or frequency of the keywords / phrases differs across the subtopics using a weighted log odds ratio. This aims to identify which differences are meaningful and weight the log odds ratio by a prior outlined in Monroe, Colaresi, and Quinn (2008). The statistical procedure requires the prior is estimated from the data itself rather than an uninformative prior, such as a Dirichlet prior. The procedure is an empirical Bayes approach with results identified in figure 7.15. A further motivation is to audit the subtopics for result relevance and transparency and provide insights into content. As a sidenote the transformer based keyword analysis provides powerful methods to review subtopics and extend the analytical power beyond procedures of evaluating a corpus such as TF-IDF (term frequency-inverse document frequency).

In figure 7.15, the keywords outlined are most characteristic of each subtopic based on the weighted log odds score which is labelled. Another implication of higher log odds scores is the ability to define the keyword identified as more likely to be used within the specific subtopic. This is interesting as some of the log odds scores are not very high, which is not surprising given the overlap encountered between the multiple subtopics identified within the specific topic landscape.

Some key findings observed are:

  • Adeno-Associated Virus & Adenovirus – characterised by AAV capsid and vectors and gene therapy which may be used for central nervous system, brain and eye related conditions amongst others.
  • Saccharomyces Cerevisiae engineered for ethanol production. Bacteriophage which can be engineered with CRISPR and having medical applications. Bacillus Subtilis used in animal feed and with plants. E.coli is engineered to produce fatty acids, chemical products via fermentative production and precision mutations. Products can be isolated via a secretion system.

It is difficult to distil and characterise the coverage of the subtopics via restricted keywords and phrases, this is also complicated by the weighting not always being frequency led but reflective of the terminology and context which is more characteristic of one subtopic in relation to others. It is fair to conclude that the subtopic model has successfully captured an extensive set of subtrends which are distinct, overlap exists but the trends are accurate once audited. The keywords are relevant to real word applications and suggest the insights identified are a useful tool to examine the specific topic landscape.