Alternative Proteins – Subtopic Landscape
A synthetic biology perspective
The subset of SynBio – alternative protein (alt protein) related patents were further investigated to identify subtopics and assess trending areas.
The topic model leverages a hybrid approach based on the optimised extractive summary for each publication. Using a combination of topic discovery via fine-tuned transformer based deep learning and ground truth cross referencing via keyword and classification codes. The process enables a patent to belong to more than one topic for accurate multi-classification trends, accounting for multiple invention embodiments. Please see the topic model page for further details regarding the topic model methodology to avoid duplication here.Subtopic landscape
The synthetic biology – alternative proteins topic model is visualised in figure 9.9, based on the dimensionality reduction of vector embeddings to map each patent to a contextually relevant x & y coordinate, the categorical clusters are colour coded to support review. The visual is based on patents assigned to one key subtopic for simplicity. However, trend analysis also enables a patent to belong to more than one subtopic which is consistent with the topic model methodology throughout this project.
Subtopic model – technology cluster totals
The hybrid topic model methodology identified 20 diverse topics which are ranked based on the total number of published applications in figure 9.10. A patent application can be counted more than once as it can belong to multiple topics.
In figure 9.10, the analysis enables multilabel classification for each patent application, to account for multiple invention embodiments. During the 20 year publication period 2004-2023, almost 68% of patent applications were classified within the top ranked subtopic (plant origin - 1,629 publications). Roughly 55% were classified in the proteins and extracts from microorganisms subtopic (1,314 publications). Microorganisms can be used to produce sustainable and nutritious alternatives to traditional animal and also plant based proteins, e.g. single cell proteins.
The broader subtopics reveal prominent sources of alternative proteins such as plants, microorganisms including microalgae, fungi, yeasts and bacteria or single cell proteins. Animal feed / food and fodder also has a noticeable distribution where alternative proteins can be used to replace or supplement conventional protein sources to address sustainability and environmental issues. During 2014-2023, there is an increase in ranking of the broad subtopic labelled ‘meat analogues / plant protein based’. This is very likely driven by increased market demand, high protein supplementation of food products such as cereals, etc. and the potential sustainability and environmental benefits of these products.
Subtopic publication trends
The alternative protein subtopic publication year trends are shown in figure 9.11. Publication trends discussed below are based on EP A1/A2 applications, identified patents can belong to more than one subtopic due to multiple invention embodiments.
In figure 9.11, there are a number of rapidly growing subtopic areas within the alternative protein landscape. The meat analogues / plant protein based subtopic publication trend has grown rapidly since 2020, with a compound annual growth rate (CAGR) of 22.2% during 2014-23. There are similar recent increases for linked subtopics such as proteins & extracts from microorganisms, proteins and extracts from algae, soy based and the use of fats or oils which are often used within alternative protein based product compositions. As animal feed / fodder and food has increased (CAGR 7.6% 2014-23), the proteins and extracts isolated from waste processing and conversion and biomass conversion and processing subtopics have also increased. Animal feed is also a large area for alternative proteins from microorganisms, etc. There exists a nutritional focus with increases in the nutritional e.g. dietetic products subtopic and the probiotics subtopic which is also linked with the growth of the bacterial related subtopic.
Subtopic top 20 assignees distributions (2014-23)
The patent portfolios of the top 20 alternative protein assignees within the SynBio – alternative dataset is analysed in figure 9.12. The portfolios are restricted to publications during 2014-23, mapped to the alternative proteins subtopics identified; the counts represent total EPO publications.
The heatmap in figure 9.12 reveals the distribution of the top 20 alternative protein assignees during 2014-23, publications can be assigned to more than one subtopic, reflecting multiple invention embodiments. NESTLE are the top ranked assignee, adapting plant material for alternative proteins, reflected in the meat analogue / plant protein subtopic distributions and soy based products. The NESTLE portfolio is arguably more focussed towards products including nutritional and plant protein based. In comparison, CJ CHEILJEDANG & CHR HANSEN have leading distributions within the strictly defined genetically modified microorganisms subtopic, whereas MONSANTO and SUNTORY are leading assignees for transgenic plants within the SynBio - alternative protein field.
The analysis does not account for earlier publications prior to 2014, which may have contributed to companies developing market share, etc. and potential licensing and acquisitions (subsidiaries). For example, subsidiaries may be linked to NESTLE and further enhance the portfolio distribution beyond figure 9.12. Data cleaning was carried out to clean names and consolidate. The analysis is an informative guide as some specific subtopics have strict content boundaries to enable differentiation, whilst others are broader to capture more generic areas.
Patent family territory analysis
The INAPDOC patent families comprising the identified alternative protein related EPO patents were analysed to identify the top 30 territories where patents are filed. Analysing the publication countries alone is insufficient as major countries such as France, the UK, Germany, etc. may not publish patents going through the European (EPO) route, especially when pending. To further supplement the available data, a bespoke analysis was conducted standardising the publication countries and including ‘protected countries’ to include patent rights which are pending or granted based on legal status. There are caveats which include:
- The study methodology is focused on EPO patents and may not capture assignees/applicants that file only in home territories or don’t file in Europe via EPO filings.
- The protected country data may not be fully up to date, due to INPADOC data availability and where EPO patents are recent filings.
The standardisation procedure ensures a territory is only counted once per family. The territory analysis is visualised in figure 9.13, EPO and WO (PCT) patents have been included for reference purposes. Despite the caveats, the analysis provides useful indicators regarding territories where applicants are filing patents within the alternative proteins field, based on 2014-23 publications for a relatively recent perspective.
In figure 9.13, approx.85% of the patent families identified had at least one US (84.3%) national filing. Other key territories with at least one national filing include; China (65.4%), Canada (53.1%) and Japan (52.3%). Below the 50% threshold, key territories include Australia (43.4%), Brazil (43%), Republic of Korea (33.6%) and India (33.2%).
Subtopic keyword trends
Investigating keyword trends provides a different perspective beyond the alternative protein subtopic model. The smart summaries used during the topic model stage were data mined for the most contextually important keywords leveraging transformer based embeddings. Identifying keywords and phrases most similar to the document plus manual auditing for relevance to the SynBio project, visualised in figure 9.14. The visualisation indicates how the cumulative publication counts have changed between the publication periods during 2014-18 & 2019-23. The methodology aims to identify contextually relevant and reliable keywords as a source of ground truth, signify important keywords within the corpus and audit the topic model subtrend analysis already carried out.
In figure 9.14, the following key findings are observed and also support the trending areas identified by the subtopic modelling:
- The importance of food and plant based patent applications are standout keyword areas. Exemplified by the increase in bacterial keywords such as microbial (40 to 94), bacteria (43 to 82), strain (42 to 82), fermentation (47 to 78), lactobacillus (33 to 55), and fungal increasing from 9 to 43 publications when contrasting 2014-2018 totals with 2019-23.
- Specific alternative protein applications can be identified such as nutritional (54 to 71), analogue (6 to 65), plant protein (12 to 57), supplement (31 to 56), meat analogue (4 to 55), probiotic (37 to 50) and pharmaceutical composition increasing from 17 to 38 publications during 2019-2023.
Subtopic keyword analysis
For a further perspective of contextually important keywords, a statistical procedure was applied selecting six subtopics from the corpus. The analysis contrasts how the usage or frequency of the keywords / phrases differs across the subtopics using a weighted log odds ratio. This aims to identify which differences are meaningful and weight the log odds ratio by a prior outlined in Monroe, Colaresi, and Quinn (2008). The statistical procedure requires the prior is estimated from the data itself rather than an uninformative prior, such as a Dirichlet prior. The procedure is an empirical Bayes approach with results identified in figure 9.15. A further motivation is to audit the subtopics for result relevance and transparency and provide insights into content. As a sidenote the transformer based keyword analysis provides powerful methods to review subtopics and extend the analytical power beyond procedures of evaluating a corpus such as TF-IDF (term frequency-inverse document frequency).
In figure 9.15, the keywords outlined are most characteristic of each subtopic based on the weighted log odds score which is labelled. Another implication of higher log odds scores is the ability to define the keyword identified as more likely to be used within the specific subtopic. This is interesting as some of the log odds scores are not very high, which is not surprising given the overlap encountered between the multiple subtopics identified within the alternative protein landscape.
Some key findings observed are:
- The keywords identified for transgenic plants highlight producing and food, engineering of soy and soybean applications, amongst others. - The stevia plant keyword is related to plant engineering, for example, the steviol glycosides extracted can be used to provide sweetness or mask flavours when used with plant proteins. Fatty acids are produced and used with alternative protein compositions.
- The meat analogues subtopic is characterised by plant proteins and analogue / substitute related keywords.
- The genetically modified microorganisms topic has a probiotic focus with lactic acid bacteria such as lactobacillus identified. Recombinant organisms and ‘accession’ indicating patents disclosing accession numbers, plus the use of yeast and bacillus subtilis for probiotics (recent research suggests the probiotic bacteria can decrease gas-related gastrointestinal symptoms in healthy adults), finally the use of GMOS for feed.
- Probiotics are characterised by nutritional and dietary compositions, the gut microbiome, microbiota and cognitive function. There are several species identified and the treatment of type ii diabetes.
- The bacteria related subtopic has a prebiotic, probiotic and nutritional focus, recombinant bacteria strains and keywords related to fermentative production.
- Finally the proteins and extracts from microorganisms subtopic is characterised by nutritive polypeptides, edible species and further nutritional aspects and meat substitutes.
It is difficult to distil and characterise the coverage of the subtopics in 15 keywords and phrases, this is also complicated by the weighting not always being frequency led but reflective of the terminology and context which is more characteristic of one subtopic in relation to others. It is fair to conclude that the subtopic model has successfully captured an extensive set of subtrends which are distinct, overlap exists but the trends are accurate once audited. The keywords are relevant to real word alternative protein applications and suggest the insights identified are a useful tool to examine the alternative protein landscape.