Topic Model – UK Applicants
A synthetic biology perspective
Artificial intelligence powered topic modelling
The topic modelling conducted during this study required fine tuning a methodology which leverages transformers-based text embeddings via a patent specific model. A custom extractive text summarisation process powered by deep learning sufficiently derived a contextually accurate summary of each patent from the title, abstract and claims within a specific context window. Further details of the different processes utilised include:
- A transformer based methodology to encode the patent text to dense vector embeddings, capturing the semantic meaning of words based on their surrounding context.
- Data engineering to ensure all EPO patents had English language title, abstract & claims available. Data was sourced from the corresponding family member e.g. WO when unavailable for the EPO document.
- Document embedding dimensionality reduction (transformer based embeddings) for each patent, mapping documents to x and y coordinates for plotting.
- Fine tuning of topic model parameters, etc. to ensure optimisation of the topic model.
- Human input into topic model auditing and topic categorisation fine tuning via bespoke methods.
- A hybrid approach using NLP for initial topic discovery together with manual input for multi-topic assignment and data cleaning, ensuring interpretable and accurate topic trends.
SynBio topic model – UK applicants
The topic model discussed in the topic model page of this report was refined to focus on UK based applicants visualised in figure 17.1. The visualisation is based on the dimensionality reduction of vector embeddings to map each patent to a contextually relevant x & y coordinate. The dense categorical clusters are colour coded to support review.
As previously discussed in the topic model sections of this report, the topic model allows each published EP patent to belong to more than one topic area to account for multiple embodiments of inventions and patents classified in multiple, areas, etc. However, for simplicity, within the datamap in figure 17.1, each patent is assigned to one key topic. Please see the overall topic model page for further details regarding the methodology and to avoid duplication here.
UK topic model - SynBio technology cluster totals (2004-23)
The topic modelling carried out identified 40 diverse clusters which have been ranked based on the total number of published applications during 2004-23 in figure 17.2. A patent application can be counted more than once as it can belong to multiple topics to account for multiple invention embodiments, etc. The topic model counts are specific for UK based assignees/applicants. Manual data cleaning of the EPO SynBio address data was carried out.
During the 20 year publication period 2004-2023, there exists a large distribution of therapeutic topics amongst the UK applicant portfolio. There were 1569 publications related to antibody uses & therapeutics which has influenced a variety of interconnected topics such as fusion proteins (7th – 717 publications), drug delivery & targeting (8th – 667 publications) and immunotherapy peptides (12th – 415 publications), amongst others. Genetically modified microorganisms (GMOs) are ranked 5th with 765 publications which can have diverse SynBio applications. The largest strictly non-therapeutic topic identified is Biofuels related ranked 17th with 336 publications. Waste processing and conversion is ranked 23rd with 182 publications and is closely linked with biofuel production due to the processing of waste biomass sources such as cellulose for ethanol production, etc. The enzyme topic such as compositions, etc. including biochemical treatment is ranked 15th with 384 publications.
UK topic model - SynBio technology cluster totals (2014-23)
For a more recent perspective, the topics with published applications during 2014-23 are ranked in figure 17.3. A patent application can be counted more than once as it can belong to multiple topics to account for multiple invention embodiments, etc. The topic model counts are specific for UK based assignees/applicants. Manual data cleaning of the EPO SynBio address data was carried out.
During the more recent 10 year publication period 2004-2023, there still exists a large distribution of therapeutic topics amongst the UK applicant portfolio in figure 17.3. There were 1008 publications related to antibody uses & therapeutics (711 publications) which has also influenced a variety of interconnected topics such as fusion proteins (465 publications), drug delivery & targeting (448 publications) and immunotherapy peptides (346 publications) amongst others. Genetically modified microorganisms (GMOs) are ranked 5th with 530 publications which can have diverse SynBio applications. The enzyme topic such as compositions, etc. including biochemical treatment is ranked 17th with 192 publications. Chimeric antigen receptors have risen 5 ranks to 19th, breaking into the top 20 with 171 publications during the publication period.
SynBio topic modelling trends – UK applicants
The individual publications identified within the SynBio related EPO patent dataset are mapped to at least one topic during the topic modelling stage. This enables accurate publication year trends to be identified for each topic cluster, investigating SynBio innovation with enhanced data granularity. For ease of review the categories have been ranked according to their overall total based on publication counts during 2004 - 2023. The top ranked clusters (1-20) are shown in figure 17.4 for the UK applicant dataset.
In figure 17.4, the publication trendlines reveal multiple increasing publication trends including; antibody uses/therapeutics, genetically modified microorganisms, fusion proteins, engineered cells e.g. stem cells, drug delivery & targeting, immunotherapy peptides & receptors (t-cell), regulation of gene expression and vaccine related patents. There exists levelling off or some decline in 2023 in some of these areas but the growth since 2014 is apparent. The recent recovery of recombinant proteins & nucleic acids is unsurprising given the trending subject matter.
SynBio clusters - ranks 21-40
The publication year trends of SynBio clusters ranked 21-40 are shown in figure 17.5 for the UK applicant dataset.
In figure 17.5, there are multiple topic areas with peak patenting activity occurring in 2023; the chimeric antigen receptors and CRISPR topics have accelerated since 2015, biomass conversion and processing has almost returned to peak levels. Alternative proteins increased in 2023 and unsurprisingly the coronavirus vaccines and antibodies topic grew rapidly since 2021 in response to the COVID-19 pandemic. Overall, some of the patenting activity is quite low in certain topics but the diversity of innovation within the UK applicant dataset provides a strong platform for further SynBio development during the next decade and beyond.
Recent trends – published during 2014 – 2023 (UK applicants)
The average number of publications during 2014-2018 & 2019-2023 are compared in figure 17.6 for each SynBio technology cluster within the identified UK applicant dataset.
The vast majority of clusters have increasing trends when contrasting 2019-2023 with the prior 5 years. The notable exception is the biofuel topic which has decreased from 19 to 16 publications on average per year in 2019-2023. A large number of therapeutic fields exhibit solid growth during 2019-23 compared with 2014-18. There are also noticeable increasing trends for antibodies, genetically modified microorganisms, fusion proteins, recombinant proteins and nucleic acids, drug delivery & targeting e.g. nanoparticles, immunotherapy and engineered cells such as stem cells.