Topic Model
Artificial intelligence powered topic modelling
The topic modelling conducted during this study required fine tuning a methodology which leverages transformers-based text embeddings via a patent specific model. A custom extractive text summarisation process powered by deep learning sufficiently derived a contextually accurate summary of each patent from the title, abstract and claims within a specific context window. Further details of the different processes utilised include:
- A transformer based methodology to encode the patent text to dense vector embeddings, capturing the semantic meaning of words based on their surrounding context.
- Data engineering to ensure all EPO patents had English language title, abstract & claims available. Data was sourced from the corresponding family member e.g. WO when unavailable for the EPO document.
- Document embedding dimensionality reduction (transformer based embeddings) for each patent, mapping documents to x and y coordinates for plotting.
- Fine tuning of topic model parameters, etc. to ensure optimisation of the topic model.
- Human input into topic model auditing and topic categorisation fine tuning via bespoke methods.
- A hybrid approach using NLP for initial topic discovery together with manual input for multi-topic assignment and data cleaning, ensuring interpretable and accurate topic trends.
Synbio topic model
The synthetic biology topic model is visualised in figure
The datamap above requires a patent to be mapped to it’s primary cluster or category for manageable data visualisation. However, the topic model is also multicategory enabling a patent to belong to more than one cluster for enhanced coverage when investigating topic trends. The topic model is setup for broad categorisation, the clustering task attempts to find topics and represent them. Initially each patent is mapped to one topic based on a specific statistical procedure with highest likelihood. This procedure is advantageous as through fine tuning the number of relevant topics can be deciphered and the precise nature of categorical topics can be determined using NLP powered methodologies.
During the study, human input was incorporated with domain knowledge enabling additional topics to be created and extensive data cleaning to verify accurate topic assignment. With the 40 topics identified, patent documents were also assigned to multiple areas using specific keyword and classification codes. The initial topic discovery procedure is optimised by fine tuning of model hyperparameters.
The hybrid methodology exploits state of the art artificial intelligence to explore topics present within the SynBio dataset for data discovery. Further enhanced by human domain knowledge to overcome the initial noise within the model, generating highly accurate patent analytics insights. In this study, specific topics are analysed to drill down within prominent SynBio areas. The analysis methodology builds upon the topic model overview of SynBio at the European Patent Ofice (EPO), establishing a data narrative from a patent perspective.
Topic model - SynBio technology cluster totals
The topic modelling carried out identified 40 diverse clusters which have been ranked based on the total number of published applications in figure
During the 20 year publication period 2004-2023, recombinant proteins and nucleic acids accounted for 30.7% of the total dataset, reflecting the prominent levels of protein engineering occurring within the SynBio field. Genetically modified organisms accounted for 11% of the total dataset revealing the large scale engineering to change the genetic material in viruses, bacteria, yeasts, etc. Topic modelling identified clusters with a variety of applications including healthcare, diagnostics, therapeutics, vaccines, cell, protein and material engineering and much more. For a more recent perspective, the topics with published applications during 2014-23 are ranked in figure
Antibody technology is now the top ranked topic during 2014-2023 accounting for 30.1% of the publications during this period. The genetically modified microorganisms topic has an increased ranking (5th to 4th) with 18.4% of publications during 2014-2023 categorised in this area. Engineered cells (14th to 10th) and the immunotherapy topic (18th to 12th) have increased rankings, reflecting the growing importance of therapeutic related topics.