Topic Model

Artificial intelligence powered topic modelling

The topic modelling conducted during this study required fine tuning a methodology which leverages transformers-based text embeddings via a patent specific model. A custom extractive text summarisation process powered by deep learning sufficiently derived a contextually accurate summary of each patent from the title, abstract and claims within a specific context window. Further details of the different processes utilised include:

  • A transformer based methodology to encode the patent text to dense vector embeddings, capturing the semantic meaning of words based on their surrounding context.
  • Data engineering to ensure all EPO patents had English language title, abstract & claims available. Data was sourced from the corresponding family member e.g. WO when unavailable for the EPO document.
  • Document embedding dimensionality reduction (transformer based embeddings) for each patent, mapping documents to x and y coordinates for plotting.
  • Fine tuning of topic model parameters, etc. to ensure optimisation of the topic model.
  • Human input into topic model auditing and topic categorisation fine tuning via bespoke methods.
  • A hybrid approach using NLP for initial topic discovery together with manual input for multi-topic assignment and data cleaning, ensuring interpretable and accurate topic trends.

Synbio topic model

The synthetic biology topic model is visualised in figure 1, based on the dimensionality reduction of vector embeddings to map each patent to a contextually relevant x & y coordinate. The dense categorical clusters are colour coded to support review.

The datamap in figure 1 requires a patent to be mapped to it’s primary cluster or category for manageable data visualisation. However, the topic model is also multicategory enabling a patent to belong to more than one cluster for enhanced coverage when investigating topic trends. The topic model is setup for broad categorisation, the clustering task attempts to find topics and represent them. Initially each patent is mapped to one topic based on a specific statistical procedure with highest likelihood. This procedure is advantageous as through fine tuning the number of relevant topics can be deciphered and the precise nature of categorical topics can be determined using NLP powered methodologies.

During the study, human input was incorporated with domain knowledge enabling additional topics to be created and extensive data cleaning to verify accurate topic assignment. With the 60 topics identified, patent documents were also assigned to multiple areas using specific keyword and classification codes. The initial topic discovery procedure is optimised by fine tuning of model hyperparameters.

The hybrid methodology exploits state of the art artificial intelligence to explore topics present within the SynBio dataset for data discovery. Further enhanced by human domain knowledge to overcome the initial noise within the model, generating highly accurate patent analytics insights. In this study, specific topics are analysed to drill down within prominent SynBio areas. The analysis methodology builds upon the topic model overview of SynBio at the European Patent Office (EPO), establishing a data narrative from a patent perspective.

Topic model - SynBio technology cluster totals

The topic modelling carried out identified 60 diverse clusters which have been ranked based on the total number of published applications in figure 2. A patent application can be counted more than once as it can belong to multiple topics.

In figure 2, since 2015 there are notable distributions where patent applications have been classified into at least one of 60 SynBio topics identified as follows: Key therapeutic areas include antibodies (31.2%) and fusion proteins (14.5%). Around 12% of documents are classified in the gene therapy topic, representing an important therapeutic function to treat or cure disease by correcting the underlying genetic problem. Enzymes are engineered for specific tasks and represent a key topic with 23.7% of documents classified here. Engineering of cells and microorganisms by designing and inserting new genetic programs to give them novel functions had large topic distributions including engineered cells (stem cells etc.) (18.9%), viruses and bacteriophages (14.5%), bacteria (11.9%) and the fungi related (5.7%) topic. The new topic; Functional genomics or proteomics (5.2%) also reflects the amount of research ongoing in this area. Modern developments such as drug discovery etc. are reflected in the new topic; AI, machine learning, etc. (3.6%).

Modern SynBio areas such as sustainable materials, etc. are reflected in the notable distributions which include textiles and coatings (10.1%), packaging, films & bioplastics (6.2%), biofuel related (5.6%) and waste processing and conversion (5.2%). Genetically modified microorganisms accounted for 15.7% of the total dataset revealing the large scale genetic engineering occurring to change the genetic material in viruses, bacteria, yeasts, etc. These microorganisms are also important for biofuel production, vaccines, etc.