Topic Model – UK applicants (Company & University) – Published 2014-23

A synthetic biology perspective

Artificial intelligence powered topic modelling

The topic modelling conducted during this study required fine tuning a methodology which leverages transformers-based text embeddings via a patent specific model. A custom extractive text summarisation process powered by deep learning sufficiently derived a contextually accurate summary of each patent from the title, abstract and claims within a specific context window. Further details of the different processes utilised include:

  • A transformer based methodology to encode the patent text to dense vector embeddings, capturing the semantic meaning of words based on their surrounding context.
  • Data engineering to ensure all EPO patents had English language title, abstract & claims available. Data was sourced from the corresponding family member e.g. WO when unavailable for the EPO document.
  • Document embedding dimensionality reduction (transformer based embeddings) for each patent, mapping documents to x and y coordinates for plotting.
  • Fine tuning of topic model parameters, etc. to ensure optimisation of the topic model.
  • Human input into topic model auditing and topic categorisation fine tuning via bespoke methods.
  • A hybrid approach using NLP for initial topic discovery together with manual input for multi-topic assignment and data cleaning, ensuring interpretable and accurate topic trends.

SynBio topic model – UK applicants

The topic model discussed on the topic model page of this report was further refined to focus on UK based applicants, specifically identifying applicants from the company and university sectors published during 2014-23 for a recent perspective, shown in figure 18.1. The visualisation is based on the dimensionality reduction of vector embeddings to map each patent to a contextually relevant x & y coordinate. The dense categorical clusters are colour coded to support review.

This section of the report provides further insights into the UK applicant landscape by assessing applicants within the company and university sectors separately using SynBio EPO patents published 2014-23. Assignees were manually analysed to determine their sector; company or university related. For this analysis, individuals and non-profits were not considered but future reports can be expanded to these areas. In figure 18.1, there exists a diverse number of topics, due to the semantic similarity of the ‘smart summary’ produced for each patent via bespoke extractive summarisation, some of the clusters comprise patents belonging to different topics. The latent space is visualised via dimensionality reduction or representation of the complex embedding vectors in two dimensions.

UK topic model (company sector) - SynBio technology cluster totals (2014-23)

The topic modelling carried out identified 40 diverse clusters which have been ranked based on the total number of published applications during 2014-23 in figure 18.2. A patent application can be counted more than once as it can belong to multiple topics to account for multiple invention embodiments, etc. The topic model counts are specific for UK based assignees/applicants from the company sector. Manual data cleaning of the EPO SynBio address data and checks of the applicant sector was carried out.

The UK company applicant topic distribution is skewed towards therapeutic applications during 2014-23. As previously identified, antibodies remain the top ranked topic. Other antibody connected topics such as drug delivery & targeting e.g. nanoparticles (5th – 340 publications), fusion proteins (7th – 325 publications) and immunotherapy peptides & receptors (9th – 246 publications) are all ranked inside the top 10 topics. Within the top 20 topics the vaccine related topic, ranked 20th with 108 publications and chimeric antigen receptors (17th - 130 publications) are very likely to have crossover with the antibody topic, which has influenced innovation and growth in multiple areas. Within the company portfolio there is a higher ranking for Biofuels than previously identified (18th – 122 publications). Waste processing & conversion is ranked 21st with 88 publications, a diverse topic with the potential to contribute to multiple SynBio areas such as biofuels and packaging where lignocellulosic biomass can be processed and converted, etc. The enzyme topic such as compositions, etc. including biochemical treatment is ranked 16th with 134 publications.

UK SynBio company applicants – statistical binning

In figure 18.3, the UK based company sector applicants were analysed via statistical binning to identify the number of companies with specific quantities of patent families and contrasting change during 2014-18 & 2019-23. Patent families are INPADOC based.

During 2014-23 in figure 18.3, there were 339 companies with 1 patent family in total, potentially representing startups, etc. There were 137 companies or almost one quarter of the companies analysed, which had between 2 to 3 patent families. Beyond this distribution there were 48 companies (8.3%) with 4 to 6 patent families in total.

The number of companies with one patent family increased from 206 to 237 companies when contrasting 2014-18 & 2019-23. There was also a noticeable increase in the number of companies which grew from one patent family to the 2-3 family range, increasing from 52 to 101 companies. The surge in the number of companies with 2 to 3 patent families in 2019-23 may also be explained by startups, etc. filing multiple patents on entry into the SynBio space during this period.

There is potential for the increased figures to be impacted by co-applicants where collaboration results in more than one company assigned to a patent family. However, some of the increases appear significant and may indicate an increased number of entrants within the SynBio UK landscape in recent years, contributing to innovation within the field.

UK topic model (university sector) - SynBio technology cluster totals (2014-23)

The topic modelling carried out identified 40 diverse clusters which have been ranked based on the total number of published applications during 2014-23 in figure 18.4. A patent application can be counted more than once as it can belong to multiple topics to account for multiple invention embodiments, etc. The topic model counts are specific for UK based assignees/applicants from the university sector. Manual data cleaning of the EPO SynBio address data and checks of the applicant sector was carried out.

In figure 18.4, the UK university applicant topic distribution is also skewed towards therapeutic applications during 2014-23. Antibodies remain the top ranked topic. Other antibody connected topics such as fusion proteins (6th – 98 publications), drug delivery & targeting (7th – 90 publications) are ranked in the top 10 topics. The genetically modified microorganisms (GMOs) topic is ranked 3rd with 136 publications, a topic with diverse applications in the SynBio landscape. The enzyme topic such as compositions, etc. including biochemical treatment is ranked 17th with 39 publications. The topic counts are also a reflection of the collaboration occurring between universities and companies which are focused towards therapeutics.

Statistical binning was also carried out for the university sector UK applicants (published 2014-23), shown in figure 18.5 where patent families are INPADOC based. The number of universities with 4 to 6 families has increased from 8 families in 2019-23 to 15 in 2019-23 where the portfolios of specific institutions have further matured during this timeframe.