From Famine To Feast: The Patent Chemistry ‘Big Bang' In Pubchem

poster · 8 years ago
by Nicholas Goncharoff, Andrew Hinton, Christopher Southan (SureChem, TW2Informatics)
Standardizer JChem Base Structure Checker Naming Document to Structure
Within only a couple of years the number of patent-extracted structures in PubChem has gone from 2 up to 14.5 million. The major sources, in order of addition, are Thomson Pharma, IBM, SCRIPDB and SureChemOpen. The latter deposited 8.3 million in December 2012, of which 4.6 million had unique compound identifiers (CIDs). This new public patent chemistry ‘feast’ has a range of implications and utilities, some of which will be touched upon in this poster. Estimating a potential total for useful structures extractable from the entire patent corpus is a subject of conjecture. However, those related to medicinal chemistry (i.e. International Patent Classification C07D) are only in the order of 200K patent document families. Thus, at least the major proportion of example structures specified in drug discovery patents, including those with activity results, can now be found in PubChem. While statistical updates will be presented, currently ~30% of the 47 million CIDs now have at least one patent extraction (although some of these will be common chemistry) and, significantly, ~15% of these are from patent-only sources. A valuable consequence is that compounds with structure-activity (SAR) data extracted from medicinal chemistry papers in ChEMBL now have a high probability of either identity or similarity intersects with structures from patents. Two sources, IBM and SCRIPDB, have patent numbers indexed in the PubChem CID records but for SureChem the substance entries link out to the free SureChemOpen application. This provides not only the location of each structure, typically as an IUPAC name, in the full-text and/or images but also a view of the entire chemistry extracted from that document. Examples will be shown where SAR tables in patents are an order of magnitude larger than those in the eventual paper. This information expansion that patents bring to PubChem (and via the use of SurChemOpen extrinsically) is transformative considering not only that ~70% more data is thereby ‘unlocked’ but also the increased connectivity between papers, abstracts, patents and protein targets (e.g. from PubChem BioAssay) via their chemical content becomes easier to navigate and exploit. A comparative analysis of Mw distributions between the major patent sources and ChEMBL as the standard for bioactive content will also be presented. The differences give an insight into complementary for PubChem content between manual expert extraction and the automated Name to Structure conversion pipelines. Download poster here