Extract Chemical Information from Patents Using Chemicalize and D2S (Document to Structure)

Posted by
David Deng
on 13 09 2013
ChemAxon hosts a free web service called Chemicalize.org to help users extract chemical information from webpages and documents. It is powered by ChemAxon’s Naming technology that converts IUPAC, common names, SMILES/InChI, and CAS Registry numbers to structures. All chemical information in the text is extracted and the uploaded document can be visualized in Document Viewer with structures interactively displayed. Furthermore, all structures are available for download, and can be identified through structure search on the site. As an increasing number of full patent texts become available online, Chemicalize can be a powerful tool for patent mining. In a short session, we will demonstrate how to extract exemplified structures from a patent via Chemicalize, and then expand the chemical space using ChemAxon’s Markush technology. For patents containing sensitive information and cannot be uploaded to a public website, D2S (Document to Structures) can be a very useful tool. Also based on the Naming, D2S applies text OCR and image OSR technologies to extract chemical information from non-searchable PDF documents. Since the locations of the extracted structures are also returned, D2S can significantly expedite the patent analysis process. Download slides