Chemical intelligence that makes hidden knowledge effortlessly reachable
The knowledge, that is being produced and stored in the forms of reports, patents and scientific journal articles is expanding exponentially. Although, the unstructured nature of such contents impose constraints for seamless information access and scientific decision support. Chemistry is a unique field in this regard, for two reasons. First, the nomenclature is verbose in a sense that a chemical structure can be represented with various synonyms, for example traditional name, IUPAC name or a wide range of brand names or chemical formats (SMILES). Second, the navigation in the knowledge base, with queries related to the encapsulated chemical space, calls for peculiar search methods like similarity-based or substructure searches.
Our study highlights computational approaches to turn chemistry related knowledge stored in all the open access articles easily accessible. We present our results obtained on this large corpus through the following workflow: i) large-scale conversion of text content to chemical objects, ii) automated preparation of databases to store and organize relevant data, and iii) analysis of the collected chemistry space.
Extraction of chemical objects was done from nearly 1.9M articles that stretches the chemical space of open access scientific literature with ChemLocator application. Chemical space was analysed with calculation of fingerprint-based chemical similarity matrix and clustering by MadFast Similarity Search. In order to explore the scaffold diversity of this exclusive chemical space, the obtained set was fragmented to yield rings and ring systems. Hidden relationships were explored by combining text and chemical information in graph data model and related visualization.
In summary, our use-case highlights the potential of novel technologies to pre-process, search and explore the information network enfolded in large document sets on the field of chemistry.