Enhancing the computer curation of patents and scientific documents and enabling InChiKey insertion technique for specific and non-specific compound proximity searching
The combined technologies of text analytics and name-to-structure conversions for reading and processing molecular structures provide researchers the ability to build large databases of structures and derive important relationships previously inaccessible. Our previous work employed ChemAxon [name=structure] tools to produce SMILES strings to use with JChemBase to render the scientific and patent literature searchable by structure/substructure programs. We now employ additional ChemAxon tools to detect, normalize, and replace chemical names in documents with InChiKeys and then index the combined text and embedded InChi keys using SOLR, a Lucene-based full text-indexing engine. The resulting index supports Boolean combinations of chemical compounds and regular text words and phrases. It also supports proximity searching that is both compound specific and “any compound” capable. The net result is that we can now perform searches for exact chemical structures or even unspecified chemical structures within a specified context. We will demonstrate that this enables new ways to search and exploit technical databases.