Automated spelling correction to improve recall rates of name-to-structure tools for chemical text mining

presentation · 9 years ago
by Sorel Muresan (AstraZeneca)

A major usage of name-to-structure software, such as Lexichem (OpenEye), n2s (ChemAxon), OPSIN (Unilever Centre for Molecular Science Informatics) is in text mining and chemical named entity recognition (CNER) from patents and on-line web documents. In this use-case, the performance is not limited by the chemistries supported by name-to-structure conversion, but the high rate of typos and lexicographic errors due to human errors, OCR failures, hyphenation and multiple line issues, etc. In this work, we present an analysis of the quality of structures extracted by automatic CNER methods and a large-scale analysis of a comprehensive database of 12 million patents. The pre-processing step of automated spelling correction done with CaffeineFix and the combined output of several name-to-structure software is shown to greatly improve patent chemical text mining.