Why 1+1=3 in the world of unstructured text analysis

news · 7 years ago
by Judy Bandy

ChemAxon works with a large variety of software partners who offer complementary capabilities. I recently chatted with David Milward of ChemAxon partner Linguamatics about what happens when you combine ChemAxon's search and chemical recognition technology with their agile text mining approach to obtain insights from unstructured text.

Getting structures out of text

So you probably already know that you can recognize and extract chemical names and structures embedded in the text of documents by using ChemAxon’s Name to Structure, Document to Structure and/or Structure to Name tools, to produce a list of names and/or structures and their locations in a document. These tools search text for compounds expressed in a wide variety of ways, from standardized IUPAC names and systematic formulae to common or proprietary drug names. Names and structures can also be extracted into desktop tools like Instant JChem and JChem for Excel and you can automatically display structure-based property predictions to help filter or sort your results. (If you haven’t already tried out this technology, take a look at chemicalize.org – ChemAxon’s free, public browsing tool.)

Example output of structures and properties from a web search using chemicalize.org

Example output of structures and properties from a web search using chemicalize.org

But what is text mining?

According to David, text mining is a way to process large volumes of text automatically - like patents, scientific literature, safety reports and electronic health records – so that you can understand the actual meaning of sentences and phrases in the text and from that extract structured results, discover trends and answer questions. The understanding part is provided by using natural language processing combined with regular expressions and terminologies (including domain-specific terms) to map from the different ways words are used in the text to the semantics (the meaning). The example below shows different ways people might express a single concept found by text mining – in this case risk factors for diabetes.

Text mining finds different ways the concept ‘risk factor for diabetes’ is expressed

Text mining finds different ways the concept ‘risk factor for diabetes’ is expressed

And when would I use it?

Chemists might typically use text mining to find properties of a chemical or set of chemicals, such as adverse events caused by particular compounds or their relationships to genes or proteins. This analysis could also uncover indirect relationships which are not immediately obvious - like compound to gene, gene to disease - which could be valuable for drug repurposing studies. And you can also use text mining to identify and extract numerical information from publications, such as dose-related effects on particular biological targets.

So, what’s this 1+1?

By combining text mining with chemical search and recognition you get the ability to answer questions that connect chemical and biological knowledge. You can now search through a document collection by substructure and find descriptive and numeric information about novel compounds represented in a multitude of ways - systematic, IUPAC, colloquial names, or just a mol file, particularly important for patent analysis. Substructure search finds chemically relevant hits that text mining alone would not have identified. In addition, structure similarity and other chemical features, including property predictions if these are not present in the text, can be used to sort or filter text based queries.

Text mining allows filtering based on the role or properties of the chemicals, for example checking there is a relationship to a gene. It can also summarize information both within and across a set of documents. For example, rather than a set of documents to read through, the results of a search can be a frequency ordered list of diseases associated with a substructure.

You could, for example, get a better understanding of the chemical space for a particular therapeutic area by looking for a common substructure between new compounds of interest and those in published reports, maybe filtering by role of the chemical (e.g. inhibitor) and extracting any quoted measurements (e.g. NMR, melting points, doses, etc.).

Combining substructure search with text mining to find chemicals with this substructure that act as inhibitors

Combining substructure search with text mining to find chemicals with this substructure that act as inhibitors

To find out more:
Explore chemicalize.org
David Milward’s presentation from the US UGM (September 2012)