Chemical text mining for current awareness of pharmaceutical patents

presentation · 7 years ago
by Sorel Muresan, Roger Sayle, Daniel M. Lowe, Paul Hongxing Xie (AstraZeneca, NextMove Software)
The increasing rate of pharmaceutical patent publication makes keeping current in medicinal chemistry ever more difficult for the practising research chemist. The USPTO alone publishes over 6000 applications each week. Following which of these are relevant to an on-going project in a timely manner is a challenging but critical task. This talk will describe a system for automatically downloading US grants and applications as they are published, extracting and mining relevant information, and storing the results in databases searched by web-based interfaces. Difficulties include handling the variety of file formats used, selecting the pharmaceutically relevant subset for analysis, indexing of structured data and the text mining of unstructured data such as chemicals, targets and diseases. Entities (chemical and Markush structures, R-groups and reactions) are extracted from text and images. A set of key “index” compounds are prioritized and used to cluster documents based on their chemical content. Download slides