SureChEMBL - Open Patent Data

Posted by
Mark Davies
on 13 09 2014

Historically the cost of access to structured chemical data extracted from patents has been prohibitively high to many researchers working in the field of Drug Discovery. The benefit of delivering this dataset to the scientific community in a free and open manner cannot be underestimated. Aware of the demand for such a service, the European Bioinformatics Institute (EMBL-EBI) acquired the SureChem patent system from Digital Science Ltd. In December 2013. The service has been re-branded SureChEMBL and run by the ChEMBL group along side existing Open Drug Discovery and research resources such as the ChEMBL database and UniChem. The focus of this talk will provide an overview of the existing system architecture, including ChemAxon software, describing how we go from patent literature to structured chemical data, accessible via Web Interface and API. The challenges of migrating such a complex system will be discussed as well as the opportunities to enhance the data processing pipeline, based on prior knowledge from running large chemical resources. In addition to providing an overview of the system, our future plans for the SureChEMBL system will be described. To date these plans include extending the functionality of the entity extractor to identify additional entities important in the Drug Discovery process, such as protein targets, diseases and cell lines. Other plans are focused integration with existing EMBL-EBI resources, such as the ChEMBL database and Europe Pubmed Central. Finally we look towards new and exciting ways to share the data such as integration with Semantic Web technologies and distribution via private Virtual Machine instances.

Open slides in pdf