Presentation

Digitalizing research data: explore your hidden knowledge base from legacy MS Office (& pdf) documents

Posted by

on 2021-09-13

Presentation

Digitalizing research data: explore your hidden knowledge base from legacy MS Office (& pdf) documents

The Client

Working alongside big pharma, biotech’s, public research institutions and investment groups, our client orients the research and development of new therapeutic and diagnostic tools. Many pharma and biotech companies still store a lot of information in various legacy sources. Valuable data is stored in various file repositories scattered across multiple sites, file repositories or even individual computers. With data becoming ubiquitous, there is a need to store them in a way that is better managed and easier to access. One client with such an issue approached ChemAxon and asked whether we offered a solution to such a problem. In particular, the client had a substantial amount of data on chemical reactions scattered across hundreds of Microsoft Word .docx files.

The Solution

In the beginning, there was no clear consensus on how to extract data from old documents, so ChemAxon agreed with the client on a short exploratory proof-of-concept style project. The client only asked if we could manage the task. ChemAxon Consultancy adopted a pro-active approach and decided to ask for a data sample to give it a try. Unlike .csv or even Microsoft Excel documents, Microsoft Word documents are notoriously hard to read using external data science tools. To tackle this problem, we opted to treat Word documents as websites. Using a mammoth library, we first converted each .docx file into a website-like object. Doing that we have transformed a hard to read proprietary format into a format which is ready for web scraping. What remained was to properly understand the structure of files, extract all the useful chemical information and finally provide the customer with the results. From over 2100 Word documents analyzed, reactions were successfully extracted from 1660 documents.In the rest of the cases, the root of the interruption lay in the source documents – images, arrows or editing mistakes. These could be rectified by editing the source documents. In the end, ChemAxon Consultancy was able to react to a particular client need by adopting an approach tailor-made for that particular customer. By going from legacy data via websites into a new integrated database, we allowed the customer to fully harness all their valuable chemical data at once.

Customer feedback

„So we can conclude that, if the reaction in the Word document is correctly drawn, then the extraction is correctly done. This project can be considered as complete. Thanks to ChemAxon’s consultants for their help and reactivity in our exchanges.”

Marvin

The new Marvin is a universal chemical editor that serves the needs of any chemist involved in research and drug discovery.

Design Hub

Your molecular design and tracking platform turning drug discovery into a team sport.

Compound Registration

Compound Registration compares the uniqueness of new small molecules against those already stored in your database.

Design Hub

Digitalizing research data: explore your hidden knowledge base from legacy MS Office (& pdf) documents

Digitalizing research data: explore your hidden knowledge base from legacy MS Office (& pdf) documents

The Client

The Solution

Customer feedback

The Client

The Solution

Customer feedback

Related content

ICCS 2022 - Translating data to predictive models

Cheminfo Stories Virtual UGM 2021 Asia Pacific Edition: Deep dive in the future of chemical patent drafting and in-house IP management

Cheminfo Stories 2021 Virtual UGM Asia Pacific Edition: Design of new compounds from the available chemical space

Cheminfo Stories 2021 Virtual UGM | Boost analytical experiments with phys-chem properties