Digitalizing research data: explore your hidden knowledge base from legacy MS Office (& pdf) documents

Posted by
on 13 09 2021

The Client

Working alongside big pharma, biotech’s, public research institutions and investment groups, our client orients the research and development of new therapeutic and diagnostic tools. Many pharma and biotech companies still store a lot of information in various legacy sources. Valuable data is stored in various file repositories scattered across multiple sites, file repositories or even individual computers. With data becoming ubiquitous, there is a need to store them in a way that is better managed and easier to access. One client with such an issue approached ChemAxon and asked whether we offered a solution to such a problem. In particular, the client had a substantial amount of data on chemical reactions scattered across hundreds of Microsoft Word .docx files.

The Solution

In the beginning, there was no clear consensus on how to extract data from old documents, so ChemAxon agreed with the client on a short exploratory proof-of-concept style project. The client only asked if we could manage the task. ChemAxon Consultancy adopted a pro-active approach and decided to ask for a data sample to give it a try. Unlike .csv or even Microsoft Excel documents, Microsoft Word documents are notoriously hard to read using external data science tools. To tackle this problem, we opted to treat Word documents as websites. Using a mammoth library, we first converted each .docx file into a website-like object. Doing that we have transformed a hard to read proprietary format into a format which is ready for web scraping. What remained was to properly understand the structure of files, extract all the useful chemical information and finally provide the customer with the results. From over 2100 Word documents analyzed, reactions were successfully extracted from 1660 documents.In the rest of the cases, the root of the interruption lay in the source documents – images, arrows or editing mistakes. These could be rectified by editing the source documents. In the end, ChemAxon Consultancy was able to react to a particular client need by adopting an approach tailor-made for that particular customer. By going from legacy data via websites into a new integrated database, we allowed the customer to fully harness all their valuable chemical data at once.

Customer feedback

„So we can conclude that, if the reaction in the Word document is correctly drawn, then the extraction is correctly done. This project can be considered as complete. Thanks to ChemAxon’s consultants for their help and reactivity in our exchanges.”