PubChem data extraction and integration using Instant JChem
PubChem is the largest known public repository of biological and chemical data. One of the goals of PubChem is to archive and make the structure-bioactivity data available. The use of various data mining and machine learning tools such as SVMs, Decision Trees, Bayesian models, etc., or advanced searching tools like Markush, custom fingerprints, 2D and 3D similarity, is essential during the post-HTS analyses process. Intellectual property monitoring is another key aspect of PubChem data-mining. Thus, the process of extracting chemical and associated bioassay data from PubChem, processing, arching and storing it in an in-house compatible system is an important task, which can result in increased speed and processing ability, with respect to the above tasks. Once PubChem data is available off-line, data integration with existing in-house HTS or other bioactivity databases becomes an equally important task; the resulting integrated system would enable researchers to query, investigate and profile compounds in a much broader context. We used Instant JChem to create and populate a database containing the MLSMR library and associated bioactivity data from PubChem; further integration with WOMBAT was also performed. The resulting database can also be accessed using the standard JDBC interface with JChem API, and is also amenable to custom data mining tools or search algorithms. Lessons learned from PubChem data mining will be discussed.