Biological, chemical and physical properties of molecules are encoded in their molecular structure. The challenge lies in discovering the relationships between the molecular graphs and the measured activity. Where data is measured, collected and curated for a series of compounds there is an opportunity to find the hidden relationships.
Chemical structures come in various shapes and sizes, depending on the scientists or even algorithms that create them. Though variability may sometimes seem subtle to a trained chemist’s eyes, these can introduce inconsistencies that impair chemical search algorithms or model building. Structure normalization is a key component of any cheminformatics workflow with an often underestimated significance. Finding relationships between chemical structures and their measured properties primarily relies on the representation of the chemical matter. Variability of the calculated features and descriptors for these representations can influence data analysis and accuracy of the predictions. During the first part of the presentation we will present the effect of chemical normalization on investigating correlations and building predictive models.
The second part of the talk will incorporate the results of model building on 163 ChEMBL targets extracted from the bioactivity benchmark set1. Results with different descriptor generation methods including ECFP fingerprints, MACCS key, structural properties, geometry properties and phy-chem properties will be discussed in detail. This part focuses on summarizing the results of more than 3000 Random Forest models. Finally model development for ADMET targets will be highlighted including hERG cardiotoxicity prediction, permeability and blood brain barrier penetration. We will describe how these models can be built, analyzed, optimized and deployed using our new machine learning platform.