Automation of building reliable models

Posted by
Ákos Tarcsay
on 13 09 2021

Volume and velocity of bioactivity data available in public or in-house sources represent an immense opportunity to be exploited in novel compound design. Wider and wider array of targets with labelled data necessitates efficient solutions to build a large number of individual models. Velocity of data growth provides the possibility to yield higher accuracy through continuous re-training of the existing models. Automatic re-training maximizes the applicability domain and minimizes the risk of accuracy drop while a project expands into novel chemical series. Based on the recognition of these requirements we launched a project to develop an automated solution for model building relying on ChemAxon chemical toolkits and Smile Java library.

Validation of the prediction power and reliability is a key factor in case of machine learning. In order to give an estimation of the prediction error we implemented and tested the conformal prediction framework. Applicability domain calculation based on chemical and descriptor space similarity were introduced to provide a tool that supports the assessment of the predicted values. Summary of descriptor selection, machine learning algorithms (RF, SVR) and hyperparameter optimization for a bioactivity data set including >150 ChEMBL targets will be presented. This pool varies in size (from hundreds to thousands) and covers a large spectrum of pharmaceutically relevant targets. Our results showed 0.8< median Pearson correlation value for these targets measured on the test sets. hERG ion channel inhibition is one of the most important safety related off-target. Related liabilities are to be recognized and filtered out early on during drug design. As a case study we present detailed results on hERG model development.