In a previous paper on the use of AI/ML tools in drug discovery (with a focus on ML-based models and predictions), we cited an observation that "AI won’t replace medicinal chemists, but medicinal chemists who use AI will replace those who don’t". We also noted that medicinal chemists share three main areas of concern about these ML-based models and techniques, which make them wary of their output and predictions, and less likely to use them in their drug discovery projects.
This paper recaps these three worrisome areas and discusses how they can be addressed and mitigated so that medicinal chemists can have confidence in the ML-based models and can take full advantage of their novel outputs to make rapid, better-informed structure-based decisions.
Ensuring the Quality and Availability of Suitable Data
If an ML algorithm is to successfully generate, train, and validate accurate and believable models and property predictions, it will require large amounts of high-quality chemical structure and property data as input. Often such data is only available in separate file locations, in disparate formats (e.g. molfiles, SMILES, InChI), and with inconsistent chemical representation conventions (e.g. salts/solvates, tautomers/mesomers, aromaticity).
These issues can be addressed by chemically-intelligent pre-processing tools such as Chemaxon Standardizer and Structure Checker. The former can read input structure files in all the commonly used formats, and then apply structure-format business rules to create consistent and canonical structures across a whole dataset. Up to 40 different customizable rules can be applied, dealing with areas of potential variation such as:
- Explicit hydrogens
- Aliases and labels
- Salts and solvates
- Abbreviations and repeated groups
- Mesomers, tautomers, and aromaticity
Once the structures have been standardized, they can be checked by Chemaxon Structure Checker which detects, flags, and optionally automatically corrects a wide range of commonly encountered structure errors, including:
- Invalid bond length
- Overlapping bonds or atoms
- Molecule charges
- Incorrect chiral flags
- Invalid valences
- OCR errors
This two-pronged approach generates correct, reliable, consistent structure files which can then be fed with confidence to the ML training engine.
Ensuring Interpretability and Transparency of the Underlying Decision-making
Medicinal chemists have a deep-seated disdain for black box processes that offer neither insight into their underlying decision-making processes nor options to customize their inner workings or evaluate the predicted descriptors.
This lack of transparency into the ML techniques used in model building and the difficulty in understanding which factors are regarded as important mean that chemists have no easy way to compare models and predictions, and this can lead to distrust.
An ideal solution to overcome these objections needs to build, train, and validate models at scale and to be demonstrably accurate, reliable, and believable. Transparency and interpretability can be provided via powerful interactive visualization and feedback into the model generation process. This includes the ability to:
- Configure analysis views with classification and regression layout presets and optimized tables, charts, and molecule visualizations
- Understand model details, including the number and statistical importance of features and descriptors
- Review prediction accuracy metrics
- Explore relationships between chemical structures and prediction accuracy
- Evaluate model performance
- Optimize model performance via retraining and modified feature selection
- Benchmark and compare the predictive power of different models
- Process single compounds or sets of structures via SDfile input
Other features of an ideal system are:
- Central repository of generated models that will simplify sharing; allow model evaluation, comparison, and optimization; and avoid duplication and rework
- Integrable with other applications in discovery workflows via REST API interfaces
A system meeting all these requirements should provide sufficient interpretability and transparency to overcome medicinal chemists’ reticence in using the generated models; facilitate collaboration and broader uptake; and lead to better-informed scientific decision making.
Integrating AI/ML Tools into Existing Drug Discovery Workflows
If validated AI/ML model- and prediction-building tools aren’t broadly deployed and integrated into existing discovery workflows, with familiar and easy-to-use GUIs, they will tend to be underused or ignored by medicinal chemists.
Similarly if they require these users to know where the needed structure and data files are located, how to reformat and combine them, and then how to select and run the most appropriate AI/ML routines to predict and compare required parameters, they will likely remain with the computational chemists as their prime users.
A better solution should let medicinal chemists access, select, and use the best models and predictions seamlessly and as needed – e.g. during the lead optimization or compound series triage stages in their own organization’s Design-Make-Test-Analyse cycle.
Medicinal chemists will be most productive and creative when they are working with familiar tools and applications – and this optimal environment is possible if reliable and trusted AI/ML-predicted values are seamlessly and immediately available via a well-integrated commercial out-of-the-box solution or third party or in-house-developed custom drug design applications via REST APIs.
Optimal Use of AI/ML Tools by Medicinal Chemists
This paper discussed three types of doubt and concern which may be inhibiting medicinal chemists from making the best use of AI/ML tools (and particularly ML-based property predictors) in their drug discovery and design efforts. These three concerns are aligned closely to the main parts of the machine learning life cycle:
Machine Learning Lifecycle
Lack of quality and availability of suitable data
Lack of interpretability and transparency of the underlying decision-making
Difficulty integrating AI/ML tools into existing drug discovery workflows
We have also outlined how each of these three areas of concern can be addressed and mitigated using currently existing and deployed tools and applications. These let computational chemists create, test, validate, and deploy reliable and trusted ML-based predicted values; and they supply medicinal chemists with novel and powerful property values – including physical, ADMET, and biological activity – for immediate use in compound design, SAR analyses, filtering virtual chemistry libraries, etc.
For more information on how we utilize machine learning in Chemaxon, get in touch with one of our expert colleagues.