Building machine learning models using relevant features
Selecting all relevant descriptors in the context of the labeled data is a fundamental step towards building accurate machine learning models. It is common to initially generate all available descriptors, as their importance is not yet known. Only some of these variables are necessarily relevant for the target and the given task (classification or regression). Reducing the number of variables has multiple advantageous effects; it contributes to faster model training and may increase the accuracy of the algorithms by reducing overfitting. In fact, identification of relevant descriptors can be considered as an additional result of the training, since relevant features can be uncovered by interpreting and explaining the underlying mechanism of the built model.
We present here the results obtained by implementing the Boruta algorithm in Chemaxon’s Trainer Engine to select all relevant features. The Boruta algorithm [1, 2] selects variables based on their feature importance value. The driver of the selection is the statistical significance tested against a baseline value coming from the introduced noise. Baseline importance is estimated by augmenting each descriptor with an additional “shadow” variable using randomization that preserves the original distribution. Models are trained on the extended descriptor set containing the original descriptors and their shadow values. During the iterative process, feature importances are extracted using ensemble tree models and original descriptors are consecutively dropped if their importance is not significantly higher compared to the importance of the shadow descriptor pool.
In our study we applied the Boruta algorithm and tested it on a large number of targets to compare the accuracy of models built with full descriptor sets versus those with the reduced, relevant set of features. Since the feature importance is influenced by the hyperparameters, we also investigated the effect of hyperparameters on the feature selection. Hyperparameters that are related to regularization were in focus, especially the mTry parameter of ensemble trees.
This presentation will discuss results on a large set of ChEMBL targets and individual ADMET related targets from the Therapeutics Data Commons.[3] Additionally we present associations between the identified chemical features and the biological targets.
[1] Kursa MB, Jankowski A, Rudnicki WR, Boruta—a system for feature selection. Fundam. Inform. 2010; 101(4):271–285. doi: 10.3233/FI-2010-288
[2] https://github.com/scikit-learn-contrib/boruta_py
[3] Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, Coley CW, Xiao C, Sun J, Zitnik M, Artificial intelligence foundation for therapeutic science. Nat Chem Biol. 2022; 18(10):1033-1036. doi: 10.1038/s41589-022-01131-2.
Related content
Scientific Software in Light of the European Accessibility Act
Copy and paste, click and go, swipe right, drag and drop – these computer UI actions are so...
The SwissDrugDesign Project: Advancing Drug Discovery online
The SwissDrugDesign project was initiated in 2011 by the Molecular Modeling group at the Swiss...
Cancer Research Horizons meets global substance regulations with Compliance Checker
Cancer Research Horizons (CRH) is an innovation engine built to complement Cancer Research UK’s...
Talk About ChemTalks
Oh, what a memorable meeting! Chemaxon’s first ever ChemTalks meeting in Basel on September 25,...