Building machine learning models using relevant features

Posted by
Ákos Tarcsay
on 27 06 2023

Building machine learning models using relevant features

Selecting all relevant descriptors in the context of the labeled data is a fundamental step towards building accurate machine learning models. It is common to initially generate all available descriptors, as their importance is not yet known. Only some of these variables are necessarily relevant for the target and the given task (classification or regression). Reducing the number of variables has multiple advantageous effects; it contributes to faster model training and may increase the accuracy of the algorithms by reducing overfitting. In fact, identification of relevant descriptors can be considered as an additional result of the training, since relevant features can be uncovered by interpreting and explaining the underlying mechanism of the built model.

We present here the results obtained by implementing the Boruta algorithm in Chemaxon’s Trainer Engine to select all relevant features. The Boruta algorithm [1, 2] selects variables based on their feature importance value. The driver of the selection is the statistical significance tested against a baseline value coming from the introduced noise. Baseline importance is estimated by augmenting each descriptor with an additional “shadow” variable using randomization that preserves the original distribution. Models are trained on the extended descriptor set containing the original descriptors and their shadow values. During the iterative process, feature importances are extracted using ensemble tree models and original descriptors are consecutively dropped if their importance is not significantly higher compared to the importance of the shadow descriptor pool.

In our study we applied the Boruta algorithm and tested it on a large number of targets to compare the accuracy of models built with full descriptor sets versus those with the reduced, relevant set of features. Since the feature importance is influenced by the hyperparameters, we also investigated the effect of hyperparameters on the feature selection. Hyperparameters that are related to regularization were in focus, especially the mTry parameter of ensemble trees. 

This presentation will discuss results on a large set of ChEMBL targets and individual ADMET related targets from the Therapeutics Data Commons.[3] Additionally we present associations between the identified chemical features and the biological targets. 


[1] Kursa MB, Jankowski A, Rudnicki WR, Boruta—a system for feature selection. Fundam. Inform. 2010; 101(4):271–285. doi: 10.3233/FI-2010-288

[2] https://github.com/scikit-learn-contrib/boruta_py

[3] Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, Coley CW, Xiao C, Sun J, Zitnik M, Artificial intelligence foundation for therapeutic science. Nat Chem Biol. 2022; 18(10):1033-1036. doi: 10.1038/s41589-022-01131-2.

 

Selecting all relevant descriptors in the context of the labeled data is a fundamental step towards building accurate machine learning models. It is common to initially generate all available descriptors, as their importance is not yet known. Only some of these variables are necessarily relevant for the target and the given task (classification or regression). Reducing the number of variables has multiple advantageous effects; it contributes to faster model training and may increase the accuracy of the algorithms by reducing overfitting. In fact, identification of relevant descriptors can be considered as an additional result of the training, since relevant features can be uncovered by interpreting and explaining the underlying mechanism of the built model.

We present here the results obtained by implementing the Boruta algorithm in Chemaxon’s Trainer Engine to select all relevant features. The Boruta algorithm [1, 2] selects variables based on their feature importance value. The driver of the selection is the statistical significance tested against a baseline value coming from the introduced noise. Baseline importance is estimated by augmenting each descriptor with an additional “shadow” variable using randomization that preserves the original distribution. Models are trained on the extended descriptor set containing the original descriptors and their shadow values. During the iterative process, feature importances are extracted using ensemble tree models and original descriptors are consecutively dropped if their importance is not significantly higher compared to the importance of the shadow descriptor pool.

In our study we applied the Boruta algorithm and tested it on a large number of targets to compare the accuracy of models built with full descriptor sets versus those with the reduced, relevant set of features. Since the feature importance is influenced by the hyperparameters, we also investigated the effect of hyperparameters on the feature selection. Hyperparameters that are related to regularization were in focus, especially the mTry parameter of ensemble trees. 

This presentation will discuss results on a large set of ChEMBL targets and individual ADMET related targets from the Therapeutics Data Commons.[3] Additionally we present associations between the identified chemical features and the biological targets. 


[1] Kursa MB, Jankowski A, Rudnicki WR, Boruta—a system for feature selection. Fundam. Inform. 2010; 101(4):271–285. doi: 10.3233/FI-2010-288

[2] https://github.com/scikit-learn-contrib/boruta_py

[3] Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, Coley CW, Xiao C, Sun J, Zitnik M, Artificial intelligence foundation for therapeutic science. Nat Chem Biol. 2022; 18(10):1033-1036. doi: 10.1038/s41589-022-01131-2.