Blog Trainer Engine Machine Learning

Building machine learning models using relevant features

Posted by

on 2023-06-27

Blog Trainer Engine Machine Learning

Building machine learning models using relevant features

Selecting all relevant descriptors in the context of the labeled data is a fundamental step towards building accurate machine learning models. It is common to initially generate all available descriptors, as their importance is not yet known. Only some of these variables are necessarily relevant for the target and the given task (classification or regression). Reducing the number of variables has multiple advantageous effects; it contributes to faster model training and may increase the accuracy of the algorithms by reducing overfitting. In fact, identification of relevant descriptors can be considered as an additional result of the training, since relevant features can be uncovered by interpreting and explaining the underlying mechanism of the built model.

We present here the results obtained by implementing the Boruta algorithm in Chemaxon’s Trainer Engine to select all relevant features. The Boruta algorithm [1, 2] selects variables based on their feature importance value. The driver of the selection is the statistical significance tested against a baseline value coming from the introduced noise. Baseline importance is estimated by augmenting each descriptor with an additional “shadow” variable using randomization that preserves the original distribution. Models are trained on the extended descriptor set containing the original descriptors and their shadow values. During the iterative process, feature importances are extracted using ensemble tree models and original descriptors are consecutively dropped if their importance is not significantly higher compared to the importance of the shadow descriptor pool.

In our study we applied the Boruta algorithm and tested it on a large number of targets to compare the accuracy of models built with full descriptor sets versus those with the reduced, relevant set of features. Since the feature importance is influenced by the hyperparameters, we also investigated the effect of hyperparameters on the feature selection. Hyperparameters that are related to regularization were in focus, especially the mTry parameter of ensemble trees.

This presentation will discuss results on a large set of ChEMBL targets and individual ADMET related targets from the Therapeutics Data Commons.[3] Additionally we present associations between the identified chemical features and the biological targets.

[1] Kursa MB, Jankowski A, Rudnicki WR, Boruta—a system for feature selection. Fundam. Inform. 2010; 101(4):271–285. doi: 10.3233/FI-2010-288

[2] https://github.com/scikit-learn-contrib/boruta_py

[3] Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, Coley CW, Xiao C, Sun J, Zitnik M, Artificial intelligence foundation for therapeutic science. Nat Chem Biol. 2022; 18(10):1033-1036. doi: 10.1038/s41589-022-01131-2.

Facebook Twitter LinkedIn

Copy to clipboard Copy link

[1] Kursa MB, Jankowski A, Rudnicki WR, Boruta—a system for feature selection. Fundam. Inform. 2010; 101(4):271–285. doi: 10.3233/FI-2010-288

[2] https://github.com/scikit-learn-contrib/boruta_py

Marvin

The new Marvin is a universal chemical editor that serves the needs of any chemist involved in research and drug discovery.

Design Hub

Your molecular design and tracking platform turning drug discovery into a team sport.

Compound Registration

Compound Registration compares the uniqueness of new small molecules against those already stored in your database.

Design Hub

Building machine learning models using relevant features

Building machine learning models using relevant features

Related content

Introduction to Controlled Substance Analogues and Generic Definitions

NMR Predictor Guide: Which Type Is Best for You?

From Small Molecules to Biologics, New Modalities in Drug Development

The State of Adopting AI in Drug Discovery in 2025