Use of Extensive Cross-Validation and Bootstrap Application (ExCVBA) for Molecular Modeling of Some Pharmacokinetics Properties

publication · 1 year ago
by Fabio Mendes dos Santos, Hans de Winter, Koen Augustyns, Julio Cesar Dias Lopes (University of Antwerp, Federal University of Minas Gerais)
Calculator Plugins (logP logD pKa etc...)

Abstract

The work of the molecular modeling can be divided in three equally important steps. The first one is the choice of the descriptors that must be able to describe accurately the properties studied. The second one is modeling method that must be planned carefully to produce the response we are looking for. Finally, the validation process that need to be properly planned in order to assess the validity of the finds. The most popular methods of validation are jack knife, cross-validation and bootstrap. In this work, we present a new method for the validation of molecular modeling studies that involve a cross-validation together with a recursive jack knife modeling. Initially the instances under study, belonging to two different classes (active/inactive, for instance), are divided in several groups of same size (typically five to ten). One of these groups is used as an internal validation set and the remaining groups are recursively divided in two sets, one for training and the other for evaluation. Each one of the original groups are used once for internal validation and one or more times for training and evaluation. The number of models generate vary from 20, for five groups, to 840, for 10 groups. Additionally, we use the Y-randomization approach of each model in order to assess the validity of the model in relation to a random model. The full set of the models generated must be subject to an external validation, as the extensive cross-validation will be able to assess the model validity within the dataset only. If there is not such external group it can be generated from the original dataset using bootstrap. We applied the approach above described to build models to predict the transposition of the blood-brain barrier (BBB), the AMES mutagenicity test and inhibition of five isoforms (3A4, 1A2, 2D6, 2C9 and 2C19) of cytochrome P450. The descriptors were 3D pharmacophore fingerprints generated with an in-house software (3DPharma) together with multiple specie and conformational fuzzification. All structures were subjected to manual pre-treatment (desalting and structures correction) and treatment for multiple tautomers and protomers using Chemaxon softwares (Structure Checker and Calculation Plugins). The calculations of multiple conformations and charges were performed with OMEGA and Molcharge softwares from OpenEye. LibSVM were used to produce the models. For BBB transposition the model generated achieve a mean accuracy above 95% versus a randomized model accuracy of 80%. For AMES test the mean accuracy was 75% against 50% for randomized model. For cytochrome isoforms the accuracy varies from 70% to 85%, with accuracy of randomized models between 50% and 70%. It is worth to note that the accuracy produced by Y-randomization reflects the composition of the dataset. The computational cost of the approach we present here is high but it allows one to asses to validity of the modeling approach, as well the quality of the descriptors used represent the modeled instances. Despite the fact that the number of instances used to generate each model are smaller than a direct (jack knife) approach the results are of same order. Acknowledgement Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) for fellowships (FMS and JCDL) and ChemAxon and OpenEye for academic license of softwares (FMS and JCDL).

(PDF) Use of Extensive Cross-Validation and Bootstrap Application (ExCVBA) for Molecular Modeling of Some Pharmacokinetics Properties. Available from: https://www.researchgate.net/publication/282644862_Use_of_Extensive_Cross-Validation_and_Bootstrap_Application_ExCVBA_for_Molecular_Modeling_of_Some_Pharmacokinetics_Properties [accessed Jul 05 2018].

Extensive Cross-validation with ExCVBA

In this work, we present a method for the validation of molecular modeling studies that involve a cross-validation together with a recursive jack knife partiotining (Figure 1). Initially, the instances under study are divided in several groups of same size (typically five to ten). One of these groups is used as an internal validation set and the remaining groups are recursively divided in two sets, one for training and the other for evaluation. The number of models generated vary from 20, for five groups, to 840, for 10 groups. Additionally, we use the Y-randomization approach of each model. The full set of the models generated must be subject to an external validation, as the extensive cross-validation will be able to assess the model validity within the dataset only. If there is not such external group it can be generated from the original dataset using bootstrap.

Fuzzy 3D Descriptors with 3DPharma

The descriptors used to describe the molecular structures were 3D pharmacophore fingerprints generated with an in-house software (3DPharma) with multiple species and conformational fuzzification (Figure 2). All structures were subjected to manual pre-treatment (desalting and structures correction) and treatment for multiple tautomers and protomers at pH=7 using ChemAxon software (Calculator Plugins).[1] The calculations of multiple conformations and charges were performed with OMEGA and QUACPAC packages from OpenEye.[2] LibSVM were used to produce the models. The scripts that perform all simulations were written in Perl language.

Datasets

We applied the approach above described to build SVM classification models to predict inhibition of five isoforms (3A4, 1A2, 2D6, 2C9 and 2C19) of cytochrome P450 and Ames mutagenicity of small organic molecules. In this study the models were built with Bursi Ames dataset containing 4284 compounds (2383 mutagens and 1901 nonmutagens) [3] and admetSAR (LMMD) P450 dataset with 17036 compounds (from 24732 original LMMD dataset).[5] In the last case the treatment was critical due to poor quality of original dataset (lack of dessalting, redundancies and inconsistencies).

Results and Conclusion

Globally, the use of multi-species (MS) (P450) and multi-conformation (MC) (AMES) approaches with 3DPharma are slightly better than using a single representation (SSSC) approach. When compared against literature data, 3DPharma is better than 2D methods in P450 datasets and worse with AMES dataset (Figure 3 and Table 1). The computational cost of the approach we present here is high but it allows one to asses to validity of the modeling approach, as well the quality of the descriptors used to represent the modeled instances. Despite the fact that the number of instances used to generate each model are smaller than a direct (single model with training set and an independent external dataset) approach the results are of same order.

Visit publication.