Enhancing the Accuracy of Chemogenomic Models with a Three-Dimensional Binding Site Kernel
Computational chemogenomic (or proteochemometric) methods predict target–ligand interactions by training machine learning algorithms on known experimental data in order to distinguish attributes of true from false target–ligand pairs. Many ligand and target descriptors can be used for training and predicting binary associations or even binding affinities. Several chemogenomic studies have not noticed any real benefit in using 3-D structural target descriptors with respect to simpler sequence-based or property-based information. To assess whether this observation results from inaccurate target description or from the fact that 3-D information is simply not required in chemogenomic modeling, we used a target kernel measuring the distance between target–ligand binding sites of known X-ray structures. When used in combination with a standard ligand kernel in a support vector machine (SVM) classifier, the 3-D target kernel significantly outperforms a sequence-based target kernel in discriminating 2882 target–ligand PDB complexes from 9128 false pairs, whatever the modeling procedure (local or global). The best SVM models could be successfully applied to predict, with very high recall (70%), precision (99%), and specificity (99%), target–ligand associations for an external set of 14 117 ligands and 531 targets. In most of the cases, pooling all data in a global model gave better statistics than just discretizing specific target–ligand subspaces in local models. The current study clearly demonstrates that chemogenomic models taking both ligand and target information outperform simpler ligand-based models. It also permits one to design good modeling practices in predicting target–ligand pairing for a large array of targets: (i) ligand-based models are precise enough if sufficient ligand information (>40–50 diverse ligands) is known; (ii) if not, structure-based chemogenomic models (associating a ligand kernel to a structure-based target kernel) are recommended for proteins of known holostructures; (iii) sequence-based chemogenomic models (associating a ligand kernel to a sequence-based target kernel) can still be used with a very good accuracy for the remaining targets.