Comparison of Combinatorial Clustering Methods on Pharmacological Data Sets Represented by Machine Learning-Selected Real Molecular Descriptors
Cluster algorithms play an important role in diversity related tasks of modern Chemoinformatics, the widest applications being in pharmaceutical industry drug discovery programs. The performance of these grouping strategies depends on various factors such as the molecular representation, the mathematical method, the algorithmical technique, and the statistical distribution of data. For this reason, introduction and comparison of new methods are necessary in order to find the model that best fits to the problem at hand. Earlier comparative studies report the Ward's algorithm using fingerprints for molecular description as generally superior in this field. However, problems still remain, like the fact that other types of numerical description have been little exploited, current descriptors selection strategy is trial and error-driven, and no previous comparative studies considering a broader domain of the combinatorial methods in grouping chemoinformatic datasets have been conducted. In this work, a comparison between combinatorial methods is performed, five of them being novel in Cheminformatics. The experiments are carried out using eight datasets, well established and validated in the Medical Chemistry literature. Each drug dataset was represented by real molecular descriptors selected by Machine Learning techniques, consistent with the neighborhood principle. Statistical analysis of the results demonstrates that pharmacological activities of the eight datasets can be modeled with a few of families with 2D and 3D molecular descriptors, avoiding classification problems associated with the presence of non-relevant features. Three out of five of the proposed cluster algorithms show superior performance than most of classical algorithms and similar (or slightly superior in the most optimistic sense) to Ward’s algorithm. The usefulness of these algorithms is also assessed in a comparative experiment to potent QSAR and Machine Learning classifiers, where they perform similarly in some cases.