Novel similarity measures for the effective and efficient retrieval of chemoinformatic datasets
Similarity searching is an important facility in modern chemical information management systems to accede the rich information contained in currently enormous chemical repositories. Basically, given a molecular representation, a similarity measure, and a matching algorithm, the technique’s output returns an ordered list of dataset molecules in decreasing order of similarity with respect to a query or reference molecule specified by the user. As a consequence, researchers have put their interest in molecular representations and similarity measures performance. However, their studies have been predominantly focused in binary representations and the corresponding resemblance measures, and few work have been done taking into account other types of numerical Description. Also, no machine learning techniques for descriptor selection being consistent with the similarity principle that warrant a high quality of retrieving have been used. These precedents, together with the need of new methods suitable for each chemical context, constitute the motivation for this work. It comprises the computational implementation, in the free language Java, and comparison of two novel measures of similarity to a couple of measures already established in the specialized literature at effectively retrieving eight chemoinformatic datasets from Medicinal Chemistry, represented by machine learning selected real descriptors, and some efficient matching algorithm.