Challenges in implementing a reliable highly-similar detection algorithm
Given the complexity of the challenges in finding the molecules that are “substantially similar” to the controlled compounds - as it is stated in the US Federal Analogue Act-, it's crucial to assess viable strategies that must be considered when designing a system capable of efficiently identifying highly similar molecule structure pairs. Which methodologies and computational tools are best suited for this task? How can these systems be optimized for speed and accuracy in a real-world application?
The notion of chemical similarity is a key concept in cheminformatics, yet it remains subjective and not easy to quantify.
In any molecular similarity analysis, two fundamental elements are needed:
- a molecular representation that captures features relevant to similarity
- a similarity function (often referred to as a similarity coefficient) for the quantitative comparison of the selected representations.
Additionally, a weighting scheme can be incorporated to assign different weights or scales to individual features within a molecular representation, tailoring them specifically for similarity calculations.
Computationally, the most common tools are the fingerprint representations of molecular structure and properties. No single chemical fingerprint can capture all the pivotal structures or properties of compounds which underscores the importance of choosing appropriate molecular descriptors (ref). The efficacy of different fingerprint types varies depending on the specific requirements and the nature of the background data sets.
For molecule structure representation in similarity calculations the Extended Connectivity Finger Print (ECFP) is proved to be a solid starting point in several SAR models. ECFP is a circular fingerprint designed to capture the chemical environment around each atom by considering atom types, bonding, and connectivity.
As similarity metrics, using the most commonly preferred option, the Tanimoto (Jaccard) coefficient (Tc) is appropriate, although this coefficient suffers from an intrinsic bias toward selecting smaller compounds.
Prior to generating the fingerprint, an initial filtering step needs to be performed: very large molecules have to be excluded, because controlled substance legislation covers only small molecule drugs. In addition, stereo information has to be eliminated, since - apart from the substances listed under Schedule II (b) point (morphine and related compounds)- the stereoisomers of all other substances are also controlled.
As part of the input process, both the controlled molecules and the input structures have to be standardized:
- salts are stripped and the main fragment is neutralized
- functional groups are normalized
- dominant tautomer forms are selected
To get the best performance from the ECFP fingerprint both the vector length and the diameter should be optimized, and - since as mentioned above, the Tanimoto coefficient is biased towards small molecules - it is practical to apply a weighting function to balance the similarity values of small molecules.
These are the steps that have to be done for the proper use of the ECFP representation, but it is still just a single fingerprint, so to make sure that it can work effectively, we made a similarity calculation on the ChEMBL v34 database. The analysis of the results uncovered an interesting pattern: symmetric molecules consistently led to elevated similarity values.
A detailed examination of the step-by-step fingerprint generation process revealed that the elevated similarity values are inherently due to the deduplication step, which follows the encoding of local neighborhoods around each atom and the bonding connectivity within the molecules. This flaw can be mitigated by applying the count version of the Extended-Connectivity Fingerprint (ECFP). Unlike bit vectors, which only track the presence or absence of features in a molecule, count vectors monitor how many times each feature appears, providing a more nuanced and accurate representation.
At this point we have a comprehensive representation of the molecule's structure, but is this truly sufficient to determine whether two molecules are similar?
When we examine the changes outlined below, we get the impression that something important remains unaccounted for.
It is obvious that even minor structural modifications can significantly impact the functional attributes. To address this problem, an additional molecular descriptor should be used, which can consider the similarity between the functional groups in the molecules.
In Compliance Checker, the detection of highly similar molecules applies a fragment-based pharmacophore fingerprint for this purpose. Pharmacophore fingerprints focus on functional similarities rather than strict structural equivalence, identifying key functional groups such as hydrogen bond donors and acceptors, aromatic rings, and hydrophobic regions. Based on the analysis of the hits in the ChEMBL dataset, the functional group definitions have been optimized, and the three fingerprint representations form a consensus model that can reliably identify molecule structures highly similar to directly controlled substances in the US and Canada.
Validation results
To measure the reliability of our similarity detection algorithm, we made a random selection of 200 molecules found to be highly similar to any compound listed in Schedule 1 on the US Controlled Substance Act. Based on a diversity analysis, we dropped one third of this set as they over-represented a few structural frameworks, and replaced them with a random set of structures found to be non-similar by our detection method. As a result, the molecule pair set contained 134 highly similar and 66 non-similar pairs.
We gave these 200 molecule pairs to 6 medicinal chemists, and asked them to judge if - according to the “substantially similar” term - should the input structure be considered controlled knowing that the other structure is listed as controlled?”
The result of the judgement of a random set of molecules found by Compliance Checker in the ChEMBL_34 dataset to be highly similar to controlled substances in the US Schedule 1 (mixed with non-similar molecule pairs)
The presented approach - the combination of fingerprints and the heuristic weighting function - specifically tailored for "substantially similar" searches in compound compliance, significantly lowers the incidence of false positives thereby minimizing the need for time-consuming, individual evaluations to determine if a substance is an analogue requiring expert opinions.
In our ongoing efforts to refine and improve this method, we are actively considering two key strategies. Firstly, we plan to conduct a new series of experiments utilizing multi-fingerprint approaches. Alongside this, we are committed to the application and integration of machine learning technologies to build a consensus similarity model, which benefits from the interpretability and established nature of fingerprint methods, while leveraging the adaptive and predictive strengths of machine learning.
