Blog Compliance Checker

Challenges in implementing a reliable highly-similar detection algorithm

Posted by

Ákos Papp

on 2025-07-30

Blog Compliance Checker

2025-09-26 Reading time:

Challenges in implementing a reliable highly-similar detection algorithm

Ákos Papp

Given the complexity of the challenges in finding the molecules that are “substantially similar” to the controlled compounds - as it is stated in the US Federal Analogue Act -, it's crucial to assess viable strategies that must be considered when designing a system capable of efficiently identifying highly similar molecule structure pairs. Which methodologies and computational tools are best suited for this task? How can these systems be optimized for speed and accuracy in a real-world application?

The notion of chemical similarity is a key concept in cheminformatics, yet it remains subjective and difficult to quantify.

For an introduction to controlled substance analogues and the concept of similarity in controlled substance regulations, read this article.

The starting point

In any molecular similarity analysis, two fundamental elements are necessary:

a molecular representation that captures features relevant to similarity,
a similarity function (often referred to as a similarity coefficient) for the quantitative comparison of the selected representations.

Additionally, a weighting scheme can be incorporated to assign different weights or scales to individual features within a molecular representation, tailoring them specifically for similarity calculations.

As similarity metrics, using the most commonly preferred option, the Tanimoto (Jaccard) coefficient (Tc) is appropriate, although this coefficient suffers from an intrinsic bias toward selecting smaller compounds.

Computationally, the most common tools are the fingerprint representations of molecular structure and properties. No single chemical fingerprint can capture all the pivotal structures or properties of compounds which underscores the importance of choosing appropriate molecular descriptors (ref). The efficacy of different fingerprint types varies depending on the specific requirements and the nature of the background data sets.

Selecting the proper molecular descriptors

For molecule structure representation in similarity calculations the Extended Connectivity Finger Print (ECFP) is proved to be a solid starting point in several SAR models. ECFP is a circular fingerprint designed to capture the chemical environment around each atom by considering atom types, bonding and connectivity.

Preparing the learning set

Prior to generating the fingerprint, an initial filtering step needs to be performed: very large molecules have to be excluded, because controlled substance legislation covers only small molecule drugs. In addition, stereo information has to be eliminated, since - apart from the substances listed under Schedule II (b) point (morphine and related compounds)- the stereoisomers of all other substances are also controlled.

As part of the input process, both the controlled molecules and the input structures have to be standardized:

salts are stripped and the main fragment is neutralized
functional groups are normalized
dominant tautomer forms are selected

To get the best performance from the ECFP fingerprint both the vector length and the diameter should be optimized, and - since as mentioned above, the Tanimoto coefficient is biased towards small molecules - it is practical to apply a weighting function to balance the similarity values of small molecules.

Results with a single fingerprint

These are the steps that have to be done for the proper use of the ECFP representation, but it is still just a single fingerprint, so to make sure that it can work effectively, we made a similarity calculation on the ChEMBL v34 database. The analysis of the results uncovered an interesting pattern: symmetric molecules consistently led to elevated similarity values.

unnamed

A detailed examination of the step-by-step fingerprint generation process revealed that the elevated similarity values are inherently due to the deduplication step, which follows the encoding of local neighborhoods around each atom and the bonding connectivity within the molecules. This flaw can be mitigated by applying the count version of the Extended-Connectivity Fingerprint (ECFP). Unlike bit vectors, which only track the presence or absence of features in a molecule, count vectors monitor how many times each feature appears, providing a more nuanced and accurate representation.

At this point we have a comprehensive representation of the molecule's structure, but is this truly sufficient to determine whether two molecules are similar?

When we examine the changes outlined below, we get the impression that something important remains unaccounted for.

unnamed (1)

It is obvious that even minor structural modifications can significantly impact the functional attributes. To address this problem, an additional molecular descriptor should be used, which can consider the similarity between the functional groups in the molecules.

Extending the molecular descriptors

In Compliance Checker, the detection of highly similar molecules applies a fragment-based pharmacophore fingerprint for this purpose. Pharmacophore fingerprints focus on functional similarities rather than strict structural equivalence, identifying key functional groups such as hydrogen bond donors and acceptors, aromatic rings, and hydrophobic regions. Based on the analysis of the hits in the ChEMBL dataset, the functional group definitions have been optimized, and the three fingerprint representations form a consensus model that can reliably identify molecule structures highly similar to directly controlled substances in the US and Canada.

Final validation results

To measure the reliability of our similarity detection algorithm, we made a random selection of 200 molecules found to be highly similar to any compound listed in Schedule 1 on the US Controlled Substance Act. Based on a diversity analysis, we dropped one third of this set as they over-represented a few structural frameworks, and replaced them with a random set of structures found to be non-similar by our detection method. As a result, the molecule pair set contained 134 highly similar and 66 non-similar pairs.

We gave these 200 molecule pairs to 6 medicinal chemists, and asked them to judge if - according to the “substantially similar” term - should the input structure be considered controlled knowing that the other structure is listed as controlled?”

Correlation between the judgment of Compliance Checker and 6 medicinal chemists.
(Medchem judgement: at least 4 of the 6 medchems marked the molecule pair as highly-similar).

The presented approach - the combination of fingerprints and the heuristic weighting function - specifically tailored for "substantially similar" searches in compound compliance, significantly lowers the incidence of false positives thereby minimizing the need for time-consuming, individual evaluations to determine if a substance is an analogue requiring expert opinions.

Want to see this in practice?

Try Compliance Checker

Future development plans

In our ongoing efforts to refine and improve this method, we are actively considering two key strategies.

Firstly, we plan to conduct a new series of experiments utilizing multi-fingerprint approaches. Alongside this, we are committed to the application and integration of machine learning technologies to build a consensus similarity model, which benefits from the interpretability and established nature of fingerprint methods, while leveraging the adaptive and predictive strengths of machine learning.

Facebook Twitter LinkedIn

Copy to clipboard Copy link

Ákos Papp

Senior Product Manager

Ákos Papp is a chemical engineer through education, as he graduated from the Department of Chemical Engineering at Technical University of Budapest. Throughout his career he has always worked in the chemoinformatics area, both at the software development and from a user standpoint. Since joining Chemaxon in 2008, he was involved in several projects including: Marvin, Compound Registration and Biologics registration, and now he is the Product Manager of Compliance Checker, cHemTS and JChem for Office.

The notion of chemical similarity is a key concept in cheminformatics, yet it remains subjective and difficult to quantify.

For an introduction to controlled substance analogues and the concept of similarity in controlled substance regulations, read this article.

The starting point

In any molecular similarity analysis, two fundamental elements are necessary:

a molecular representation that captures features relevant to similarity,
a similarity function (often referred to as a similarity coefficient) for the quantitative comparison of the selected representations.