Structural Similarity Methodologies for Small Molecules

Posted by
Jonathan Buttrick
on 13 06 2024

Structural Similarity Methodologies for Small Molecules

Introduction

Small molecule structural similarity metrics are important in a number of drug discovery workflows, such as Structure-Activity Relationships and Virtual Screening. This blog post serves to highlight some key points from recent publications related to structural similarity in cheminformatics: 

  1. Molecular Similarity: Theory, Applications, and Perspectives by Lopez-Perez et al.; a recent review in ChemRxiv.
  2. Navigating Chemical Space by Tarcsay et al.; a book chapter written by Chemaxon in Computational Drug Discovery by Poongavanam.

(full references below)

It begins by discussing methods for generating molecular fingerprints, followed by common expressions for quantifying similarity, and finishes with a few relevant applications. If any topic is of particular interest it is highly encouraged you dive deeper into the source material and the references therein. 

Molecular Fingerprints

Molecular fingerprints are one of the most systematic and broadly used molecular representation methodologies for computational chemistry workflows. Molecular fingerprints are descriptors of structural features and/or properties within molecules, determined either by predefined features or mathematical descriptors of molecular features. Conceptually, structures are represented with fixed-dimension vectors, which can then be compared to one another using a distance metric between query and target vectors. In practice, the presence of subgraph features within a molecular graph yields integer vectors, most often binary (0,1) vectors, that come together to make up the molecular fingerprint. Choice of fingerprint has a significant influence on quantitative similarity, which will be touched on in more detail later. In general, molecular fingerprints fit into two categories: substructure-preserving fingerprints and feature fingerprints.

 

Figure 1: The chemical hashed fingerprint generation process.

Substructure-Preserving Fingerprints

Dictionary-based structural fingerprints use a predefined library of structural patterns and assign a binary bit to represent the presence (on bit) or absence (off bit) of these patterns. These patterns can include features such as common chemical pairs/moieties, functional groups, atom counts of a particular atom (e.g. O or N), ring counts of a particular size ring, etc. The most commonly used examples of dictionary-based fingerprints include PubChem (PC), Molecular ACCess System (MACCS), Barnard Chemistry Information (BCI) fingerprints, and SMILES FingerPrint (SMIFP).

Linear path‐based hashed fingerprints, such as the chemical hashed fingerprint (CFP), exhaustively identify all linear paths in a molecule up to a predefined length (typically 5-7 bond paths). Figure 2 shows an example of this using a maximum length of 2. Additionally, ring systems are represented with ring type and size attributes. The example in Figure 2 yields 14 patterns, plus 2 additional bits for the ring. The number of bits represented per feature is configurable at the time of generation, as is the total length of the fingerprint. Shorter fingerprint lengths can lead to two features mapping to the same position on the fingerprint (called a bit collision), which can be balanced to some extent by adjusting the number of bits per feature. In contrast, longer fingerprint lengths are more expensive in terms of computational resources. Darkness is the characteristic pertaining to the percentage of on bits of a fingerprint and should be balanced based on your objectives. 

 

Figure 2: Structural moieties in a CFP. Orange bonds highlight the paths in scope. 

Feature Fingerprints

Feature fingerprints represent the characteristics within a molecule identified to correspond to key structure-activity properties in known compounds. They are non-substructure preserving (and thus are not adequate for substructure search pre-filtering), but they provide better vectors for machine learning model building and yield better similarity values for activity based virtual screening. Radial, topological, and a few more niche fingerprints discussed below are all types of feature fingerprints.

Radial (or circular) fingerprints iteratively focus on each given heavy atom and capture information about neighboring features. The extended connectivity fingerprint (ECFP) - the most common radial fingerprint - starts from each atom, and expands out to a given diameter generating patterns hashed using a modified Morgan algorithm and mapped to a predefined bit string length. This basic radial fingerprinting technique can be combined with other methods, such as graph reduction or ring encoding methods to form other fingerprinting methods. Additional radial fingerprints include Functional-Class FingerPrints (FCFPs), MiniHashFingerpint (MHFP), Molprint2D, and Molprint3D.

Topological fingerprints represent graph distance within a molecule between an atom and another feature in the molecule. Atom pair fingerprints, for example, encode the shortest topological distance between two atoms in the molecule. These can be combined with radial fingerprinting techniques to encode finer details, as is done with MAP4 fingerprints. Topological fingerprints are especially useful for larger systems such as biomolecules. Other examples of topological fingerprints include topological torsion (TT) and Daylight fingerprints.

More niche fingerprints can include additional features other than just the atom-bond connectivities and are particularly useful for virtual library screening. Pharmacophore fingerprints use additional physchem properties to predict how a feature on the molecule will interact with other molecules and examples include protein-ligand interaction fingerprints (PLIFs), structural interaction fingerprints (SPLIFs), and protein-ligand extended connectivity (PLEC) fingerprints. Shape-base fingerprints describe the 3D surface of a molecule and its possible interactions; examples include rapid overlay of chemical structures (ROCS®) and ultrafast shape recognition (USR).

Similarity Expressions

To quantitatively determine the similarity between two structures, either distance (D) or similarity (S) functions can be used. Distance metrics should obey four rules:

  1. positive for nonidentical objects
  2. distance from the object itself equals zero
  3. symmetric
  4. triangular inequality

Similarity metrics should obey three rules:

  1. for nonidentical objects, the similarity is less than 1
  2. identical objects have similarity of 1
  3. Calculated similarity between A and B equals similarity between B and A (if the similarity function is symmetric)

Sometimes dissimilarity is also used, which can be determined by the equation: Dissimilarity = 1 - Similarity. 

For the similarity expression, the following symbols are used:

  • a is the number of on bits in molecule A
  • b is number of on bits in molecule B
  • c is the number of bits that are on in both molecules
  • d is the number of common off bits
  • n is the bit length (total number of bits) of the fingerprint: n = a + b − c + d.

The most commonly used similarity expressions are:

  • Tanimoto coefficient:  tanimoto
  • Soergel distance or Tanimoto dissimilarity:
    Soergel_distance-1
  • Euclidean distance: Euclidian_distance
  • Manhattan distance: Manhattan_distance
  • Dice coefficient: Dice_coefficient
  • Tversky: Tversky
  • Cosine: Cosine

 

To showcase effect of fingerprint selection on structural similarity, similarity spaces of 1 k randomly selected structures from ChEMBL (v31) hERG target with activity data (Target id CHEMBL240) are shown using the Tanimoto similarity metric and ECFP with diameter 4, chemical hashed linear fingerprint with path length 4, and MACCS key. MACCS key‐based similarity space identifies the structures to be more similar than CFPs, while ECFP4 identifies them to be the least similar. Care should be taken when selecting a fingerprint method to ensure it matches the type of similarity you wish to investigate; for example, a structure-preserving fingerprint if substructure features are of importance and a feature fingerprint if similar activity is of importance.

Figure 3: Comparison of Tanimoto dissimilarity spaces using different fingerprinting techniques. (a) ECFP diameter 4 (ECFP D4) vs. linear hashed chemical fingerprint with length 4 (CFP L:4), (b) MACCS key vs. CFP L:4, (c) ECFP D4 vs. MACCS key. 1000 randomly selected unique compounds from hERG target data (ChEMBL240, ChEMBL version 31, fingerprints were generated with JChem v22.13.).

Uses of Structural Similarity 

Practically speaking, the Similarity Principle states that compounds with similar structures will have similar properties. This can be seen extensively in drug discovery where similar compounds are assumed to have similar bioactivity, and small changes to target compounds are made to achieve slightly more favorable properties while keeping favorable properties consistent. Applying the informatics techniques above, chemists can determine structural similarity in their data sets when executing these workflows. 

Structure-Activity Relationship 

When assessing Structure-Activity Relationships (SAR) within a data set, ordering compounds by similarity can help to see trends in the data. Deviations from the Similarity Principle, referred to as activity cliffs, can provide key insights. For example, similar compounds that differ only by one structural moiety but with a drastic difference in a measured property show that that particular structural moiety has a strong effect on the property studied. 

Design and Virtual Screening

When designing virtual compounds it is important to know the similarity of your ideas to real compounds. With property predictions, the more similar your virtual compound is to the training data the more reliable your predictions will be. This becomes especially important with highly specific predictions, such as target protein activity, where the training data may be narrow or sparse. 

Similarity metrics can also be used to screen existing libraries for new uses. Starting with a reference compound, usually one with established desired activity, an existing library (corporate database, public literature database, or commercial catalog) can be filtered down with similarity metric(s) to a more reasonable size. These most similar compounds, known as hits, can then be selected for further activity screening without the need to synthesize novel compounds. 

Conclusion

Structural similarity plays a vital role in small molecule drug discovery. The topic is vast and ever growing, especially as advances are made in its application to topics like AI/ML and large data searching. The references highlighted in this article provide a more comprehensive and deeper overview on the subject while also providing a plethora of additional sources to dig even further. 

References

  1. Lopez-Perez, K.; Avellaneda Tamayo, J.; Chen, L.; Lopez Lopez, E.; Juarez Mercado, K. E.; Medina Franco, J. L.; Miranda-Quintana, R. Molecular Similarity: Theory, Applications, and Perspectives. 2023. https://doi.org/10.26434/chemrxiv-2023-cs3wb
  2. Tarcsay, A.; Volford.; A, Buttrick, J.; Christopherson, J.-C.; Erdős, M.; Szabó, Z. B. Navigating Chemical Space in Computational Drug Discovery; Wiley K&L, 2024; pp 337-364. ISBN: 978-3-527-84073-1

Introduction

Small molecule structural similarity metrics are important in a number of drug discovery workflows, such as Structure-Activity Relationships and Virtual Screening. This blog post serves to highlight some key points from recent publications related to structural similarity in cheminformatics: 

  1. Molecular Similarity: Theory, Applications, and Perspectives by Lopez-Perez et al.; a recent review in ChemRxiv.
  2. Navigating Chemical Space by Tarcsay et al.; a book chapter written by Chemaxon in Computational Drug Discovery by Poongavanam.

(full references below)

It begins by discussing methods for generating molecular fingerprints, followed by common expressions for quantifying similarity, and finishes with a few relevant applications. If any topic is of particular interest it is highly encouraged you dive deeper into the source material and the references therein. 

Molecular Fingerprints

Molecular fingerprints are one of the most systematic and broadly used molecular representation methodologies for computational chemistry workflows. Molecular fingerprints are descriptors of structural features and/or properties within molecules, determined either by predefined features or mathematical descriptors of molecular features. Conceptually, structures are represented with fixed-dimension vectors, which can then be compared to one another using a distance metric between query and target vectors. In practice, the presence of subgraph features within a molecular graph yields integer vectors, most often binary (0,1) vectors, that come together to make up the molecular fingerprint. Choice of fingerprint has a significant influence on quantitative similarity, which will be touched on in more detail later. In general, molecular fingerprints fit into two categories: substructure-preserving fingerprints and feature fingerprints.

 

Figure 1: The chemical hashed fingerprint generation process.

Substructure-Preserving Fingerprints

Dictionary-based structural fingerprints use a predefined library of structural patterns and assign a binary bit to represent the presence (on bit) or absence (off bit) of these patterns. These patterns can include features such as common chemical pairs/moieties, functional groups, atom counts of a particular atom (e.g. O or N), ring counts of a particular size ring, etc. The most commonly used examples of dictionary-based fingerprints include PubChem (PC), Molecular ACCess System (MACCS), Barnard Chemistry Information (BCI) fingerprints, and SMILES FingerPrint (SMIFP).

Linear path‐based hashed fingerprints, such as the chemical hashed fingerprint (CFP), exhaustively identify all linear paths in a molecule up to a predefined length (typically 5-7 bond paths). Figure 2 shows an example of this using a maximum length of 2. Additionally, ring systems are represented with ring type and size attributes. The example in Figure 2 yields 14 patterns, plus 2 additional bits for the ring. The number of bits represented per feature is configurable at the time of generation, as is the total length of the fingerprint. Shorter fingerprint lengths can lead to two features mapping to the same position on the fingerprint (called a bit collision), which can be balanced to some extent by adjusting the number of bits per feature. In contrast, longer fingerprint lengths are more expensive in terms of computational resources. Darkness is the characteristic pertaining to the percentage of on bits of a fingerprint and should be balanced based on your objectives. 

 

Figure 2: Structural moieties in a CFP. Orange bonds highlight the paths in scope. 

Feature Fingerprints

Feature fingerprints represent the characteristics within a molecule identified to correspond to key structure-activity properties in known compounds. They are non-substructure preserving (and thus are not adequate for substructure search pre-filtering), but they provide better vectors for machine learning model building and yield better similarity values for activity based virtual screening. Radial, topological, and a few more niche fingerprints discussed below are all types of feature fingerprints.

Radial (or circular) fingerprints iteratively focus on each given heavy atom and capture information about neighboring features. The extended connectivity fingerprint (ECFP) - the most common radial fingerprint - starts from each atom, and expands out to a given diameter generating patterns hashed using a modified Morgan algorithm and mapped to a predefined bit string length. This basic radial fingerprinting technique can be combined with other methods, such as graph reduction or ring encoding methods to form other fingerprinting methods. Additional radial fingerprints include Functional-Class FingerPrints (FCFPs), MiniHashFingerpint (MHFP), Molprint2D, and Molprint3D.

Topological fingerprints represent graph distance within a molecule between an atom and another feature in the molecule. Atom pair fingerprints, for example, encode the shortest topological distance between two atoms in the molecule. These can be combined with radial fingerprinting techniques to encode finer details, as is done with MAP4 fingerprints. Topological fingerprints are especially useful for larger systems such as biomolecules. Other examples of topological fingerprints include topological torsion (TT) and Daylight fingerprints.

More niche fingerprints can include additional features other than just the atom-bond connectivities and are particularly useful for virtual library screening. Pharmacophore fingerprints use additional physchem properties to predict how a feature on the molecule will interact with other molecules and examples include protein-ligand interaction fingerprints (PLIFs), structural interaction fingerprints (SPLIFs), and protein-ligand extended connectivity (PLEC) fingerprints. Shape-base fingerprints describe the 3D surface of a molecule and its possible interactions; examples include rapid overlay of chemical structures (ROCS®) and ultrafast shape recognition (USR).

Similarity Expressions

To quantitatively determine the similarity between two structures, either distance (D) or similarity (S) functions can be used. Distance metrics should obey four rules:

  1. positive for nonidentical objects
  2. distance from the object itself equals zero
  3. symmetric
  4. triangular inequality

Similarity metrics should obey three rules:

  1. for nonidentical objects, the similarity is less than 1
  2. identical objects have similarity of 1
  3. Calculated similarity between A and B equals similarity between B and A (if the similarity function is symmetric)

Sometimes dissimilarity is also used, which can be determined by the equation: Dissimilarity = 1 - Similarity. 

For the similarity expression, the following symbols are used:

  • a is the number of on bits in molecule A
  • b is number of on bits in molecule B
  • c is the number of bits that are on in both molecules
  • d is the number of common off bits
  • n is the bit length (total number of bits) of the fingerprint: n = a + b − c + d.

The most commonly used similarity expressions are:

  • Tanimoto coefficient:  tanimoto
  • Soergel distance or Tanimoto dissimilarity:
    Soergel_distance-1
  • Euclidean distance: Euclidian_distance
  • Manhattan distance: Manhattan_distance
  • Dice coefficient: Dice_coefficient
  • Tversky: Tversky
  • Cosine: Cosine

 

To showcase effect of fingerprint selection on structural similarity, similarity spaces of 1 k randomly selected structures from ChEMBL (v31) hERG target with activity data (Target id CHEMBL240) are shown using the Tanimoto similarity metric and ECFP with diameter 4, chemical hashed linear fingerprint with path length 4, and MACCS key. MACCS key‐based similarity space identifies the structures to be more similar than CFPs, while ECFP4 identifies them to be the least similar. Care should be taken when selecting a fingerprint method to ensure it matches the type of similarity you wish to investigate; for example, a structure-preserving fingerprint if substructure features are of importance and a feature fingerprint if similar activity is of importance.

Figure 3: Comparison of Tanimoto dissimilarity spaces using different fingerprinting techniques. (a) ECFP diameter 4 (ECFP D4) vs. linear hashed chemical fingerprint with length 4 (CFP L:4), (b) MACCS key vs. CFP L:4, (c) ECFP D4 vs. MACCS key. 1000 randomly selected unique compounds from hERG target data (ChEMBL240, ChEMBL version 31, fingerprints were generated with JChem v22.13.).

Uses of Structural Similarity 

Practically speaking, the Similarity Principle states that compounds with similar structures will have similar properties. This can be seen extensively in drug discovery where similar compounds are assumed to have similar bioactivity, and small changes to target compounds are made to achieve slightly more favorable properties while keeping favorable properties consistent. Applying the informatics techniques above, chemists can determine structural similarity in their data sets when executing these workflows. 

Structure-Activity Relationship 

When assessing Structure-Activity Relationships (SAR) within a data set, ordering compounds by similarity can help to see trends in the data. Deviations from the Similarity Principle, referred to as activity cliffs, can provide key insights. For example, similar compounds that differ only by one structural moiety but with a drastic difference in a measured property show that that particular structural moiety has a strong effect on the property studied. 

Design and Virtual Screening

When designing virtual compounds it is important to know the similarity of your ideas to real compounds. With property predictions, the more similar your virtual compound is to the training data the more reliable your predictions will be. This becomes especially important with highly specific predictions, such as target protein activity, where the training data may be narrow or sparse. 

Similarity metrics can also be used to screen existing libraries for new uses. Starting with a reference compound, usually one with established desired activity, an existing library (corporate database, public literature database, or commercial catalog) can be filtered down with similarity metric(s) to a more reasonable size. These most similar compounds, known as hits, can then be selected for further activity screening without the need to synthesize novel compounds. 

Conclusion

Structural similarity plays a vital role in small molecule drug discovery. The topic is vast and ever growing, especially as advances are made in its application to topics like AI/ML and large data searching. The references highlighted in this article provide a more comprehensive and deeper overview on the subject while also providing a plethora of additional sources to dig even further. 

References

  1. Lopez-Perez, K.; Avellaneda Tamayo, J.; Chen, L.; Lopez Lopez, E.; Juarez Mercado, K. E.; Medina Franco, J. L.; Miranda-Quintana, R. Molecular Similarity: Theory, Applications, and Perspectives. 2023. https://doi.org/10.26434/chemrxiv-2023-cs3wb
  2. Tarcsay, A.; Volford.; A, Buttrick, J.; Christopherson, J.-C.; Erdős, M.; Szabó, Z. B. Navigating Chemical Space in Computational Drug Discovery; Wiley K&L, 2024; pp 337-364. ISBN: 978-3-527-84073-1