Tautomers… every chemist knows them, but tautomerism is usually forgotten until it turns into an issue.
As a synthetic or medicinal chemist, you might look at an NMR spectrum and notice that the alkylation occurred in an unexpected position because an unanticipated tautomeric form went under the transformation. You might have performed a search in your compound database looking for a ketone, and you ended up with two hits, a ketone and an enol. These clearly represent a tautomeric pair, with slightly different assay data for each compound.
A typical scenario: you might have performed a search in a vendor’s web shop and ended up with two hits. These look like a tautomeric pair, but one of them is five times more expensive than the other. Now you are wondering: “These structures should actually represent the same compound! Can I place an order for the cheaper one?”
The above represent smaller, practical issues, something a skilled organic or medicinal chemist can handle. It is more concerning, when a tautomeric form is not retrieved during a database search, as this might result in missing important information and ultimately leading to duplication of data, work or procurement - this may be how the chemist ended up with two sets of assay data in the first place.
If we look outside the chemists’ laboratory, we can see that drug action, disposition, toxicity and delivery are all affected by tautomerism. Famous drug examples include warfarin and thalidomide1,2. Moving to materials science, azo dyes can serve as molecular photo switches through tautomerization mechanism.
It is sometimes next to impossible to tell which tautomer we can find in a bottle. Tautomerism is pH and temperature dependent, hence a different tautomer might be dominating in the freezer than at room temperature.
When creating, curating and using chemical data, one must ensure that chemical structures are represented in a standardized way, capturing stereochemical information accurately and matching tautomers. As such, when entering new data, we need to know whether a compound is already in the database under a different representation.
Herein, we will look at two cheminformatics aspects of tautomer handling:
In databases, it is convenient to store only one structure which covers all possible tautomer forms. The exact selection might depend on individual preferences or business rules.
Chemaxon’s tautomer generator can generate different tautomer forms of a compound. The generator first identifies all possible proton donors and acceptors in a molecule and finds tautomerization paths between them. Then, depending on the desired tautomer form, the tautomerization algorithm can provide a set of all the possible tautomer forms or a single representation.
Three types of single structure representations can be obtained:
Both Canonical and Normal Canonical tautomers might appear as an excellent choice for representing structures in a database.
When performing a database search, false positives are usually more desirable than false negatives. False positive hits still allow the expert to select the desired outcome based on human judgement.
Chemaxon’s search engines come with high configurability, and tautomer search most often relies on the use of generic tautomers as these allow the retrieval of all possible hits. For example, in case of a duplicate and full fragment search, the generic tautomer - representing all theoretically possible tautomers - of the query and the generic tautomer of the target is compared. On the other hand, in a substructure search the query itself is matched with the generic tautomer of the target.
In all cases of common tautomer representations (ie. generic, canonical and normal canonical) we need to ensure that the same output structure will be generated when we subject any of the possible tautomers to the algorithm. In the case study below, we used the Tautobase database to evaluate the efficiency of Chemaxon’s tautomer generation algorithms.
Tautobase is an open-source tautomer database containing 1680 tautomer pairs.iv1 We decided to use this database to benchmark our tautomer generator. Data curation was performed using KNIME.
The original SMIRKS data was transformed into two SMILES datasets representing the first and second tautomer of the pairs. All structures were subjected to standardization and structure checking. Finally, we filtered for those transformations that were studied in water, ending up with 922 tautomer pairs.
Next, we investigated the outcome of the three different tautomer generation methods starting from tautomer 1 and tautomer 2, respectively. When generic tautomers were generated, we observed >98% overlap (908 structures out of 922) between the two sets. Canonical tautomers provided 92% overlap (844 structure out of 922), while in case of normal canonical forms this value is 78% (718/922 matching structures).
| Number of tautomers matched | Percentage of tautomers matched | |
| Generic tautomers | 908/922 | 98% |
| Canonical tautomers | 844/922 | 92% |
| Normal canonical tautomers | 718/922 | 78% |
Figure showing examples for matching and non-matching generic tautomers.
The excellent match in the generic tautomer sets is the most significant in terms of search results. The results show that when generic tautomer search is applied, in Chemaxon’s JChem 2nd Gen search engine, we can have high confidence in retrieving the appropriate hits independently of the tautomer form used as the query or target structure.
Technical details:
References:
1. P. V. Bharatam, O. R. Valanju, A. A. Wani, D. K. Dhaked Drug. Disco. Today 2023, 28, 103494 Link
2. A. R. Katritzky, C. D. Hall, B. E.-D. M. El-Gendy, B. Draghici J. Comput. Aided Mol. Des. 2010, 24, 475 Link
iv1. O. Wahl, T. Sander J. Chem. Inf. Model. 2020, 60, 1085 Link