Making the Most of Approximate Maximum Common Substructure Search

poster · 6 years ago
by Péter Kovács, Péter Englert (ChemAxon, Eötvös Loránd University)
JKlustor JChem Base

The maximum common substructure (MCS) problem is of great importance in multiple aspects of chemoinformatics. It has diverse applications ranging from lead prediction to automated reaction mapping and visual alignment of similar compounds. Many different solutions have been developed over the years [1], both exact and approximate. Most of the applications have time constraints, either because they are used in interactive systems, or because they rely on finding the MCS of a large number of molecule pairs. Since the problem, either formulated as finding the maximum common induced subgraph (MCIS), or the maximum common edge subgraph (MCES) of molecular graphs, is NP-complete, this time constraint can only be realistically satisfied by approximate algorithms. This has resulted in an increased interest in efficient heuristics for solving the MCS problem.

We developed two efficient approximate algorithms. One is based on the popular approach of reducing the MCS problem to finding the maximum clique in the modular product of the two molecule graphs. The other is based on a new algorithm by Takeshi Kawabata, called the build-up method [2]. We also incorporated other ideas, for example the topological fingerprinting primarily used in substructure and similarity searching.

We optimized our algorithms for use in multiple applications developed at ChemAxon. In some applications, for example hierarchical MCS-based clustering or similarity search in databases, the algorithms are required to give close to optimal result in limited time. To meet these conflicting demands, strong heuristics were used. The algorithms were also augmented with upper bound calculation methods for screening purposes, and to make early termination possible if an optimal result is found.

In other applications, for example reaction mapping or visual alignment, the challenge is that topological features must also be taken into account. Apart from the size of the found common substructure, the one-to-one correspondence between the atoms of the molecules that it defines is also very important. The relative position of the fragments of the common substructure when embedded in the input molecules should be as similar as possible. For this purpose, heuristics were used to guide the algorithm to prefer these solutions. The possibility to enumerate multiple different mappings of the same common substructure was also added, so that the preferred solution can then be selected by the application.

Multiple implementations of these algorithms have been developed and thoroughly tested and benchmarked. They have also been compared to publicly available solutions, and integrated into different products at ChemAxon. This has shown that great care must be taken if a general-purpose, efficient and robust MCS algorithm is to be developed, however, it is possible for a good implementation to adequately cover all of the use cases. We present our algorithms and the most important heuristics along with computational experiments about their effect on performance as well as the size and features of the results.

[1] John W. Raymond and Peter Willett. Maximum common subgraph isomorphism algorithms for the matching of chemical structures. Journal of Computer-Aided Molecular Design, 16:521–33, 2002. [2] Takeshi Kawabata. Build-up algorithm for atomic correspondence between chemical structures. Journal of Chemical Information and Modeling, 51(8):1775–87, 2011.

Open poster in pdf