Blog Data

Reactor in Large Library Workflows

Posted by

Jan C. Christopherson

on 2024-08-08

Blog Data

2024-08-14 Reading time:

Reactor in Large Library Workflows

Jan-C. Christopherson

Billion scale libraries are here - to stay

chemical space

Figure 1. The chemical space of available small molecules is continuously increasing.

In recent years the exploration of chemical space has advanced to a new scale, with libraries often spanning into the billions of molecules. This is spurred by technological advancements on all sides, including cloud computing availability, novel screening methodologies like DEL and automated laboratories.

The pervasiveness of “synthesis on demand” libraries makes it feasible to expand these investigations beyond in silico methods.

As AIML approaches gain traction, it seems inevitable that interest in libraries of this scale will continue to grow to act as a source of input data.

Unique challenges, and intriguing new representation possibilities

This scale of library presents a new set of challenges to the informatics infrastructure used to interrogate them. Most RDBMS cartridges and other similar tools would be hard pressed to store and search on this scale in a way that complies with both the speed expectation of the user as well as their hardware limitations.

A number of non-enumerated formats have been developed that compress the volume of data needed to represent the library. A common method to do this is to represent the library as a set of starting materials, combined with a chemical transformation such as a reaction. This allows for a small set of starting materials to represent a much larger library.

Storing libraries this way requires new search methods be developed in order to search both the starting materials as well as their combination into a product. These methods include LEAP2 and FTrees, which we’ve previously discussed in greater detail.

In order to get to this stage we have to first create such a library and process it; there are two common methods of doing so.

The first is to replace reacting atoms with labels to indicate where the transformation will occur, and the second is to store the vanilla reactants and use a chemical transformation engine to perform structure based matching and transformation as is needed.

Interrogate the product space efficiently with Chemical Space Docking

Docking is a common tool in most drug discovery teams’ arsenals that should not require much introduction to the reader. The ability to determine potential poses of a small molecule in the context of the target protein of interest, and determine a relative preference between a number of potential candidates through the generation of docking scores is a common step in prioritizing ideated compounds.

While it is relatively low in computational effort compared to physics based modeling such as FEP or other quantum mechanical methods, it remains challenging to perform at the scales of these new libraries.

Recently, an approach named “Chemical Space Docking” has been gaining interest as an efficient means to interrogate such large libraries using docking methods. A full introduction to the approach is beyond this piece, but I would like to encourage the reader to take a look at some of the references if they require a primer.

In brief, the method uses an iterative sequence of fragment based docking procedures. At each step, the best results from the previous iteration are selected, and then enumerated with other fragments known to be compatible with their reactive terminal end. The previous iteration’s best poses are used as constraints, making the next docking iteration more computationally efficient by reducing the number of poses to be sampled.

Diagram the process

Wiley Illustration (1)

Diagram 1. Chemical space docking

Creating an appropriate representation is a cumbersome prerequisite

Implementing such a workflow is no doubt a difficult, yet worthwhile task. Given a set of possible reactants, how can we easily identify and keep track of those that can undergo the desired transformation?

A number of commercially available libraries may be available in such representations, however a chemist might want to look in their organization’s proprietary compounds in such a search. Additionally, using only such public information does not grant an organization an edge over competitors in their therapeutic target space.

Having the ability to perform an appropriate set of find-and-replace operations in a chemical context is then necessary. Chemaxon’s Reactor is a tool specifically designed for this purpose. It allows the user to input the desired matching conditions and modified output representation with best-in-class chemical accuracy. This allows you to reliably create input data for chemical docking model workflows.

Case study: Amide synthesis from carboxylic acid

A small case study was prepared, performing amide synthesis from a carboxylic acid.

We filtered the first 1000 small molecules from ChEMBL, having first removed all empty structures.

Diagram 2. provides a step-by-step walkthrough of the followed procedure. As well as the additional effort in configuring multiple steps in the process, a few measurable quantities stood out.

First, the final step in the R-group capping method was several orders of magnitude slower than the pre-filtered structure recognition based method.

Screenshot 2024-10-18 105148

Diagram 2. Amide synthesis from carboxylic acid

Second, the enumeration using R group or similar end caps results in the generation of far more products. While this might seem appealing at first, we actually believe that it will simply lead to far more unreasonable compounds that either a user or a different filtering process will later have to remove.

Inherent limitations of Synthon representations

One method of representing reactive sites is to remove the atoms that will be removed by the transformation, and replace them with an “R” group representation. These may be referred to as “Synthons”.

While this provides a straightforward way to keep track of reactivity for that particular reaction, there are a number of drawbacks:

It raises questions about how representations should be handled in case there are multiple matching moieties on the reactant
Out of context, it removes what could be important information about what the reacting site is
- If updates are made to the reactivity rules, it could be difficult to backtrack and regenerate the matching set of reactants
If multiple reactions are in scope, it raises questions about how to treat a compound that can undergo multiple different reactions
- If multiple R groups are attached to the same compound, this requires the management of an R group library
- Alternatively, storing the same reactant multiple times with different R groups reduces storage efficiency and could lead to redundant docking computation if not carefully managed

Simple approaches will often apply an R-group to any matching moiety, however as chemists we know that that in itself does not guarantee reactivity. Applications that can calculate the properties of either the entire structure or a substructure to be used as filtering or prioritization conditions will improve the accuracy of the reactivity prediction.

While many commercially available libraries have undergone validation through many years of knowledge gathering (often reported along with library-wide percentage synthetic success rates), most organizations trying to create their own libraries do not have the resources to perform such exhaustive validation.

It’s then important that the software tools that perform the matching assist the user in predicting the likelihood of success, from simple approaches that use molecule properties as candidate criteria to more specific, often machine-learning supplemented retrosynthetic tools.

Reaction based combinatorial representations are useful beyond Chemical Space Docking

As new approaches continue to be developed, we see more use cases for the representations that we have been discussing.
One of the current most prominent use cases is that of DNA encoded libraries. The libraries are intrinsically combinatorial, and as such all of the tools to generate their representations and perform successful searches against these libraries that we have discussed are also relevant to them.

Tackling the challenge of large enumerations

At Chemaxon, we are interested in continuing our investigation of the benefits and challenges of large scale enumerations.
While the computational resources needed to perform combinatorial enumeration may at first appear challenging, the availability of cost-effective platforms such as AWS Spot Instances means this is not insurmountable. This leads naturally to questions of appropriate batching, and management of the workflows being executed. In some cases, task management and distribution services native to cloud services, such as load balancers, may be preferred. In other cases, tools more tightly coupled to the enumeration service, such as the Task Manager capabilities of JChem Microservices Reactor, can be beneficial to the user.

How often do you and your organization distribute workloads amongst multiple computational resources? How do you prefer to store combinatorial libraries? How do you explore them?

Share your story

Facebook Twitter LinkedIn

Copy to clipboard Copy link

Was this post interesting?

Not so much

Yes!

Jan-C. Christopherson

Senior Application Scientist

Jan (Yan) Christopherson is a Senior Application Scientist at Chemaxon, where he demonstrates and advisers users on optimal cheminformatics process. He also actively works to discover gaps in current workflows, in order to provide innovative solutions to the problem. He has experience in the creation of testing methodologies in the production space, and the integration of instrumentation with data integrity compliant management software, which he gained as a Laboratory Technical Specialist with Mettler Toledo. Jan’s research background lies in the fields of solid-state organic chemistry and design of crystalline materials with novel optical and photo-mechanical properties, performed during a Master’s degree and undergraduate career at McGill University in Montreal.

News

13 01 2025

Information on CVE-2024-52046

We would like to inform our Customers that Chemaxon products are safe from the CVE-2024-520246 vulnerability.

News

02 10 2024

Certara Completes Acquisition of Chemaxon

The combined organization offers life sciences companies predictive biosimulation and scientific informatics capabilities, improving certainty in...

News

10 07 2024

Certara to Acquire Chemaxon to Strengthen Drug Discovery Software Portfolio

We are excited to share the official announcement of Certara, our partner for over a decade, as they set to acquire Chemaxon.

Billion scale libraries are here - to stay

chemical space

Figure 1. The chemical space of available small molecules is continuously increasing.

The pervasiveness of “synthesis on demand” libraries makes it feasible to expand these investigations beyond in silico methods.

As AIML approaches gain traction, it seems inevitable that interest in libraries of this scale will continue to grow to act as a source of input data.

Unique challenges, and intriguing new representation possibilities

In order to get to this stage we have to first create such a library and process it; there are two common methods of doing so.

Interrogate the product space efficiently with Chemical Space Docking

Diagram the process

Wiley Illustration (1)

Diagram 1. Chemical space docking

Creating an appropriate representation is a cumbersome prerequisite

Case study: Amide synthesis from carboxylic acid

A small case study was prepared, performing amide synthesis from a carboxylic acid.

We filtered the first 1000 small molecules from ChEMBL, having first removed all empty structures.

Diagram 2. provides a step-by-step walkthrough of the followed procedure. As well as the additional effort in configuring multiple steps in the process, a few measurable quantities stood out.

First, the final step in the R-group capping method was several orders of magnitude slower than the pre-filtered structure recognition based method.

Screenshot 2024-10-18 105148

Diagram 2. Amide synthesis from carboxylic acid

Inherent limitations of Synthon representations

While this provides a straightforward way to keep track of reactivity for that particular reaction, there are a number of drawbacks:

It raises questions about how representations should be handled in case there are multiple matching moieties on the reactant
Out of context, it removes what could be important information about what the reacting site is
- If updates are made to the reactivity rules, it could be difficult to backtrack and regenerate the matching set of reactants
If multiple reactions are in scope, it raises questions about how to treat a compound that can undergo multiple different reactions
- If multiple R groups are attached to the same compound, this requires the management of an R group library
- Alternatively, storing the same reactant multiple times with different R groups reduces storage efficiency and could lead to redundant docking computation if not carefully managed

Reaction based combinatorial representations are useful beyond Chemical Space Docking

Tackling the challenge of large enumerations

How often do you and your organization distribute workloads amongst multiple computational resources? How do you prefer to store combinatorial libraries? How do you explore them?

Share your story

Marvin

The new Marvin is a universal chemical editor that serves the needs of any chemist involved in research and drug discovery.

Design Hub

Your molecular design and tracking platform turning drug discovery into a team sport.

Compound Registration

Compound Registration compares the uniqueness of new small molecules against those already stored in your database.

Design Hub

Reactor in Large Library Workflows

Reactor in Large Library Workflows

Billion scale libraries are here - to stay

Unique challenges, and intriguing new representation possibilities

Interrogate the product space efficiently with Chemical Space Docking

Diagram the process

Creating an appropriate representation is a cumbersome prerequisite

Case study: Amide synthesis from carboxylic acid

Inherent limitations of Synthon representations

Reaction based combinatorial representations are useful beyond Chemical Space Docking

Tackling the challenge of large enumerations

Information on CVE-2024-52046

Certara Completes Acquisition of Chemaxon

Certara to Acquire Chemaxon to Strengthen Drug Discovery Software Portfolio

Billion scale libraries are here - to stay

Unique challenges, and intriguing new representation possibilities

Interrogate the product space efficiently with Chemical Space Docking

Diagram the process

Creating an appropriate representation is a cumbersome prerequisite

Case study: Amide synthesis from carboxylic acid

Inherent limitations of Synthon representations

Reaction based combinatorial representations are useful beyond Chemical Space Docking

Tackling the challenge of large enumerations

Related content

Introduction to Controlled Substance Analogues and Generic Definitions

NMR Predictor Guide: Which Type Is Best for You?

From Small Molecules to Biologics, New Modalities in Drug Development

The State of Adopting AI in Drug Discovery in 2025