Challenges in Searching Ultra-large Chemical Libraries

Posted by
Phil McHale
on 03 09 2021

As all good Trekkies know – “space [is] the final frontier” – and it keeps throwing out challenges, such as landing a satellite on an asteroid or piloting a helicopter on Mars, that human ingenuity has been able to meet and successfully overcome. This same challenging frontier aspect can be ascribed to chemical space. As real and virtual chemical libraries continuously grow, and are now sized in multiple powers of ten, traditional cheminformatics tools and hardware infrastructures begin to falter. Apparently simple, yet fundamentally important, drug design questions like “is this compound in the library?” or “what are the ten most similar compounds in this collection?” become difficult, if not impossible, to answer with acceptable response times in these ultra-large spaces. This article looks at the current challenges in searching large chemical libraries and considers approaches that might deliver “helicopter on Mars” level breakthrough solutions to drug designers, medicinal chemists, and cheminformaticians striving to explore the far reaches of chemical space to track down the next optimal, novel, synthetically-accessible bioactive structure.


As pharma companies, public content depositories, and commercial chemical suppliers and aggregators consolidated and organized their various libraries of known, registered, published, or commercially available compounds, file sizes began to approach 107 – 108 structures (e.g. eMolecules Plus 107; PubChem, Sigma Aldrich 108). But there was still a desire to extend the reach and diversity of explorable chemical space to augment the real compounds with synthetically feasible virtual compounds.

Pfizer pioneered a combinatorial technique incorporating validated reaction transformation information from its in-house ELN records combined with available building blocks to generate a massive virtual library (PGVL: 1016 virtual, synthetically-feasible compounds) for exploration. Other companies and groups built on this reaction-based combinatorial approach, bringing in their own proprietary knowledge and literature-based reaction information from sources like Reaxys, and expanding the library sizes.

DNA encoded libraries (DELs) offer a complementary type of ultra-large library. Real (as opposed to virtual) compounds are created in tiny amounts, typically using split-and-pool combinatorial chemistry, to generate massive mixtures containing billions or even trillions of compound, with each compound’s synthetic reaction provenance encoded in an attached DNA tag. DELs can be tested in single-pot affinity assays where molecules that bind to the target are enriched, and non-binders are washed away. The binding compounds are deconvoluted and decoded by PCR amplification and purification, followed by DNA sequencing of the tags, which then identifies the discrete structures. Although the structures are of real compounds, representing them in a virtual library remains a challenge.

At an NIH-sponsored workshop on ultra-large chemistry databases held in December 2020, several participants highlighted this continuous growth in commercial (e.g. Enamine REAL Space 1010 structures), public (e.g. BioSolveIT KnowledgeSpace 1015 structures), and proprietary (e.g. GSK GSKchemspace 1026 structures) ultra-large chemical collections and libraries of real and/or virtual synthetically-accessible compounds for exploration. The demand for efficient methods of representation, storage, and searching libraries of this size is growing in parallel.


Finding the optimal way to represent compounds in ultra-large libraries is a challenge. Current cheminformatics tools and computer hardware are not yet able to efficiently search fully-enumerated and explicitly-described ultra-large compound sets and begin to become unacceptably slow at 108 structures or more. As an example, with appropriate memory and special hardware, a 106 sized enumerated library occupies 3.8 MB and can be exhaustively searched in a very acceptable 1 second: but a 1012 sized enumerated library occupies 3.8 TB and would take 12 days to search.

The currently available techniques to overcome this hardware/performance barrier are to represent and store unenumerated libraries using reduced compound descriptors such as Feature Trees, and then to run initial “fuzzy pharmacophore” similarity searches on the unenumerated library. This approach can generate smaller, more-focused, tractable hit sets in an acceptable time, and these can then be enumerated and subjected to more detailed structure and physicochemical property searching.

This approach works well for pharmacophore similarity/dissimilarity searches that are often used for lead optimization and scaffold hopping, but it does not support full substructure searches across the whole library, or searches combining important physicochemical or topological properties such LogP or number of H-bond donors. Feature Trees lack the ability to handle ring substitution patterns and stereochemistry so this technique cannot answer important questions such as “is this exact structure in my non-enumerated set?” or “what is the most similar set of compounds to this potential lead in my database?” There are some newer approaches that offer substructure search, but there are questions about how well they scale to >109 sized libraries.

One other challenge with ultra-large libraries is how to share them as seamlessly as possible. If an organization wants to receive and load a complete virtual library on their own servers or in a private cloud for further in-house processing and analysis, being sent a large set of files with multiple SMILES in each is not a workable solution.

Addressing the challenges

Current methods of representing discrete enumerated structures tend to take up too much space for use in an ultra-large library, so a more compact non-enumerated representation is needed that retains full fidelity with each compound’s structure, including all atom types, connectivity, and stereochemistry. The Compact Virtual Library (Compact VL) format has been developed to address these needs.

Based on the familiar MDL v2000 SDfile format, Compact VL adds restriction rules and descriptors to the data field sections of the SDfile so that reaction transformation information can be stored in a more compact manner. This gives the ability to store a complete virtual library in a single SDfile. As an example, for a simple two component reaction A + B → C, with 5K each of reactants A and B, Compact VL can store the resultant combinatorial library in an SDfile with 10,001 entries, as opposed to 25M. Additional fields enable searching of sub-libraries within larger collections.

Virtual libraries in this format can be generated by combining reaction transformation files with reagents in SDfiles using currently available cheminformatic toolkits, and as an example, KNIME workflows have been written to produce Compact VL. This results in a virtual library that can be shared as a single SDfile, and loaded for further analysis and searching.

Research into efficiently searching Compact VL-formatted libraries is ongoing, with the intent to enhance similarity searching and to add full substructure search. Acceptable searching performance might be achieved by distributing large libraries across scalable search systems.

Novel approaches to searching unenumerated libraries include algorithms similar to LEAP2 developed by Pfizer, and substructure and Markush-based techniques, which may finally allow researchers to get answers to their important, previously unanswerable structure-based questions as they explore these ultra-large virtual libraries.

If you would like to understand more about ChemAxon's solutions to efficiently searching large virtual libraries, check out this presentation.