Introduction

US ChemAxon users gathered at the Sheraton Fisherman’s Wharf, San Francisco for the annual meeting. A hands-on workshop and biomolecule forum took place in the morning of the first day, followed, after lunch, by roundtable discussions on big data management, cheminformatics in the cloud, and user needs in cheminformatics. The day concluded with pre-dinner drinks and a dinner in the hotel. The bulk of this report is concerned with the second day, which also concluded with a jolly social event, this time in a more informal location in the hotel.

Reaxys presentation on the San Francisco 2017 UGM stage

Chemical Data Extraction

Structure and reaction querying in Reaxys

Elsevier’s CrossFire system was launched in 1993 to enable searching of the Beilstein and Gmelin databases. Elsevier also launched a Patent Chemistry Database in 2005. Reaxys, a Web-based system based on data from Beilstein, Gmelin, and the Patent Chemistry Database, was launched in 2009 as a workflow system for synthetic chemists. Major upgrades were introduced in 2013, 2014 and 2016. Today, Reaxys plus Reaxys Medicinal Chemistry offers access to 54 million document/bibliographic records (from patents and 16,000 journal titles); 105 million substances with 500 million experimental properties in over 400 fields; more than 43 million reactions; 33 million experimental bioactivity data points; and 12,700 targets including species information.

Derrick Umali of Elsevier outlined the excerption and database production processes, and the quality control measures used to ensure the high quality of the data in Reaxys. The system aims to deliver immediate access to information: it focuses on use of the information, rather than on searching. To this end, Reaxys combines a first-class user interface, state-of-the-art indexing and meticulous data excerption, world-class taxonomies, and a unique database. The 2016 version interprets either natural language, or truncation and operators; recognizes the search intention (e.g., reactions or documents); and delivers a ranked list of alternative suggestions as results. Users thus get intended answers directly, plus relevant results they may not have considered.

Finding structures and reactions graphically in Reaxys since 2009 has required sophisticated cheminformatics software. Marvin tools are used in Reaxys in many ways. They offer a structure input editor which includes support of 3D structures; a feature-rich and easy-to-use query generation powerhouse; and ways of rendering chemical structures and reactions. Marvin tools have the advantage of high-quality structure rendering; flexible APIs and integration capabilities; feature-rich and customizable search options; a non-Java application; and comprehensive import and copy-and-paste features.

Predicting off-target toxicities

Numerate seeks to overcome major challenges in drug discovery by applying novel machine-learning algorithms, at cloud scale, to the problems of small-molecule drug design. Brandon Allgood explained that more than 9 million medicinal chemistry and biological activity data points from the literature since the 1950s are integrated in the system, and about 6,600 models for 2,000 targets are built from activity, ADME and toxicity data. All the properties in a compound’s pharmacological profile are included in every design decision. The property spaces of compounds can be evaluated against all models in minutes. The Numerate system, ToxTool, was developed from the ground up on the cloud. It was built and validated with a substantial investment which included support from the US Department of Defense.

The ToxTool architecture has a RESTful API server (Java, Tomcat, Jersey, Marvin, and custom libraries and services); a Web server (Node and Express); and React-Bootstrap, and Marvin JS as clients. ChemAxon components are Marvin JS, Marvin, the Protonation plugin, and Markush enumeration.

Single molecule or space (Markush) jobs can be submitted. Draft jobs are stored in a database as an MRV string. The submitted job generates molecules from the MRV string. Molecules are validated, pKa adjusted with the Protonation plugin, and stored as an SDfile. Space job execution involves molecule enumeration and validation, job scheduling, molecule scheduling, distributed evaluation of molecules, and collecting and reducing the space results. Brandon showed some screenshots of opioid receptor antagonists with enumerated structures on the left-hand side, and a spreadsheet of activity, ADME and toxicity results on the right, with cells color-coded for high, moderate, and low (in the case of activity and mechanism of action). The system is currently deployed and helping to reduce off-target toxicities in preclinical drug programs.

Data Management

ChemAxon solutions

ChemLocator helps users to discover the hidden chemical knowledge in documents, regardless of whether they are located on a local computer, on a network share, or in the cloud. József Dávid explained that it is a Web-based, fast hybrid search tool with fine-grained security. It is extensible, and can be integrated with third-party applications. No IT deployment is needed. Zsolt Mohácsi presented five things that we might not have known about Compound Registration: the staging area used when a chemist registers a compound that does not comply with company rules, and it must be fixed by a registrar; the salt dictionary; registering different stereoisomers of the “same” compound; adding new data fields to the registration form; and connecting Compound Registration to data mining, querying and reporting tools. Ákos Papp ran us through Compliance Checker.

Karla Jarkovská looked at the future for Plexus and Instant JChem (IJC). Plans for IJC include customizable structure search options, integration of ChemAxon’s bioinformatics tools, the Biomolecule Toolkit and BioEddie, and improved usability. Soon, a Spotfire plugin will be securely integrated with IJC and Plexus Connect. The ultimate goal is for Plexus Connect to become Web-based. Scripting support and usability will be improved. Plexus Assay is coming soon, with a processed assay data uploader, standardization of data and non-numerical values, procedure mapping, field mapping, and data aggregation. Finally, Karla introduced Plexus as a platform: ChemAxon’s ambitious cloud-based plans centered on the ChemAxon Synergy concept, integrating ChemAxon’s supporting tools and multiple third-party tools, allowing users to connect, discover and navigate through many different applications.

Consultancy use cases

The consultancy team is a group of experts doing product customization and integration; workflow design and management; project management; and creative solutions. Norbert Sas outlined some project examples, from small ones, such as training, to large, multi-year exercises. Mini-Reg, a small scale, chemical and biological registration system based on IJC, arose from a consultancy project. The team has integrated Mini-Reg with an ELN, and linked the system to third party applications, for a small company. Norbert mentioned a project involving IJC, with calculators added, in which a data warehouse was made. A customization example was addition of visualization features using a third party application. Another project was a patent mining workflow. A way to search an ELN database and other documents was built for DuPont using the Document-to-Database program. Database migration is exemplified by the Novartis system for searching reaction data in a legacy (ISIS/Base) database and a CambridgeSoft (PerkinElmer) ELN. A project for another company to replace the existing search, data visualization and reporting application, and provide a Web sketcher for the in-house registration system may be presented at a future ChemAxon user meeting.

The ChemAxon consultancy team also has large scale project management contracts with GlaxoSmithKline (GSK) and Bristol-Myers Squibb (BMS). Over more than three years the team are rolling out IJC and Plexus Connect as a global reporting tool for GSK, providing assistance and training, customizing IJC and creating additional admin software. The BMS project, lasting more than two years involves IJC customization and development for global roll-out needs, training, migration of data, consulting on data-mart integration, and thin client development. Last but not least, a large, non-pharma company has set up a major project with ChemAxon, having found that there was no tool on the market to handle their spectra adequately.

Biomolecules

Biomolecules in R&D informatics

Roland Knispel presented a detailed case study concerning the ChEMBL database, the Biomolecule Toolkit, Hierarchical Editing Language for Macromolecules (HELM), and content curation. He also outlined an example of using IJC, Mini-Reg and the Biomolecule Toolkit in antibody registration. In the first and second quarters of 2017, sequence domain support has been added to BioEddie; and a library manager tool, Oracle support, and experimental KNIME nodes and Pipeline Pilot components have been added to the Biomolecule Toolkit. Later this year antibody drug conjugates will be supported in BioEddie, and the Biomolecule Toolkit will have genealogy tracking, entity batches, and sequence similarity search.

A chemical workbench for biopolymers

At Merck (known as MSD outside the United States and Canada), the in silico tools and resources for a design-make-test paradigm work well for small molecules but the workflow (property calculation, automated synthesis, registration, purification, and SAR) for biopolymers has been less well supported. The company wanted to fill this gap. They also wanted to fill unmet technology gaps: they needed a configurable, off-the-shelf solution, with JavaScript, and a perception engine.

Joshua Bishop said that the approach was to adopt a services-oriented architecture, to utilize reusable Web components where possible, and to build on a data platform for ingestion and aggregation. At the base of the technology stack are databases of biopolymers, HELM monomers, small molecules, and assay data. A scientific information management platform aggregates the data. On top of that are services to calculate the properties of biopolymers, and above that, applications for enumeration, synthesis request and registration, assay data collection and SAR. Sustaining partners in biopolymer design are ChemAxon (for the Biomolecule Toolkit, including BioEddie), Discngine (for Spotfire) and EPAM.

ChemAxon’s Biomolecule Toolkit is used to enumerate, and then register a virtual library, and to calculate and predict properties for the biopolymers. After selection of the desired biopolymers, synthesis is carried out by “MerMade” robots. MS-directed synthesis verification follows, and, finally, purification and formal biopolymer registration (using the Biomolecule Toolkit) take place. During registration the user interacts in the sketcher with a large chemical structure; the computer perceives this in terms of monomers and registers the molecule as three-letter templates. If the computer detects a new template in the sketched structure, the user is prompted to register first a novel HELM monomer. An assay request system accesses the biopolymer database. The SAR analysis system is a work in progress; it uses Spotfire, and the Biomolecule Toolkit including BioEddie. Matched molecular pair analysis will be implemented.

Managing the process of biologics registration

As reported in Budapest in May 2016, IDBS had developed their own cheminformatics capabilities for their chemistry ELN in the E-WorkBook desktop client, but those capabilities did not meet future requirements in moving ELN functionality from the desktop to the Web. As part of IDBS’ strategy, new technology was required for the representation of chemical structures on the Web, and IDBS considered a build, buy or partnership approach. The E-WorkBook desktop chemistry notebook already incorporated ChemAxon’s Reactor and Calculators, and IDBS decided to build on its existing partnership with ChemAxon.

In San Francisco, Jarrod Medeiros reported that the IDBS E-WorkBook Cloud is an end-to-end, cloud-based, research and development platform that supports internal, external and hybrid data management and research needs. It helps users plan and coordinate work, manage samples and inventory, and capture content and results. It supports external collaboration and provides decision support and process insight. It has integrated modules and an intuitive interface, and supports different devices and operating systems. It can be integrated with an existing infrastructure, and used on-premises or as a SaaS deployment.

For small molecules, informatics tools such as computer-based representations and registration systems are very mature, but for biologics there is not the same depth of mature tools. The biologics area needs informatics tools that cope with ambiguity, unknowns, large molecules, complex biology, and data types that are new. Reporting and analysis needs to be flexible and able to expand and adapt as the science develops. Most large molecule therapeutics are synthesized by cell lines, so the synthetic process is far more complex than that used in small molecule development, and biologics production is often continuous, not batch-centric.

The biologics landscape includes materials such as nucleotides, expression systems, and peptides, and the expression systems producing samples of them. Jarrod compared the physical and virtual aspects of bio-registration. A physical entity such as a vial of primer may or may not be linked to registered entities. Physical materials and samples have parent: child relationships. A primer sequence is an example of a virtual entity. It is typically linked to batches in inventory. There are relationships between entities. Uniqueness checking and business rules are needed for registration. Jarrod ran through a typical scenario involving E-WorkBook and peptide synthesis, CHO-S host cells in the inventory, and transfection parameters in E-WorkBook. HELM and ChemAxon’s Biomolecule Toolkit are essential to the implementation of the process of biologics registration into one integrated system.

Back End: Store and Search

JChem engines

András Volford gave a brief introduction to JChem Base, and the JChem Oracle and PostgreSQL cartridges, concentrating mainly on the PostgreSQL cartridge, and the very significant speed-up of duplicate and similarity search.

The PostgreSQL cartridge at Dart Neuroscience

Brock Luty of Dart Neuroscience was unable to attend the meeting but has sent me the following, very preliminary notes. Installation of the PostgreSQL cartridge was easy and there have been no problems with stability in recent versions. Searching seems to be quite a lot faster, but at the cost of flexibility. In Oracle more options can be specified at runtime. With the PostgreSQL cartridge, the molecule type is fixed at table creation time. Maybe this allows the significant speed improvements. Structure searching with the PostgreSQL cartridge is faster than searching in Oracle, but it should be noted that Dart is comparing a recent Postgres cartridge with a fairly old Oracle cartridge.

Searching using SQL is quite different in PostgreSQL; there was a learning curve but Dart expected this. The fact that the chemical terms output can only be a string or a molecule is very limiting. For example, when getting atom counts, the string output needs to be cast to a number. This may be a Postgres limitation. Searching with chemical terms is slower in the PostgreSQL cartridge (2.7): a query (by id) for 10 compounds and their carbon counts consistently (across multiple instances) took longer in Postgres compared with Oracle (again, across multiple instances). (The database configurations are different so a true, exact comparison is not possible.) This may be something specific to that particular query. On the plus side, the PostgreSQL cartridge has a simplified list of search functions: the “chemterms” function suffices in places where multiple (at least four) different “jc_” functions would be needed in Oracle.

There are also substructure searching differences, in large part due to the differences in aromatization and vague bond settings. (Dart is using “basic” aromatization in Oracle. At the time they thought that this was not an option in Postgres.) It is still taking Dart some time to adjust to these different outcomes, and if chemists are ever allowed directly to do structure searches against Postgres, training will be required. Dart has confidence in the correctness of the search results and thanks ChemAxon for extensive and patient discussions.

Migrating the eMolecules chemical search engine to ChemAxon

Craig James said that the eMolecules system allows users to search for and order chemical compounds from a high quality, accurate, and up-to-date database of 10 million products. Users can search by substructure, similarity, or exact structure; browse, filter, sort and save results; and import and export lists. eMolecules has also built 100 private websites for pharmaceutical, biotechnology and biology customers. JAGGAER Enterprise Reagent Manager (ERM) federates inventory search with eMolecules search. Structure drawing and rendering in JAGGAER ERM is now powered by Marvin JS.

The eMolecules search engine was fast and stateless, that is, it answers the query, and then forgets the user, such that each subsequent query (e.g., “page 2”) is a new search. Pagination is efficient: for example, “get rows 11-20 and stop” is carried out without searching again for rows 1-10. There are powerful hitlist features. So, why replace eMolecules’ home-grown cheminformatics? The cost of maintaining compatibility with OpenBabel was one issue, but above all, a superior commercial alternative from ChemAxon became available.

Web search systems are stateless, fast, and lightweight, allowing fast queries, pagination, and load spikes, but there are problems with abusive robots and industrial theft. A traditional RDBMS is stateful, and unpredictable, with an overhead. Queries can take minutes (or hours), pagination is poor, search results cannot be refined, search progress cannot be reported, and it is hard to interrupt a search. Chemical queries are particularly a problem: optimizer predictions are wrong for them and query times are extremely variable. eMolecules got around this by separating out the chemical queries into an eMolecules search engine before passing them to the Postgres database. This engine did exactly one query, “Find the next N rows”, and carried out dynamic optimization. The execution state (“Where did the last search stop?”) was encoded in a few bytes that can be embedded in a URL.

The current solution is to partition the Postgres molecule tables and to query 500,000-compound subsets. Initially, the Web query is applied to just one partition; the full dataset is only queried if the user is interested in a full result set (an uncommon occurrence). “Stupid” queries (e.g., benzene) are thus 95% faster and very specific queries are only slightly slower. Search results are stored in a “hit list”, and a progress report is given during the search (via partitioning); persistent results are saved in a database. Fast perusal of results may lead to no further searching; results can be sorted and refined; and a search can be canceled. In the RESTful JSON API the database layer is separated from the application. It is thus now feasible to use ChemAxon’s JChem PostgreSQL cartridge embedded in a Postgres relational database management system (RDBMS) for the eMolecules website.

MadFast similarity search

MadFast is a new ChemAxon product released in December 2016. It is an engine for fast similarity searching with efficient in-memory storage. It also provides fast calculation of descriptors such as CFP, ECFP and MACCS-166 fingerprints. It is a Java application available via command line, REST API and Web interfaces. Using 1024-bit binary fingerprints on an Amazon r3.8xlarge machine, MadFast delivers the 40 most similar structures in about 80 milliseconds per 16 million structures, or about 5 seconds per 1 billion structures. Memory usage per million molecules is 250-350 MB. Gábor Imre presented some examples, and a use case comparison with the PostgreSQL cartridge. Future plans include overlap analysis visualization; real-time clustering; similarity-based hierarchical clustering; query of a remote database using JDBC; a single desktop UI release; and public Java API components for developers.

Partner Session

Gloria Patterson presented the Agilent OpenLAB ELN, a central hub for organizing and sharing information between lab members. Ideas, protocols, and data files are collated all in one place with easy-to-use, drag-and-drop capabilities. OpenLAB ELN forms enable users to import data automatically from a wide variety of sources, directly into a notebook page. Users can delegate and track the progress of experiments, share the latest results with a team in real time, and standardize routine workflows with the ELN’s templates and forms, e-signature and validation workflows, and collaboration tools. There is also an intuitive mobile interface. Structure search in Agilent OpenLAB ELN is powered by JChem.

Jonathan Lee made a presentation on behalf of DeltaSoft. The company was formed in 1996 and has been a ChemAxon partner since 2006. The DeltaSoft ChemCart cartridge works on JChem (and other) cartridges. Sketchers, renderers, Excel, and workflow tools are supported. ChemCart applications include ChemCart Registration, ChemCart BioAssay, ChemCart ELN, ChemCart Reagent Inventory, ChemCart Sample Inventory, and a Structure Activity Browser. The solutions can be used in the cloud or on-premises.

Titian Software and ChemAxon are demonstrating integration between Mosaic inventory management software and ChemAxon Compound Registration. A Mosaic inventory contains physical container-level information, whereas a registration system holds details about the substance in the container. Anne Vergnon presented a video of registering a new molecule in ChemAxon Compound Registration, creating some tubes of it in Mosaic, placing the tubes into a store, and viewing information about the molecule in Mosaic. Hyperlinks provide easy navigation between the two systems. The demonstration video shows capabilities in the cloud. Anne showed a small molecule example, but the integration can be extended to other registration systems and substance types.

Conclusion

I personally found this meeting useful, but it has to be admitted that this was not the biggest and best of the many ChemAxon meetings that I have attended. A number of potential speakers had to withdraw at the last minute, and the audience was lower than one would have expected for a typically popular venue. It seems as if the ChemAxon change from Boston in September to San Francisco in April was not a good idea. Perhaps cheminformaticians now accept the supremacy of ChemAxon and are no longer curious about what might be going on. Who knows? The good news is that the Budapest meeting a month later was as successful as ever, as I hope you will see from my upcoming report on that event.