ChemAxon European User Meeting, Budapest, May 20-22, 2019
Table of contents
- Hands-on Workshops
- ChemAxon and Its Portfolio
- Empowering Chemical and Biological Design
- Proposed Enhancements to the Reaction InChI (RInChI)
- Helping Medicinal Chemists Discover New Opportunities in Lead Identification and Optimization
- Marvin Live
- Managing Hypotheses in Boehringer Ingelheim's eDesign Landscape
- JChem for Office Lite
- Best-in-Class Search Technology
- Cartridges and Graph Databases
- Novel Similarity Graphs
- A Complex Database
- Central Registry and Cloud Integration
- Agnostic Registration
- Compound Registration
- The Haystack Project
- Transforming AnalytiCon's Database Infrastructure
- Compound and Assay Data Management in the Cloud
- Partner Session
- BSSN Software
- Mestrelab Research
In 2019 ChemAxon users yet again convened at the Akvárium Club, Budapest, but the program was rather different this year: workshops were included in the main meeting, rather than run on a separate day. Last year saw an unusually large number of user talks. This year there were fewer user presentations, but there was more opportunity for networking, and for in-depth discussions with ChemAxon staff. Attendees, and almost all of ChemAxon’s 160 employees, flocked in from 16 countries across Europe, Asia, and North America.
On the evening before the meeting proper, there were informal discussions in ChemAxon’s offices in the Graphisoft Park, followed by the usual garden party. The next night there was a buffet dinner at the Akvárium Club followed by a concert from a Hungarian pop band, called Irie Maffia. Fortunately for your correspondent, who is anxious to retain her auditory faculties, this event took place in the lower floor of the venue, allowing her to indulge in serious discussions in the main foyer with Efi Hoffman of ChemAxon.
A fringe event during the main meeting was the Zosimos challenge. Zosimos is a new web application to help chemistry and biochemistry classes to improve their knowledge by interactive homework quiz creation, evaluation, and publishing. Tools such as Marvin JS and JChem Base are used to evaluate and grade students’ assignments automatically. The winners of the two challenges were Charles Goehry of Life & Soft, Anna Szekely of Chempass, Chimed Jansen of Mercachem, Kornélia Tacsi of mcule, Eric Ginoux of Life & Soft, and an unknown person who did not claim the prize.
In parallel sessions, the hands-on workshops were:
- Chemical drawing tips and tricks (for scientists) and Coding MadFast similarity search (for developers).
- Using Marvin Live, a drug design ELN (for scientists) and Getting familiar with JChem codes, Choral and Graph DB (for developers).
- Glimpse of new design concepts in Marvin (for scientists) and Add chemical intelligence to your web environment with Chemicalize (for developers).
- Upload and analysis of compound and assay data in the cloud and Predictive search and draw.
A third option for all four sessions was Chat with ChemAxon.
ChemAxon and Its Portfolio
János Fejérvári of ChemAxon opened the meeting with some comments on ChemAxon achievements over the last year. The company experienced double-digit growth in both income and the number of customers. More than 500 companies and more than 700 universities use ChemAxon products; Finland sets its final secondary school examinations using ChemAxon software. The number of ChemAxon employees has grown 12% to 160.
Dávid Malatinszky of ChemAxon then gave an overview of the company’s portfolio. First he described how innovation is done at ChemAxon. When the company’s experts work with a client on a project, they always ask: “Why?” Why do you want to do this? What is your purpose? They ask these questions over and over again, following the “golden circle” philosophy of Simon Sinek (author of Start with Why: How Great Leaders Inspire Everyone to Take Action, Penguin, 2011). Dávid gave an exceptionally good outline of ChemAxon’s vision and product line (it was a joy to see such simple, clear slides), and for each product he gave the name and photograph of the key contact in ChemAxon, so that we knew who to approach with questions. Many of these people described their products in the sessions that followed.
Empowering Chemical and Biological Design
Marvin JS is a very popular lightweight chemistry sketcher web component, and it led ChemAxon to understand users’ motivations better in composing content-rich chemistry figures. The company is therefore rethinking chemistry drawing. Dávid Malatinszky said that there are plans for adding:
- publication-quality drawing (allowing image insert, graphical elements, cliparts, structure formatting, custom fonts and highlights)
- complex, multi-step reaction schemes (with custom and curved arrows, reagents and reaction conditions)
- a built-in Structure Checker (as in MarvinSketch, with highlights, suggestions and fixes).
Efi Hoffmann gave more detail. New features to be added to Marvin JS in 2019 include accessibility improvements, Structure Checker integration, and oligopeptide support with more S-groups. The long term vision is one of a web-based, publication quality chemical editor component which is capable of drawing big synthesis schemes, can be easily extended with other modules, and can be seamlessly integrated in ChemAxon’s products and those of other vendors. Hence the MARVIN_NG project in which ChemAxon will carry out user interviews, usability tests and workshops, in collaboration with users, in order to build a great product that fulfills user needs and workflows. The plans were to produce the architectural concept in Q1 of 2019, the chemical model in Q2, and the graphical concept in Q3-Q4. By year end there will not be a new product, but, with the help of users, there will be a concept validation.
Proposed Enhancements to the Reaction InChI (RInChI)
Gerd Blanke of StructurePendium Technologies talked last year about the IUPAC International Chemical Identifier for Reactions (RInChI). RInChI is based on the IUPAC International Chemical Identifier (InChI), a unique representation of a compound it describes. InChI and InChIKeys can be read by, for example, Marvin. The RInChI format is a hierarchical, layered description of a reaction with different levels based on the Standard InChI representation of each structural component participating in the reaction. The formats and algorithms of InChI and RInChI are non-proprietary, and the software is open source.
RInChI version 1.0 was released in March 2017. The ChemAxon application of RInChI is new since Gerd’s talk in 2018. This year Gerd and István Őri of ChemAxon were able to give a demonstration of drawing a reaction in Marvin and generating a RInChI string and key. ChemAxon has supported the RInChI file format since Marvin version 19.11 and the longer term goal is for all ChemAxon products to support RInChI.
The RInChI team (Gerd Blanke, David Nicolaides of Dassault Systèmes, Günter Grethe, Hans Kraut of InfoChem, István Őri of ChemAxon, Jan Holst Jensen of Biochemfusion, and Jonathan Goodman of the University of Cambridge) is now planning version 2.0. Reaction SMILES and the Unified Data Model (UDM) will be additional import and export formats. Atom mapping, to mark the reaction centers will be handled as an auxiliary layer, MapAuxInfo. The team is not implementing a mapping algorithm in RInChI but will use the information provided by the RXN file as delivered by the author. Mapping will identify only those atoms that are kept during the reactions.
Other enhancements to RInChI are under discussion. Failed reactions could be marked by a direction layer, or by an additional identifier. Additional auxiliary information layers could be added for statistical tools provided by vendors and publishers. InfoChem’s narrow, medium, and broad Class Codes will be handled in the AuxClassCode layer. Addition of the transforms used in Reaxys is under discussion. A proposal for handling processing information in a ProcAuxInfo layer has been made by authors from Cambridge University. This rich format supports (among other things) stoichiometry, reaction conditions such as temperature and pressure, and yields over time.
Helping Medicinal Chemists Discover New Opportunities in Lead Identification and Optimization
One thing that is critical to the design, make, and test cycle during drug development is the ability to make informed decisions at each step, and decisions are often driven by in-house and published knowledge. Elsevier assists with the decision-making task by extracting and normalizing data from large volumes of literature and patent documents, and delivers a user-friendly and actionable solution in the Reaxys product suite. Last year Ralph Hössel of Elsevier talked about the use of ChemAxon solutions in the Reaxys excerption and search systems. This year Rosalind Sankey of Elsevier talked about using data from Reaxys to strengthen support for the drug design cycle. Use of ChemAxon tools is critical also in these applications.
The first project discussed concerned matched molecular pair (MMP) analysis. A typical problem in lead optimization is optimizing pharmacokinetic (PK) and absorption, distribution, metabolism, and excretion (ADME) properties. For example, if scientists know that a certain part of a molecule is causing poor cell permeability, they would like to see options for replacing that part of the structure with something else, while maintaining activity. MMP analysis is a decision support tool used to understand the impact of substructural replacements on a given parameter of interest: pairs of compounds with a small substructural exchange are studied.
Rosalind’s team partnered with the Swiss Institute of Bioinformatics (SIB) to run MMP analysis on Reaxys Medicinal Chemistry (RMC) data; the SIB ran pair identification, replacement extraction, and computation of the MMPs. In the back end was JChem for handling the molecules and fragments, and for calculating molecular and physicochemical descriptors. The outcome was over 9.5 million possible molecular replacements: almost twice as many as those offered when the same exercise was run on a well-known, publicly available dataset. In the front end was Marvin JS, for entering even complex structure and reaction queries, and JChem Web Services for generating the fragments and the molecule images.
The prototype was delivered to customers via a web application to get feedback on the potential impact. Users reported that the application helped drive creativity (in terms of new ideas and chemical space), helped collaboration with colleagues, and helped drive decisions more quickly. For example, when looking for potential replacements for an aniline fragment (often associated with toxicity issues), the application suggested using a 1,3-benzodioxole which had several advantages: it had been tested several times, the effect on activity was mostly positive, and the calculated properties were not negatively affected. This replacement would not necessarily have sprung to mind. This demonstrates the value of MMP analysis to drive creativity during compound design. The users could then dig deeper to explore where the information came from, and on which targets.
A second exploratory project was target prediction. Often scientists understand their primary target, but do not have a broader view on potential secondary targets, some of which could lead to safety issues later down the line. Target prediction is based on the assumption that if two molecules are very similar, they are more likely to be active on the same target. Thus a reverse screening approach can be applied, starting with a molecule of unknown target, and using the data from RMC in combination with machine learning/artificial intelligence (AI) technology developed by the SIB. This looks for similar compounds with known activity and produces a list of possible targets.
The SIB’s approach has the advantage of looking at molecular similarity taking into account both 2D fingerprints and 3D shape; this gives a significant boost to the performance of the predictive capability of the model. In the back end were Structure Checker, and JChem Web Services for handling the molecules, for standardization, and for generation of 3D geometries. In the front end were Marvin JS for sketching, and JChem Web Services for generating the molecule images. Again, the Reaxys team created a web application to get feedback from customers. The application was based on more than 500,000 small molecules with activity lower than 10 mM on more than 2,500 targets.
Target prediction was used to understand potential secondary pharmacology better. In one case the prediction was used to optimize a kinase panel, and when it was experimentally tested, the results showed activity at a previously unidentified target, showing that target prediction has the potential to predict off-targets earlier in the discovery process, reducing the risk and costs of developing potentially unsustainable leads.
Another important aspect of drug design is the ability to buy or synthesize a target molecule. It is a challenge to find faster, better, and cheaper ways to synthesize compounds, and not be constrained by our own knowledge or experience. One emerging technology is predictive retrosynthesis; several approaches are currently being worked on. One that has gained a lot of attention was developed by the team of Prof. Mark Waller. It uses a deep neural network approach, and is based on data from Reaxys. It has many advantages over more traditional approaches. It is scalable: you can keep feeding data into it as you get updates. It is 30 times faster than traditional, computer-aided methods, and it is efficient and more robust than other approaches. It performed well in double-blind studies with graduate level chemists. Reaxys will offer a new approach to retrosynthesis in 2019, using deep neural networks and symbolic AI, in collaboration with Prof. Mark Waller.
Curated, clean, linked and normalized Reaxys data underpin all three applications that Rosalind described. AI technologies will not replace researchers but will enable them to focus on innovation. The use of ChemAxon technologies and tools underpins many of the developments reported here.
Ákos Tarcsay outlined a development path toward overarching hypothesis management, and an application study integrating a novel hERG assistant relying on matched molecular pair analysis, highlighting the agility of the Marvin Live design hub. The decision making sequence is to identify the goal, gather information, identify alternatives, weigh the evidence, choose among the alternatives, take action, and review the decision, before gathering yet more information and reiterating. Marvin Live gathers its information in real time, in a plugin that is service-agnostic and dynamic. In the case study, ChEMBL was filtered for hERG data, which was used in mmpdb (an open source matched molecular pair platform), and then JChem Web Services for the plugin. The statistical results from mmpdb were turned into a hERG Assistant by showing the most relevant transformations and corresponding example structures. The evidence was weighed in a spreadsheet of structures and properties. Marvin Live gave a card-like representation of the options, and the chemist was able to choose among the alternatives. Ákos closed with a screen shot of the planned hypothesis management system.
Managing Hypotheses in Boehringer Ingelheim's eDesign Landscape
Edith Scheringer of Boehringer Ingelheim (BI) talked about Marvin Live and Hypoman (managing hypothesis), in BI’s web-based eDesign solution. The aim for eDesign was to build a global, state-of-the-art, modular, flexible and sustainable design environment that provides a seamless user experience. The approach was to use “best of breed” and implement a system with a focus on a seamless user experience. Other aims were to focus on modular, web-based technologies to stay agile; to invest in experimental and computational tools that provide contextual data for each design idea (tools which can be integrated in different front ends); and to follow the field, and adjust the system as new opportunities emerge. eDesign provides convenient, real-time access to design-relevant data and predictions. Currently, it covers ideation, 2D- and 3D-based design, and prioritization of synthesis ideas in the context of available compounds and data. eDesign version 1.0 is available; eDesign version 2.0 is being planned.
Edith presented a screen shot from the design mode of Marvin Live 1.0 as implemented at BI. On the left is a list of “snapshots” of structures, with the ability to pin a structure for comparisons. Alongside is a flexible and adjustable 2D web-based editor (Marvin JS), with a button below for fast access to the compound library. To the right, live information with plugins (“SmartAssistants”) provides contextual information (experimental and computationally predicted values). The user interface is flexible: it can be adjusted, and is extensible with plugins. It is easy to use, and easy to integrate in a workflow.
Alex Schmalz of BI described the implementation of eDesign version 1.0 at last year’s meeting. At the end, he listed some wishes for enhancements to Marvin Live version 1.0, namely, a dynamic interface between Marvin Live and Maestro; running Marvin Live on a Microsoft Surface Hub (a digital whiteboard); “Marvin JS Plus”, that is, Marvin JS with plugins such as web services, and Structure Checker; and push notifications. Edith said that some of these wishes are partly implemented, and some are in planning mode. BI is working on improving the user experience in the interface. The company saw opportunities with the “best of breed” approach in terms of delivering a seamless user experience. The ChemAxon tools were fine, but BI is trying to improve the seamlessness internally.
BI was collaborating with ChemAxon not only as a software company, but also engaged ChemAxon to support BI in evaluating the problem space. BI wanted to try an agile approach for version 2.0 of eDesign. So, an agile workshop was set up with ChemAxon involved. Some opportunities came out of that workshop. As noted earlier in the user meeting, ChemAxon always asks “Why?” Asking the question led to discussions around capturing hypotheses. Users could get ideas from eDesign, but not track the hypothesis. So, more workshops were set up to give BI feedback on capturing the hypothesis. Needs for ease of use and prioritization of options are also being revealed. Reaxys has been integrated; the collaboration was threefold. From facilitation of the agile method, new ideas are now appearing, and some are exciting. ChemAxon continues to come back with prototypes, and feedback is being seamlessly captured.
JChem for Office Lite
Ákos Papp demonstrated a prototype for JChem for Office Lite. The proposed product features invisible, vendor-independent chemistry in Word, PowerPoint and Outlook, with no Excel component. JChem for Office without ribbon loads faster. No license is needed for copy and paste from any major chemical editor. The structure image is rendered by the source editor, and the chemistry is stored in the native format of the editor (MRV, CDX or SKC). The object can be edited by double clicking: the most appropriate editor is detected automatically. The preferred editor can be configured.
Best-in-Class Search Technology
Cartridges and Graph Databases
András Volford introduced two new products: JChem Choral and JChem Microservices. JChem Choral is the next generation JChem engine available as an Oracle Cartridge. JChem Choral features “hit as you draw”: fast searches with a hit limit. Hits are ordered according to similarity to the query. Also featured is fast combined chemical and nonchemical search. JChem Choral can handle very large databases. JChem Microservices extends the capabilities of ChemAxon’s chemically intelligent platforms to the web via specialized modules using the second generation JChem engine, made even faster by using relevance-sorted, in-memory search. The product is modular, scalable, easily manageable, highly available, and cloud-ready. Currently there are modules for search in a chemical dataset, conversion between different chemical file formats, structure manipulation operations, chemical property calculations, and Markush enumeration.
Ákos Tarcsay presented a comparison of the speed of the JChem Oracle Cartridge (JOC, introduced in 2004), with the speed of the next generation cartridges: the JChem PostgreSQL Cartridge (JPC, 2015), and the JChem Choral Oracle Cartridge (CHR, 2019). On “simple data” (42 million structures), JPC and CHR showed a significant performance improvement on limited substructure and similarity searches. For real-life data (1.8 million compounds and 15 million activities in ChEMBL) speed improved on joined queries over complex data. JPC outperformed CHR in all cases. The next generation cartridges (the JChem PostgreSQL Cartridge and the JChem Choral Oracle Cartridge) feature sorted substructure search hits, early hits, agile search, and low memory setup for large datasets.
András Volford talked about graph databases. In the property graph model, data are organized as nodes, relationships, and properties (data stored on the nodes or relationships). Nodes are entities, structures, etc., and relations are connections between the nodes. There are no tables, so extra nodes and relations can be added without corrupting the database. Joining tables is slow in relational databases; querying relationships within a graph database is fast because they are perpetually stored within the database itself. Neo4J is an open-source, NoSQL, native graph database. Cypher is Neo4J’s graph query language. The JChem Neo4J Cartridge features substructure search, and similarity search on a set of nodes. It is a sort of index mode search; there is no functional mode search yet.
Novel Similarity Graphs
Dan Dragos Stefanescu of Sanofi could not attend, so András Volford presented his slides and a video. Dan Dragos’ work involved building a graph database with chemical intelligence, using the Neo4J graph database, ChemAxon’s Marvin JS and ChemAxon for Neo4J plugin, and Tom Sawyer Perspectives, a platform for building graph, and data visualization and analysis applications.
Dan Dragos started by listing Sanofi’s business needs. These included efficient exploration of chemical space around biologically active chemical matter, with integration of diverse information linked to compounds; and efficient navigation and visualization, exploiting neighborhood relationships. Highly interactive and visual data traversing of the chemical space required excellent performance in retrieving data from large datasets, and high-end visualization capabilities to depict complex relationships. Benefits would include new insights that might have otherwise been overlooked, and that increased creativity. Researchers need highly interactive and user-friendly tools to answer questions such as:
- What are the nearest neighbors to a given compound X that contain scaffold Y and show a high permeability?
- Which compounds show activities on targets A and B and have a reasonable ADME profile?
- Is there a commercially available compound similar to compound X that comes with pharmacological data that might be used as a tool compound?
Previous solutions had technology gaps. Data were stored only in relational databases. A single nearest neighbor search could take minutes, and a compound collection walk-through required a series of successive searches that could have taken hours.
So Sanofi took steps to build a similarity graph tool. Chemical similarities using FCFP4 fingerprints were calculated with 10 nearest neighbors, canonical SMILES, InChIKeys, and structure pictures for all Sanofi screening collection compounds, using the new ChemAxon for Neo4J plugin for substructure and similarity searches. Redundant storage of structures in Oracle is avoided. Compound annotations include physicochemical and ADME data, calculated properties, and related Sanofi project names. The data were loaded into the Neo4J graph database, and a web application was built using Tom Sawyer Perspectives by Tom Sawyer Software. This software was selected for its advanced data integration and graph visualization capabilities. Marvin JS was integrated for drawing structures for substructure search.
The similarity graph tool can retrieve the nearest neighbors of a molecule, and highlight and rank chemical similarity edges of a molecule node for interactive graph traversal. The tool allows a scientist to track the path and order of visited compounds, and export selected compound IDs for further analysis in other tools such as Certara’s D360. It allows filtering on edge and node properties, and applies color coding rules to molecule nodes. Color coding may be based on single or multiple node properties, allowing for compound profiling. The tool finds the shortest path(s) between two molecules with respect to biological context, considering visible nodes of the currently displayed graph or all database nodes. Nodes can be enriched with data from CSV files (for example, to provide linking by compound ID). Scaffolds can be displayed. Compounds with the same biological function, but low chemical similarity can be shown. CHEMBL data for 1.8 million compounds have been integrated.
The talk concluded with a 15-minute video, made by Dan Dragos, illustrating not just these features but also clustering capabilities. Many interesting screen shots were shown, including the one below. It shows the searched compound “CHEMBL1834852”, colored in blue, along with its nearest neighbors connected by similarity edges, and also its scaffolds colored in dotted pink. The orange HERG project node is linked to all related compound nodes. The green-colored compound nodes fulfill the criteria cAlogP less than 3.971 and HERG activity less than 8.06. The properties inspector on the left shows all node properties of the compound CHEMBL1834853.
User feedback was very positive, because users found that traversing their data in a graph is more intuitive than using Excel sheets or other retrieval tools. Possible future extensions to the web application will be the integration of other data sources such as ZINC and eMolecules.
A Complex Database
Egis makes pharmaceuticals and active pharmaceutical ingredients at two production facilities in Budapest. The company has worked with ChemAxon for five years. András Dancsó of Egis talked about Egis’ complex database which uses ChemAxon database technology (JChem, Marvin, Marvin JS, Standardizer, the Partitioning plugin, and Structure to Name). Integration work was done by Cominnex. There are 200 users of the Egis database.
Automatic name generation following registration or modification is 99.8 % successful; exceptions are some rather complicated complexes, where even Reaxys did not give a correct systematic name. The database holds structures and reactions, interlinked. Productions, the realized reactions, are linked to reactions, and linked to projects too. Containers are linked to structures. The end products of a reaction are also in containers so these containers are linked to productions as well. Analytical results are linked to containers.
Reagent containers have unique barcodes with precise locations. Quantities are continuously tracked. Procurement software starts and tracks orders. Safety data are stored, including Globally Harmonized System of Classification and Labeling of Chemicals (GHS) symbols, GHS H and P phrases, Material Safety Data Sheets (MSDSs) and data on controlled substances. Labels can be printed. The ELN tracks starting materials, and handles physical constants and analytics. All features are available in both English and Hungarian. Hard copies can be generated.
Precise stereochemistry is handled. Basic types of synthesis schemes are linear, combinatorial, and convergent, but, in practice, schemes can be much more complex. There is an assignment database for 1H and 13C NMR data.
Egis uses ChemAxon’s support with satisfaction, and continuous development is planned. For example, ChemAxon’s Compliance Checker could be implemented, and the new features in Marvin and the recently announced Office Lite seem to be highly interesting.
Central Registry and Cloud Integration
ChemAxon’s small molecule registration system is moving toward becoming an agnostic registration platform, where larger entities can be handled and added as unique rows to the central registry. Roland Knispel outlined the architecture. MRV files, SMILES, molfiles, Hierarchical Editing Language for Macromolecules (HELM) and FASTA, and in future other formats, will act as input to a perception engine that will automatically recognize the format. An ID generator is also linked to the perception engine. The recognized format will act as input to either Compound Registration or biomolecule registration. In future, types of databases other than those for small and large molecules could be accommodated.
Sairam Kalapatapu of Sun Pharma Advanced Research Company (SPARC) presented two slides about ChemAxon tools used in drawing structures and reactions, in TLC monitoring, in structure and reaction search, and in compound registration, in SPARC’s ELN. The bulk of the talk, however, was about computational simulations of pharmaceutical formulations. Molecular dynamics are beyond the scope of this report, so I am not summarizing the case study on simulations.
The Haystack Project
Earlier, Ákos Tarcsay had described the gathering of information for a hypothesis management system. Many very large public and private databases need to be accessed in this process. It would be very desirable for them to be accessible through a single interface. Iván Solt described the Haystack project which aims to combine more than a billion publicly available unique compounds into a single, searchable database. Seven databases were chosen for the prototype, with a total of 140 million compounds and 126 million unique structures:
- Corporate Compound Repository (1.5–3 million compounds)
- MolPort All Stock (7 million compounds)
- eMolecules (22 million compounds)
- BindingDB (651,000 compounds)
- ChEMBL (1.8 million compounds)
- SureChEMBL (18 million compounds)
- PubChem Compounds (95 million compounds).
(Adding 720 compounds from Enamine REAL would have raised the number of compounds to about a billion.) The 140 million compounds will be stored in the PostgreSQL+Cartridge and submitted to MadFast similarity search. JChem Microservices will be used for Enamine REAL. The Marvin Live front end will then be imposed on the whole system. Consolidating the back end, implementing the API layer, and GUI design and early development will start in Q2 2019.
Transforming AnalytiCon's Database Infrastructure
Lars Ole Haustedt of AnalytiCon Discovery described the transformation of his company’s data management and storage technologies. The company has complex interactions of structural, analytical, and biological data, large LCMS and NMR datasets (more than 5TB) and medium-sized structural datasets (less than 10GB, for 200,000 structure records). Legacy systems involved ChemFinder, ISIS Base, Access and Excel. Maintenance was tedious and the whole system was unsustainable.
Now AnalytiCon has a relational database with MySQL and Instant JChem. Viewing and editing are browser-based. Registration, inventory, and the relationship to analytical data have been enhanced, and medicinal chemistry and computational chemistry applications are integrated. Plexus Connect relies on the search technology of JChem Engines, and on the data handling of Instant JChem. It accesses chemical databases managed within Instant JChem, thus making research data available in an online environment. It allows scientists to view, search and analyze their data, and share them with their peers easily, using an intuitive interface.
The Natural Product Database stores chemical structures and a relational table with all batches. There is direct linking to NMR data (using the JCAMP-DX standard) and to LCMS PDFs. Polarity indicators are handled. The Synthetic Compound Database stores chemical structures, with direct links to LCMS PDFs. Synthetic synthons, and compound management and storage are handled. The Reagent Database stores chemical structures, and provider, price, and inventory information. Laboratory staff have browser access for user-friendly selection of reagent sets. The new system has integration into Microsoft Office using JChem for Office and Marvin.
AnalytiCon also uses ChemAxon’s Reactor. ChemAxon KNIME Nodes are used in library enumeration: nodes for generation of unique compound identifiers, tractable synthon information, property calculation, generation of conformers (using Inte:Ligand software) and export of SDfiles and tables of SMILES to a MySQL database. Subselections are directly imported from the MySQL reagent database; there is no redundant storage of information. Without ChemAxon KNIME Nodes the processing of synthon information had been more tedious, and careful filtering of results was necessary. A workflow based on ChemAxon KNIME Nodes gives user-friendly handling of reagent properties, and better control on reaction products. The definition of reactions for enumeration, supported by Marvin SMARTS, is very flexible, and intuitive (with no need for nonchemical “hacks”). Atom mapping is good.
Finally, Lars Ole summarized some integrated computational chemistry applications. Virtual libraries can be designed based on pharmacophores or established scaffolds, and can be enumerated. Properties, conformations, and fingerprints can be calculated. In virtual screening, pharmacophore search is carried out with LigandScout, and docking with SeeSAR and LeadIT. Similar or diverse compounds can be selected from the results of virtual screening using properties or fingerprints. The improved system is clearly more sustainable, stable, and reliable, with many new features, but Lars Ole would like to see even more intuitive format menus (for color, superscript, and subscript) in Marvin in future. He also listed some requests for more functionality and customizable dynamic content in Plexus Connect:
- Implementation of scripts and buttons in form view
- Pull down menus in form view
- SDfile import
- Adding records in form view
- Definition of mandatory fields in form view.
Compound and Assay Data Management in the Cloud
ChemAxon’s capability to upload and manage assay data is rapidly increasing; this is the newest development coming to the cloud. ChemAxon Assay is a new, web-based assay data management tool which matches intelligent uploading functionalities with a modern scalable database. Easily integratable into ChemAxon Synergy, it can be used in conjunction with Compound Registration, and to form a complete solution for compound and assay data management. Ronan Downey and Lenka Cardova described ChemAxon Assay. Interactive upload of assay data gives full control of the process (including cross checking IDs in compound registration, mapping, curve fitting etc.). Alternatively, there is a template for a semi-auto upload, or multiple files can be uploaded automatically. Egnyte offers secure file sharing. Plexus Connect is used to search and report data.
Nóra Lapusnyik of ChemAxon introduced this session. To describe ChemAxon’s partnering philosophy she used a metaphor: the secret of baking a really good cake is to get the best quality ingredients from the best suppliers, to stir in some chemistry, and to bake everything to perfection. This year there were partner presentations from Arxspan, BSSN Software, Certara, Chemantics, Enamine, IDBS, KNIME, Mestrelab, MolPort, ONTOFORCE, SciBite and SciNote.
Arxspan is now part of Bruker. The Arxspan ELN and cloud suite of applications offer an integrated data management platform for collaborative research. Bill Rousseau listed the benefits of the workflow:
- It provides a single solution for the creation of work requests and work execution
- It allows users and administrators to configure scientific workflows
- It reduces the number of interfaces researchers are required to navigate daily
- It simplifies working with external collaborators
- It provides customers a means to customize the ELN.
The workflow spans ELN, registration, inventory and assay. Bill showed how to configure the workflow, and he presented a use case.
Patrick Boba said that BSSN Software focuses on the data that are produced in the lab and the processes around them. To do that, the company uses the Analytical Information Markup Language (AnIML) XML file format. The core schema offers a generic container to store the data, and technique definitions are used to define the setup for specific experiment types. BSSN’s Seahorse Scientific Workbench is offered for desktop, web and mobile platforms. Another product is the Sea Shell BioProcess Manager which handles and tracks analytical requests throughout the laboratory workflow, and offers features such as real-time dashboards, and integration with other tools from BSSN’s portfolio. The bioprocess manager is a good example of how BSSN Software uses Marvin JS. First of all, it gets the information on the chemical structure from a data source, for instance, directly from a LIMS or an AnIML file, into the analytical request. Then the user is able to work on that request by using the embedded Marvin JS plugin. So instead of having multiple sources of data in different places, the process management can be combined with chemical information in a single place.
Certara has a number of software solutions, and 90% of all novel drugs approved by the US FDA in the past three years were supported by Certara software or services. Nina Hofle of Certara concentrated on D360 for scientific informatics. D360 provides scientists with self-service data access, to facilitate faster time to insight, for a wide range of research organizations. It is used in five out of the top 10 pharmaceutical companies. Initially it was most used to explore SAR in small molecules, but it is now used in biologics discovery (sequences and HELM notation are supported), and in preclinical safety. Shortening the time of the design-make-test-analyze cycle means more cycles can be carried out in a year, and improving the quality within a cycle means fewer cycles are needed, so drugs get to market faster. D360 Express is aimed at smaller companies, and D360 Partner is for external collaborators. A JChem Cartridge provides chemical structure searching within D360 data queries. Compound Registration works as a data source for D360. MarvinSketch can be used for structure input, and a custom compound sketcher has been created for one Certara customer using Marvin JS. The use of Marvin Live with D360 is under investigation.
Chemantics, an OSTHUS company, has developed an infrastructure to enable the integration of chemical structures into big data approaches, opening up access to any chemical data in a semantic data framework. Its chemistry tools were developed in partnership with ChemAxon. The Chemantics infrastructure integrates public and proprietary data across R&D domains based on chemical structures and reactions. Structures are first standardized and checked using Standardizer and Structure Checker. They are then analyzed, and a “chemantic” graph is built and prepared for structure search. Structure search is exposed via an API. Eight structure species (e.g., Bemis-Murcko scaffolds) are calculated. The classification ontology has about 600 features, functional groups, etc. Universal Resource Identifiers (URIs) are generated. The chemantic annotations enable a linkable “internet of chemistry”. New sources are constantly being added to the integration engine. Chemantics provides one interface to access any chemical structure, public or proprietary, while maintaining data provenance, and keeping data proprietary as necessary. All structures are stored in searchable databases, currently from RDKit and InfoChem. The structure search service is a plugin environment and other cartridges or search technologies can be added based on customer requirements. Discussions with ChemAxon are underway concerning implementation of the latest JChem PostgreSQL Cartridge.
The Readily AccessibLe (REAL) database (described by Andrii Buvailo of Enamine) is a chemical space of hundreds of millions of synthetically accessible, lead-like compounds, which have an 80% success rate of synthesis. They can be ordered at fixed prices at EnamineStore.com (after searching using ChemAxon software), for delivery within 3-4 weeks. Sixty-five percent of REAL molecules have no close analogues among the 17.2 million molecules purchasable from eMolecules. The EnamineStore API enables the database to be linked to a customer’s in-house software. Marvin Live, for example, could be linked. Enamine is working with Olexandr Isayev’s team at the University of North Carolina on machine learning to grow the REAL space. The database itself continues to grow. More than 2 billion molecules can be made from 68,000 in-stock building blocks and 167 reliable reaction procedures, but tens of billions of compounds could be made from 220 reaction procedures and 100,000 building blocks for which more than 10 grams are in stock.
Ian Pierson of IDBS began by illustrating some of the features of E-WorkBook Chemistry, including stoichiometry features, with integrations to reagent sources, analytical services and registration; Marvin JS sketching (integrated with MarvinSketch, ChemDraw and BIOVIA Draw); and integration with ChemAxon’s Compliance Checker for risk analysis. Single sign-on to ChemAxon Compound Registration gives direct access to the ChemAxon registration record. Search combining experimental and chemical information uses JChem in the back end. Ian also emphasized biology. In E-WorkBook it is possible to load GenBank, GenPept, FASTA or HELM formats, to draw using HELM, and to query bioregistration records. Linear and cyclic sequences can be rendered. The BioEddie sequence editor, and bioregistration search by sequence and type, are in the Biomolecule Toolkit back end.
Mihály Medzihradszky (“Medzi”) represented KNIME. JChem Extensions for KNIME (including a few free ones), developed by Infocom, provide ChemAxon tools in the KNIME node format. Marvin JS is available in the KNIME Server Web Portal. ChemAxon applications can be extended via workflows: users can choose best of breed components, build workflows in KNIME, and deploy them via the KNIME Server REST API. Distributed executors have been built for more efficiency in the KNIME Server:
Mestrelab Research has announced a new strategic collaboration with Bruker as part of which Bruker has become a majority shareholder of Mestrelab. Guy Desmarquets of Mestrelab Research talked about his company’s software products Mbook and Mnova. Mbook is a cloud-based ELN with built-in raw analytical data support, and automatic structure confirmation capabilities. Users can improve efficiency by connecting their ELN to the instrument, automatically processing the data, and incorporating the results in the ELN to close the loop. Mnova is a multivendor software suite designed to process data from NMR, LC- and GC-MS, and electronic and vibrational spectroscopic techniques. Information stored can be mined and displayed in various layouts. Mgears is a brick-based system for building analytical workflow pipes. Verification can be driven by Mbook Chemistry or a third party ELN. Mbook can consume services and capabilities in Mnova and its plugins. Guy showed screen shots of analysis request and management with Mbook, and concluded with an architecture diagram for the whole Mbook analytical chemistry based web ecosystem.
The MolPort database contains data and prices for over 7 million purchasable chemical compounds, from 65 suppliers, available from stock. Andrii Lozoniuk spoke for MolPort whose database stores structures, warehouse inventory, prices, quality control methods, and compound state. The data are updated daily, and fulfillment is tracked. The database is searchable on the Web (by Marvin JS and JChem Base), and downloadable by using FTP or HTTP protocols. With web services (an API), users can get up-to-date information for compounds using KNIME, Pipeline Pilot, Excel, or the users’ own applications.
ONTOFORCE DISQOVER is a semantic search platform that integrates over 140 public biomedical data sources, such as PubMed and ChEMBL, and internal “data lakes”. Peter Verrykt showed how an organization can use DISQOVER to link any number of data sources and search all the data via one easy-to-use interface. DISQOVER uses semantic web technologies, including ontologies, URIs and RDFs, to build linked data. Following a partnership with ChemAxon, users can now draw a structure or substructure in Marvin JS and select a search action which is sent to a third-party service such as the Chemantics web service (see above) or JChem. The results are sent back to DISQOVER, where the user can continue to navigate the linked data. DISQOVER gives users the ability to have competitive, new insights in minutes and to bridge the gap between big data storage and analytic possibilities. It helps users handle their growing data problems, and makes the data actionable by enabling searching, data linking, harmonization and cataloging.
Neal Dunkinson of SciBite emphasized the importance of having clean data before applying analytics to get insights. Ontologies are a de facto standard in creating clean data. They are the key to unlocking findable, accessible, interoperable, and reusable (FAIR) data. SciBite combines ontologies and machine learning to revolutionize access to and use of scientific information, transforming unstructured text into contextualized, machine readable data suitable for new discovery. SciBite VOCabs go beyond ontologies. They are maintained by a dedicated team, using expert curation and machine learning. They are far more detailed than anything publicly available, with a huge number of extra synonyms, and are aligned to industry standards. Users can customize VOCabs, or deploy their own vocabularies. The ChemAxon lookup service is integrated as a VOCab into TERMite, SciBite’s named entity recognition and extraction engine. To extract and search a combination of chemistry and biology knowledge via a single interface allows users to answer questions such as “What research has been conducted in a given disease or target related to a specific chemical substructure?” Network views of disease-phenotype mappings can be built (“phenotype triangulation”). By layering-in ChemAxon’s Chemical Name and Structure Conversion, the use of phenotype triangulation is expanded to drug repurposing at scale.
Can electronic notebooks replace paper? Matjaz Hren of SciNote believes that an ELN has a central role in reproducibility and making science more efficient. It stores data safely and securely; connects users with processes, results, and inventory; aids in project management; allows communication across teams; and assures interoperability of data. The return on investment is proven. On average SciNote users saved 9 hours per week due to increased efficiency. On average, they spent 40 hours spread over 3 months to become proficient. Taking average U.S. researcher salary into account, the breakeven point is only 1.5 months after becoming proficient, saving $14,000 per user per year. SciNote has taken the first step in integration with ChemAxon by incorporating Marvin JS in part of its ELN. In future it hopes to expand integration with Marvin JS throughout SciNote, including in inventories and annotations throughout SciNote. Other ChemAxon tools (structure search, text-to-structure, and cartridges) will be integrated for advanced SciNote customers.
The five takeaway messages from the meeting concerned the development of a new version of Marvin; hypothesis management in Marvin Live; speed and innovations in JChem search technologies; agnostic registration; and ChemAxon Assay. Of course, these are only a few of the products we were told about. As usual, things are moving forward quickly at ChemAxon. The theme of this year’s meeting was “superheroes”. When someone asked Dávid Malatinszky (the opening speaker from ChemAxon) what the company’s superpowers are, he first thought that one power was strong chemistry that makes ChemAxon different. Or it could be the open APIs that make partners happy. After a little more thought, however, he decided that, actually, ChemAxon’s clients are the superpowers. This reflects many comments made about collaboration between ChemAxon and its customers, and agile development, and it seems a good note on which to end this year’s report.