ChemAxon users from around the world gathered at the Novotel, Budapest for the annual European meeting. A hands-on workshop took place in the morning of the first day, followed, after lunch, by round table discussions on big pharma R&D IT infrastructure, cheminformatics in the cloud, (integration and migration), and “saying good bye to OLE”. The day concluded with the traditional garden party at the Graphisoft Park. There was a most enjoyable boat trip on the Danube, with dinner, on the next night, in between two days of presentations by users and ChemAxon staff. This report summarizes the presentations.

Keynote Address

Data for drugs

In a departure from the usual framework of ChemAxon user meetings, the opening session began with a keynote address from a well-known name in the industry, to set the scene about the issues we face. John Overington of the Medicines Discovery Catapult was the keynote speaker. The Catapult centers are a network of centers designed to transform the United Kingdom’s capability for innovation in specific areas. The Medicines Discovery Catapult has applied R&D expertise and infrastructure to support industry in developing new approaches for the discovery and early development of new medicines, helping to transform ideas into commercial products and services. It brings together independent stakeholders in patient-centered collaborations to develop and validate new ways of discovering new medicines. As Chief Informatics Officer, John will lead his expert team to help the UK community make better use of data in the complex process of medicines discovery. These initiatives will include improving the flow of data into and between predictive analytics and modeling systems, and improving collaborative platforms that can share and interpret data to improve the speed and predictability of medicines discovery.

John first talked about ChEMBL (for which he was formerly team leader), SureChEMBL, and UniChem. ChEMBL is the world’s largest primary public database of medicinal chemistry data; all the data are open. ChemAxon’s tools and others were used in the production of SureChEMBL. Marvin JS is the sketcher for ChEMBL and SureChEMBL. UniChem produces cross-references between chemical structure identifiers from different databases, with over 144 million structures from about 30 sources. Unfortunately errors in both data and chemical structures abound in the scientific literature and on the Web.

John presented examples of data mining of ChEMBL and other public domain data. The first concerned drug targeting, monotherapy and monopharmacology versus monotherapy and polypharmacology, and combination therapy versus drug blending. The second concerned mechanisms of drug resistance, mutations of the target, and blending inhibitors. The third concerned competitive intelligence and kinase inhibitor productivity. Next John discussed differences observed in the physicochemical properties of antibacterials compared with other drugs: differences which are likely to be due to target class, not organism. Finally he presented recent work on assay networks being done by a PhD student of his. He showed an assay graph of FDA approved drugs linked by shared activity in a phenotypic assay, assay clustering using word embedding, and an assay graph from Word2Vec software.

Chemical Drawing, Calculations and Design

Marvin JS in Discngine Decision

Discngine provides services and solutions for life sciences research informatics. Discngine Decision is a business intelligence tool that allows users to gather, analyze and visualize multivariate data within TIBCO Spotfire to interpret and combine experimental results, from multiple sources, in order to design the next steps of R&D projects rationally and intelligently. Sandra Plumas said that a sketcher was needed in Discngine Decision to sketch a molecule to add it in a report, and to carry out substructure search in a dataset or corporate database.

Discngine chose Marvin JS for many reasons. From the developer’s point of view, it does not use Java, and it is easy to integrate, with a complete, clean, and well documented API, which also has many import and export formats. The Marvin JS API allows the developer to attach a custom function to different events in the sketcher: each time the user makes a change to a molecule, this function is run. From the user’s point of view, Marvin JS is intuitive and user-friendly, it has a flexible and extensible interface, and there are frequent releases. It has a responsive and compact design. (Responsive Web design is the approach that suggests that design and development should respond to the user’s behavior and environment based on screen size, platform and orientation.) From the point of view of Discngine’s customers, the business relationship is flexible, and the solution is affordable. The Marvin JS Oracle Application Express (APEX) plugin was easy to implement, with few parameters. The sketcher is used with the molecule mode in Discngine Decision; users can draw or import molecules and the Marvin JS API retrieves the structure in .mol format.

Marvin JS at the University of Chemistry and Technology, Prague

There are 300 courses at UCT implemented in the leaning management system Moodle, an open source platform for educators to develop and manage courses online. It is a modular system based on plugins. Martin Mastny has integrated Marvin JS for online usage in chemistry courses at UCT. Marvin JS is integrated into TinyMCE so that teachers are able to draw structures and edit them at any place in Moodle, but Moodle recently started to use a different text editor.

In the organic chemistry Web application, teachers build questions from predefined steps, and draw answers in Marvin JS. Students’ answers are checked and hints are given. Teachers get basic statistics about the students’ answers. A question consists of several steps, two of which (drawn image and expected answer) involve Marvin JS, which opens in a pop-up window. There are editable Marvin JS presets for each question. Structures are exported and saved either as .png images or SMILES and InChI by calling JChem Web Services. In the inorganic chemistry application, a Lewis structure creation wizard uses Marvin JS. Different setups of Marvin JS use different features, and three or four Marvin JS editors cooperate in each wizard. This necessitates a lot of built-in presets of the sketcher.

The question type in Moodle is based on the EasyOChem work of Carl LeBlond of Indiana University of Pennsylvania. Martin has rewritten that plugin and added Marvin JS presets per question, JChem Web Services support, and better user interaction. He is trying to get the new plugin accepted as an official Moodle one. It can also be based on Open Babel with no need for JChem Web Services.

Martin used Marvin JS for many reasons. There is no need for additional technologies beyond a modern Web browser. The JavaScript API is rich when used with the editor alone, and JChem Web Services can be connected to the editor to add further functionality. ChemAxon user support is excellent, and the product has regular releases and bug fixes.

An integrated testing strategy for skin sensitization assessment

Barry Hardy of Douglas Connect reported on work done in conjunction with Ahmed Abdelaziz on a Web application implementing an integrated testing strategy (ITS) developed by Jaworska et al. (Jaworska, J. S.; Natsch, A.; Ryan, C.; Strickland, J.; Ashikaga, T.; Miyazawa, M. Bayesian integrated testing strategy (ITS) for skin sensitization potency assessment: a decision support system for quantitative weight of evidence and adaptive testing strategy. Arch. Toxicol. 2015, 89 (12), 2355-2383). The ITS uses three validated in vitro assays, DPRA, KeratinoSens and h-CLAT, which represent the first three key events of the adverse outcome pathway for skin sensitization.

In collaboration with Joanna Jaworska, Douglas Connect have now created an integrated software application, with a Bayesian network structure, reproducing the work by Jaworska et al. Network training and prediction were implemented in R, and molecular pre-processing in RDKit. LogP, logD and LogS calculations were performed by ChemAxon Calculators. For plasma protein binding, Douglas Connect built a customized model on OCHEM which is used for the online version, and later was substituted with an offline version running on R. (The published work involved calculations using ACD/Labs software.)

The testing strategy provides a quantitative potency estimate for skin sensitization together with 4-class prediction for the local lymph node assay (LLNA) assay outcome, the four classes being non-sensitizing, and weak, moderate and strongly sensitizing. The Bayesian network provides robustness in combining information from different sources, and a mechanistically interpretable single decision with confidence estimation, incorporating bioavailability into the decision. Bayesian networks can run predictions despite missing information. The ITS corrects for the anti-inflammatory properties of Michael acceptors and accounts for the in vitro assays’ applicability domain. It provides a testing strategy by suggesting which assay to run next: the one which achieves the maximum value of information (VoI).

The ChemAxon LogP calculator was able to reproduce the Bayesian network accuracy reported in the literature, including an estimation of confidence in the predictions. Replacing the plasma protein binding method also slightly improved the performance reported in the literature. LogD and logS calculator replacement had a slight negative effect on overall network performance, but the performance was still on a par with the reported accuracy.

The Douglas Connect solution can be used as a standalone application, or integrated into existing applications, and workflows, for example, using OpenTox APIs. Versioned summary results and a detailed report are available for users to cite, and share with a team, and eventually with regulators. Partners can receive an on-site version (e.g., a containerized deployment). The Web interface is available for evaluation at

Does a 3D shape based method differ from a molecular descriptor based similarity method in its ability to predict biological activity?

More than 100,000 compounds have been screened against the NCI60 collection of 60 human cancerous cell lines maintained by the National Cancer Institute (NCI), providing a vast repository of molecules for which both biological data and structural information are available. In work done by Anna Lovrics and colleagues at the Institute of Enzymology at the Hungarian Academy of Sciences, pairwise structural and biological similarities were calculated for a set of NCI agents to quantify how well structure similarity predicts related patterns in activity data.

Activity was measured by GI50 (50% growth inhibition activity) values. To measure molecular descriptor similarity, Anna’s team used chemical fingerprints (CFP) and extended chemical fingerprints (ECFP), and the Tanimoto similarity metric, using ChemAxon’s Screen2D. In 3D shape base similarity, the active conformation is unknown. The maximal pharmacophore volume intersection of the van der Waals volumes is calculated. In OpenEye’s Rapid Overlay of Chemical Structures (ROCS) each compound is expanded into a set of 3D conformers and these are compared to each other. ChemAxon’s Screen3D uses pairwise comparison of molecules by maximizing their overlaps while tweaking their rotatable bonds, flexible rings, and ring systems.

Anna and co-workers took 49,569 NCI compounds with both GI50 and structural information, and reduced this to 5,531 compounds by selecting valid structures for compounds which had not just a known GI50 for at least 30 cell lines, but also a high enough variation of the GI50 values. They calculated positive predictive value (the fraction of molecule pairs with at least a certain structural similarity that also show a minimum biological activity similarity) and sensitivity (the fraction of molecule pairs with at least a certain biological activity similarity that also share a minimum structural similarity). Using the NCI60 panel, Wallqvist et al. have found that the connection between structure and biological response is not symmetric, with biological response better at predicting chemical structure than vice versa (Wallqvist, A.; Huang, R.; Thanki, N.; Covell, D. G. Evaluating Chemical Structure Similarity as an Indicator of Cellular Growth Inhibition. J. Chem. Inf. Model. 2006, 46 (1), 430-437).

The Hungarian workers found that the different similarity metrics they used are comparable in their ability to predict biological similarity, but there are differences in the set of predicted pairs. Of particular interest are those pairs that exhibit 3D shape and biological activity similarity, but not molecular descriptor based similarity, since scaffold hopping molecule pairs may be found among these. The team collected putative scaffold hopping analogues around the FDA approved drugs within their set of NCI molecules. In silico docking and in vitro decatenation assays validated the mechanism of action of the scaffold hopping analogues around mitoxantrone.

Marvin Live at Sprint Bioscience

Fredrik Rahm said that how to make better molecules faster is a key challenge in preclinical drug discovery. Sprint Bioscience use fragment-based techniques to find high-quality chemical starting points quickly, and then they use structure-based drug design to optimize the compounds. One way to make better design ideas, faster, is to give everyone in the medicinal chemistry team access to the 3D world; to check ideas and generate more of them; and to have improved workflow when sharing, documenting and evaluating ideas. Sprint Bioscience chose to use Marvin Live to streamline collaborative workflow in the medicinal chemistry team, from a design idea to a target molecule ready to be synthesized.

Marvin Live can be used, for example, to calculate physicochemical properties; to align a virtual molecule (design idea) with a known molecule (design template), taken from a ligand-protein crystal structure generated in-house or downloaded from the PDB; and to see how the proposed molecule can fit in the binding pocket of the protein structure. Design ideas can easily be captured as snapshots containing the structure and a short comment (“Task”) explaining the hypothesis behind the design idea. A collection of snapshots can be saved as SDfiles or PowerPoint slides, for example, ready for reporting and analysis. New ideas are then generated and the chemist decides which compounds to make next. Sprint Bioscience has also built a building block “one stop shop” where the internal chemical inventory can be searched in parallel with commercial sources such as MolPort. There are over 12,000 reagents in the company inventory, and more than 23,000 bottles. Marvin JS is used as structure editor in the system. Plexus Design is used for enumeration. A shopping list can be constructed.

Marvin Live is customizable and versatile but Fredrik made three suggestions for enhancements: quicker alignments; surface rendering of the protein; and sorting and reordering snapshots, and deleting snapshots that were captured by mistake (e.g., mistakes when drawing the idea molecule, or duplicates).

Platform and Back End Tools

[email protected] Migration to ChemAxon technology. Our way into a new world

Elke Hofmann of Merck KGaA outlined the objectives behind the company’s migration to new technology. Merck had too many systems with partly overlapping functionality, and wanted to reduce the number of systems and of system providers. They wanted to look to the future by implementing technology with high innovation potential and good support, including large molecule representation via the Hierarchical Editing Language for Macromolecules (HELM) standard, enrichment of SharePoint with chemical intelligence, chemical text mining, macrocycle naming, Markush structures, and so on. The initial scope was the database cartridge, the drawing tool for chemical structures, and JChem for Office add-ins for structure visualization and manipulation.

Several vendors were contacted and asked for detailed information about their systems. The decision to proceed with ChemAxon was based on better coverage of functionality, and the support option during migration. Before the final decision, ChemAxon technology was installed in-house to evaluate potential risks, and an extended pilot study with Marvin and JChem for Office was carried out.

A risk assessment and mitigation plan was made covering different chemical representation (e.g., stereochemistry) in the cartridge, the change of drawing tool, and chemical structures in legacy documents. Duplicate and isomer recognition proved to be very good, but the thousands of legacy documents with OLE objects of the old drawing tool were more of a problem. Conversion to Marvin OLE on-the-fly was possible, but stability and speed needed improvement. OLE is supported only with Java 1.6. It was thus decided to convert only active documents.

The main issues with V14 of the drawing tool were configurable settings such as standard bond lengths; journal settings; publication-ready quality of drawn structures; handling of no-structures; and complex reaction schemata. ChemAxon collaborated closely with Merck to mitigate most of the issues, but there is still the need for a drawing tool with support for “ambitious” publication-quality sketches.

A stepwise approach was taken to implementing the new research database. Introduction of Marvin and JChem for Office add-ins started at the beginning of 2015. Migration of the research database should be complete by mid-2017, and migration of other structure databases by the third quarter of 2018. Migration of the reaction database for the electronic laboratory notebook will continue through 2018.

Michael Hofmann took over the presentation at this point. Merck contacted three vendors about controlled substance compliance, and then chose ChemAxon’s Compliance Checker because it was the best fit to requirements, and because of ChemAxon’s record on collaboration and support. Compliance Checker is now used at Merck to check if the research substance pool is in compliance with applicable regulations; in future it will also be used to ensure that time is not wasted on designing compounds which are not compliant.

Michael finally summarized the current situation. The chemical functions of the cartridge are good, but performance has not yet been tested under heavy load. There are issues with middleware, access, and security. JChem Web Services for both Web applications and as a data source for JChem for Office now function very well, after ChemAxon enhanced JChem for Office to support this, and enabled JChem Web Services to support the JChem Oracle Cartridge.

Marvin has been integrated in major applications, and Marvin JS and Web Services have been integrated into some small Web applications. JChem for Office is still in beta test at Merck. Unfortunately JChem for Office influences the standard behavior of Microsoft Office so that the latter is not working as expected. Merck suggests that JChem for Office should not take over any Office functions, and there should be on-demand usage of ChemAxon functions if structures are involved. JChem for Office should not make automatic decisions; for example, the chemist should decide if he or she wants to insert a molfile or a chemical name in the clipboard as a structure or as text.

Non-standard behavior in Marvin is unsatisfactory, for example, “arrow up” moves the selected structure down, and selecting and deleting a bond deletes the atoms as well. [Postscript: ChemAxon fixed the arrow issue in Spring 2017, and they point out that the preferences menu can be used to fix the second issue.] Integration of Marvin in the .NET world has caused some problems. A new version of Marvin must not trigger rollout of all custom applications using Marvin; a workaround has been implemented. An unsolved challenge is a custom application crash after Ctrl+C in Marvin is called from the custom application. ChemAxon has not been able to reproduce this bug in any other environment.

A big complaint is ChemAxon’s release cycle: ChemAxon releases new versions weekly and each version contains both bug fixes and new features which may introduce new bugs. Merck would prefer just one release a year with bug fixes based on this release, although a maximum of two releases per year might be acceptable.

Registration and Management of Chemical and Biological Data

Biology and chemistry in E-WorkBook

Paul Denny-Gouldson of IDBS gave an update on progress in E-WorkBook since 2016. Compliance Checker, Marvin JS, Compound Registration, and a Plexus Design parallel synthesis solution have been integrated. IDBS have worked with their chemistry forum members to make the data capture and presentation for the stoichiometry table as simple and streamlined as possible. This means more auto calculations, tab support for moving entry fields, and splitting of the screen etc.

Registration of biologics has been added. Most large molecule therapeutics are synthesized by cell lines, so the synthetic process is far more complex than that used in small molecule development, and biologics production is often continuous, not batch centric. The biologics landscape includes materials such as nucleotides, expression systems, and peptides, and the expression systems producing samples of them. Paul compared the physical and virtual aspects of bio-registration. A physical entity such as a vial of primer may or may not be linked to registered entities. Physical materials and samples have parent:child relationships. A primer sequence is an example of a virtual entity. It is typically linked to batches in inventory. There are relationships between entities. Uniqueness checking and business rules are needed for registration. Paul ran through a typical scenario involving E-WorkBook and peptide synthesis, CHO-S host cells in the inventory, and transfection parameters in E-WorkBook.

HELM and ChemAxon’s Biomolecule Toolkit are essential to the implementation of the process of biologics registration into one integrated system. E-WorkBook can handle large molecules (proteins, DNA etc.), small molecules (pharmaceuticals, fine chemicals etc.), and everything in between (oligonucleotides, small peptides, unnatural monomers, and antibody-drug conjugates).

European Lead Factory Web portal

Tim Dudgeon of Informatics Matters reported on three years of European Lead Factory (ELF) progress. Part of the ELF initiative is to generate a Joint European Compound Collection (JECC) that can be used for screening. Initially 300,000 compounds were supplied by the pharmaceutical company members of the Innovative Medicines Initiative, but this is being supplemented by a further 200,000 compounds from libraries that will be synthesized as part of the project. ChemAxon’s role has been to create the ELF Web portal (using Marvin JS, JChem Base, MadFast Similarity Search, and other technologies) for submission and assessment of design proposals for these libraries. This portal has now been in operation for over three years.

Using the Web portal, library proposals are submitted to a library selection committee from consortium partners or crowdsourcing. There are different workflows for different types of users, and different levels of access. Submitters register themselves before entering a library. They sketch a scaffold using Marvin JS, and then define the library by entering an SDfile, or entering an .mrv file with a Markush library, or by picking R-groups within pre-defined lists, after which Markush Enumeration is used to generate the library. Accompanying information, including rationale and synthesis validation information, is also added.

Property calculations are then carried out with Calculator Plugins, and unsuitable compounds are identified using SMARTS, and by applying Compliance Checker with reference to the UK Misuse of Drugs Act. Property distributions of enumerated structures are produced. The proposal is submitted, and a further round of property calculations is carried out so that ECFP4 and pharmacophore fingerprints can be compared with 12 million structures in a reference set. The comparisons would have taken days had MadFast Similarity Search not been developed. Finally the library selection committee assesses the library. Once a library design is approved, the chemistry is validated, the library is synthesized, compounds are added to JECC, and the compounds can be used for screening. Professor Adam Nelson, Chair of the Library Selection Committee, has said that the Web portal has been instrumental in the success of the overall project.

The number of registered users was 92, and 75 of them submitted 851 library proposals, of which 61% were approved. Of these, 192 libraries have been synthesized and 77 library syntheses are ongoing. The number of compounds added to JECC to date is 145,000. This number is expected to rise to 200,000, at which point the portal is likely to be closed.

Public compounds (“PCC”) are slightly larger, but with a slightly lower ALogP value than compounds contributed by the European Federation of Pharmaceutical Industries and Associations (EFPIA). PCC compounds are less “flat” or “planar” in nature. The PCC set explores chemical space complementary to that of EFPIA sets. PCC and EFPIA compounds are equally likely to be chosen from the hit lists. The biological activity profile of the PCC set is highly distinctive from the EFPIA sets.

Lessons were learned from the project. Automation of submission had a big payback: it enabled efficient and transparent assessment of proposed libraries by the selection committee. Usability was reasonably good given the resource allocated, but more should have been invested in management and error handling functions. Crowdsourcing of libraries was relatively unsuccessful.

From library design to compound delivery

ComInnex is a CRO with 25 years’ experience of providing drug discovery services. Csaba Peltz presented a use case of compound library services. In the library design strategy, traditional chemistry steps ensure a wide diversity of chemotypes among key intermediates and final libraries, after which a unique technology-enabled step, with proprietary synthetic know-how, enhances IP security and novelty, and increases synthetic scope and success rate. Data from the ComInnex ELN and LIMS are used with ChemAxon software, and RDKit with Python, in a KNIME workflow starting from collecting building blocks and reagents, and ending with library and reaction generation for the selected products. JChem and Standardizer are used in reaction tree handling (written in-house) and structure search.

Csaba outlined the ELN and LIMS systems used in producing the libraries. Instrument data upload is automated as much as possible and there is strict project handling. The status of a container from its creation, to delivery and storage is visualized. Two-way integration of production and design workflows ensures that new libraries automatically feed target reactions into the ELN, and selection of reagents is based on previous usage, price, and so on. Powerful tools guide the design and production processes, and integration of the tools means that minimal user intervention is required.

Development of compounds targeting resistant cancer

Members of the ATP binding cassette transporter family are major contributors to the acquisition of anticancer drug resistance. Although several ABC transporters have been shown to transport anticancer drugs in vitro, P-glycoprotein (Pgp) stands out by conferring the highest level of resistance to a vast array of drugs, and its association with clinical multidrug resistance (MDR). The team led by Dr. Szakács at the Hungarian Academy of Sciences has sought to identify “MDR-selective” compounds that show paradoxical toxicity against MDR cells (Turk, D.; Hall, M. D.; Chu, B. F.; Ludwig, J. A.; Fales, H. M.; Gottesman, M. M.; Szakacs, G. Identification of Compounds Selectively Killing Multidrug-Resistant Cancer Cells. Cancer Res. 2009, 69 (21), 8293-8301). One of the goals is to expand the scope of currently available bioactive agents and to gain insight into the Pgp-specific mechanism of action.

Fifteen compound classes possessing variable MDR selectivity have been identified in the National Cancer Institute (NCI) drug repository. Veronika Pape and Judit Sessler presented results on the structure activity relationship (SAR) of compound class 1, consisting of 8-hydroxyquinolines. Based on structural similarity, desired property space, and visual inspection, 121 compounds and 282 compounds were purchased from the NCI drug database and a 6 million compound vendor database, respectively. Based on in vitro screening results, focused libraries, consisting of further 300 derivatives were synthesized. Initially, Instant JChem (IJC) was used to build a chemical database and the Citotox program was used for storing and analyzing measurement data from the many different assay plate layouts. Later, IJC was complemented by Plexus.

In the past Veronika used to analyze SAR by printing rows of SAR data, cutting cells out of the hard copy, and scattering the bits of paper around the room! Plexus is so much better. MarvinSketch, Calculator Plugins and JChem for Excel were used in semi-automatic prediction of chemical properties. Quality of prediction is discussed in Domotor, O.; Pape, V. F. S.; May, N. V.; Szakacs, G.; Enyedy, E. A. Comparative solution equilibrium studies of antitumor ruthenium(η6-p-cymene) and rhodium(η5-C5Me5) complexes of 8-hydroxyquinolines. Dalton Trans. 2017, 46 (13), 4382-4396.

Veronika and Judit used Plexus Analysis to analyze their chemical and biological database. Datasets can be visualized dynamically in histograms and scatter plots, so the effect of multiple variables such as chemical properties on biological activity can be visualized. JChem for Excel has been used in a published SAR analysis (Pape, V. F. S.; Toth, S.; Furedi, A.; Szebenyi, K.; Lovrics, A.; Szabo, P.; Wiese, M.; Szakacs, G. Design, synthesis and biological evaluation of thiosemicarbazones, hydrazinobenzothiazoles and arylhydrazones as anticancer agents with a potential to overcome multidrug resistance. Eur. J. Med. Chem. 2016, 117, 335-354).

The increased proliferation of cancer cells results in an elevated demand for metal ions, which creates a vulnerability that can be exploited therapeutically. The Hungarian team built a focused chelator library of thiosemicarbazones, hydrazinobenzothiazoles and arylhydrazones and studied toxicity trends related to the donor atom sets. They identified compounds in the NCI database with available biological data using IJC and could confirm their hypothesis of increased toxicity of NNS and NNN donor atom chelators over ONS chelators. The trend was confirmed for matched molecular pairs, and in an overall comparison.

Data Extraction and Curation

Fast access to chemical data with ChemCurator

Jasna Klicic Badoux of the Nestlé Institute of Health Sciences said that chemical structures are the language of chemistry. They convey information in terms of molecular and bulk properties, chemical reactivity, and biological activity. Enabling exploitation of chemical data requires that they are given a technologically suitable format such as SDfiles, for storage in a chemical cartridge database, for example. Chemical structure data are structured, but other data are unstructured in publications, patents, and documents. The unstructured data must be made computer-readable. Once you have the machine readable data you can compute fingerprints, descriptors, properties, and pharmacophores, do similarity searching, carry out virtual screening, and so on. The bottleneck is getting the data.

ChemCurator reads a document in .pdf, .html, .htm or .xml format; annotates chemical terms in the text; recognizes images containing chemical structures (using CLiDE, OSRA or Imago for image recognition); and compiles a file with chemical structures, with no duplicates. Each structure is uniquely linked with an object (text or image) in the document. Manual corrections are possible. Nestlé uses ChemCurator in a KNIME workflow for automatic retrieval and conversion of chemical data into their system.

For a website catalog example, intervention was needed in the case of incomplete names, salt, or other modifier information, stereochemistry, and additional text such as purity information. In an example using CLiDE for a patent, only relevant parts were annotated (for a large document). Out of 20 structures present in the selected part of the document, 17 were extracted automatically, and only one structure was correct. Errors in the image interpretation required correction. Interpretation is dependent on the image quality and the font. Three out of 20 structures were not imported due to failure in atom recognition (S was appearing as a question mark or an asterisk). Dynamic linking between a document and a compound entry allows the possibility of copying and pasting chemical structures for fast corrections. You would make even more errors if you did everything manually and it would be slow work.

Jasna presented another example, extracting chemistry from a medicinal chemistry literature article; 26 structures, including glutathione and amino acids, were automatically extracted after filtering for a molecular weight greater than 100; 11 structures were automatically extracted after filtering for molecular weight over 100 and less than 500, including benzene as a substructure, and only two amino acids. In extracting structures from images in this document, some images were not recognized, due to arrows, or Markush structures, or issues with CLiDE.

Jasna concluded that access to chemical data in the right format is critical; tools such as ChemCurator that automate data conversion into chemical structures are needed; and full automation is still not accurate enough to be used in an unsupervised way, but automation plus manual supervision is still significantly faster than manual conversion.

ChemAxon integration within Orbit Intellectual Property Business Intelligence (IPBI)

Aurélie Brunet, of Questel made this presentation. Orbit IPBI is a Web-based solution for IP and R&D professionals, covering over 100 patent authorities, 23 of them in full text, 40 with legal status, and 40 with citations. It has litigation for four authorities; reassignments; a 4 million record corporate tree; licensing data; standards, designs and trademarks for many authorities; and business information. Powered by ChemAxon Naming software, trade names, common names, IUPAC names, and multiple other chemistry names are linked to the relevant chemical structures in Orbit IPBI. Users can search for a name in a patent database and get a chemical structure. Structure search can be combined with search of Orbit’s high quality IP data.

Since 1997, Intellixir has been a leading solution to analyze patent and non-patent, scientific data for innovation and competitive intelligence. Chemical names have been extracted from all the data, using ChemAxon tools. Users can study the top extracted molecules for a competitor, for a condition, or in a specific time period. They can monitor the evolution over time of molecules or core structures; compare assignees through their molecules; find assignees publishing in similar fields based on their molecules; identify molecule applications through technology domain analysis; identify molecule types through automatic clustering; and look at their own categorization based on structure search versus assignees. There is a dedicated interface for R&D and legal experts to discuss, rate, and comment on documents. R&D experts reading a document benefit from structure display by clicking on chemical names, and they can navigate through the molecule panel. Intellixir users can share results with management and R&D through online, interactive, dynamic reports and a dynamic dashboard.

Partner Session

The ChemAxon partner program is based on cooperation as opposed to competition, and “best of breed” rather than a monopolistic market. The business model is one of flexibility, integration, and impartiality. Partners and users will benefit from Plexus Synergy (outlined below). There were 12 partner presentations from: Arxspan, BSSN Software, ChemPass, Enamine, Yurii Moroz of the National Taras Shevchenko University of Kyiv (on the Enamine REAL database of readily accessible compounds), Informatics Matters, KNIME, Matti Lattu of the Matriculation Examination Board of Finland (on MarvinSketch in an electronic examination system), Mestrelab Research, quattro research, Schrödinger and Titian Software.

ChemAxon Update

The most significant presentation was that by Roland Knispel on Plexus Suite. He said that the market demands integrated workflows, less IT overhead, mix-and-match of tools, operational flexibility, and affordable cost, so ChemAxon’s vision is of integrated discovery workflows within a Web-based environment. Hence the vision of Plexus as a platform: ChemAxon’s ambitious cloud-based plans centered on the ChemAxon Synergy concept, integrating ChemAxon’s supporting tools and multiple third-party tools, allowing users to connect, discover and navigate through many different applications. It will have application, user, and project management; authentication; and security. ChemAxon will offer flexibility in concepts and workflows, extensibility through defined interfaces, and best-of-breed solutions with little effort. Eventually the company will provide a full cheminformatics solution as Software as a Service (SaaS), with all the benefits of operating in the cloud.

Currently three Plexus tools have been released: Plexus Connect, Plexus Design, and Plexus Analysis. They can be accompanied by other ChemAxon products such as Compound Registration. Federated access to data sources is unique to Plexus Connect, and curve fitting is unique to Plexus Analysis (i.e., not available in IJC). In accord with ChemAxon’s vision of integrated discovery workflows within a Web-based environment, a structure activity form for a given registered compound, currently available in IJC, will be developed for Plexus Suite. Plexus Assay is coming soon. The first milestone in Plexus Synergy, integrated registration of compounds, biological data, and reporting, will be available in the first half of 2018. Plexus Synergy in the cloud will be integratable, extensible, modular, and centralized.

MadFast is a new ChemAxon product released in December 2016. It is an engine for fast similarity searching with efficient in-memory storage. It also provides fast calculation of descriptors such as CFP, ECFP and MACCS-166 fingerprints. It is a Java application available via command line, REST API, and Web interfaces. Using 1024-bit binary fingerprints on an Amazon r3.8xlarge machine, MadFast delivers the 40 most similar structures in about 80 milliseconds per 16 million structures, or about 5 seconds per 1 billion structures. Memory usage per million molecules is 250-350 MB. Gábor Imre presented some examples, and a use case comparison with the PostgreSQL cartridge. Future plans include overlap analysis visualization; real time clustering; similarity-based hierarchical clustering; query of a remote database using JDBC; a single desktop UI release; and public Java API components for developers.

András Volford gave a brief introduction to JChem Base, and the JChem Oracle and PostgreSQL cartridges. His focus was on substructure search, and within that on the “Hit as you draw” feature. This is the feature that returns the hits as soon as the query structure is drawn on the canvas, and modifies the results if the query structure was changed. The remarkably fast substructure search behind it orders the hits based on similarity (relevance) to the query structure. In other talks we learned that soon a Spotfire plugin will be securely integrated with IJC and Plexus Connect. The ultimate goal is for Plexus Connect to become Web-based. Scripting support and usability will be improved. The Marvin applet has been discontinued.


It is 10 years since I attended my first ChemAxon user meeting, in Budapest in 2017. What a lot has happened since then! The company now has 143 employees, 7 distributors, 3 offices and 30 products. Revenues continue to increase. Csizi tells me he does not talk about profit; he just wants customers and employees to be happy. This seems an odd sort of business model to me, as the scribe of 10 years’ standing, but I guess that I am happy too, since I keep going to the meetings. When it comes to the product line, more and more functionality is being made available in Web-based tools, and there is an increased focus on large molecules (as I reported last year). The big news is Plexus Synergy. I was somewhat skeptical when the enormous potential of Plexus Suite was first announced, but major problems do not seem to have surfaced yet. It is too early to assess Plexus Synergy, so I have chosen to call it “ambitious”. One thing that has not kept users happy is the weekly release cycle. I sense that this is likely to change, since ChemAxon always responds fast to customers’ demands. Just as customers and employees appear to be happy, I will be more than happy to be invited to next year’s meetings, even if the company needs to experiment with new formats or venues.