ChemAxon European Annual Meeting (UGM), Budapest, May 24-26, 2016
About 180 people (many of them from ChemAxon) gathered at the Novotel Centrum in Pest for the annual user meeting. One-on-one sessions took place on the preceding day at ChemAxon’s offices in the Graphisoft Park, followed by the traditional garden party: an excellent networking opportunity on a hot, sunny summer evening. The weather was, alas, wetter for the conference dinner the next night, at the Fisherman’s Bastion, but the familiar view of the Parliament was still superb.
The meeting proper started with a ChemAxon overview, in two days of technical sessions on the themes of chemical data extraction; data capture, store and search in the back-end; create and design in a collaborative environment; and data management, characterisation, analysis and report. There was also a partner session. In this report, I have chosen to concentrate mainly on the user presentations.
Questel partners with ChemAxon
Questel is one of the world’s main online service providers dedicated to intellectual property. Renaud Garat said that his company has the patent databases (and related software), while ChemAxon has so many chemistry solutions, some of them of great interest to Questel. So, ChemAxon is the best partner for Questel. The partnership has already led to a new product released on the Intellixir platform. Intellixir applies statistical analysis to scientific publications and patents and specialises in business intelligence solutions for intellectual property. The new product already offers structure display using ChemAxon software. Chemical indexing of patents and structure search will be added in Q3 of 2016. Users will be able to process all sources, including internal documents, and produce landscapes based on chemical structures and substructures. Questel’s Orbit solution will integrate structure display in Q3 of 2016, and structure searching in Q1 of 2017.
Development of ChemAnalyser, a novel, large-scale patent search engine
Lutz Weber of OntoChem has a problem with CAS doing only human annotations. There are 450 million terms and synonyms in the PubChem database, compared with only 1 million words in the “English language”. The solution is to use both human and automated annotations, as Elsevier does for Reaxys. Together with infoapps, OntoChem has developed a chemistry search engine, ChemAnalyser, for patent full text documents. About 90 million patent documents, from 110 patent offices, have been annotated using OntoChem’s UIMA-based annotators, covering terms for (amongst other subjects) chemistry, biology, materials science, diseases, cosmetics, and nutraceuticals. Lutz compared the recall and precision of ChemAnalyser search engine with those of SciFinder, describing the advantages of using ontology based “cognitive search”, and searching knowledge N-tuples in non-relational databases. In one example, search for pinocarvone either by name or by structure gave the same results: 303 patent hits in ChemAnalyser, compared to 53 patent hits (perhaps an erroneous figure) by searching in SciFinder. It might be that ChemAnalyser’s results are more comprehensive than SciFinder’s, but the figures in Lutz’s slides should not be compared, since his system counts individual patent documents, whereas SciFinder counts patent families. Stereochemistry of pinocarvone was also not considered, and the way of counting “patents in 2009” is misleading, since only the publication date was considered, whereas application years and priority years need to be taken into account. It seems unfair to report in detail Lutz’s other examples, including a Markush search, until he has had discussions with CAS. In the back-end, ChemAnalyser uses JChem Base and in the front-end Marvin JS. The display of patents from ChemAnalyser can be switched to patent family view in infoapps’ sem-ip.com software (semantic intellectual property search), and exported into Excel. The work being done by infoapps and OntoChem is clearly of importance, so I hope to hear an updated version of this talk in future.
“Leveraging” the value of the contents of documents: Plexus for agile dissemination of text- and data-mining results
The increasing volume of both structured and free text data used by the pharmaceutical industry, and the heterogeneity of those data, are such that it is very challenging for users to find what they need, but it is now feasible to extract and structure the information using a pipelining tool to combine several optical character recognition (OCR), optical structure recognition (OSR), text-mining, and semantics applications. Matthias Negri reported that the tools used for this at Boehringer Ingelheim are KNIME for pipelining, ChemAxon KNIME nodes and command line tools for chemical recognition, Linguamatics’ I2E for text mining, CLiDE for optical structure recognition, and ChemCurator and Plexus for visualisation. The patent curation workflow is complex and is still not fully automated but more workflows are possible with KNIME. An eight-CPU notebook plus a server are needed; the OCR correction routines (which have been improved) are particularly compute-intensive. The extracted chemical entities are finally checked for their “novelty” relying on the “type” classification (the output column for molconvert, with flags such as “common” or “systematic”), but also by comparing the preferred IUPAC and traditional names (whenever those are identical, the compound is classified as “novel”). Having successfully extracted terms from patents, Matthias extended the method to the extraction of chemical reactions from PDFs. It is very useful to tap into the wealth of in-house synthetic knowledge in detail. To allow this, linguistic reaction pattern recognition, annotation with brat, and splitting of reactions into components are carried out. The .mrv files are then input to Plexus, which provides multi-user Web access to data managed in Instant JChem (IJC). Another use case is combining structured and unstructured search in external databases, for example, combining chemical structure and text searches of DrugBank. The database is mapped by IJC and relationships are displayed via Plexus. Custom views have been built for various customers within Boehringer Ingelheim. Matthias’ final example concerned the adding of meaning to internal Word files. KNIME is used for chemistry recognition, and I2E for indexing and annotating text. The output from the two procedures is combined for input to IJC, and Plexus can then be used for search and visualisation. Detailed reaction procedures for a required substance can be displayed. Matthias noted a few limitations. Plexus produces annoying error messages if there are empty fields; there is only one chemical field, so it is not possible to search for product and reagent in a “chemical” way; and not all the capabilities IJC are reflected in Plexus.
Handling large volumes of chemical data in a scalable way and on a tight budget with the PostgreSQL database and the JChem PostgreSQL Cartridge
Ellert van Koperen of MedChemData has tackled the challenge of pushing a very large number of chemicals through Reactor. The challenge was not just in performance, but also in keeping the results structured, clean and correct. Using a database was an obvious approach (for data consistency, safety and security, central storage and backup, speed, and availability of centralised tools with centralised licensing), but, more specifically, Ellert chose the PostgreSQL database in combination with the new JChem PostgreSQL Cartridge. He summarised the reasons for choosing Postgres in the following table.
The test case was to run Reactor on a subset of all commercially available reactants for a Grignard reaction. Ellert listed four reasons why it was advantageous to pump the data through a database. Pre-filtering was necessary (or Reactor would need to try 1444 million million combinations); pre-segmentation kept the data clustered; it was possible to ensure the correctness of classes without the need to fix the reaction filters in Reactor; and high speed meant that test-and-fix iterations were easy. Step 1 was to gather the data from nearly 30 chemical suppliers, load it (using JCMAN), and create the chemindex. Step 2 was pre-filtering the Grignard reagent and the reactant. Here some “tricks” were needed to optimise the substructure and feature searches. Step 3 was to query and segment the data, and feed them to Reactor. Ellert now knows that Postgres and the JChem PostgreSQL Cartridge can be excellent performers, if handled with care. Mixing chemical and non-chemical clauses and using the chemindex may require tweaks; mixing several chemical clauses is not possible without tricks; and the NOT operator cannot be used with a chemindex. Nevertheless, though the cartridge is not fully mature yet, the combination of an open source database with a chemistry-aware back-end does have the potential to be a game-changer.
Fast similarity searching: making the virtual real
Similarity searching is a key component of many cheminformatics processes including compound collection design, compound clustering and lead hopping. GSK has command-line tools for bespoke analysis and cartridge-based systems for similarity searching standard compound sets, but would like to use these applications in a more interactive manner whilst compounds are being designed. Several years ago ChemAxon approached GSK about the MadFast prototype and GSK has been collaborating to develop the application. Stephen Pickett explained why speed is so crucial in similarity searching. GSK clusters the whole screening collection of more than two million compounds every weekend. Typically, clustering can take many hours or days, but the process is highly parallelisable. Similarity and clustering is the rate-limiting step in the compound acquisition process. There is also a need to search very large libraries of available compounds (e.g., the 100 million compounds in ZINC) in real time. There are various ways of making similarity search go faster; for comparison in a benchmarking exercise with MadFast, Stephen used the FFSS method of Sunny Hung (GSK). MadFast features efficient pre-computation of target fingerprint storage; very fast querying via command line; and a Web server to provide a REST API. It fits with GSK’s current infrastructure. In a benchmark to find the most similar compound to a set of 790,000 (typically used for closest cluster centroid), MadFast showed a threefold improvement on FFSS in single thread mode. For “All versus All with a Tanimoto coefficient > 0.85” (typically used for sphere exclusion clustering), MadFast requires more optimisation, and has to be multithreaded, if it is to improve on FFSS. MadFast is being deployed at GSK to provide interactive searching of more than 30 million compounds as part of Schrödinger’s LiveDesign. Stephen concluded that MadFast provides a great solution for interactive searching: it offers multiple data sources, fingerprint options, and metric options, and is integrated with LiveDesign, but command-line use for clustering applications requires more optimisation. Perhaps speeding up substructure searching might be possible in future.
BioScity: interfacing Biomolecule Toolkit with lab reality
Daily laboratory activities such as data analysis, material management, sample registration and biodata reporting demand access to information, and access in a way that follows proper procedures. In reality, this is a function of time, money, diversity and flexibility. It is therefore important to integrate the best possible tools to facilitate everyone’s work. Dominique Besson of Texelia said that integrating ChemAxon’s Biomolecule Toolkit into a tool such as BioScity would serve many of the needs of scientists handling macromolecules. BioScity is a Web-based translational information component allowing entity information management, biodata registration and visualisation, querying and reporting, and centralised project information. It is an intuitive, out-of-the-box and customisable solution. It has a three-layer architecture. The surface layer includes ELN integration, and GUIs for registration and use of partner software. The middle layer comprises SprinGO (Texelia’s proprietary core management system and data structure) and the ChemAxon Biomolecule Toolkit API and functionalities. In the bottom layer is a MySQL or Oracle database. Biomolecule Toolkit integration in BioScity adds scientific analysis of macromolecules; populates a database of laboratory knowledge; allows retrieval of valuable information; and opens doors to novel development.
Create and Design in a Collaborative EnvironmentIDBS and ChemAxon: a 1:1 stoichiometry
Ian Peirson of IDBS spoke about the partnership between IDBS and ChemAxon. Although IDBS had developed their own cheminformatics capabilities for their chemistry ELN in the E-WorkBook desktop client, those capabilities did not meet future requirements. IDBS have been working on moving ELN functionality from the desktop to the Web, because of the increase in external research collaborations, and because the cloud is seen as critical to support new start-ups, who may not have IT services to support a local installation. As part of IDBS’ strategy, new technology was required for the representation of chemical structures on the Web, and IDBS considered a build, buy or partnership approach. The E-WorkBook desktop chemistry notebook incorporated ChemAxon’s Reactor and Calculators, and hence IDBS decided to build on its existing partnership with ChemAxon, because ChemAxon are the world leaders in chemical and biological molecule representation and their technology provided the best platform for developing the next-generation chemistry ELN.
Roaming though natural products space with JChem and KNIME
The Nestlé Institute of Health Sciences (NIHS) is exploring the link between nutrition and health at the molecular level and is looking for chemical substances occurring in nature that are pharmacologically active. So, NIHS is establishing a screening library based on natural products. Jasna Klicic Badoux showed how the chemical space of natural products differs from that of typical druglike molecules and discussed different strategies to select natural products subsets for screening in biochemical assays. Compared with druglike molecules, natural products have greater property variability, more hydrogen bond donors and acceptors, fewer aromatic atoms, more nitrogen and oxygen atoms per molecule, and more chiral atoms per molecule. About 26% of them fail the Rule of Five. Druglike molecules have broader Bemis-Murcko scaffold diversity than natural products. The molecular and physicochemical properties of compounds in the NIHS sample library are very similar to the averages in the Dictionary of Natural Products (DNP) database but there are relatively more scaffold singletons and there is higher scaffold diversity for NIHS. Jasna demonstrated the powerful combination of JChem and KNIME when it comes to manipulations and analysis of molecular libraries. The KNIME nodes used most for this application are Standardizer, chemical terms (used for the scaffolds and properties) and N2S. Jasna described the selection of 15% of the NIHS collection, a subset of representative and “best chance” compounds. These molecules are predicted to be bioavailable and are likelier to have a good binding energy, and they represent diverse natural product chemical classes. Rule of Five filters are used for bioavailability. “Best chance” filters for good binding energy ensure that the number of aromatic rings is less than five, that there are no more than 90% of aromatic (heavy) atoms, and that there are at least two hydrogen bond donors or acceptors. After both filters had been applied, natural product chemical classes were assigned and a final selection of 25% was made based on chemical diversity. A central nervous system multi-parameter optimisation (CNS-MPO) score was also implemented using JChem KNIME nodes to make a subset selection for a CNS project. In this case, Jasna made sure to include metabolites from selected plants and their analogues that were available in the NIHS library. This was yet another application of JChem KNIME nodes. JChem and KNIME are the perfect marriage. Marvin Live to facilitate structure-based drug design How to make better molecules faster is a key challenge in preclinical drug discovery. Sprint Bioscience use fragment-based techniques to find high-quality chemical starting points quickly, and then they use structure-based drug design to optimise the compounds. The company has computational chemists (including the speaker, Jenny Viklund) and medicinal chemists (including Fredrik Rahm, the other speaker) but no IT group. Sprint Bioscience chose to use Marvin Live to streamline collaborative workflow in the medicinal chemistry team, from a design idea to a target molecule ready to be synthesised. Medicinal chemists can check out their ideas without asking Jenny to help. The idea might be a potential scaffold hop. Marvin Live is then used to calculate physicochemical properties, to align two structures, and to see how the proposed molecule will fit in the binding pocket of the protein structure from PDB. Ideas can be saved as SDfiles or PowerPoint slides, for example, ready for reporting and analysis. New ideas are then generated and the chemist decides which compounds to make next. Jenny made three suggestions for enhancements to Marvin Live: better alignments; use of several alignments simultaneously; and surface and all-atom rendering of the protein. The way to make better ideas, faster, is to give everyone access to the 3D world; to check ideas and generate more of them; and to have improved workflow when sharing, documenting and evaluating ideas.
ChemAxon tools at Gedeon Richter
Using Plexus Design to create chemical space for structure-based drug design
To make better molecules faster, Sprint Bioscience does fragment-based drug design. Jenny Viklund decides where to expand the fragment; creates virtual libraries; evaluates the libraries in computational models (using Schrödinger software, physicochemical filters etc.); and selects the best compounds to make, in agreement with the synthetic chemist. It helps if you have lots of ideas to choose from and that means that the software must be fast. Sprint Bioscience also wanted to improve workflow. Jenny said that in library design the Schrödinger enumerator is not as fast as ChemAxon’s or as easy to use. Fredrik Rahm ran through two use cases for Plexus Design. In the first the chemist has a somewhat loose idea about the compound to be made but the synthetic route and reagents are not known. Here scaffold enumeration is used and it has the “wow factor”: it is very easy to use, gives increased diversity, and produces results in SDfile format. In the second use case, the chemist wants to expand a structure in the selectivity pocket and knows which reaction and reagents to use. Here reaction-based enumeration is employed; in Plexus Design it is easy to use large R-sets from databases, pre-defined reactions are available, and results are produced in SDfile format. Jenny said that Plexus Design is easy to use for both computational and synthetic chemists; “both pairs of eyes” can create the ideas. Putting more structures into the computer models will give better molecules (if the models are good). Plexus Design is fast in enumeration, filtering, and creating the SDfiles that the computational chemist wants. Sprint Bioscience does these exercises often and even small savings in time make a big difference. Boredom is also reduced. After design with Plexus Design, Schrödinger tools such as docking are great for weeding out the best compounds. Suggested enhancements for Plexus Design are filtering on the fly, improved spread-sheeting, stripping salts, and naming products. With Plexus Design Sprint Bioscience can make the optimum combination of a human’s brain and a computer's power.
Data Management, Characterisation, Analysis and ReportInstant JChem as a basis to construct a customised database to store biological activity data
Stefano Crosignani of iTeos Therapeutics said that in his company is a small biotech with no research informatics team and limited IT support. They needed a cost effective but high quality solution for data handling and reporting, including an ELN, and a database to store all chemical and biological data. The system also needed to be compatible with Mac computers. They decided to use IJC for the database, and to work with ChemAxon to customise the tools. The database has both a single compound view (with easy access to all the data for one molecule) and a list view (to see selected types of data across a list of compounds), and it is searchable by both data and chemical structure. Each view can be easily adapted to show the necessary data. Data aggregation (for example, averaging of data from the same experiment) is automatically performed according to predetermined rules for each type of biological data (for example, IC50’s are averaged, but PK results are not). iTeos used Excel files to load the biological data and SDfiles for loading chemical compounds. The iTeos ELN number and a date are used as the identifier for an experiment. To keep costs down, the ELN was not incorporated in compound view but each experiment displayed in the table links to an ELN page. A SAR table can be exported as an SDfile (with independent selection of the fields to be exported). Further IJC functions such as conditional formatting and list management, are available. Because of its simplicity, versatility and easy adaptability to different needs, IJC is very well suited to handling biological data in smaller biotech companies. It can accommodate different types of data with different rules. It is easy to maintain, and extensible by addition of projects and SOPs, without resort to ChemAxon help, although creating a new “mask” (i.e., a new type of data which requires different data fields and different averaging rules) can only be done by the ChemAxon team. iTeos will next extend the system to biologicals, using ChemAxon’s Biomolecule Toolkit.
Instant JChem in virtual high throughput screening and commercial compound selection
An in-house database of commercial compounds can be used for streamlining virtual high throughput screening (vHTS) campaigns including filtering for druglike compounds, ordering compounds from different vendors, and storing the history of small molecules purchased during project development. Elżbieta (“Ela”) Plesnar reported that Selvita has developed, and updates quarterly, an in-house database of 7 million compounds commercially available from 10 vendors. Selvita has many reasons for wanting to maintain the database in-house. Vendors’ own systems often have inadequate substructure search features. Queries and result lists need to be stored and shared in-house at Selvita, as does information about order status and procurement. Database update has to be possible. Not all vendors are equally reliable. Additional comments and annotation may be needed. Above all, Selvita do not want to share their ideas, and information about what they are looking for. The in-house database is in MySQL with an IJC front-end. JChem and ChemAxon plugins are extensively used at Selvita. Groovy scripts are used to export data, update tables, and get information from tables. Swing GUIs are used. The Excel files required by the procurement team can be prepared, or an SDfile can be output for use as input to another database. For updates, vendor SDfiles are read in and standardised. The main limitation is in searches on an IJC client side. Large amounts of data are sent over the network, consuming so much RAM that users often cannot get access to the database.
Rollout of Plexus Connect at GSK: a brief history and update
IJC is a thick client that has been delivered via a desktop application at GSK since 2011. The client uses direct connections to the GSK database infrastructure to provide access to registration, biological data and structure searches. Complex data joins because of the GSK database infrastructure, and access from sites remote to the data, made connection and query slow for many users; Citrix was necessary, but not popular. GSK have worked closely with ChemAxon to develop Plexus Connect as a new Web interface client to replace IJC. Richard Bolton said that GSK hopes that Plexus Connect will deliver not just performance enhancements but also a more unified chemistry desktop experience, fewer application conflicts and dependencies, and easier global updates. “IJC Web” wave 1 was initiated in August 2015 to deliver the five most used IJC projects via Plexus Connect. Opening a project over Citrix used to take 2-5 minutes; with Plexus Connect it takes less than 30 seconds. Feedback from users has been generally positive, although there have been complaints about some missing features. ChemAxon has developed a tool for migration of forms for use in wave2, and feedback from a usability session has been fed into wave 2 for layout, workflow, and functionality optimisation. Migration of 63 projects and124 forms will take place in “mini-waves”. Once ChemAxon has added similarity search, keyboard commands, grid view, export list selection, list logic, and export of child tables, 15 more projects will be uploaded in June 2016. Migration of a reaction database is planned for the end of July. Form editing on the Web and JChem for Excel should be delivered later in 2016. Scalability will be a challenge; now that all forms are loaded, load time has increased. The thick client cannot be decommissioned until the end of 2017 as it will be needed to create forms and maintain complex databases. Apache Hadoop HDFS (a distributed Java-based file system for storing large volumes of data) will be rolled out early in 2017 and it is not clear how substructure search will run over HDFS. Other issues are external partners and security, and connection of Web Services to forms. There are many future challenges as the infrastructure and business model changes, but despite a very complicated architecture at GSK, progress is being made.
There were about 20 presentations from ChemAxon, most of them dealing with enhancements to existing products. Here I only mention the newly released tools. ChemLocator is a new ChemAxon product: a Web-based search tool that allows users to discover the hidden chemical knowledge in documents, regardless whether they are located on a local computer, a network share or in the Cloud. József Dávid advised us that unstructured data is growing three times faster than structured data. JChem for SharePoint allows chemical search of unstructured text, but small companies are less likely to use SharePoint. Data often sit on drives on local machines or in Dropbox or Google folders; ChemLocator finds chemistry lost in such places. György Pirok described Plexus Analysis, a new tool for analysis and visualisation that was released to coincide with the annual user meeting. Plexus Analysis supports histograms and multidimensional scatter plots, which can uncover new patterns and correlations in data. The software can handle millions of records efficiently. It is Web-based, and easy to deploy and use. It currently has a limited number of features; there are no export features and no filters yet but these are in the pipeline. Version 1.0 of the revamped Compliance Checker has been released. It has RESTful Web Services and a powerful API. Ákos Papp showed its redesigned Web GUI. It is also 60 times faster: it can process 100 compounds per second on a single medium-type Amazon server (m4.xlarge), and scales almost linearly with increasing performance of the server or with an increasing number of servers. There are plans to integrate it into Plexus in future, and to introduce pay-per-view services and hosted solutions.
In the partner session there were presentations from Biochemfusion, BSSN Software, Certara, IDBS, KNIME, Linguamatics, MolPort, quattro research and Schrödinger. IDBS also gave a talk in the main program.
I was invited to ChemAxon user meetings from their inception but ignored two invitations before I was persuaded to write a report on the 2007 meeting in Budapest. From then on I have written every year about a ChemAxon meeting either in Hungary or the United States. In the beginning ChemAxon needed me (in the form of publicity); the wheel has now turned full circle and I need ChemAxon! It is impossible to keep up to date with the cheminformatics market unless you follow progress at ChemAxon; I need to be at the user meetings. I do miss the “state of the nation” address that used to conclude the meeting, but Csizi did present a corporate viewpoint to me personally. ChemAxon is moving more towards meeting the requirements of the smaller user companies, but still wants to keep the big customers happy. More and more functionality is being made available in Web-based tools, and there is an increased focus on large molecules. All in all, I perceive that ChemAxon is yet further along the S-curve of a market leader, with many of its expected successes and challenges.