ChemAxon US User Group Meeting (UGM), San Diego, September 24-25, 2013

news · 8 years ago
by Wendy Warr (Wendy Warr & Associates)

Conclusion Introduction
Keynote End-user applications
Sharing data with chemically enhanced Microsoft tools Registration systems and Web Services
The ChemAxon platform for building systems and integration Plexus
Plexus and the Plexus Suite Partner session
Cloud-hosted systems ChemAxon’s atom mapping algorithm
Metabolizer Chemical mixture applications
A chemical library design tool DeltaSoft and ChemAxon tools for screening, lead discovery, and SAR
NIH BioAssay Research Database (BARD) Ontologies for bioassay information
Machine learning applications with JChem Naming and text mining
Markush technology Closing remarks

Between 1981 and 2007 I attended more than 50 MDL user group meetings in the US and continental Europe, and not a few UK-based ones as well. So it is hard to attend ChemAxon user meetings without making comparisons, fair or odious. Interestingly, it was 2007 when my sequence of ChemAxon meetings began. By now ChemAxon is probably the market leader in mainstream chemical structure handling, and still moving steadily up the S-curve. It is now in the realm of enterprise applications, and consultancy services for systems building and integration, but it still retains its appeal to the smaller biotechs.

In previous reports I have mentioned possible threats to the company’s future. None has materialized, but here is a new one. I was not the only one to listen to Tim Aitken’s talk on Plexus Suite and mutter under my breadth “Isentris”. This cradle-to-grave vision was frightening in its implications, but Alex Drijver assures me that it will be thoroughly planned and resourced to prevent it from spiraling out of control.

ChemAxon has fingers in a lot of pies other than “mainstream chemical structure handling”: virtual screening, Markush technology, text mining, and metabolite prediction, to name but a few. Here be dragons. Your combined printer-scanner-photocopier will probably be second best at printing, copying, and scanning although the combination may have its advantages. Could ChemAxon similarly be too thinly spread? I personally think that things are under control and the business model is being carefully crafted. Not a lot is wasted on overheads and administration, and the company is still small enough to be very flexible. I remain optimistic.

ChemAxon has been profitable since inception with 30-50% growth year on year. And what a long way it has come since Alex took over five years ago! My congratulations to the company on 15 years of continuing success. I agree with Alex: I too am looking forward to the next 15 years.


· return to TOC
The meeting was held at the Catamaran Resort Hotel and Spa, San Diego, CA, as in 2011. About 85 users and partners gathered, plus more than 30 ChemAxon attendees. The sun sparkled on the bay and ocean, and a good time was had by all, both socially and intellectually. The day before the main UGM there were three parallel satellite meetings: an integration workshop for senior developers and IT managers; a KNIME workshop for scientists interested in KNIME workflows; and a Markush open day for patent professionals. These were followed by a one-to-one session for users to have personal meetings with ChemAxon staff. In the evening we headed to Pacific Beach for dinner, drinks and networking at Nick’s at the Beach. The next night a gala dinner was held on the Berkeley Ferry at the Maritime Museum of San Diego, with desserts and coffee on the Star of India (the world’s oldest active sailing ship and also the oldest iron-hulled merchant ship still floating). ChemAxon staff got into the spirit of things by donning pirate paraphernalia. On a related note, this year’s memory stick was parrot-shaped, in keeping with the hotel’s “pets”. Photographs from these events are available on the ChemAxon web site event gallery.


· (Presentation in our Library) ·return to TOC

Continuing a recent tradition, the UGM proper began not with a keynote address but with Alex Drijver, CEO of ChemAxon interviewing Mark Murcko, Senior Vice President, Strategy at Schrödinger as follows.

Alex: Schrödinger has a long relationship with ChemAxon which goes back to the days when Schrödinger acquired Seurat.

Mark: Schrödinger has always focused on physics-based method development, but two years ago it started to add more tools to help with decision making. Rather than reinvent the wheel, the company decided to extend its relationship with ChemAxon.

Alex: ChemAxon has an “Android-like” business model: many providers can use the same platform, as you will see in the partner session later.

Mark: Collaboration is important. The Schrödinger advisory board gives the company serious advice; according to Joy’s Law: “most of the smartest people work for someone else”.

Question from the floor: Why doesn’t Schrödinger build toolkits?

Mark: LiveDesign, Schrödinger’s next generation drug design platform, is plug and play. All Open Eye tools began at Vertex where they wanted to create an open environment. You want your scientists to be able to concentrate on what your own company is uniquely good at. Heads of medchem have a pent-up dissatisfaction with their IT groups because the IT solutions do not address the real needs.

Alex: How was your relationship with your IT people?

Mark: I hired all the IT people at Vertex. The IT groups that are the most successful have people who have a strong science background; people who are not just experts on Ruby on Rails but want to solve problems for scientists.

Alex: What will hardware and software look like in three years’ time?

Mark: Allosterism, designing non-Lipinski compounds, macrocycles. polymorph prediction, undruggable targets, ADME prediction, X-ray data and p450, these are the science challenges that we face. I will need a system that anticipates my needs. It will tell me “Did you know that Lilly made something like this years ago?” Or “Your series looks as if cuts across that patent: do you want to see it?” Pharma is using less than 1% of its information; we need to get all the information into a software package that is a joy to use. LiveDesign needs to be like Google and Facebook. Not all IT people are excited about medicine; Schrödinger recruits IT people who are keen on medicine.

End-user Applications

· return to TOC

Much has been done at ChemAxon over the last year, and there were (justifiably) rather a lot of ChemAxon talks at the UGM. Petr Hamernik kicked off with news about accessing Instant JChem (IJC) forms from the Web. The IJC desktop client was previously not easy enough to use; the interface has now had a complete face-lift and the welcome screen features more relevant data in default views through a new dashboard. The form design mode now features a widget palette panel: available widgets are more readily accessible and widgets can be dragged and dropped into a form. In the Export Wizard, previous settings are now applied for future operations. Scripting support has been improved. Oracle Text is an extension to the core functionality of the Oracle database. Its purpose is to enhance the performance and provide some additional search features on the indexed text fields. Another focus has been integration and extensibility. Relational data support for Spotfire has been added, and the Form application programming interface (API) allows forms to be created (by experts) using scripts. JChem for Office can now load data directly from IJC using a bridge similar to the Spotfire plugin. The new WebClient interface presents IJC forms as Web pages and allows users to browse, query and export data directly from any Web browser. It preserves the functionality available in the IJC desktop client. (Presentation in our Library)

Petr was followed by Efi Hoffmann who updated us about Marvin. MarvinSketch is great for sketching structures. It is smart (with cleaning and mapping), has built-in calculations and naming, and can be customized and integrated. Focus in the upcoming releases will include publication quality drawing, text box and image handling.. It also has the disadvantage of using Java, and the applet loading is slow. Enter Marvin for JavaScript, a lightweight chemical editor on modern Web browser pages that does not require users to install Java. It is easy to integrate and fast to start up, has new import/export, query, display, API, and image generation features, and will soon be ready for tablets. Compound Registration, Plexus, IJC WebClient, JChem for SharePoint, and the Document to Database Web application incorporate Marvin for JavaScript.

MarvinSketch has also been enhanced. There is a new icon set and menu items have been rearranged. The drawing quality has been improved, the fitting of single bonds refined, and the gap at single and double bonds eliminated. Bond length and size of objects can be separately defined for multiple structures on a single canvas. There are new menu options for inserting and saving fragments on a canvas, and new display options for peptide cycles and bridges, and IUPAC numbers. An unlimited number of attachment points on superatom S-groups can be handled. (Presentation in our Library)

Sharing data with chemically enhanced Microsoft tools

· return to TOC

Aurora Costache of ChemAxon reported on progress with JChem for Office. JChem for Excel 6.0 has many new features. A basic search options category has been added (in addition to chemistry search options). Two ribbon menus present functionalities according to user: chemists (standard) and modelers (advanced). User interfaces for the R-group decomposition workflow, and for filter and database search have been improved. Descriptions of functions and online help have been added. Corporate IDs can be copied and pasted and there is a one-click structure-activity relationship (SAR) table. JChem for Excel has now been rolled into the new product, JChem for Office that enables live chemical structures across most common Microsoft Office files. Users can add and edit structures, open structures from files, convert text to structures, convert structures to SMILES strings, calculate properties and fragments, filter data, and perform chemical searches, and copy and paste structures and data among Microsoft Office applications. (Presentation in our Library)

Microsoft SharePoint is a widely used Web-based collaboration platform; ChemAxon has added chemical intelligence to it by integrating JChem and Marvin. In his presentation Attila Szabó used the so-called feature circle that Microsoft usually uses to explain SharePoint’s capabilities: sites, communities, content, search, insights and composites. SharePoint sites can be enriched with chemical structures using MarvinSketch or certain third-party structure editors. SharePoint has communities such as calendars, blogs, discussion boards or internal Wiki libraries. ChemAxon has enabled structure drawing on discussion boards, providing a useful forum for synthetic chemists discussing reaction pathways, for example. Chemical intelligence has also been added to SharePoint content: chemists can now use the solution as a document repository, where they can share and edit Office documents with their colleagues. Structure files and JChem for Excel workbooks can be imported and exported and viewed as SharePoint lists. Calculator plugins are also integrated. Since SharePoint has a fine grained security administration, this could be a way of sharing structures to be synthesized by a contract research organization (CRO). The JChem for SharePoint Search component is an extension to SharePoint search allowing you to find structures in SharePoint content. JChem for SharePoint is not of interest to pharma alone: DuPont is about to roll it out into production. (Presentation in our Library)

Registration systems and Web Services

· return to TOC

A group of ChemAxon talks dealt with chemical registration systems. István Rábel and Zsolt Mohácsi explained how to keep the garbage out of your compound database and store everything in a standard way. ChemAxon’s interactive Structure Checker detects and fixes issues and Standardizer yields a customized, canonical representation. JChem 6.1 contains 58 checkers and 38 standardizer actions. In addition, you can write your own checkers and standardizer actions and share them across your organization. (Presentation in our Library)

Compound Registration is no longer a custom solution; it is a standard product, built on a set of Web Services. Ákos Papp gave a demonstration. Mandatory and optional input fields are now easily configurable, and input data type and external content validation rules can be attached. The structure editor can be configured within the Web client; ChemDraw is now supported. New features in version 6.1 include salt and solvate bulk import; a customizable search page; configurable lot ID; project-based access control; and Marvin for JavaScript integration. (Presentation in our Library)

It is not only “chemicals” that need to be registered; ChemAxon is also working on registration of biological products. The Hierarchical Editing Language for Macromolecules (HELM) is an emerging notation standard created by Pfizer scientists for the representation of large biomolecular entities (Zhang, T.; Li, H.; Xi, H.; Stanton, R. V.; Rotstein, S. H. HELM: A Hierarchical Notation Language for Complex Biomolecule Structure Representation. J. Chem. Inf. Model. 2012, 52, 2796–2806). It has recently been made freely available by the Pistoia Alliance. It expresses biologics in a concise format on various abstraction levels, from exposing the modularity of multi-component entities, to sequence representation of natural and un-natural residues, down to exact atomic description of post-translational modifications and other chemical features. Roland Knispel demonstrated an early version of a Web Service based toolkit, implementing simple structure-driven registration of large biomolecules by integrating the Open Source HELM tools.

Roland envisioned a complete registration system, with business logic and data relationship management for small molecules, macromolecules, biomaterials, and processes, fed by electronic laboratory notebook (ELN), LIMS, browser client, and data providers. In the long term, structures will be handled by HELM and BioSketcher; there will be an agnostic data model and synonym tables and ontologies for annotation; and business logic for all four sorts of registration will be integrated. In the short term (2014) ChemAxon will offer a biomolecule toolkit with HELM, a HELM editor, and the data model for annotation. (Presentation in our Library)

The Web Services on which Registration and other systems are built, have also been enhanced. In JChem Web Services 6.0, to complement existing Simple Object Access Protocol (SOAP) Web Services, there are new representational state transfer (REST) Web Services with basic and administrative JChem Base functions, image and file format conversion, naming, and Markush enumeration. András Strácz discussed the driving factors of ease-of-use, efficient integration, and high productivity from any programming language, including C/C++, Python, and JavaScript for any platform, including PC, Mac, Android and iOS. (Presentation in our Library)

The ChemAxon platform for building systems and integration

· return to TOC

In the above section I included work on building registration systems; now for two other talks related to building and integrating different sorts of systems. CRAIS Checker was developed by, and currently is maintained by the Japanese company Patcore in close collaboration with ChemAxon. (I have been unable to discover whether “CRAIS” is an acronym.) Aurora Costache of ChemAxon explained how the software uses superstructure search to check whether the compounds you want to synthesize, purchase, import or export fall under controlled substances regulations. Laws and regulations are regularly updated and stored in a database as part of the complete solution. The system ensures compliance with legislation in multiple countries. The CRAIS Checker Server is accessed via a Web client, a Windows client, or a SOAP or command line interface. The Pistoia Alliance has chosen two vendors, Patcore with ChemAxon, and Scitegrity, to develop two separate instances of its Controlled Substance Compliance Services (CSCS). The ChemAxon solution uses CRAIS Checker. (Presentation in our Library)

Tim Dudgeon and Erin Bolstad of ChemAxon provide consultancy to support the adoption of ChemAxon products. Typically they build new applications, extend the functionality of existing systems, provide migration and integration support, and perform customized end-user and developer training. Examples of custom application development are DuPont’s document structure search (see Mark Andrews’ talk below); the Novartis reactions database (see Greg Landrum’s talk at the 2012 European user group meeting); the Document to Database demonstration; and the European Lead Factory portal that will go live in January 2014. Examples of customized development of products are compound registration (for GSK), Plexus (for Eli Lilly, see below), IJC (for GSK and BMS), structure checkers (for Boehringer Ingelheim) and Marvin (for Elsevier and Pearson). Existing software has been extended for compound and assay registration and reporting in IJC, customized structure databases in IJC, Markush and structure search, and KNIME workflows. The consultants have assisted a number of pharmaceutical companies in replacing an old cheminformatics platform (ISIS/Base, ISIS/Host, and Accelrys Direct) with ChemAxon tools. Migration and integration support have included database redesign, data cleaning, data migration, system design, chemical business rules, workflow review, integration with other software, and training. (Presentation in our Library)


· return to TOC

Plexus is a new Web application for chemists, featuring dynamic data visualization, with a user-friendly, simple interface, offering a number of ChemAxon functionalities. András Strácz of ChemAxon gave us an introduction. The current focus is on compound library design and the tools supporting this: database management, chemical searching, library enumeration, physicochemical property calculation, virtual synthesis and similarity searching. Plexus currently covers Document-to-Structure, JChem Base, Standardizer, Markush enumeration, Calculator plugins, Screen, the Marvin applet, and Marvin for JavaScript. (Presentation in our Library)

ChemAxon collaborated with Eli Lilly in developing Plexus. Daniel Robertson of Eli Lilly put this in perspective. Lilly’s number one priority is increasing the flow of innovative therapies that make a real difference for patients across multiple disease indications. Partnering and collaborating are part of the company’s heritage and have been a key element of its strategy for more than a decade. Successful partnerships match Lilly’s ongoing pursuit of innovation. Lilly Research Laboratories is exploring innovative approaches to improve the success rates of future drug discovery and development.

There is a diversity of strategies for small molecule drug discovery and there are many approaches to molecule identification; Lilly does not have all the talent and intelligence in-house to use them all so it has built an ecosystem incorporating external scientific talent in the form of partnerships, discovery collaborations and in-licensing. In the Open Innovation Drug Discovery Program (OIDD) Lilly makes an in-kind contribution of screening data to external investigators, who also have an option to contribute to the Lilly TB Drug Discovery Initiative. Both phenotypic and target-based screening assays are offered, through the Phenotypic Drug Discovery (PD2) and Target Drug Discovery (TargetD2) initiatives.

An academic or biotech-based investigator submits chemical structures confidentially through a Web-based interface to be selected by an automated cheminformatics filter. Subsequently, Lilly carries out in vitro assays using physical samples submitted by the users evaluates the results, and then returns all generated in vitro data via the users’ individual web-based accounts, along with a business decision regarding each compound’s potential for follow up. Lilly has first right of negotiated access or collaboration for promising molecules, but otherwise the investigator is free to publish the results. The Web system also allows investigators to keep track of the samples they have submitted.

Security is essential. At the Lilly secured data center, all structural information (SMILES, molfiles, images, fingerprints) is stored in an encrypted form. Under the control of the investigator, only an encoded compound fingerprint eventually crosses the firewall for processing on the cheminformatics server on the Lilly internal network. After analysis, all traces of fingerprints are removed from Lilly internal servers.

The OIDD program has recently been expanded. Peptide molecules are now accepted; the number of screens has been increased; and computational tools for external investigators, to aid compound design and selection are now added via the Plexus software. In 2012 investigators were sent a report with calculated physicochemical physical properties for compounds they submitted. Now they have access to interactive physical property calculation tools that Lilly has developed in partnership with ChemAxon: the Plexus software offers chemical visualization functions to aid in compound design. Next year, structural hypotheses will be generated using Lilly tools; Lilly will provide users with in silico scoring data from predictive models associated with target-based projects. Consistent with the foundation of the OIDD program, confidentiality and security are essential elements of the incorporation of the Plexus software. Thus, OIDD participants using Plexus are guaranteed the confidentiality of all structural information generated during modeling sessions. Temporary modeling and design results are accessible only to each individual user during the session: they are not passed through the firewall to the Lilly internal network and they are not stored.

From the business point of view, OIDD collaborations are flexible, and span a wide range of legal structures to match the partner’s specific objectives. Collaborations may be molecule-based or biology-based. By design, collaborations are of short duration, with relatively higher risk and lower cost, to minimize overheads and facilitate negotiations. Progress is evaluated periodically, usually with one or two years of exclusivity for Lilly, with a proviso to negotiate in good faith at the conclusion of that period. It is necessary to allow sufficient time to conduct research to determine whether the opportunity merits further investment.

Dan listed four examples. The National Institutes of Health (NIH) have signed a joint research agreement for profiling more than 3,000 drugs in phenotypic models, to support drug repositioning efforts. All generated data will be made public. The University of Valencia, Spain signed a one-year research agreement in 2011 to develop synthetic methodology and produce fluorinated molecules with potential use in oncology research. UC Irvine (UCI) had a two-year agreement to support a postdoctoral fellow at UCI working to optimize molecules and assess potential use in diabetes. In 2010 the University of Notre Dame signed a one-year agreement for a joint study evaluating molecules with therapeutic potential in oncology.

There has been a steady increase in new affiliations and accounts created each month over the life of the program; over 340 affiliations in 31 countries have been created. Universities make up 62% of partners, biotechs 26%. Some 43% of partners are in the United States and 37% in Europe. Approximately 40% of virtual compounds submitted are accepted for biological screening and about 600 new samples are registered each month. Two thirds of active users are repeat submitters, and one third of active users have submitted three or more times. OIDD has also received favorable reports in the media. (Presentation in our Library)

The Plexus Suite

· (Presentation in our Library) · return to TOC

Plexus as used at Lilly is only the first step in a significant new product development. Tim Aitken of ChemAxon explained the vision. ChemAxon has carried out a major survey recently of customer problems and “pain points”. Tim Aitken said that eight major common themes had been identified:

  • Current applications are too complex and non-intuitive
  • It is difficult to access data from multiple disparate sources
  • Project-relevant information is lost, locked away in e-mails, PowerPoint slides etc.
  • Collaboration between companies and users is generally difficult
  • Data sharing is difficult: reporting tools are not standardized, and export to PowerPoint is often problematic
  • ELNs are difficult to search and maintain
  • Access to 3D tools is piecemeal, overwhelming, and underutilized
  • SAR tools generally have a complex and non-intuitive nature

ChemAxon’s goal is therefore to develop a single, easy to use, and very intuitive user interface, with easy access to data and analysis techniques, enabling scientists to access data from a variety of disparate sources, and generate and share reports easily. The first offering will focus on medicinal chemistry and library design: data visualization, enumeration, property calculations, searching, calculation and reporting. Following versions will focus on other user roles; integration with external data sources (ELN, assay databases, Reaxys, etc.); collaboration; and added functionality (reactions, analytical, and ontology support).

Plexus Suite will be a Web-based common user interface for drug discovery data visualization, analysis and project-level data management, merging IJC, WebClient and Plexus into one framework. It will reuse common components (charting, calculation, reporting, and security) and expose ChemAxon (and third-party) tools via role-specific modules such as synthetic chemistry (stoichiometry tables, retrosynthetic analysis, access to content in SciFinder, Reaxys, and Thomson Reuters and Wiley databases, etc.) and library design. It will not be a replacement for the IJC client: IJC provides tools for power users, whereas Plexus Suite will be a more basic tool exposing powerful, role-based functionality in a simpler user interface. Tim gave us a glimpse of the prototype: the first release is planned for the first half of 2014. Tim’s architectural diagram is worth reproducing here.

Partner Session

· return to TOC

Nóra Lapusnyik introduced the partner session, emphasizing that ChemAxon views competition as an opportunity not as a threat. The company has more than 50 partners. At this meeting there were presentations from:

Cloud-hosted systems

· return to TOC

In addition to his presentation in the partner session, Steve Muskal, CEO of Eidogen-Sertanty, gave a talk describing how ChemAxon’s technology inspired and enabled his company’s liberation into the Cloud. He asked those present to raise a hand if they made use of “the cloud”. The number of hands raised increased once he pointed out that users of gmail, Dropbox and iPhones or iPads are benefiting from the cloud.

In January 2008, Amazon announced that the Amazon Web Services (AWS) consume more bandwidth than its entire global network of retail sites. The pharmaceutical industry is increasingly pursuing outsourcing strategies and a 2010 report showed that cloud computing is rapidly growing in importance as life science R&D organizations are deluged with data from multiple sources. Spending on public and private IT cloud services will generate nearly 14 million jobs worldwide from 2011 to 2015, according to a 2012 study by IDC, funded by Microsoft.

Eidogen-Sertanty has gradually migrated to the cloud. In 2003-2005 the company had more than 120 servers in three different server rooms. In 2005-2010 it had one colocation center and five caged racks. In 2010-2012 it had one colocation center with three racks plus Amazon Elastic Compute Cloud (EC2) Relational Database Service (RDS) cloud instances for mobile apps and migration. In 2012-2013, EC2/RDS cloud instances were the primary infrastructure with two legacy racks for “fallback”.

Many of us have legacy systems that cannot readily migrate into the cloud. A registry system which is more than a few years’ old, by definition has legacy dependence, and falls into the category “if it ain’t broke, don’t fix it”. Eidogen-Sertanty’s internal, distributed content curation system is one such system, previously relying on older Web-based application environments such as WebLogic, older Oracle RDBMS systems, and now outdated relational database cartridges supporting structure search and registration.

The cloud has simplified Eidogen-Sertanty’s content creation. Nine Ph.D. and M.S. chemists and biologists work in a highly standardized and controlled workflow of data entry, guided by standard operating procedures, followed by quality control and final sign-off, to produce quarterly updates to the company’s kinase and oncology knowledge bases. Relevant articles and patents to be captured are rigorously prioritized. Numerous hardware and software legacies did not make things any easier. The “fully clouded” curation workflow is significantly simpler than its predecessor. ChemAxon’s Web Services, JChem and IJC have helped pave a migration pathway into the cloud for Eidogen-Sertanty.

The cloud also opens up new mobile possibilities. Pipeline Pilot Mobile (apparently undergoing a name change to “Tasks”) is a new Pipeline Pilot component collection: an iOS application for iPhones and iPads. It allows any Pipeline Pilot PDF and HTML report to be deployed to mobile devices, and allows authoring of new Pipeline Pilot protocols to deploy dashboards, and mobile-centered tasks and actions, on mobile devices. Cloud computing offers the promise of lower cost, scalable, “utility” computing. It is everywhere, but full migration to the cloud takes time, so if you have not yet started you should start now. Mobile and cloud computing pair well: there are many new opportunities so you should start retooling. (Presentation in our Library)

Another cloud-hosted solution has been built at Chromocell, using IJC in a LIMS. Adam Idone made a presentation about it. Chromocell is a small business (now employing about 100 people) founded in 2002, based on Chromovert technology, which enables the company to use naturally occurring cells and receptors in high performance discovery platforms. In addition to the flavors and nutrition business there is research in therapeutics concentrated on analgesics and the treatment of respiratory disorders.

The company originally outsourced all chemistry support but after it expanded into natural and synthetic chemistry, its own informatics efforts started to grow in 2011. Formerly, all data were tracked in Microsoft Excel, and collaboration occurred though SharePoint and e-mail. Chemistry viewing support was limited to three licenses of non-enterprise IJC. At the beginning of 2011, Chromocell purchased ChemDraw for reaction and chemical drawing support. In the third quarter of the same year some off-the-shelf software was purchased with a limited number of modules and users to keep costs down. That solution failed on the grounds both of cost and functionality, and a year later Chromocell licensed IJC Enterprise with a set of modules available to the informatics group. There is now overwhelming support for IJC: it is customizable, configurable, supported in-house, extensible, powerful, and fast.

A system was needed in assay development, in the sensory lab, and in natural products, medicinal chemistry, and the legal and intellectual property (IP) groups. It had to serve as a chemical registration service; to track analytical, solubility, toxicity, and descriptive data on chemical structures; to keep all inventory information up to date; to link all cell and sensory data relationally at the experiment level with ability to drill down; to serve as a collaboration platform for idea sharing within medicinal chemistry; to provide access to chemistry informatics support to legal and IP teams; to support natural products data and workflow management; to automate numerous workflows; to provide a searchable and secure data store; and to keep costs low. The system is hosted on AWS, using RDS to host IJC schemas and others, and using the Oracle 11g database. Amazon EC2 Web Servers are used to host the Oracle Application Express (APEX) application server, to host the IJC project configuration and shared URL server, as an application server to host toxicity calculators and other custom applications, and soon as an IJC application server for the Web client. All this is wrapped behind an Amazon Virtual Private Cloud with a virtual private network (VPN) gateway.

All the information is stored in four data schemas. The security policy for all schemas is “Username/password using IJC database” with required encryption. Roles are implemented to filter entities and columns from certain users; row-level security is implemented on a project basis for a handful of tables; and “delete rows” is not enabled on any table in IJC. Adam showed examples of multiple schemas, entities, and views. Chromocell had a contract with ChemAxon to generate a Groovy script to perform compound registration. The company is very satisfied with the work ChemAxon did, but the system is perhaps too simple, and Chromocell is considering buying ChemAxon’s registration system.

Use of the R-group decomposition function has been a big hit. The overlap analysis feature has also proven to be an extremely useful function within IJC. It allows Chromocell to compare large datasets and analyze the results using macros in Microsoft Excel. Values extracted from the macro are reformatted with appropriate information and their corporate ID’s are used to pull structures from JChem for Excel. Overall, IJC’s role in informatics lies mainly in handling large datasets as they are transported and used by different software packages (e.g., MatLab, R, network graphing software, GraphPad Prism, and custom software using IDBS’ XLFit) with results being transferred back into IJC. IJC also hosts a database schema of external vendor compounds, and a legal and IP schema for checking IP space, pursuing new scaffold ideas, and filing. Other uses of IJC include checking for drugs and hazardous chemicals, a flavors and olfaction database, and a natural products schema that tracks biomass. The Oracle APEX Web interface for all non-chemistry related activities can be used to register, search, and query entities. It has access to IJC data schemas in addition to its own Oracle schema and is hosted through the Apache application server.

Chromocell currently holds licenses for JChem for Excel, JKlustor, a few descriptors, and Document-to-Structure. JKlustor is automated through scripting and command line functionality for all scientists to use. Soon, Reactor will be a replacement for the company’s existing solution. Chromocell will also be acquiring the Screen suite for virtual screening of large datasets and proposed compound purchases, Markush search and enumeration software, Standardizer for all chemical registration, the JChem Cartridge backend for IJC hookup, and KNIME plugins for IJC. Adam concluded with his wish list: drop down lists from static or dynamic lists; auto-push of IJC updates through shared projects and configuration; enabling list boxes with multi-select for querying; sliders in forms for filtering; AND/OR functionality within queries; and the ability to change the order of entities in a data tree by dragging them or setting a sequence number. (Presentation in our Library)

ChemAxon’s atom mapping algorithm

· (Presentation in our Library) · return to TOC

In the first of a number of presentations under the loose heading of “ChemAxon science”, Daniel Lowe of NextMove Software discussed reaction atom mapping applications. Atom mapping is useful in assigning roles to reagents, in normalization of reactions for registration, and in aligning structures. It also allows more precise database searches: solvents and catalysts can be distinguished from reactants, and the relationship between the reactant atoms and product atoms can be made explicit. Suppose, for example, that you want to find reactions converting an alkene to a cyclopropane; you will get the following false drop unless you use atom mapping.

Atom mapping is also useful for identifying suspect reactions, e.g.,

ChemAxon mapping is available through the MarvinSketch GUI. Batch mapping can be achieved using the AutoMapper class in MarvinBeans. The “complete” option assigns mapping numbers to all atoms; “changing” assigns numbers to only those atoms involved in reaction centers; and “matching” shows the correspondence between atoms in the reactants and products:

To evaluate any improvement in version 6 of Marvin Beans and ChemAxon’s atom mapping algorithm, Daniel used the following test sets.

(The work on reactions from patents was reported in a paper entitled “Automated Extraction of Reactions from the Patent Literature” given at the 243rd ACS National Meeting in San Diego in March 2012.)

Daniel measured precision by the number of carbon-carbon bonds broken (since breaking these bonds is energetically difficult and rare), and recall by whether or not all product atoms were mapped. Version 6 of the software has improved recall.

Precision is also improved.

The program is also faster according to a comparison performed on the PharmaELN dataset on an i7-2600:

The Newman–Kwart rearrangement is a difficult case that the ChemAxon software now handles correctly:

Implicit stoichiometry is one area for improvement. The following reaction is mapped correctly if the stoichiometry of the Grignard reagent is made explicit.

Another case where improvement is needed is where there are many choices for reactant atom mapping. The ChemAxon software chooses an incorrect reactant in the following example. This might be partly a bond order issue.

The use of a consensus of methods gives improved results:

Daniel finally presented some cases where atom mapping alone is insufficient. One case is missing reactants, often for routine reactions, for example, Boc protection:

Another example is change of stereoisomer or chiral resolution:

This will be found by atom mapping but it will not really tell you what the reaction is. The atoms are the same on both sides of the equation but the stereochemistry has changed.

Unfortunately there are some reactions which genuinely make no sense (e.g., multiple reaction steps described in one reaction) so 100% mapping may not be desirable. The atom mapping algorithm in Marvin version 6 provides large improvements in recall, precision and speed over version 5 but atom mapping in some cases is not as simple as finding a maximum common subgraph mapping. Algorithms such as NextMove Software’s NameRXN can be useful for the validation of some such reactions.

Since Daniel gave this talk a test set for mapping algorithms has been published (Kraut, H.; Eiblmaier, J.; Grethe, G.; Löw, P.; Matuszczyk, H.; Saller, H. Algorithm for Reaction Classification. J. Chem. Inf. Model. 2013, 53, 2884–2895) and contributions have been invited. ChemAxon should study this.


· (Presentation in our Library) · return to TOC

The second “science” talk concerned ChemAxon’s Metabolizer software (a new product this year) which predicts and ranks the metabolites of drugs and other xenobiotics. The knowledge base is stored in the form of biotransformation libraries which can be replaced or customized. Metabolizer comes with a built-in human biotransformation library containing biotransformation schemes and rules constructed from experimental data collected from the literature. György Pirok described a recent test he had carried out using 310 substrates and 826 experimental metabolites; 366,795 metabolites were produced in four generations.

The system must find all experimental metabolites, with no structural errors. It should identify at least one major metabolite correctly for 95% of the substrates, and it should identify most of the experimental metabolites correctly. Metabolizer performed reasonably well. Experimental metabolites are rarely missed; the site of metabolism is correct in most cases; and there is good coverage of the biotransformation space. On the other hand, experimental metabolites are sometimes marked “unlikely”; there is no site highlighting yet; and there is a risk of getting many unlikely metabolites. Comparing the competitors in this market is not easy; György’s results are not discouraging but there is still plenty of work to be done.

Chemical mixture applications

· (Presentation in our Library) · return to TOC

The next “science” paper was by Oleg Ursu of the University of New Mexico who has developed a system for studying similarity of mixtures. Mixtures of chemicals are currently widely used in pharmaceutical formulations, as well as in flavors, and agrochemicals. Recent data show that approximately 25% of active ingredients used in pharmaceutical formulations are used in combinations with at least one more active ingredient. For example, acetaminophen (paracetamol) is present in almost 3000 pharmaceutical formulations, where more than half are mixtures of two or more active ingredients.

Perhaps the simplest way of representing a chemical mixture is as a dot disconnected SMILES string, but this does not allow you to store the composition or stoichiometry of a mixture. The MDL connection table (CTfile) format can store mixtures, and data can be attached to fragments using Sgroups blocks. ChemAxon’s MRV format also supports data attached to a molecular fragment. If the composition information is stored along with the chemical structure, it can be used to compute a mixture fingerprint to encode not only chemical structures but also the ratio of each component.

Oleg’s mixture fingerprint generation process is based on generation of Accelrys’ extended connectivity fingerprints (ECFPs). After the bit vector is generated, the ON bits are replaced with the actual concentration or ratio of components from which the corresponding fragments or circular environments are derived. Where a fragment is present in more than one component then the concentration or ratio is aggregated, using SUM in the following example of a mixture of 100 mg of levodopa and 10 mg of carbidopa.

The resulting floating vector is a compact encoding of both chemical structure non-zero values as a float point vector, and composition encoded by floating point values. The computation of mixture fingerprints and similarity coefficients was implemented on top of the ChemAxon JChem library which was chosen for its comprehensive and robust API, which allowed quick prototyping, deployment, and integration of the algorithm in existing workflows.

Alternative fingerprints (key-based or path-based) can be used. The default aggregation function is SUM, but other functions can be used, and at this stage weights can also be applied where a particular component is more important than the others. A final transformation (log or exponent, for example) can be applied: this can be useful if a mixture of molar concentrations is used in biological assays where sometimes the response is better correlated with log scale concentration.

Oleg’s team has applied the fingerprint in applications related to the chemical similarity of mixtures. In Oleg’s first example he used Atripla as a query, searching the FDA drugs labels database, which contains 12,597 formulations with two or more active pharmaceutical ingredients. Atripla is an HIV treatment consisting of three active ingredients: efavirenz, emtricitabine, and tenofovir disoproxil fumarate. The most similar formulations that Oleg found, Trizivir, Combivir, and Epzicom, contained similar active ingredients and similar strengths.

In an example of document clustering around doxorubicin, Oleg searched the ChEMBL database of 20,000 documents, and 122,000 compounds. There were 40 SAR series and 79 references for doxorubicin analogs. There was one common lead compound in all the documents. Oleg showed structures with mixture fingerprint similarity to the lead compound. The documents (40 documents containing about 21 SAR series) from a methotrexate cluster were also all focused on tumor cell biological activity. This is not the expected outcome of the clustering, but it can potentially be used to explore chemical space around biological activity of the lead compound. The expected result was to identify SAR series with similar compounds in the whole series not just the lead. Note that because only methotrexate is common in all documents, the SAR series are not that similar to each other, but all papers have methotrexate tested as reference compound. Document clustering shows high sensitivity in many examined examples to one common chemical structure: the algorithm needs sensitivity adjustment for these special cases.

Another example is similarity ranking of literature references. The hits from a high-throughput screen (HTS) were prioritized and hits with the best biological activity and selectivity profile were selected. The literature was searched for other known biological activities. Oleg’s team looked at a PubChem confirmatory assay and ran a similarity search on ChEMBL database documents to determine which papers contain similar series of compounds with a similar range of concentrations. The top ranking results for assays where they found hits had very similar or identical compound series, and very similar concentrations. The top ranking paper usually contains most of the active molecules from the assay and usually is a publication reporting the results from the PubChem assays. The most interesting papers are probably the ones containing results unrelated to the PubChem assay biological activities: they are important since they fill the biological activity space of the identified actives and can potentially drive a decision on which actives to prioritize for follow up.

A chemical library design tool

· (Presentation in our Library) · return to TOC

Dart NeuroScience’s approach to the design and virtual screening of potential chemical libraries is rather different from Plexus. Tim Parrott described it. At Dart NeuroScience about 20 chemists are involved in the design and creation of chemical libraries. They needed a chemical library design tool to select reactants, enumerate products, and calculate properties prior to analyzing and filtering the products. Since limited IT support was available, no one had time to spare, and the chemists were already overloaded with software, the IT approach was to standardize calculations and reactions, to simplify the system by wrapping processes and minimizing import/export operations, and to enhance capabilities and speed by doing calculations remotely. Dart NeuroScience had already licensed JChem, Spotfire, and OpenEye’s Rapid Overlay of Chemical Structures (ROCS); the “glue” for the system was a customized set of KNIME nodes built using both Infocom’s JChem Extensions and the JChem API.

The “diversity elements” node allows chemists to select reactants by class (e.g., amines), filter them by substructure, and import a list. The “deduplicate” node removes functionally equivalent reactants (e.g., salts). There is a node for selecting a reaction type from a set of 40 curated types: a reaction browser displays the reaction title (e.g., reductive amination), the classes of reactants expected as input (e.g., aldehydes and amines in the case of reductive amination) and an example reaction. Reaction images are clickable and linked to wiki pages with more detail on, for example, selectivity. Chemists enumerate potential libraries with ChemAxon’s Reactor, using customized nodes that can contain multi-step workflows.

Standardized KNIME nodes call back-end services on a high-performance computing grid to enable computationally intensive calculations (e.g., logP and ROCS scores), with result sets pushed back to the user on reconnection. Poses can be viewed in OpenEye’s Visual Interface to Drug design Applications (VIDA) or exported to Spotfire. ChemAxon chemical fingerprints are used in a “cluster” node that allows different clustering methods.

Exporting products to Spotfire is a two-step process: the “export for Spotfire” node, and the “publish for Spotfire” node that launches the application. Selections are made in Spotfire, and new nodes with the selected products and reactants are returned to KNIME. In another node, the stereochemical codes needed for registration are assigned based on structure. The final node publishes a library design plan containing separate SDfiles for the products and reactants, along with a .csv file listing how many times each reactant is used. The zipped file is parsed on import into a chemist’s ELN.

Each user has KNIME installed locally and a Dart NeuroScience node update site pushes out updates. KNIME and Infocom updates are turned off, to keep consistent versions across all users, but Spotfire updates are pushed out to users. Service updates are largely invisible to users. The overall goal for the library design system was that increased ease of use should result in decreased support, leaving both IT staff and chemists happy and productive.

DeltaSoft and ChemAxon tools for screening, lead discovery, and SAR

· (Presentation in our Library) · return to TOC

Danni Harris of RTI International (RTI) described another drug discovery system. RTI uses Marvin Sketcher, the Oracle JChem Cartridge, Canonicalization, Structure Checker, and the Properties package from ChemAxon, and, from DeltaSoft, the API, Java and SQL code base underlying ChemCart Registration, Bioassay, ELN, and Sample Inventory. DeltaSoft Compound Registration Admin and Bioassay Admin were customized to provide collaborative, project-linked pick lists for compound registration and bioassay module pull-down menus. ChemCart Admin is a user friendly graphical user interface and API used to develop and control the function of ChemCart applications.

To provide fine-grained security, the institutional database is partitioned into customized data vaults among which project-specific collaborations can also be defined. Role-based permissions ensure that postdocs have read-only access on projects within their lab, principal investigators (PIs) can administer their own data and submit data to the institution, and management can view and search all data across all projects. A few minor changes have also been made to ChemCart Registration.

The ChemCart Bioassay system has project-linked pick lists to facilitate data entry for in vitro and in vivo screens. There is a Java stream interface to the printer control languages Eltron Programming Language (EPL) and Zebra Programming Language (ZPL) for printing barcodes for the sample and batch inventory. The barcode is inserted automatically when an assay request is submitted electronically. Pharmacologists are prompted to review new requests by e-mail.

The electronic assay fulfillment system handles the following data and functions: barcode; assay protocol and version; project for charging and PI; detailed modifications to the assay (e.g., concentration specification); status of order (dates of receipt and completion); instructions for stock solution preparation; and free base or acid form. Automatic confirmatory e-mails are sent when an assay is received or completed, data are entered, and a PI receives the results.

The database used for registration of commercial and in-house library compounds provides an audit trail of high and medium throughput screening, and robotics. It also provides a depletion inventory for the daughter plate series used for computationally driven cherry picking, and for confirmatory screening following primary screens. Each screening plate well is registered as a distinct sample with bioassay results linked to that sample. Plate-based applications and views combine data from the chemical registration and bioassay schemas. Audit trail software has also been developed for robotics screening: the screening system allows for import and export of NX files (from the Beckman Coulter Biomek NX Laboratory Automation Workstation for liquid handling) as well as plate data. Dose response data can be displayed: agonist or inhibitor control curves and background controls. Curves for control samples are stored independently of those for registered screening compounds. Primary control data are stored in tables, and Prism files, and Prism image analysis curves, EC50s, IC50s and textual comments are also stored. Inter-plate normalization allows users to spot patterns, and automatic scaling of plate data within the database allows them to spot anomalous wells. A spreadsheet of chemical structures and plate data is used to generate primitive structure-activity relationships.

Within the last few months RTI has screened three newly de-orphanized and orphan G protein coupled receptors (GPCRs) using small diversity libraries optimized for GPCR targets of interest. Generally, 100-150 primary hits distilled down to 20-30 robust antagonist or agonist leads for medicinal and computational chemistry diversification, and further scaffold hopping. This initial success is in part due to examination of GPCR fingerprints, 3D-pharmacophoric criteria, ADME parameters, and diversity metrics, so, inspired by a publication on the Structure-Activity Landscape Index (SALI), Danni wanted to give medicinal chemists some simple modeling tools. A “Computational Task Panel” has thus been integrated into ChemCart, interfacing the DeltaSoft and ChemAxon databases with graphics processing unit (GPU) powered, parallelized computational chemistry, facilitating ligand- and structure-based design.

SMILES files are transferred batchwise from client to server, where 3D conformations are generated, and computational procedures are carried out. These include conformational library generation; overlaying conformations and informatics; docking against preferred targets; binding energy prediction by molecular mechanics rescoring in AMBER; and prediction of metabolism and toxicity using in-house code. Multiple programs are available for docking against pre-computed grids; varied parameters are passed through a stored procedure; and then graphical and quantitative results are emailed to the PI. RTI is also working on peptidomimetic design using solvated peptide dynamics to develop SAR. Work on the computational infrastructure is ongoing.

NIH BioAssay Research Database (BARD)

· (Presentation in our Library) · return to TOC

Ajit Jadhav of NIH presented an introduction to BARD. The mission of the NIH National Center for Advancing Translational Sciences (NCATS) is “to catalyze the generation of innovative methods and technologies that will enhance the development, testing and implementation of diagnostics and therapeutics across a wide range of human diseases and conditions”. It catalyzes collaborations both within NIH and outside. The NIH Chemical Genomics Center (NCGC) currently has more than 230 collaborations with investigators worldwide, in assay development, HTS, chemical informatics, and medicinal chemistry. The focus is on unprecedented targets, and rare or neglected diseases. The NCGC Pharmaceutical Collection (NPC) is a publicly accessible collection of approved and investigational drugs for high-throughput screening (Huang ,R.; Southall, N.; Wang, Y.; Yasgar, A.; Shinn, P.; Jadhav, A.; Nguyen.; D.-T. Austin, C. P. The NCGC Pharmaceutical Collection: a Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics. Sci. Transl. Med. 2011, 3, 80ps16).

The mission of the BARD is to enable novice and expert scientists to use the NIH Molecular Libraries Program (MLP) data to generate new hypotheses. It was developed as an open-source, industrial-strength platform to support public translational research. It is a unique collaboration among NIH and academic centers with expertise in screening and software development (including the Broad Institute, Sanford-Burnham, Scripps Research Institute, the University of Miami, the University of New Mexico, and Vanderbilt University), with advisers also from industry.

The BARD data warehouse has been loaded with Gene Ontology (GO) terms, Kyoto Encyclopedia of Genes and Genomes (KEGG) terms, and DrugBank annotations, and with all PubChem bioassay identifiers and results. The catalog of assay protocols (CAP), the CAP user interface with view and basic editing, and the CAP data dictionary (defined as Web Ontology Language (OWL) using Protégé) have been implemented, and the results deposition data model has been created and populated. Full text search is through Solr, which enables filtering, faceting and auto-suggest. This is a key entry point for users. The search code runs queries in parallel and is customized for fast structure searches.

The REST API is versioned and fully documented; you hit a URL and get a JavaScript Object Notation (JSON) response. The architecture is Java, read-only, and deployed on a GlassFish cluster. Different functionality (for maintenance and security; stability; and performance) is hosted in different containers. The software is open source as far as possible; ChemAxon’s JChem is the only exception.

BARD can be extended with plugins. Plugin resources can accept almost anything: text, JSON, files, links, etc. Plugin responses can be almost anything: plain text, JSON, HTML, SVG, etc. An example is SMARTCyp which predicts the site of metabolism by cytochrome P450 (CYP450) isoforms using 2D structures. BARD is not just a data store, it is a platform. It will interact with users’ preferred tools, allow the community to tailor it to their needs, serve as a meeting ground for experimental and computational methods, and enhance collaboration opportunities.

Ontologies for bioassay information

· (Presentation in our Library) · return to TOC

Christopher Mader of the University of Miami Center for Computational Science continued the theme of bioassay data. NIH is funding a Library of Integrated Network-based Cellular Signatures (LINCS). The LINCS program aims to facilitate a mechanistic understanding of disease in support of drug and biomarker development; generate a reference set of cellular response signatures to a variety of small molecule and genetic perturbations; establish common standards and best practices; and develop computational tools and approaches to analyze cellular signatures.

The LINCS Information FramEwork (LIFE) aims to build standards and an integrated ontology to describe LINCS data; to develop semantically enabled software to access and explore LINCS data; and to enable knowledge-based contextual reporting. A near term goal is to enable exploration of information in the LINCS dataset by non-experts, tackling the challenges of representation and modeling, and tools for search.

The University of Miami Center for Computational Science has been developing the BioAssay Ontology (BAO) Web Ontology Language and applying it to create the LIFE ontology. More data do not necessarily mean more knowledge. Several public resources of screening results exist (PubChem, PDSP, ChEMBL, Binding DB etc.) but there are many challenges, including syntactic, structural, and semantic heterogeneity problems; incomplete or missing annotations; lack of standardized metadata; and badly defined project context.

BAO describes assays and screening results; defines assay and result annotations; provides controlled terminology; formalizes knowledge of assays and screening results; and describes and formalizes screening campaigns. In BAO version 2, Basic Formal Ontology (BFO) is used as an upper level ontology. The various types of bioassays (e.g., binding assay, reporter gene assay) are described using concepts such as “assay format”, “assay design method”, “perturbagen”, “biology”, “physical detection method”, and “endpoint”. “Biology” encompasses many concepts pertaining to the biological domain, including cell line, tissue, organism, molecular target, biological process, molecular function and disease. The LINCS Data Working Group is developing standards for the metadata describing LINCS reagents, assays and experiments.

LIFE.wrx addresses the second aim of LIFE: it is Web-based, semantically enabled, responsively designed software for exploring LINCS data. It is a RESTful, multi-tier application, with an HTML and JavaScript user interface. The middle-tier uses Java running on Tomcat. The back end implements Solr for indexing, PostgreSQL for storage, and SDB for the triple store. JChem is used for the chemical structure functionality.

The third aim of LIFE is knowledge-based, contextual reporting to turn the data into knowledge. LINCS content has been enriched through connection with other ontologies, additional classes have been added through computational techniques, and the data can be explored using description logic reasoning (DL-reasoning). The information is semantically searchable by combining both biological concepts and chemical feature information. Christopher showed examples of compound classification, cell line classification by disease, based on the disease ontology (DOID), and exploration of promiscuity. LIFE also offers list-based filtering, downloading of data, keyword filtering, and structure-based searching powered by ChemAxon software.

Machine learning applications with JChem

· (Presentation in our Library) · return to TOC

Steven Wilkens of Takeda gave another paper in the “ChemAxon science” session. His work concerned predictive models in drug discovery. Most of the machine learning at Takeda California is done in Java, since there is an extensive, “out of the box”, open source API, a simple thread model, and dynamic class loading, and Java is platform independent. All model construction is done with the Weka Java-based, open source machine learning framework. The Prognosticus is an automated system to identify the optimal model for property prediction, built on top of Autocorrelator (Lardy, M. A.; LeBrun, L.; Bullard, D.; Kissinger, C.; Gobbi, A. Building a Three-Dimensional Model of CYP2C9 Inhibition Using the Autocorrelator. An Autonomous Model Generator. J. Chem. Inf. Model. 2012, 52, 1328−1336). Autocorrelator is a general computational model optimizer using a genetic algorithm to evaluate results and alter variables intelligently until the model converges. With the Sun Grid Engine, tens of thousands of parametric variations are attempted in parallel during the optimization. Steven co-developed the Prognosticus with Matt Lardy.

Training data are fed into the fingerprinter. Feature pruning and cross-validation follow. The tasks are run in parallel. Each job is dependent upon the previous job, and will not execute until it has finished. ECFP features of varying length (with no hashing) were generated (Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742-754) by the JChem MolecularDescriptor package. (ECFPs with Pipeline Pilot would have been too expensive.) Since these were large datasets, and memory was limited, ECFP feature pruning cutoffs were necessary. Models were trained from datasets varying in size from about 1,000 to 100,000 experimental observations. Nearly all possible combinations were tried and the results selected the optimal model.

An example is the Prognosticus logD model. An accurate clogD model can have a significant impact on project success. There were more than 85,000 compounds in the training set, measured at Takeda laboratories in Japan. After pruning, 801 ECFP_4 features survived. The weka.classifiers.functions.PLSClassifier PLS1 algorithm was used with 25 components. Ten-fold cross-validation model evaluation took about 5 hours’ execution time on Intel Xeon 5160 at 3 GHz with 3 GB of RAM. Model conditions converged after a week-long Prognosticus run. Test set r2 was 0.66, and training set q2 0.76. This compared with r2 of 0.55 for JChem clogD. Takeda’s error histograms were also better. The strategy has been generalized and applied to several other properties: CYP inhibition, plasma protein binding, multidrug resistance protein 1 (MDR1) efflux, cytotoxicity, phototoxicity, human Ether-à-go-go-Related Gene (hERG) inhibition, pregnane X receptor (PXR) activation, kinetic solubility and kinase selectivity.

There is a Web portal for chemists to score compounds before they synthesize them. Models can be calculated in two seconds or less. Compounds, classifiers, and results are stored in a PostgreSQL database and can be viewed by other project members. Takeda is in the process of implementing a single global prediction service where models can be accessed by published SOAP and RESTful interfaces and users can access models using their favorite tools.

Steven suggested that ChemAxon could implement a wrapper for Weka, integrate Weka clustering with existing cluster algorithms, and allow ChemAxon customers to access advanced machine learning algorithms without having to learn the Weka API. Unfortunately ChemAxon’s Weka license will not allow the company to do this. There was a lengthy discussion after this talk. It was claimed that Oleg Ursu’s models were better than Takeda’s and are global, but someone suggested that Takeda’s are also global if they are derived from 85,000 measurements. It was also pointed out that ECFPs will not work in all cases.

Naming and text mining

· return to TOC

Another group of talks concerned extraction of chemical information from documents, patents and an ELN. By providing reliable name to structure conversion, Naming has become the backbone of ChemAxon’s chemical text mining tools, such as Document to Structure, JChem for SharePoint and Naming is mature but is still improving. David Deng gave some examples and demonstrated corporate ID to structure conversion via a Web Service. KeyModule’s Chemical Literature Data Extraction (CLiDE) and GGA’s Imago can now be used for image recognition (in addition to NIH’s Optical Structure Recognition Application, OSRA). Chinese chemical name recognition has been added. David also demonstrated the new Document-to-Database solution that can continuously index chemical information from documents in a repository system (currently, Documentum). It also provides a Web interface using which users can perform chemical search within the documents, or view the augmented documents with chemical information annotated. (Presentation in our Library)

Mark Andrews of DuPont has used the new Document-to-Database solution. There was no way to structure search the DuPont ELN repository. It contains working ELN documents in Microsoft Word and Excel etc., with embedded “live” structures (in Accelrys Draw, ChemDraw etc.), and names and structure images. Witnessed ELN documents are in PDF format and contain names and structure images. Since the ELN repository was searchable only by text, and not by structure, DuPont could not easily take advantage of prior work. The company’s options were to find a way to extract structural information from the current ELN, and invest in a more chemically intelligent ELN in the future.

A system was needed for extracting embedded “live” structures, chemical names, and structure images from documents; converting names and images to structures, and structures to names; and storing structures, chemical names, document metadata, and links from structures to documents. The system had to have granular security and an interface for extraction, conversion, storage, search, and hit set export. For names and structures DuPont considered ChemAxon’s Document-to-Database and Document-to-Structure, PerkinElmer’s Structure Genius and Document Manager, Accelrys’ Pipeline Pilot ChemMining, Scilligence’s Chrawler, and InfoChem’s Mining for Chemistry. Options for images and names were SimBioSys’ CLiDE, NextMove’ Caffeine Fix, ACD/Name to Structure, the National Cancer Institute’s OSRA, and the University of Cambridge’s Open Source Chemistry Analysis Routines (OSCAR) and Open Parser for Systematic IUPAC Nomenclature (OPSIN).

Evaluation criteria were: handling embedded structures and names (images were ruled out of scope because there were too many conversion errors); performing all the required conversions; support of both substructure and text search (including document metadata); provision of an entire workflow and search interface; provision of hot links to original documents (not to another copy, because of security concerns); support for secure individual user login accounts; and suitable cost and licensing terms.

DuPont decided to use ChemAxon’s Document-to-Structure (although it was a new product in development it was overall the best match); to crawl only ELN working documents (archived PDFs do not have live structures); and to store extracted structures and data in a separate database, rather than “contaminate” the DuPont chemical registration database. The company opted to have two interfaces: the ChemAxon custom Web application for casual users, and the DeltaSoft ChemCart Web form, customized at DuPont, for “power users”. The former is easier to use; the latter has a full choice of operators, pick lists, and many more menu-driven options.

Mark gave one example of a problematic structure, where, instead of entering

the user had entered:

with an “NTs” (for “N-tosyl”) that the drawing tool did not understand. The software then converted each “NTs” to a carbon atom. Thus this structure came over as cyclohexane, and since it was crawled before any real cyclohexane, real cyclohexane hits displayed with the N-tosyl skeleton above. The solution to this is to implement ChemAxon Structure Standardizer on crawl and search, to intercept and fix abbreviations.

Other issues identified were improper name conversions, and handling of reactions, multi-components and polymers. Some ELN documents refer to “Endeavor”, a piece of laboratory equipment, but Name-to-Structure generates many false hits because it thinks that Endeavor is the following structure.

Fortunately, Structure-to-Name offers options on the type of name generated and the dictionary for Name-to-Structure can be customized. Reactions are stored as such, allowing searches on reactions, but structure search must use “full fragment” (not “full”) to find a structure as a reactant or a product. The choice of molecular formula to store for reactions and polymers is a dilemma.

Despite the challenges, the system will provide much improved access to prior work at DuPont. The tool is developing nicely, it integrates well with the existing DuPont cheminformatics infrastructure, and ChemAxon is very responsive. DuPont hopes to extend the tool to other document sources in future. (Presentation in our Library)

Steven Wilkens presented Takeda’s automated system for extracting compounds and gene names from patents. The project was initiated because of a request from the Journal of Expert Opinion on Therapeutic Patents for an analysis of all vascular endothelial growth factor receptor 2 (VEGFR2) patents published from 2005 to 2012. The company wanted to identify patterns in compounds being patented for VEGFR2.

Steven used JChem’s Document to Structure API to implement a high performance patent analysis pipeline. He extracted only exemplar structures from the patents and used heuristics to separate exemplars from reagents, standards, etc. Patents with no detectable chemical matter were discarded, and then compounds were clustered by structure, organization, year, and therapeutic area, etc. To begin, Takeda searched USPTO records for all kinase insert domain receptor (KDR)/VEGFR2 patents published between 2005 and 2012, using a Perl script to page through results and scrape patent names. They downloaded all patents published between 2005 and 2012 from Google, and from them selected patents identified in the USPTO search that are associated with VEGFR2.

Analyzing 8,700 patents in a reasonable time frame required extreme measures. Fortunately, because patents are independent of one another, multiple patents can be processed in parallel. Takeda made use of the Java concurrent API for thread-safe operations. The method was inspired by Demo7 in the Document to Structure documentation.

The assumption in selecting exemplar compounds was that those compounds should share a common scaffold and an overall general similarity. It is better to miss exemplars rather than include non-exemplars, so Steven chose to reduce the false positive rate at the expense of increased false negative rate. First pass filtering ensured that compounds must contain at least one ring, have a molecular weight greater than 300, and have correct valence, aromaticity, etc. Second pass filtering involved finding maximum common substructures, and keeping only those that have a membership that exceeds a threshold. In third pass filtering, a Tanimoto similarity matrix was built and compounds with a minimum number of similar neighbors were kept. In the end, 170,443 compounds were extracted from 8,799 patents, exemplar compounds were flagged, and the centroid was determined for each patent in 3 hours and 15 minutes run time, on a 4 x 4 core Intel Xeon E5540 at 2.53 GHz, with 16 GB of RAM, 16 threads, and maxmemory of “2048M”.

All VEGFR2 patents were analyzed, but most are not small molecule patents. The system is heavily dependent on the quality of Google’s optical character recognition (OCR). Steven believes that Google’s OCR is better than ChemAxon’s but Acrobat’s OCR would be even better. Steve Boyer’s solution is the best one; Takeda is the only pharma that does not use it. In the end, the article for the Journal of Expert Opinion on Therapeutic Patents was never produced, but the speed of Takeda’s approach to analyzing patents opens up opportunities for a diverse set of applications. (Presentation in our Library)

Markush technology

· return to TOC

At the Markush forum before the main user meeting, Árpád Figyelmesi of ChemAxon described the Thomson Reuters Markush evaluation service. (This is not yet a product.) ChemAxon has invested a lot of effort in redesigning its Markush software. In the last six months, search time has been reduced by 90% and memory requirement has been reduced by 70%. The distributed configuration for the cartridge offers almost linear scaling. Nested R-groups can now be handled on the query side.

The patent search form in IJC is designed mainly for IP experts, for searching bibliographic data, structures, claims, classification, and details in one place. It allows structure and advanced text queries, and has an easy report generation function. The example search form is designed mainly for quick structure searching in the Derwent Chemistry Resource (DCR) database, to look for specific compounds. It presents structure hits and bibliographic information for related inventions. An easy structure export function and a simple grid view are also available. The Markush structures search form is designed mainly for quick structure searching in the Markush database. Display and export functionalities are similar to those for example search. Automatic batch searching of multiple structures in the Markush database can now be carried out, but this is only for structure searching. Results are stored in a database and can be post-filtered. Again, there is a structure export function. In all cases, after a search, a patent document in the relevant patent authority online database can be opened in the browser by double clicking on the publication number.

The new IJC Web Client is an alternative user interface. It has patent, example and Markush search forms, and is simple to use, but it has some limitations: no batch search form, no scripted buttons, no simple access to external databases, and limited post processing and hit visualization. The data are held on a local server or in the cloud. The complete patent database from Thomson Reuters (VMN files of Markush structures, exemplified structures from DCR, and textual information from Derwent World Patents Index (DWPI)) is stored in Amazon Cloud with a powerful virtual machine, a secure connection and confidential search.

David Deng demonstrated ChemAxon’s Markush technology in IJC. Useful new features are exporting exemplified structures, retrieving a patent document, an improved Markush enumeration interface, addition of notes (e.g., “relevant to project Y”) and viewing by note, and batch search of multiple queries. David showed the new search interface, including customized buttons for exporting exemplified structures and retrieving patent documents. He showed the redesigned Markush enumeration interface, with “Markush reduction according to the hit” (expanding the core to contain the substructure query), and the query aligned and colored in the enumerated structures. Structures can be filtered, for example, by druglike criteria, and exported for customized study. Enumeration speed has been improved. David also did a structure search and demonstrated the improved R-group hit visualization in the Markush viewer. Additionally he did R-group decomposition on exemplified structures based on the Markush core, generated a Markush structure, added additional R-group definitions, enumerated the resultant library and saved it as a local file. You do have to specify the core, but in future you will be able to use ChemAxon’s maximum common substructure (MCS) algorithm to generate the core.

In the main user meeting Steve Hajkowski of Thomson Reuters gave another viewpoint on the technology. His company has indexed 2.6 million patent families, including 1.6 million Markush structures, re-drawn and stored in Thomson Reuters’ .vmn file format. Some 2.15 million specific compounds are indexed with the corresponding DWPI records. Pharmaceutical, agrochemical and general chemistry patents are covered from 28 patent issuing authorities. The indexing is part of the editorial process that creates DWPI, with informative English language titles and abstracts from worldwide patents, and patent family listings, and patent assignees. The Markush structure data are available alongside corresponding specific compounds and DWPI records. They are loaded into ChemAxon’s JChem database tools, in-house or in the new cloud-hosted system. The data can be searched, enumerated and integrated into the customer’s own workflows.

The 2013 database is much faster to search: comparable to Questel’s, or faster. Other features are a revised user interface, improved search algorithms and new export options (the user can create reports with structures, DWPI, and abstracts together). The backfile from 1978 has been added for US, EP and WO patents. A customer working group has been active since mid-2012, advising on product needs; the group has concluded that the product provides direct access to Markush structure data, offering high volume search capability, coupled with unique visualization and structure mining features.

To demonstrate the utility of the ChemAxon system, the Markush Working Group requested benchmark searches for 50 drug molecules, comparing results for ChemAxon against the Merged Markush Service (MMS) on Questel. Eighty-four percent overlap with MMS was achieved for the full structures and 78% for the generic searches: a good result as there is inherent “fuzziness” in Markush structure searching. In addition the ChemAxon search retrieved relevant results not found by Questel, e.g., in a search for the antiulcer drug rabeprazole, ChemAxon found 12 patents not retrieved by the Questel MMS search engine.

Having Markush structures, specific compounds and English language DWPI summaries integrated together in a single platform is convenient and time-saving. Query structures are displayed in context within the hit Markush structures, saving the user time in identifying key hits from a results set. ChemAxon’s enumeration features unlock a Markush structure to allow the user to see the real compounds described within it. Enumerated structures can be exported in standard chemical data formats, allowing potential integration into other analysis tools.

Use cases are novelty and freedom to operate searching by IP departments (easy assessment of the structural proximity of the query structure to the Markush structure hit); screening of new structures against existing IP; potential use in white-space identification, patent busting etc.; and creation of libraries of specifics for use in other systems. Steve did a short demonstration of substructure search against the Markush dataset, display of a patent hit and some hit structures, the Markush viewer, and the enumeration tool.

In 2012, there were more than 8000 patents claiming new pharmaceutical Markush structures, and more than 3700 patents claiming new non-pharma Markush structures. Merck and Co. topped the table with 129 Markush patents. Roche was close behind with 119. Bayer, Abbott and GlaxoSmithKline followed. Steve presented a pie-chart of Markush pharma patenting territories in 2012.

In a case study Steve analyzed all Markush patents claiming new molecules containing the structure fragment below, and looked at companies, timeframes, activities, and inventors.

A structure search of the Thomson Reuters data generated hits from 122 patents, with hits found back to 1988. The patenting timeline showed maxima in 1992 (26 patents) and 2008 (19 patents). There were fewer than 5 patents in each of the years 1996 to 2002. A variety of activities was claimed, ranging from hypotensive in 88% of patents, down to enzyme inhibitors in 25%.

Steve concluded that the new cloud-hosted Markush solution offers comprehensive data via a powerful and convenient platform. Markush analysis allows us to view the chemistry patent landscape, with a broad view of trends for companies, countries etc., and a narrower view for particular structure types, showing, for example, which companies are active, trends over time, top inventors, and drug activities. (Presentation in our Library)

Closing remarks

· return to TOC

Alex Drijver concluded the meeting. Like me, he observed what a large amount of new functionality is coming out of ChemAxon, and how much is also being done by users too. ChemAxon learns a lot from these events and sends a big team from Budapest because it “gets them closer to the fire”. ChemAxon has been in business for 15 years; Alex has served 5 years. There are 130 people at the company now, compared with about 25 five years ago. There has been exponential growth in revenues too. There are 630 paying customers now, and 18,000 unique users. ChemAxon is “growing and going”. Within the customer base, a position has been built with large pharma. More SMEs than large pharm attend the San Diego meeting; ChemAxon has not forgotten its roots and is very much at home in biotechs and SMEs too. Press releases tend to mention big pharma but there has been much growth in biotechs too. Partnering has also grown and is an important part of the ChemAxon business: 30% in revenue terms. ChemAxon is competing with its own partners but there is a complementary aspect too. A balance has to be maintained. The product portfolio grows in breadth and depth but the company is still the same ChemAxon and the core of the software is the same across the whole platform. The common core of the company grows. Many people are still with the company. There are no “managers” in ChemAxon. There is no hierarchy. The culture is informal but not unserious. This makes ChemAxon accessible. ChemAxon is still spelled ChemAxon and looks forward to the next 15 years.


· return to TOC

Between 1981 and 2007 I attended more than 50 MDL user group meetings in the US and continental Europe, and not a few UK-based ones as well. So it is hard to attend ChemAxon user meetings without making comparisons, fair or odious. Interestingly, it was 2007 when my sequence of ChemAxon meetings began. By now ChemAxon is probably the market leader in mainstream chemical structure handling, and still moving steadily up the S-curve. It is now in the realm of enterprise applications, and consultancy services for systems building and integration, but it still retains its appeal to the smaller biotechs.

In previous reports I have mentioned possible threats to the company’s future. None has materialized, but here is a new one. I was not the only one to listen to Tim Aitken’s talk on Plexus Suite and mutter under my breadth “Isentris”. This cradle-to-grave vision was frightening in its implications, but Alex Drijver assures me that it will be thoroughly planned and resourced to prevent it from spiraling out of control.

ChemAxon has fingers in a lot of pies other than “mainstream chemical structure handling”: virtual screening, Markush technology, text mining, and metabolite prediction, to name but a few. Here be dragons. Your combined printer-scanner-photocopier will probably be second best at printing, copying, and scanning although the combination may have its advantages. Could ChemAxon similarly be too thinly spread? I personally think that things are under control and the business model is being carefully crafted. Not a lot is wasted on overheads and administration, and the company is still small enough to be very flexible. I remain optimistic.

ChemAxon has been profitable since inception with 30-50% growth year on year. And what a long way it has come since Alex took over five years ago! My congratulations to the company on 15 years of continuing success. I agree with Alex: I too am looking forward to the next 15 years.

Return to Table of Contents