Augmenting chemical structures with calculated properties is an influential step to boost numerous workflows across various research domains. Dedicated, secure and scalable services provide an optimal solution for this task regardless of the size of the data to process. We have created a proof-of-concept workflow demonstrating how the Chemaxon Calculators and Predictors service can be used to add calculated properties to molecule records. This code example, published on GitHub, presents a simple workflow with two straightforward operational steps (see Figure 1).
Figure 1. Process and architecture overview.
Step 1) Divide: The first part of the workflow consists of the parsing of the input file and chunking it to optimally manageable sized tasks, performed by a lambda function (cxn-csv-parser). Molecules are read from the input smiles file stored in Amazon’s Simple Storage Service (S3 bucket) to create messages, each containing 25 molecule chunks that are sent to the Simple Queue Service (SQS). SQS ensures that all entries are processed at scale by managing robust message queuing before sending it to our Calculators endpoint.
Step 2) Conquer: The second lambda function (cxn-cns-mpo) receives the 25 molecule batches from the queue to compose the payload for the Calculator endpoint. In this code example the CNS-MPO score is calculated. This complex multi-parameter score includes protonation (pKa), lipophilicity (logP, logD), hydrogen bond donor count, mass and polar surface area (PSA) calculations to yield the final score. The lambda function sends the request to the Calculator service, collects the response and persists the calculated data in a DynamoDB table. This function is executed in parallel to scale appropriately to reach the 50 request/sec turnover.
This workflow pattern highlights that executing calculations and storing the generated data is as easily manageable as uploading a file and invoking a single function, even for large collections. Calculator resource scaling has been optimized under the hood relying on Fargate technology to handle high peak loads. According to our proof of concept calculation, 53 ms/cpd per thread (~1 ms /cpd global) sustained performance can be achieved on an 8.8M compound collection.
Delve into the code on GitHub, subscribe below and take advantage of the Calculator service in your project.