
Processing large research molecule libraries to identify compounds falling under generic regulations requires a computationally processable representation of the complicated regulatory text. In case of commercial compliance software, content specialists translate these generic definitions into chemical query structures to - sometimes highly complex - Markush structures, that are then matched by a search engine against the input compounds.
Markush structures representing generic definitions
It's crucial to be able to monitor the status of compounds throughout every stage of the drug discovery and development process—such as design, compound registration, inventory, and shipping. This ongoing visibility ensures that, beyond the initial structural review for compliance with current regulations, any newly introduced legislation can also be applied to the compound as it progresses through each phase.
Markush structures are generic descriptions created for specifying collections of chemically related compounds. Their very first application was by Eugene A. Markush, who claimed generic chemical structures in applying for a patent for pyrazolone dyes in 1924 to the U.S. Patent Office. Since then, they have gained widespread use in chemical patents and other fields. The invariable part of the structure – the scaffold – includes the common structural features of the collection.
The variable parts can be described by:
- Substituent variation – listing a set of different substituents at a position
- Position variation – different attachment points/position for a substituent
- Frequency variation – allowing substituents to occur multiple times in a chain or a part of a ring
- Homology groups – general nomenclatural expressions covering a large or theoretically infinite number of substructures with a common structural feature like “aryl.
As a result of the structural/chemical diversity, a single Markush structure might cover a huge chemical space possibly representing an unlimited number of molecules.
Generic definitions controlling fentanyl-related substances in real life
What does this transformation into Markush structure look like in real life? Let’s take a closer look by transforming the point B of the generic description of Fentanyl derivatives in the United States
“Fentanyl-related substance means any substance not otherwise listed under another Administration Controlled Substance Code Number, and for which no exemption or approval is in effect under section 505 of the Federal Food, Drug, and Cosmetic Act [21 U.S.C. 355], that is structurally related to fentanyl by one or more of the following modifications:
(A) Replacement of the phenyl portion of the phenethyl group by any monocycle, whether or not further substituted in or on the monocycle;
(B) Substitution in or on the phenethyl group with alkyl, alkenyl, alkoxyl, hydroxyl, halo, haloalkyl, amino or nitro groups;
(C) Substitution in or on the piperidine ring with alkyl, alkenyl, alkoxyl, ester, ether, hydroxyl, halo, haloalkyl, amino or nitro groups;
(D) Replacement of the aniline ring with any aromatic monocycle whether or not further substituted in or on the aromatic monocycle; and/or
(E) Replacement of the N-propionyl group by another acyl group.”
On the phenethyl fragment, there are nine distinct positions available for substitution. If we limit the scope to methyl derivatives and mono-type haloalkyl substituents –CH₃X, –CH₂X₂, –CHX₃, and –CX₄, where X = F, Cl, Br, or I — the total number of unique substituents, (including hydrogen) amounts to 26. Consequently, the total number of possible substitution patterns is 269≈5×1012 . Thus, under these constraints, there are approximately 5 trillion theoretical substitution patterns for the selected phenethyl scaffold.
Considering the full scope of this generic definition, it becomes clear that efficiently searching the covered chemical space requires a sophisticated algorithm capable of handling this combinatorial complexity.
In the context of generic definition processing, the development of specialized algorithms for the efficient search in Markush structures is essential—but not sufficient. As certain generic definitions currently in use cannot be adequately represented using traditional Markush structure notation, a cheminformatics toolbox with scripting capability is also required to achieve accurate structural representation.
For Commercial controlled substance identification software, for example Compliance Checker from Chemaxon, teams of experts interpret generic definitions into the appropriate Markush structures that can be used to identify hits against a single entry or a list of structures, in timely and automated manner.
How does this work in practice? Compliance Checker, from Chemaxon, is a software system with a continuously updated regulation package. It is designed to allow anything from a screen, single chemical structures or large collections to be checked against controlled substance legislation.
Case study
We were interested in if libraries and databases containing molecules with drug-like properties included Fentanyl related compounds. To look at this, we chose to interrogate the manually curated ChEMBL 35, a library of around 2.5 million compounds, to identify Fentanyl-related molecules. Using this large database also allows us to verify the performance of the software, which is useful as it’s equivalent to many pharma libraries and so shows how the software would perform in a ‘real world’ situation. The experiment was performed using Compliance Checker Saas Premium subscription, against the full range of the covered regulations. Prior to compliance checking, the molecular database underwent preprocessing: salts and solvents were removed, duplicate structures were eliminated, and only molecules with a molecular weight between 120 Da and 1200 Da were retained. In total, approximately 2.318 million structures were assessed. The screening process was completed within three hours with the following counts of fentanyl derivatives detected:
Singapore | 1108 | Denmark | 30 |
Switzerland | 458 | France |
27 |
United States | 214 | Canada | 26 |
United Kingdom |
183 | Germany | 25 |
Italy | 169 | Yellow List | 25 |
The results suggest that legislation among countries employing lists of individual derivatives is relatively consistent. However, those using generic definitions exhibit significant variability, covering different chemical spaces. The reasons for this are speculative, but in the author's opinion, it may be due to the ease of Fentanyl production, reducing the immediate need for new "legal" derivatives to meet demand.
This stands in stark contrast to the situation with synthetic cannabinoids, where a prolonged tug-of-war has unfolded between producers of so-called “legal highs” and regulatory authorities—particularly in the UK. The surge in synthetic cannabinoid analogues during the 2000s, especially in the latter half of the decade, prompted significant changes in how these substances were defined legally. As these new compounds began appearing on UK streets with increasing frequency, a broad, generic definition was introduced to encompass this third generation of synthetic cannabinoids. This culminated in the implementation of The Misuse of Drugs (Amendment) (England, Wales, and Scotland) Regulations 2016, an update to the original Misuse of Drugs Act 1971.
Insert here Chemble35 vs synthetic cannabinoids here.
This variation highlights the importance of understanding local legislation in each country. As life science R&D becomes increasingly global—driven by outsourcing, collaboration, and externalisation—this awareness is more critical than ever in the pharmaceutical industry. Synthesis, compound management, and testing operations are now spread across international sites, relying heavily on the rapid and efficient cross-border shipment of compounds to the appropriate testing facilities. Hence, the importance of being able to easily manage the identification of controlled compounds with the back-up knowledge of de-minimus and research exemptions in different territories.
As mentioned earlier in relation to performance, it's worth highlighting that Compliance Checker processed a library of 2.3 million unique structures across 18 countries in approximately three hours (to update). In a pharmaceutical context, this means that even large libraries—such as those containing around five million compounds—can be efficiently screened against global legislation over a weekend. Based on the results, compounds can then be flagged or removed as needed according to company policy.
