R-group Decomposition

Version 5.1.0

Contents

 

Introduction

R-group decomposition is a special kind of substructure search that aims at finding a central structure - scaffold - and identify its ligands at certain attachment positions. The query molecule consists of the scaffold and ligand attachment points represented by R-groups. These R-groups are simple R-group atoms without R-group definitions in most cases. An example query structure is shown below:

Query

Note, that there are two R1 atoms referring to symmetrical ligand positions. By default, this means that the matching ligands should be identical. You can change this behavior by setting the --skip-same-structure-check parameter, in which case the same RGroup index is only used to denote symmetrical positions on the scaffold and different ligands at these positions are also accepted.

Ligand attachments are allowed at implicit H atoms, but these attachment points are not stored and are not shown in the output. To allow attachments only at R-group positions, you should add explicit H atoms in place of all implicit hydrogens. This can also be done automatically by the H command line option. The resulting query structure is shown below:

Hydrogenized query

To achieve just the opposite, there is another query transformation option that adds R-groups in place of implicit hydrogens, allowing and storing all attachments at implicit hydrogens. This is useful if you are interested in all scaffold-ligand attachments. In this case you do not need to add R-group connections manually to your original query if you use the R command line option. The resulting query structure is shown below:

R-grouped query

As an example, take the following targets:

Targets

Decompositions using the different query options (no modification, hydrogenize, add R-groups) are shown below. By default, decomposition is generated for the first hit only. To process all hits, set the --allHits option.

Standardization may be necessary, the aromatization task is usually needed: substructure search requires aromatized query and target structures and also assumes that the same functional group representation is used in the query and the target molecules (e.g. nitro-groups, also think of tautomer and mesomer forms). If your input file format contains the non-aromatized form of the molecules (e.g. SDF) then aromatization should be specified. Standardization can be specified in the --standardize option.

The following examples show some decomposition tables that can be obtained by running the rgdecomp command line tool or directly using the R-group Decomposition API. Ligand attachment points are represented by a connection to an any-atom in the scaffold, atom color codes are defined in Colors.ini and coloring data is stored in the molecule property "DMAP". In the examples below, we choose MRV output format so that this color data can be stored in an MRV tag and we sepcify this tag name with the color definition file when running mview. To get a nice table output, we also specify the number of columns in the -c parameter. Alternative decomposition output styles for the above query and targets are shown later. To run these examples, refer to the preparation instructions.

 

Usage

Usage:
  rgdecomp [options] -q <query file/string> [target file(s)/string(s)]

Prepare the usage of the rgdecomp script or batch file as described in Preparing the Usage of JChem Batch Files and Shell Scripts.

Search options are identical to that of jcsearch. In this section we describe the R-group decomposition specific command line options.

Options:
  -h, --help                           this help message
 
Input options:
  -q, --query <query>                  query SMARTS string or file
  -m, --query-modification <H|Ra|Rs>   query modification options:
                                       H:      add explicit hydrogens
                                       Ra, Rs: attach unique rgroup nodes
                                               in place of missing bonds
                                               Ra: with any-bonds
                                               Rs: with single-bonds
  -S, --standardize <file/string>      standardize query and target
                                       according to configuration file/string
  -g, --ignore-error                  continue with next molecule on error
 
Output options:
  -a, --attachment-symbol <N|P|A|M|L>  attachment symbol on ligands:
                                       N: none
                                       P: attachment point
                                       A: any-atom (default)
                                       M: atom map
                                       L: atom label
  -s, --style <HTS>                    output style (multiple choice):
                                       H: include header (default)
                                       T: include target (default)
                                       S: include scaffold
  -p, --skip-same-structure-check      allow different structures
                                       matching identical rgroup nodes
  -i, --id <ID field>                  ID field in target, 
                                       to be displayed in SMILES table output
                                       set '=' to display target index as ID
  -f, --format <format>                output file format:
                                       SMILES table if omitted,
                                       molecule series output otherwise
  -o, --output <filepath>              output file (default: standard output)

Search options:
  -A, --allHits                        process all hits
  ...  
  1. Query and target standardization can be specified in the --standardize option: the standardization configuration is given either directly in a simple action string or as a configuration XML file path. Note, that substructure search requires aromatized molecules, therefore if your input file format does not support the aromatized form (e.g. SDF) then you definitely have to specify -S "aromatize" as a minimum.

  2. We can require query modification by setting the --query-modification option:

    • H for hydrogenize: forces ligand attachments being at R-group positions
    • Ra for adding R-groups: allows and stores all scaffold-ligand any-bond attachments
    • Rs for adding R-groups: allows and stores all scaffold-ligand single-bond attachments

    If the query has no R-group nodes then the Rs modification is applied automatically.

  3. We can set the attachment symbols by the --attachment-symbol option:

    • N: none
    • P: attachment point - a small mark besides the attachment atom
    • A: any-atom (default) - an any-atom is attached to the attachment atom representing the connection to the scaffold
    • M: atom map representing the corresponding R-group index
    • L: an atom label representing the corresponding R-group index

    Note, that the default any-atom representation and atom maps can be exported in all molecule file formats, while attachment point is not available in SMILES and atom labels are only supported in MRV.

  4. We can set the output format in the --format output option. The output is

    • a tab-separated SMILES table if the output format is omitted, including a target ID column if the ID field is specified in the --id parameter
    • a molecule file with a series of query, R-group, target and ligand molecules otherwise, which can be seen as a colored molecule table if read in mview with appropriate options defining the color palette, the color symbol molecule property name and the number of table columns

    In both cases, data included in the output can be specified in the --style option (set any combination of the following letters):

    • H: include query header
    • T: include targets
    • S: include scaffold
    The default is HT.

  5. In case when the query contains R-group nodes with the same R-group indexes, these nodes represent identical ligand structures by default. If we set the --skip-same-structure-check option then we allow different structures to match these nodes. In this case the identical R-group indexes represent symmetrical attachment positions on the scaffold and have no implication for the matching target structures.

By default, only one decomposition for each target corresponding to the first search hit is presented in the output. If the rgdecomp command line option --allHits is specified, then all possible decompositions are listed.

If the command line parameter --ignore-error is specified, then import/export errors will not stop the processing but the error is written to the console and the molecule is skipped. By default, the program exits in case of molecule import/export erros.

 

Examples

To run these examples:

  1. The Java Virtual Machine version 1.4 or higher and JChem have to be installed on your system.
  2. The PATH (all systems) and the JCHEMHOME (under Windows) environment variables have to be set as described in the Preparing and Running JChem's Batch Files and Shell Scripts manual.
  3. A command shell (under UNIX / Linux: your favorite shell, under Windows: a Cygwin shell or a Command Prompt) has to be run in the RGroupDecomposition_files subdirectory.
    In UNIX / Linux:
    cd jchem/doc/user/RGroupDecomposition_files
    
    In Windows:
    cd jchem\doc\user\RGroupDecomposition_files
    

In the following examples we use the query and targets from the introduction. You can type these examples and see the results yourself in the subdirectory RGroupDecomposition_files where you can find the input files query.mol and targets.sdf.

  1. SMILES table output (no -f parameter is specified):
    rgdecomp -S "aromatize" -q query.mol targets.sdf
    
    Clc1cc(c(c(c1)[*:1])[*:2])[*:1] [*:1]   [*:1]   [*:2]
    CCC(N)c1cc(Cl)cc(C(N)CC)c1Br    CCC(N)* CCC(N)* Br*
    Oc1c(Cl)cc2CCCC3CCCc1c23        *CCCC(*)CCC*    *CCCC(*)CCC*    *CCCC(*)CCC*
    CC(c1cc(Cl)c(O)c(c1)C(C)c2cc(Cl)cc(C)c2Br)c3cc(Cl)cc(C)c3Br     CC(*)c1cc(Cl)cc(C)c1Br  CC(*)c1cc(Cl)cc(C)c1Br  *[H]
    
  2. SMILES table output with all decompositions listed allowing the two R1 query node matching different ligands, displaying target index in ID column:
    rgdecomp -S "aromatize" -p -q query.mol targets.sdf -i = --allHits
    
    ID      Clc1cc(c(c(c1)[*:1])[*:2])[*:1] [*:1]   [*:1]   [*:2]
    1       CCC(N)c1cc(Cl)cc(C(N)CC)c1Br    CCC(N)* CCC(N)* Br*
    2       Oc1c(Cl)cc2CCCC3CCCc1c23        *CCCC(*)CCC*    *CCCC(*)CCC*    *CCCC(*)CCC*
    3       CC(c1cc(Cl)c(O)c(c1)C(C)c2cc(Cl)cc(C)c2Br)c3cc(Cl)cc(C)c3Br     CC(*)c1cc(Cl)cc(C)c1Br  CC(*)c1cc(Cl)cc(C)c1Br  *[H]
    3       CC(c1cc(Cl)c(O)c(c1)C(C)c2cc(Cl)cc(C)c2Br)c3cc(Cl)cc(C)c3Br     CC(*)c1cc(Cl)c(O)c(c1)C(C)c2cc(Cl)cc(C)c2Br     C*      Br*
    3       CC(c1cc(Cl)c(O)c(c1)C(C)c2cc(Cl)cc(C)c2Br)c3cc(Cl)cc(C)c3Br     CC(*)c1cc(cc(Cl)c1O)C(C)c2cc(Cl)cc(C)c2Br       C*      Br*
    4       CC(c1cc(CN)cc(Cl)c1O)c2cc(Cl)cc(C)c2Br  CC(*)c1cc(Cl)cc(C)c1Br  NC*    *[H]
    4       CC(c1cc(CN)cc(Cl)c1O)c2cc(Cl)cc(C)c2Br  CC(*)c1cc(CN)cc(Cl)c1O  C*     Br*
    
  3. The same with taking ID-s from the ID molecule field:
    rgdecomp -S "aromatize" -p -q query.mol targets.sdf -i ID --allHits
    
    ID      Clc1cc(c(c(c1)[*:1])[*:2])[*:1] [*:1]   [*:1]   [*:2]
    id1     CCC(N)c1cc(Cl)cc(C(N)CC)c1Br    CCC(N)* CCC(N)* Br*
    id2     Oc1c(Cl)cc2CCCC3CCCc1c23        *CCCC(*)CCC*    *CCCC(*)CCC*    *CCCC(*)CCC*
    id3     CC(c1cc(Cl)c(O)c(c1)C(C)c2cc(Cl)cc(C)c2Br)c3cc(Cl)cc(C)c3Br     CC(*)c1cc(Cl)cc(C)c1Br  CC(*)c1cc(Cl)cc(C)c1Br  *[H]
    id3     CC(c1cc(Cl)c(O)c(c1)C(C)c2cc(Cl)cc(C)c2Br)c3cc(Cl)cc(C)c3Br     CC(*)c1cc(Cl)c(O)c(c1)C(C)c2cc(Cl)cc(C)c2Br     C*      Br*
    id3     CC(c1cc(Cl)c(O)c(c1)C(C)c2cc(Cl)cc(C)c2Br)c3cc(Cl)cc(C)c3Br     CC(*)c1cc(cc(Cl)c1O)C(C)c2cc(Cl)cc(C)c2Br       C*      Br*
    id4     CC(c1cc(CN)cc(Cl)c1O)c2cc(Cl)cc(C)c2Br  CC(*)c1cc(Cl)cc(C)c1Br  NC*    *[H]
    id4     CC(c1cc(CN)cc(Cl)c1O)c2cc(Cl)cc(C)c2Br  CC(*)c1cc(CN)cc(Cl)c1O  C*     Br*
    
  4. SMILES table output with hydrogenized query, represent attachments by atom maps, include target, scaffold and ligands in output:
    rgdecomp -S "aromatize" -m H -a M -s HTS -q query.mol targets.sdf
    
    [H]c1c(Cl)c([H])c(c(c1[*:1])[*:2])[*:1] [H]c1cccc([H])c1Cl      [*:1]   [*:1]  [*:2]
    CCC(N)c1cc(Cl)cc(C(N)CC)c1Br    Clc1ccccc1      CC[CH2:1]N      CC[CH2:1]N     [BrH:2]
    
  5. Molecule series output in MRV format with all decompositions, allowing the two R1 query node matching different ligands, showing results in MView:
    rgdecomp -S "aromatize" -p -q query.mol targets.sdf -a P -f mrv:-a --allHits -o result3.mrv
    mview -t DMAP -p Colors.ini -c 4 -r 5 result3.mrv
    
    You can also pipe the output of rgdecomp directly to mview under Linux/Unix systems:
    rgdecomp -S "aromatize" -p -q query.mol targets.sdf -a P -f mrv:-a --allHits | mview -t DMAP -p Colors.ini -c 4 -r 5 - 
    
    Note, that by specifying MRV output format in the -f parameter we automatically switch to molecule series output as default output style and also enable the storage of atom color data if the output format is capable of storing molecule fields (as e.g. SDF and MRV). Atom color data is stored in the DMAP MRV tag and the color palette is defined in Colors.ini. We also specify the number of table columns in the mview option -c. The decompositions of the third and fourth target molecules are shown below:
    All decompositions of the third and fourth targets
  6. The same with using hydrogenized query, SDF output without query header:
    rgdecomp -S "aromatize" -p -m H -s T -q query.mol targets.sdf -a P -f sdf:-a --allHits -o result4.sdf
    mview -t DMAP -p Colors.ini -c 4 -r 4 result4.sdf
    
    With piping:
    rgdecomp -S "aromatize" -p -m H -s T -q query.mol targets.sdf -a P -f sdf:-a --allHits | mview -t DMAP -p Colors.ini -c 4 -r 4 -
    

    The result is shown below:

    All decompositions with hydrogenized query

  7. R-grouped query (single-bond attachments) with first hit only:
    rgdecomp -S "aromatize" -p -m Rs -a P -q query.mol targets.sdf -f sdf:-a -o result5.sdf
    mview -t DMAP -p Colors.ini -c 6 -r 5 result5.sdf 
    
    With piping:
    rgdecomp -S "aromatize" -p -m Rs -a P -q query.mol targets.sdf -f sdf:-a | mview -t DMAP -p Colors.ini -c 6 -r 5 - 
    

    The result is shown below:

    Decompositions with R-grouped query
 
Copyright © 1999-2008 ChemAxon Ltd.    All rights reserved.