Standardizer User's Guide

Version 5.1.0

Contents

 

Introduction

Standardizer brings molecules to a standardized form: the same molecule can have different tautomer and mesomer representations, chemists can draw an aromatic ring in aromatized form or by using alternating single and double bonds, draw hydrogen atoms explicitly or use only implicit hydrogens. However, when identifying a molecule or performing a substructure search, we need a common form to work with. For example, if we need those molecules from a database which have a pyrrole ring, we want to find the molecules either containing aromatized or non-aromatized pyrrole rings. We cannot predict whether an unknown input database or SDF file stores molecules in aromatized or non-aromatized forms.

There is a brute force approach for getting the expected search result, which performs two search operations: one with aromatized and another with a non-aromatized pyrrole ring as query and then merge the results. This algorithm doubles the search execution time for this simple pyrrole example and increases it exponentially with the number of possible molecule forms. For example, the query structure below has a two alterable functional groups (the pyrrole part and the enol part) and this accounts to four cases to be considered in the search:

Standardizer intro image 1 Standardizer intro image 2

We have chosen a much more efficient way to deal with this problem: first bring the molecules to standardized forms and then perform the search. We define the standard form for each variable group (i.e. oxo form for oxo-enol tautomers) and drawing modes (i.e. aromatic ring for aromatic-alternating bonds). This definition is specified in a configuration file. Standardizer performs the necessary transformations on the molecule in the order they are listed in the configuration file. The attentive reader may have noticed that these transformations also require search operations: we have to find the functional groups or aromatizable rings. However, with thoughtful planning these operations must be performed only once, then the molecules can be stored in standardized form and any operation that requires standardization (e.g. substructure search, reaction processing) can then be performed on these standardized molecules.

Standardizer provides some additional actions to refine the molecules by setting the stereo flag or performing coordinate cleaning. See an example below for the template based cleaning of a bridged skeleton.

compound to clean template cleaned result
COC(=O)C1C(CC2CCC1N2C)CC(=O)C3=CC=CC=C3 Standardizer Bicycle template Standardizer Bycicle cleaned

A set of simple examples and working examples are also available.

Application (GUI)

Standardizer GUI is an easy-to-use, high-end graphical user interface for the Standardizer tool of ChemAxon. This GUI allows you to reach all the functionalities of the Standardizer without the need of using the command-line with parameters, or editing configuration files by hand.

The Standardizer GUI will provide you a friendly way to bring your molecules to a standardized form with guide informations for each task you may encounter.

 

Usage

	standardize

Alternatively, on Win32, Unix or Mac / Java 2 (assuming that JChem is installed with creating shortcuts to Desktop or to the Start Menu):

	by double-clicking the appropriate icon
 

Input and Output

Most molecular file formats are accepted ( Marvin Documents (MRV), MDL molfile, SDfile, RXNfile, RDfile, SMILES, etc.).

Input files can be added by browsing them from the file system. Selecting more files will result in a concatenated output file. There are no restrictions in file format, input files can differ from each other, as well as from output. This allows concatenating molecule files and/or export to another format by leaving standardization rules empty.

Standardizer GUI input sample image 1

 

Configuration

Standardizer GUI also provides an interface to manipulate configuration files. The embedded editor simplifies the creation of new configurations, or modification of existing ones. Building a configuration from the list of available commands, specifying the order of execution, setting custom parameters is fast and easy.

Standardizer GUI configuration sample image 1 Standardizer GUI configuration sample image 2 Standardizer GUI progress sample image Standardizer GUI result sample image

The command-line tool

			standardize [<input files>] -c <config file> [<options>] 
		

Prepare the usage of the standardize script or batch file as described in Preparing the Usage of JChem Batch Files and Shell Scripts.

Alternatively, the Standardizer class can be directly invoked:

Win32 / Java 2 (assuming that JChem is installed in c:\jchem):

			java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%" \
			chemaxon.reaction.Standardizer [<input files>] \
			-c <config file> [<options>] 
		

Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):

			java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
			chemaxon.reaction.Standardizer [<input files>] \
			-c <config file> [<options>] 
		

Options

		General options:
		-h, --help                          this help message
		-u, --active-groups <g1,g2,...>     active groups:
							process only tasks belonging to
							at least one of these groups or
							no groups at all

		Input options:
		-c, --config <filepath|string>      configuration XML file
							or action string,
							actions separated by "..",
							valid actions are:
							- reaction SMARTS
							- "aromatize" (Daylight, general)
							- "aromatize:b" (ChemAxon, basic)
							- "dearomatize"
							- "addexplicitH" ("hydrogenize")
							(converts implicit H-s to explicit)
							- "removeexplicitH[:lonely:isotope:
							charged:radical:mapped:wedged]"
							("dehydrogenize")
							(converts explicit H-s to implicit,
							except for lonely, isotope, charged,
							radical, mapped and wedged H-s;
							if some of these are specified in a
							':'-separated list, then the given
							H types are also converted)
							- "clearisotopes"
							(converts isotopes to non-isotopic form)
							- "neutralize" (neutralize molecule)
							- "clean" (partial clean in 2D)
							- "clean:full" (full clean in 2D)
							- "clean:<template file>"
							(template based clean in 2D)
							- "clean:3" (clean in 3D)
							- "keepone" (largest atom count)
							- "keepone:mass" (largest mass)
							- "removergroupdefinitions"
							(remove R-group definitions)
							- "sgroups:contract" (contract Sgroups)
							- "sgroups:expand" (expand Sgroups)
							- "sgroups:ungroup" (ungroup Sgroups)
							- "clearstereo" (chirality, double bond)
							- "clearstereo:chirality" (chirality)
							- "clearstereo:doublebond" (double bond)
							- "absolutestereo:clear"
							(clear absolute stereo flag)
							- "absolutestereo:set"
							(set absolute stereo flag)
							- "wedgeclean"
							(rearranges stereo wedges according to
							the IUPAC recommendations)
							- "convertwedgeinterpretation"
							(converts each wedge between two stereo
							centers into two wedges)
							- "convertdoublebonds:wiggly"
							(converts double bonds with unspecified
							CIS/TRANS stereo information to wiggly
							representation)
							- "convertdoublebonds:crossed"
							(converts double bonds with unspecified
							CIS/TRANS stereo information to crossed
							representation)
							- "tautomerize"
							(take canonical tautomer form)
							- "mesomerize"
							(take canonical mesomer form)
							- "mapreaction"
							(add atom maps to reaction)
							- "unmap"
							(remove atom maps)
		-g, --ignore-error			continue with next molecule on error

		For each action, a prefix "{<groups1>,<groups2>,...}" adds the task
		to the specified groups.

		Output options:
		-f, --format <format>		output file format (default: smiles)
		-o, --output <filepath>		output file (default: standard output)
		-e, --export-fields-to-smiles	export property fields to SMILES
		-v, --verbose			verbose output with time results

		Examples:
			standardize in.sdf -c "keepone..aromatize..[O-][N+]=O>>O=N=O"
			standardize in.sdf -c "aromatize..clean:templates.sdf"
			standardize in.sdf -c Standardizer.xml -f sdf -o o.sdf
			standardize in.sdf -c Standardizer.xml -u query
		

The command line parameter --config is mandatory. This specifies the path and filename of a configuration file or else it is the simple action string, without which the program cannot operate. A detailed description of the format of this configuration file is given below.

If the command line parameter --ignore-error is specified, then import/export errors will not stop the processing but the error is written to the console and the molecule is skipped. By default, the program exits in case of molecule import/export erros.

If the command line parameter --active-groups is specified, then only taks belonging to at least one of the active groups and those belonging to no groups are executed. For each task, the container groups are specified in the Groups attribute in the configuration XML, or else as a comma-separated list of group names between curly braces as action prefix in the action string.

 

Input

Most molecular file formats are accepted ( Marvin Documents (MRV), MDL molfile, SDfile, RXNfile, RDfile, SMILES, etc.).

The input is either specified in input file(s), or else in input string(s), usually in SMILES format.

If neither the input file name(s) nor the input string(s) are specified in the command line then the standard input is read.

 

Output

Standardizer writes output molecules in the format specified by the --format option (the default format is "smiles"). If the --output is omitted, results are written to the standard output.

If the command line parameter --export-fields-to-smiles is specified, then the property fields (SDF fields) of the molecules will be exported even if the output format is SMILES, SMARTS, ChemAxon Extended SMILES or ChemAxon Extended SMARTS. In case of other formats the property fields are always exported, this option has no effects.

 

Usage examples

  1. A UNIX command that reads molecular structures from the mols.sdf file and writes the standardized molecules to the standard output in smiles format:
    			standardize -c Standardizer.xml mols.sdf
    			
  2. A UNIX command that reads molecules given as SMILES strings from file nci10000.smiles located in the ./test/pharmacophore directory and writes results in the file named nci10000.sdf to be created in the same directory:
    			standardize -c Standardizer.xml nci10000.smiles -f sdf -o nci10000.sdf
    			
  3. The same with transformation check and verbose output, then displaying the result in MarvinView:
    			standardize -c Standardizer.xml -e -v nci100.smiles -f sdf -o nci100.sdf
    			mview nci100.sdf
    			
  4. Processing an SD file and displaying the standardized molecules using MarvinView:
    			standardize -c Standardizer.xml med100.sdf | mview -
    			

    Note that such piping does not work in Windows.

  5. Standardization with action string:
    			standardize -c "aromatize..[O-:2][N+:1]=O>>[O:2]=[N:1]=O" med100.sdf -o med100.smiles
    			
  6. Standardization with action string, taking input molecules as SMILES strings:
    			standardize -c "aromatize..[O-:2][N+:1]=O>>[O:2]=[N:1]=O" \
    			"[O-][N+](=O)C1=CC=CC=C1" "[H]C1=C(C=C(C=C1)[N+]([O-])=O)[N+]([O-])=O"
    			
  7. Processing tasks belonging to no groups or to task group "target":
    			standardize -c Standardizer.xml -u target targets.sdf -f sdf -o output.sdf
    			
 
Copyright © 1999-2008 ChemAxon Ltd.    All rights reserved.