Standardizer brings molecules to a standardized form: the same molecule can have different tautomer and mesomer representations, chemists can draw an aromatic ring in aromatized form or by using alternating single and double bonds, draw hydrogen atoms explicitly or use only implicit hydrogens. However, when identifying a molecule or performing a substructure search, we need a common form to work with. For example, if we need those molecules from a database which have a pyrrole ring, we want to find the molecules either containing aromatized or non-aromatized pyrrole rings. We cannot predict whether an unknown input database or SDF file stores molecules in aromatized or non-aromatized forms.
There is a brute force approach for getting the expected search result, which performs two search operations: one with aromatized and another with a non-aromatized pyrrole ring as query and then merge the results. This algorithm doubles the search execution time for this simple pyrrole example and increases it exponentially with the number of possible molecule forms. For example, the query structure below has a two alterable functional groups (the pyrrole part and the enol part) and this accounts to four cases to be considered in the search:
![]() |
![]() |
We have chosen a much more efficient way to deal with this problem: first bring the molecules to standardized forms and then perform the search. We define the standard form for each variable group (i.e. oxo form for oxo-enol tautomers) and drawing modes (i.e. aromatic ring for aromatic-alternating bonds). This definition is specified in a configuration file. Standardizer performs the necessary transformations on the molecule in the order they are listed in the configuration file. The attentive reader may have noticed that these transformations also require search operations: we have to find the functional groups or aromatizable rings. However, with thoughtful planning these operations must be performed only once, then the molecules can be stored in standardized form and any operation that requires standardization (e.g. substructure search, reaction processing) can then be performed on these standardized molecules.
Standardizer provides some additional actions to refine the molecules by setting the stereo flag or performing coordinate cleaning. See an example below for the template based cleaning of a bridged skeleton.
| compound to clean | template | cleaned result |
| COC(=O)C1C(CC2CCC1N2C)CC(=O)C3=CC=CC=C3 | ![]() |
![]() |
A set of simple examples and working examples are also available.
Standardizer GUI is an easy-to-use, high-end graphical user interface for the Standardizer tool of ChemAxon. This GUI allows you to reach all the functionalities of the Standardizer without the need of using the command-line with parameters, or editing configuration files by hand.
The Standardizer GUI will provide you a friendly way to bring your molecules to a standardized form with guide informations for each task you may encounter.
standardize
Alternatively, on Win32, Unix or Mac / Java 2 (assuming that JChem is installed with creating shortcuts to Desktop or to the Start Menu):
by double-clicking the appropriate icon
Most molecular file formats are accepted ( Marvin Documents (MRV), MDL molfile, SDfile, RXNfile, RDfile, SMILES, etc.).
Input files can be added by browsing them from the file system. Selecting more files will result in a concatenated output file. There are no restrictions in file format, input files can differ from each other, as well as from output. This allows concatenating molecule files and/or export to another format by leaving standardization rules empty.
Standardizer GUI also provides an interface to manipulate configuration files. The embedded editor simplifies the creation of new configurations, or modification of existing ones. Building a configuration from the list of available commands, specifying the order of execution, setting custom parameters is fast and easy.
standardize [<input files>] -c <config file> [<options>]
Prepare the usage of the standardize script or batch file
as described in Preparing the Usage of JChem
Batch Files and Shell Scripts.
Alternatively, the Standardizer class can be directly invoked:
Win32 / Java 2 (assuming that JChem is installed in c:\jchem):
java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%" \ chemaxon.reaction.Standardizer [<input files>] \ -c <config file> [<options>]
Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):
java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \ chemaxon.reaction.Standardizer [<input files>] \ -c <config file> [<options>]
General options:
-h, --help this help message
-u, --active-groups <g1,g2,...> active groups:
process only tasks belonging to
at least one of these groups or
no groups at all
Input options:
-c, --config <filepath|string> configuration XML file
or action string,
actions separated by "..",
valid actions are:
- reaction SMARTS
- "aromatize" (Daylight, general)
- "aromatize:b" (ChemAxon, basic)
- "dearomatize"
- "addexplicitH" ("hydrogenize")
(converts implicit H-s to explicit)
- "removeexplicitH[:lonely:isotope:
charged:radical:mapped:wedged]"
("dehydrogenize")
(converts explicit H-s to implicit,
except for lonely, isotope, charged,
radical, mapped and wedged H-s;
if some of these are specified in a
':'-separated list, then the given
H types are also converted)
- "clearisotopes"
(converts isotopes to non-isotopic form)
- "neutralize" (neutralize molecule)
- "clean" (partial clean in 2D)
- "clean:full" (full clean in 2D)
- "clean:<template file>"
(template based clean in 2D)
- "clean:3" (clean in 3D)
- "keepone" (largest atom count)
- "keepone:mass" (largest mass)
- "removergroupdefinitions"
(remove R-group definitions)
- "sgroups:contract" (contract Sgroups)
- "sgroups:expand" (expand Sgroups)
- "sgroups:ungroup" (ungroup Sgroups)
- "clearstereo" (chirality, double bond)
- "clearstereo:chirality" (chirality)
- "clearstereo:doublebond" (double bond)
- "absolutestereo:clear"
(clear absolute stereo flag)
- "absolutestereo:set"
(set absolute stereo flag)
- "wedgeclean"
(rearranges stereo wedges according to
the IUPAC recommendations)
- "convertwedgeinterpretation"
(converts each wedge between two stereo
centers into two wedges)
- "convertdoublebonds:wiggly"
(converts double bonds with unspecified
CIS/TRANS stereo information to wiggly
representation)
- "convertdoublebonds:crossed"
(converts double bonds with unspecified
CIS/TRANS stereo information to crossed
representation)
- "tautomerize"
(take canonical tautomer form)
- "mesomerize"
(take canonical mesomer form)
- "mapreaction"
(add atom maps to reaction)
- "unmap"
(remove atom maps)
-g, --ignore-error continue with next molecule on error
For each action, a prefix "{<groups1>,<groups2>,...}" adds the task
to the specified groups.
Output options:
-f, --format <format> output file format (default: smiles)
-o, --output <filepath> output file (default: standard output)
-e, --export-fields-to-smiles export property fields to SMILES
-v, --verbose verbose output with time results
Examples:
standardize in.sdf -c "keepone..aromatize..[O-][N+]=O>>O=N=O"
standardize in.sdf -c "aromatize..clean:templates.sdf"
standardize in.sdf -c Standardizer.xml -f sdf -o o.sdf
standardize in.sdf -c Standardizer.xml -u query
The command line parameter --config is mandatory. This
specifies the path and filename of a configuration file or else it is the
simple action string,
without which the program cannot operate. A detailed description of the format of this
configuration file is given below.
If the command line parameter --ignore-error is specified, then import/export errors
will not stop the processing but the error is written to the console and the molecule is skipped.
By default, the program exits in case of molecule import/export erros.
If the command line parameter
--active-groups is specified,
then only taks belonging to at least one of the active groups and those
belonging to no groups are executed. For each task, the container groups
are specified in the
Groups
attribute in the configuration XML, or else as a comma-separated list of group names between curly
braces as action prefix in the action string.
Most molecular file formats are accepted ( Marvin Documents (MRV), MDL molfile, SDfile, RXNfile, RDfile, SMILES, etc.).
The input is either specified in input file(s), or else in input string(s), usually in SMILES format.
If neither the input file name(s) nor the input string(s) are specified in the command line then the standard input is read.
Standardizer writes output molecules in the format specified by the --format
option (the default format is "smiles"). If the --output is omitted, results are
written to the standard output.
If the command line parameter --export-fields-to-smiles
is specified, then the property fields (SDF fields) of the molecules will be exported even if the output format
is SMILES, SMARTS, ChemAxon Extended SMILES or ChemAxon Extended SMARTS. In case of other formats the property
fields are always exported, this option has no effects.
mols.sdf file and
writes the standardized molecules to the standard output in smiles format:
standardize -c Standardizer.xml mols.sdf
nci10000.smiles located in the ./test/pharmacophore
directory and writes results in the file named nci10000.sdf to be
created in the same directory:
standardize -c Standardizer.xml nci10000.smiles -f sdf -o nci10000.sdf
standardize -c Standardizer.xml -e -v nci100.smiles -f sdf -o nci100.sdf mview nci100.sdf
standardize -c Standardizer.xml med100.sdf | mview -
Note that such piping does not work in Windows.
standardize -c "aromatize..[O-:2][N+:1]=O>>[O:2]=[N:1]=O" med100.sdf -o med100.smiles
standardize -c "aromatize..[O-:2][N+:1]=O>>[O:2]=[N:1]=O" \ "[O-][N+](=O)C1=CC=CC=C1" "[H]C1=C(C=C(C=C1)[N+]([O-])=O)[N+]([O-])=O"
standardize -c Standardizer.xml -u target targets.sdf -f sdf -o output.sdf