Extreme Search Speed-Ups in JChem PostgreSQL Cartridge 2.7

Posted by
Róbert Wagner
on 03 09 2017

Extreme Search Speed-Ups in JChem PostgreSQL Cartridge 2.7

  • New index type: sortedchemindex
  • Massive speed-up in duplicate and similarity searches using sortedchemindex
  • Extra speed-ups can be achieved in substructure search with top hits
  • Up to 60 times speed-up in case of typical joined queries

Results of duplicate (DUP) and similarity (SIM) search benchmarks in JChem Oracle Cartridge (JOC) version 17.3.6, JChem PostgreSQL Cartridge (JPC) 2.4 and JPC 2.7 (see technical details in footnote1). All tables are indexed, and JPC 2.7 uses the new sortedchemindex type (by using this type of index most similar hits are displayed first).

Please note that in case of these search types the speed is more-or-less query independent.

Graph of DUP and SIM

Substructure search benchmarks are run on “rare” and “frequent”2 query sets where the hits are ordered by relevance3.

In case of many hits it may be worth retrieving only the first hits (top 500 in the benchmark).

Graph of sub rare

Grapg of sub frequency

Joined queries4 can also speed up, depending on the decision of the PostgreSQL execution planner.

Graph of joined query

Click to the demo site and try this out now.


Footnotes

1. [Target set: 8M structures, PubChem, Query set: small fragments and druglike molecules, Similarity search: retrieved only the 100 most similar structures] 2. [rare: few hits; frequent: many possible hits after screening phase, many returned hits] 3. [ Starting from JPC 2.7, the result set can be ordered directly by the chemical structures - most relevant hits come first.] 4. [Benchmark queries: JOC - select count() from pbch_8m where jc_compare(mol, 'Clc1ccccc1', 't:s') = 1 and molweight < 120; JPC - select count() from pbch_8m where 'Clc1ccccc1' |<| mol and molweight < 120;]

  • New index type: sortedchemindex
  • Massive speed-up in duplicate and similarity searches using sortedchemindex
  • Extra speed-ups can be achieved in substructure search with top hits
  • Up to 60 times speed-up in case of typical joined queries

Results of duplicate (DUP) and similarity (SIM) search benchmarks in JChem Oracle Cartridge (JOC) version 17.3.6, JChem PostgreSQL Cartridge (JPC) 2.4 and JPC 2.7 (see technical details in footnote1). All tables are indexed, and JPC 2.7 uses the new sortedchemindex type (by using this type of index most similar hits are displayed first).

Please note that in case of these search types the speed is more-or-less query independent.

Graph of DUP and SIM

Substructure search benchmarks are run on “rare” and “frequent”2 query sets where the hits are ordered by relevance3.

In case of many hits it may be worth retrieving only the first hits (top 500 in the benchmark).

Graph of sub rare

Grapg of sub frequency

Joined queries4 can also speed up, depending on the decision of the PostgreSQL execution planner.

Graph of joined query

Click to the demo site and try this out now.


Footnotes

1. [Target set: 8M structures, PubChem, Query set: small fragments and druglike molecules, Similarity search: retrieved only the 100 most similar structures] 2. [rare: few hits; frequent: many possible hits after screening phase, many returned hits] 3. [ Starting from JPC 2.7, the result set can be ordered directly by the chemical structures - most relevant hits come first.] 4. [Benchmark queries: JOC - select count() from pbch_8m where jc_compare(mol, 'Clc1ccccc1', 't:s') = 1 and molweight < 120; JPC - select count() from pbch_8m where 'Clc1ccccc1' |<| mol and molweight < 120;]