Code Class: Frequency occurrence calculations in Instant JChem

news · 6 years ago
by Petr Hamernik

Often there is a need to cluster data, based on values of some specific column. Sometimes it is necessary to calculate how many occurrences each value has in the table.

In Instant JChem 5.11 there is a new experimental widget called Tree Table. It combines a tree (for grouped columns) and table (for the rest of the columns). In the current version the widget is not yet suitable for large data; this will be improved in the future version.

Table Tree widget in Instant JChem

One possible solution for clustering the data, based on a specific column value, can be done using the following script.

Please follow these instructions:

  1. Create a new script under your datatree using r-click popup menu.

    Create a new script under your datatree using r-click popup menu.

  2. Select whole content of newly created sample script and delete it. Then copy the following script and paste it into the editor.
  3. [java] def ety = dataTree.rootVertex.entity def edp = ety.schema.dataProvider.getEntityDataProvider(ety) // ==== Customize the name of field which is used for frequency calculations ===== def fieldName = 'Donors' def columnNamePrefix = 'DONORS' // ===================== // Create new field def countField def clusterIndexField def lock = ety.schema.schema.lockable.obtainLock('Create a new field') def statistics = new HashMap() try { def envRW = EnvUtils.createRWFromRO(env, lock) countField = DFFields.createIntegerField(ety, fieldName + ' Count', columnNamePrefix + '_COUNT', envRW) clusterIndexField = DFFields.createIntegerField(ety, fieldName + ' Cluster Index', columnNamePrefix + '_CLUSTER_INDEX', envRW) } finally { lock?.release() } // Sort table, fill counts def rs = dataTree.getDefaultResultSet(true, env) lock = rs.lockable.obtainLock("Sorting") try { envRW = EnvUtils.createRWFromRO(env, lock) freqField = ety.fields.items.find { it.name == fieldName } freqFId = freqField.id rs.rootVS.setSort(SortDirective.create(freqField, true), envRW) lastValue = null sameIds = [] index = 1 counter = 0 envRW.feedback.switchToDeterminate(rs.rootVS.size) rs.rootVS.ids.each { rowId -> value = rs.rootVS.getData([ rowId ], env).get(rowId).get(freqFId) if ((lastValue != null) && (!value.equals(lastValue))) { flush(edp, sameIds, index, countField.id, clusterIndexField.id, statistics) index++ sameIds = [ rowId ] } else { sameIds += rowId } lastValue = value envRW.feedback.progress(counter) counter++ } flush(edp, sameIds, index, countField.id, clusterIndexField.id, statistics) print "Statistics: " + statistics } finally { lock?.release() } void flush(DFEntityDataProvider edp, List ids, int index, String countFieldId, String clusterIndexFieldId, Map statistics) { if (ids.isEmpty()) { return } current = statistics.get(ids.size()) newCount = (current == null) ? 1 : (current + 1) statistics.put(ids.size(), newCount) lock2 = edp.lockable.obtainLock("updating count values") try { envRW2 = EnvUtils.createRWFromRO(env, lock2) Map updates = new HashMap() updates.put(countFieldId, ids.size()) updates.put(clusterIndexFieldId, index) def updDesc = DFUpdateDescription.create(edp.entity, ids, updates) edp.update([ updDesc ], DFUndoConfig.OFF, envRW2) } finally { lock2?.release() } } [/java]
  4. Modify the header of the script. It’s necessary to specify the name of the field, which is used for frequency calculation. Also change the column name prefix for the newly created columns.
  5. Run the script

The script then produces two new columns – e.g.: “Donors Count” and “Donors Cluster Index”. The first one contains the count of rows with the same value in the Donors column. The second column contains unique index of each “cluster”. These are real columns in the database so they can be used for sorting or searching. If you run it on Pubchem sample data project in IJC you will get something like this:

In the table there are highlighted three molecules that have 16 donors and belong to cluster index 15.

In the table above there are highlighted three molecules that have 16 donors and belong to cluster index 15. This cluster consists of three molecules indicated by Donors Count =3.

Also please note that script prints some simple statistics into output window. The first number represents the size of a cluster and the second number indicates how many clusters of this size exist. In our example there are three clusters of one item, two clusters containing three values, one cluster with 246 values, and so on.

Note that script prints some simple statistics into output window. The first number represents the size of a cluster and the second number indicates how many clusters of this size exist.

There might be many of useful applications of this script – for example for Bemis and Murcko frameworks. You can just add bmf() function as a calculated field and then use it for the clustering with this script.

Although the script gives you lots of flexibility (you can modify it in many other ways), we also plan to add this type of functionality to IJC without scripting.

[carousel_and_lightbox]