Discovery of Power-Laws in Chemical Space

publication · 8 years ago
by Ryan W. Benz, S. Joshua Swamidass, Pierre Baldi (University of California Irvine)
Power-law distributions have been observed in a wide variety of areas. To our knowledge however, there has been no systematic observation of power-law distributions in chemoinformatics. Here, we present several examples of power-law distributions arising from the features of small, organic molecules. The distributions of rigid segments and ring systems, the distributions of molecular paths and circular substructures, and the sizes of molecular similarity clusters all show linear trends on log-log rank/frequency plots, suggesting underlying power-law distributions. The number of unique features also follow Heaps’-like laws. The characteristic exponents of the power-laws lie in the 1.5-3 range, consistently with the exponents observed in other power-law phenomena. The power-law nature of these distributions leads to several applications including the prediction of the growth of available data through Heaps’ law and the optimal allocation of experimental or computational resources via the 80/20 rule. More importantly, we also show how the power-laws can be leveraged to efficiently compress chemical fingerprints in a lossless manner, useful for the improved storage and retrieval of molecules in large chemical databases.
Visit publication