Chemistry - Are there any datasets containing molecules with more than 38 heavy atoms?

Solution 1:

This sounds like you were exploring work at least related to the work by the Lilienfeld group equally hosting a dedicated site here about data sets already used in their earlier and ongoing exploration of chemical space, programs used to work with the data, and publications.

To go considerably higher in molecule count than QM9, you could either go for

  • GDB-11 about small organic molecules up to 11 atoms of C, N, O and F which «contains 26.4 million molecules (110.9 million stereoisomers), including three- and four-membered rings and triple bonds», described in J. Chem. Inf. Model. 2007, 47, 342-353 (doi.org/10.1021/ci600423u), or

  • GDB-13, about «small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules. With 977 468 314 structures, GDB-13 is the largest publicly available small organic molecule database to date». This one was described in J. Am. Chem. Soc. 2009, 131, 8732-8733 (doi.org/10.1021/ja902302h)

Convienently, you can download both -- including sub-sets like «containing only carbon and nitrogen», or «chlorine and sulfur», or «fragrance like» in case you don't want to fetch 2GB of already compressed data -- from the Reymond group. To quote: «All the molecules are stored in dearomatized, canonized SMILES format.»

The even larger GDB-17 («of up to 17 atoms of C, N, O, S, and halogens» with an universe of 166 billion entries, described in J. Chem. Inf. Model. 2012, 52, 2864-2875, [doi.org/10.1021/ci300415d, open access]) is accessible to the public on this site as a 50 million random subset only, partly because the gzipped archive is about 400GByte. Among the publications citing this work is for example the Lilienfeld group again for machine learning (J. Chem. Phys. 143, 084111 (2015), doi.org/10.1063/1.4928757).


Initially, I misinterpreted the question but think the answer may be more rounded by addition of the following complementary publication: «Chemical diversity in molecular orbital energy predictions with kernel ridge regression» (J. Chem. Phys. 150, 204121 (2019), doi.org/10.1063/1.5086105, preprint available here). Aiming for a machine-learning analysis, the authors first compared QM9, 44k conformers of proteinogenic amino acids (AA), and a 64k set of organic molecules extracted from the CCDC potentially suitable for organic electronics (OE) for the content of atoms per molecule and found the following distribution:

enter image description here

To shed some light on them:

  • QM9 represents 133,814 small organic molecules with up to 9 heavy atoms (C, N, O and F)
  • AA is about «44,004 isolated and cation-coordinated conformers of 20 proteinogenic amino acids and their amino-methylated and acetylated (capped) dipeptides. The molecular structures are made of up to 39 atoms including H, C, N, O, S, Ca, Sr, Cd, Ba, Hg and Pb.»
  • OE is about «64,710 large organic molecules with up to 174 atoms extracted from organic crystals in the Cambridge Structural Database (CSD). [...] The OE dataset is not yet publicly available. OE offers the largest chemical diversity among the sets in this work both in terms of size as well as number of different elements (Fig. 2). It contains the 16 different element types H, Li, B, C, N, O, F, Si, P, S, Cl, As, Se, Br, Te and I.»

(The mentioned restriction sharing the original data relates to the user agreement with the CCDC.)

Further DFT-based property computations with these OE extracted molecular geometries lead to an ensemble of equilibrium molecular structures, and these derived geometries are accessible within a public Jupyter notebook. Shared with the public here, the deposit comes with a guiding tutorial.ipynb, including an example how to retrieve these optimized geometries and display them with Jmol.

Solution 2:

The ISOL24 database (http://www.thch.uni-bonn.de/tc.old/downloads/GMTKN/GMTKN55/ISOL24.html) contains molecules with up to 81 atoms!

The other answer says that there's a database called "OE" with molecules that have up to 174 atoms, but it is "not yet publicly available".


Solution 3:

Beyond other answers, I'd suggest the original PubChemQC project, which offers ~3 million molecules from PubChem optimized using DFT (B3LYP/6-31G*). Molecules include a wide variety of elements as long as the molecular mass is less than 500 Da. (Roughly speaking that should still handle ~38 carbon atoms.)

"PubChemQC Project: A Large-Scale First-Principles Electronic Structure Database for Data-Driven Chemistry" J. Chem. Inf. Model. 2017 57(6) pp. 1300-1308

You mention the number of heavy atoms, but keep in mind that QM9 only contains a small subset of elements and ZINC has many more.