Chemistry - What are the GDB-13 criteria for "synthetically accessible organic molecules"?

Solution 1:

What you have provided is a definition of all the terms involved. This is fine, but there remains the (rather interesting) question of exactly how this is defined for a molecule in silico, especially because your question asks specifically about the GDB-13 database. Of course, one could evaluate "synthetic accessibility" by trying to make it in the lab: if you succeed, then it is synthetically available, and if not, then it isn't. Likewise, one could measure "stability" by (potentially) making it and studying its thermodynamic properties.

However, for a database of 1 billion compounds, this is obviously unfeasible. So the designers of the database have to resort to using some heuristics, or rules, to determine whether a given molecular structure is "synthetically accessible" and "stable". The rules are outlined in the original paper in which the GDB-13 database was published,[1] as well as the first GDB paper from slightly earlier.[2] Both are very interesting papers and well worth a read if you are interested in cheminformatics in general.

Because all organic molecules have a carbon chain or ring as their backbone, the first step is to construct all possible frameworks of carbons and hydrogens using graph theory (up to a maximum number of carbons, of course). Here there is not yet any consideration as to what other atoms may possibly exist in a typical organic molecule. However, there are some computer-generated frameworks which are almost certainly impossible to make because of strain or other considerations (such as Bredt's rule). On top of that, it appears that they have intentionally removed three- and four-membered rings, which are stable but would have dominated the database, thus making it highly unrepresentative:[2]

The vast majority of these graphs (99.8%) contained three‐ and four‐membered rings and was excluded to avoid generating a database consisting almost exclusively of such small rings.

The next step is to introduce heteroatoms, which can be done fairly straightforwardly because in organic chemistry atoms have very specific bonding patterns. Thus, for example, because carbon forms four bonds and nitrogen three, you could replace a $\ce{CH2}$ group in a molecule with $\ce{NH}$. This is not trivial to do in the lab, but is very easy for a computer:[2]

[...] all possible atom‐type combinations by introducing carbon, nitrogen, oxygen, and fluorine (as a model halogen) at each node

although for GDB-13, they have ignored fluorines:[1]

We also eliminated fluorine because it was rarely found and never considered in our group for synthesis in virtual-screening guided drug discovery applications of GDB-11.

and also added chlorines (replacing $\ce{OH}$ groups in molecules which have them) and sulfurs (replacing $\ce{O}$ atoms).

The problem with doing that exhaustively is that too many heteroatoms tends to make a molecule very unstable. For example, directly joining heteroatoms via single bonds is (generally) a good recipe for making explosive compounds. So, essentially all such molecules were removed. The authors found that most of this work could be very quickly automated by using an even simpler heuristic: simply removing any compound with a high heteroatom:carbon ratio.[1]

Because most of the rejected molecules contained multiple heteroatoms, we reasoned that it might be possible to accelerate the database computation using a very fast “element-ratio” filter. Analysis of databases of known compounds suggested cutoff values of (N + O)/C < 1.0, N/C < 0.571, and O/C < 0.666

and specifically disallowed cases, which have to be filtered out in another step, included:[1]

The following functional groups are discarded as too unstable to be considered: hemiacetals, hemi-aminals, aminals, acyclic imines, non-aromatic enols, orthoesters and analogs, carbamic acids, non-aromatic enamines (except acylated enamines and vinylogous enamines), beta-keto-carboxylic acids and beta imino-carboxylic acids, and all compounds containing both a primary or secondary amines and an aldehyde or ketone.

This is also helpful for preparing a database of "druglike" molecules because molecules with too many heteroatoms are very polar, which makes it almost impossible to diffuse across cell membranes (see e.g.

What you get at the end is probably not likely to be 100% "synthetically accessible". However, the filters applied mean that if you were to pick a random molecule from the database, there is an exceptionally good chance that you could make it in the lab if you so desired. Quoting from the authors one final time:[2]

The database construction strategy chosen also ensures that the majority of GDB, although presently unknown, should be synthetically accessible.


  1. Blum, L. C.; Reymond, J. 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. J. Am. Chem. Soc. 2009, 131 (25), 8732–8733. DOI: 10.1021/ja902302h.
  2. Fink, T.; Bruggesser, H.; Reymond, J. Virtual Exploration of the Small-Molecule Chemical Universe below 160 Daltons. Angew. Chem. Int. Ed. 2005, 44 (10), 1504–1508. DOI: 10.1002/anie.200462457.

Solution 2:

After spending the afternoon googling around I've written a simple answer to my own question (essentially a basic definition of terms). For a more complete answer see the accepted answer by orthocresol♦:

Synthetic Accessibility - Refers to ease of synthesis. I.e how difficult a compound is to make (synthesize) in a lab.

Organic Molecule - An organic molecule is a molecule that contains carbon atoms (generally bonded to other carbon atoms as well as hydrogen atoms). Although carbon is present in all organic compounds, other elements such as Hydrogen, Oxygen, Nitrogen, Sulphur, and Phosphorus, are also common in these molecules.

Stable Molecule - in general this is more difficult to ascertain as an absolute. Generally the idea would be that a molecule that does not decompose (ie is persistent) in most environments may be considered stable. For more details the following link contains quite a bit of information.


So, In short.

The phrase “stable and synthetically accessible organic molecules”, refers to molecules that contain Carbon (generally bonded to other atoms and Hydrogen), can be created with relative ease in a laboratory, and are relatively persistent in most environments (ie Are not prone to spontaneous decomposition to lower energy states).