Chemistry - SMILES vs. Graph representation in deep learning

Solution 1:

I agree that there seems to be a shift towards using graph representations over SMILES strings. I personally think this is a good thing, and I'll try to explain why, but even if there's nothing inherently better about graph representations of molecules, there is a very practical reason people are moving towards graph representations.

So, first of all, as you point out, both representations suffer from an issue in that they are necessarily a reduction from the complete amount of information needed to specify a molecular structure. For instance, neither a graph nor a SMILES string could distinguish between two isomers which differ only in the direction in which a free $\ce{O-H}$ group points. In fact, a bare graph cannot tell the difference between the boat, chair, or planar hexane molecules as they all have identical connectivity. Nonetheless, a graph is a much more flexible representation than a SMILES string because it is quite common to add weights to edges, which could be bond distances, or add parameters to nodes, which could describe angles to other nodes. So, you can include some information which is in the Cartesian coordinates, without having the problems associated with deriving your molecular representation from Cartesian coordinates.

This hints at the very practical reason why people are using graph representations. Basically, graphs are used in machine learning models in many fields other than chemistry. So, it's very nice to not have to reinvent the wheel, especially when people have so much success with things like transfer learning, where you just take a pre-trained model and re-train it for your own purposes.

Also, very often the first step in training a neural network is making some transformation to your data. For instance, graph convolutional neural networks have been successful in many tasks, so why not just use a convolutional filter on your graph representation of a molecule? You could do this with a SMILES string, but you would probably first just transform the string into something resembling a graph.


As to your specific points about chirality and aromaticity, etc. All of this information can be attached to a graph via parameters belonging to each node, although I would personally avoid giving information that isn't strictly necessary. That is, there is nothing special about a bond which is in an aromatic ring. You need to provide enough data that the model can learn about this on its own. If you tell it that bonds in aromatic rings are special enough to get another parameter, this is probably going to bias the model in some unforeseen way. Chirality is easily handled by a simple parameter attached to each node.

Ultimately, though, the representation depends a lot on the problem you are trying to solve. For instance, if you are trying to learn a representation of the potential energy surface, then graphs can work quite well. What is more common perhaps, is to use so-called atom-centered symmetry functions. In this case, the actual features are abstract vectors which are guaranteed to have the relevant symmetries and smoothness needed in the potential energy surface.

If you're doing something more like a classification problem, then using a representation like a SMILES string might be perfectly suitable.


TL;DR

Graphs are a more flexible representation which are commonly used in fields outside of chemistry. Hence, being able to draw on the knowledge of other disciplines is a huge plus, especially when you're collaborating with computer scientists who know a lot about machine learning and nothing about chemistry.

Solution 2:

It depends on how you code your molecular graphs

The idea of a 'connection table' or valence model for molecules, and thus molecular graphs is embedded in chemical thinking.

Let's take your four points:

  1. It's possible to design connection tables that supports a variety of interactions. For example, zero order bonds can encode coordination bonds, delocalized metal-ligand interactions, etc.
  2. You can write a graph that stores a variety of stereochemistry, although admittedly axial chirality, etc. require some work to do so (i.e., it's a property of the molecule itself and not any particular atom or bond). Some formats even support concepts like '55% R, 45% S' stereo centers.
  3. You would need to define an aromaticity model, although many exist and can be adopted (e.g., 'we use the SMILES aromaticity definition')
  4. For both graphs and SMILES, there are many published canonicalization algorithms (e.g., we use the InChI canonical atom order).

In short, people have worried about cheminformatics issues for a long time:

  • Dietz 1995
  • Gasteiger 1997

Both papers indicate expansions of the standard molecular graph concepts, e.g. [Gasteiger]:

This representation overcomes the limitations of connection tables designed to only represent chemical structures with bonds localized between two atoms. The representation introduced is based on the separation of the σ- and π-electrons of bonds and the delocalization of electrons also across more than two atoms. It also allows the description of chemical compounds containing multicenter or coordinative bonds.

Alex Clark's zero order bond model linked above gets at many of these issues but in a way that's backwards compatible with the standard SD file format.

It's a long answer, but if you code a good graph representation, you can probably encode a lot of chemistry.