Chemistry - Is converting SMARTS to SMILES a "lossless" operation?

Solution 1:

SMARTS is deliberately designed to be a superset of SMILES. That is, any valid SMILES depiction should also be a valid SMARTS query, one that will retrieve the very structure that the SMILES string depicts.

However, as a query language, SMARTS can be more general than SMILES is. For example, CC as a SMILES string depicts a single compound: ethane. As a SMARTS query, though, CC will match ethane, but will also match propane, acetic acid, cyclohexane, vancomycin, etc.

There's also SMARTS strings which are not valid SMILES strings. You list several in your question: [#6]-1=[#6]-[#6](-[#6]-[#6](-[#6]-1)-[#6])=[#8] is not a valid SMILES representation, unless your SMILES parser is being particularly generous. Even if it is, [#6]-1=[#6]-[#6](-[#6]-[#6](-[#6]-1)-[#6])=[#8,#7] is also a valid SMARTS query which should match your molecule, but not even a generous SMILES parser is likely to accept it.

Despite being similar, SMARTS and SMILES are intended for fundamentally different things - the purpose of SMILES is to represent particular compounds, whereas SMARTS represents a query against a range of possible molecules, or an abstract description of a set of possible molecules. As such, they're not inter-convertible, even if the strings are literally identical.

For your particular case, yes, CC1CC=CC(=O)C1 is both a valid SMILES and a valid SMARTS, but as a SMARTS query, it represents not just 5-methyl-2-cyclohexen-1-one, but also 5-propyl-2-cyclohexen-1-one and 3-hydroxy-5-butyl-6-amino-2-cyclohexen-1-one, as well as many others, all of which contain that substructure. The SMARTS viewer you link doesn't depict this explicitly, because it's implicit in the use of SMARTS that it's a substructure pattern for a broader class of compounds.

Solution 2:

Thilo asked a similar question on the rdkit-discuss mailing list, where Andrew Dalke chimed in with this response, which he gave me permission to post here. The answer uses the python-based rdkit library to give examples of converting between SMILES and SMARTS and other tasks.

From Andrew Dalke:

On Apr 19, 2017, at 12:03, Thilo Bauer wrote:

Is converting SMARTS to SMILES a "lossless" operation, or does one loose information on doing so?

It is obviously not lossless if you include terms that cannot be represented in SMILES.

>>> from rdkit import Chem
>>> Chem.MolToSmiles(Chem.MolFromSmarts("[C,N]"))
'C'

or which don't make sense as a molecule:

>>> Chem.MolToSmiles(Chem.MolFromSmarts("c"))
'c'
>>> Chem.MolFromSmiles("c")
[23:02:24] non-ring atom 0 marked aromatic

It also loses some information which could be represented in SMILES:

>>> Chem.MolToSmiles(Chem.MolFromSmarts("[NH4+]"))
'N'
>>> Chem.MolToSmiles(Chem.MolFromSmarts("C[N+]1(C)CCCCC1"))
'CN1(C)CCCCC1'
>>> Chem.MolToSmiles(Chem.MolFromSmarts("[12C]"), isomericSmiles=True)
'C'

Do be careful if you want to handle aromatic atoms and bonds:

>>> Chem.MolToSmiles(Chem.MolFromSmarts("[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1"))
'C1:C:C:C:C:C:1'
>>> Chem.MolToSmiles(Chem.MolFromSmarts("c=1-c=c-c=c-c=1"))
'c1=c-c=c-c=c-1'

Background: I've got three different SMARTS strings representing the same structure - at least when depicting it. Also all three strings result in the exact same SMILES (see code and output below).

It looks like you want SMARTS canonicalization.

In general this is hard, because SMARTS can include boolean expressions and recursive SMARTS.

If you limit yourself to patterns like '[#6]-1=[#6]-[#6]...', with only atomic numbers and single/double/triple bonds, then I think RDKit will do what you want.

[[CF: Andrew Dalke also had important answer-level commentary on R.M.'s answer, which I copy below.]]

From chemistry stack exchange, an answer contributed by user R.M.:

SMARTS is deliberately designed to be a superset of SMILES. That is, any valid SMILES depiction should also be a valid SMARTS query, one that will retrieve the very structure that the SMILES string depicts.

Except, that last clause isn't true. Try matching tritium against itself.

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("[3H]")
>>> pat = Chem.MolFromSmarts("[3H]")
>>> mol.HasSubstructMatch(pat)
False

For hydrogens you must use '#1', because H in SMARTS means something different.

>>> pat2 = Chem.MolFromSmarts("[3#1]")
>>> mol.HasSubstructMatch(pat2)
True

SMILES input under Daylight and most other toolkits gets normalized to the chemistry model, including aromaticity perception:

>>> mol = Chem.MolFromSmiles("C1=CC=CC=C1")
>>> pat = Chem.MolFromSmarts("C1=CC=CC=C1")
>>> mol.HasSubstructMatch(pat)
False
>>> pat2 = Chem.MolFromSmarts("c1ccccc1")
>>> mol.HasSubstructMatch(pat2)
True

RDKit also does a small amount of additional normalization, or 'sanitization' to use the RDKit term. For example, it will convert "neutral 5 coordinate Ns with double bonds to Os to the zwitterionic form" (see GraphMol/MolOps.cpp):

>>> s = "CN(=O)=O"
>>> mol = Chem.MolFromSmiles(s)
>>> pat = Chem.MolFromSmarts(s)
>>> mol.HasSubstructMatch(pat)
False
>>> Chem.MolToSmiles(mol)
'C[N+](=O)[O-]'

I believe that the output SMILES from a toolkit, assuming that the SMILES doesn't have an explicit hydrogen, can be used a SMARTS which will match the molecule made from that same SMILES, by that same toolkit.

Tags: