Chemistry - Substructure search with RDKit

Solution 1:

I'm not sure about why it's not matching but when I carry out substructure matches in rdkit I use SMARTS instead https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html

Here is what I would have done.

from rdkit import Chem

smiles_list = ['C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=CC=C6', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C=C4', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7CCCCC7=C6', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7CC8=CC=CC=C8CC7=C6', 'c1ccccc1']

pattern = Chem.MolFromSmarts('[#6]12~[#6][#6]~[#6][#6]~[#6]1[#6]3~[#6][#6]~[#6][#6]4~[#6]3[#6]2~[#6][#6]~[#6]4')
for idx,smiles in enumerate(smiles_list):
    m = Chem.MolFromSmiles(smiles)
    print("Structure {}: pattern found {}".format(idx+1,m.HasSubstructMatch(pattern)))

and the output

Structure 1: pattern found True
Structure 2: pattern found True
Structure 3: pattern found True
Structure 4: pattern found True
Structure 5: pattern found True
Structure 6: pattern found False

I've added a Benzene smiles to show that it's not just matching everything. The SMARTS pattern I made probably could be better I think. I hope this helps anyway.

Solution 2:

The following looks like a solution, unless anybody disproves it. Thank you @Unskilled, for pointing me in the right direction.

If you use structure 1 with smiles_1a: C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4, you will not find structures 3 and 5. If you take OpenBabel and convert smiles_1a to SMILES you get: c12ccccc1c1cccc3c1ccc3. If you take this SMILES string and convert it to mol via Chem.MolFromSmarts() you will find all structures. However, I don't want to use additional external tools.

So, in RDKit, if you convert smiles_1a to mol and this mol back to SMILES again, you get c1ccc2c(c1)-c1cccc3cccc-2c13. If you search with this, you will still not find structures 3 and 5. Probably because of the defined single bonds. However, if you replace - by ~, you get smiles_1b: c1ccc2c(c1)~c1cccc3cccc~2c13. With this, you will find also structures 3 and 5. Happy End, hopefully.

Code: (I also added Benzene to have a non-match)

from rdkit import Chem

smiles_list = ['C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=CC=C6', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C=C4', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7CCCCC7=C6', 'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7CC8=CC=CC=C8CC7=C6','c1ccccc1']

def search_structure(pattern):
    for idx,smiles in enumerate(smiles_list):
        m = Chem.MolFromSmiles(smiles)
        print("Structure {}: pattern found {}".format(idx+1,m.HasSubstructMatch(pattern)))

smiles_1a  = smiles_list[0]
pattern_1a = Chem.MolFromSmiles(smiles_list[0])
smiles_1b  = Chem.MolToSmiles(pattern_1a).replace('-','~')
pattern_1b = Chem.MolFromSmarts(smiles_1b)

print("\nSMILES 1a: {}".format(smiles_1a))
search_structure(pattern_1a)
print("\nSMILES 1b: {}".format(smiles_1b))
search_structure(pattern_1b)

Result:

SMILES 1a: C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4
Structure 1: pattern found True
Structure 2: pattern found True
Structure 3: pattern found False
Structure 4: pattern found True
Structure 5: pattern found False
Structure 6: pattern found False

SMILES 1b: c1ccc2c(c1)~c1cccc3cccc~2c13
Structure 1: pattern found True
Structure 2: pattern found True
Structure 3: pattern found True
Structure 4: pattern found True
Structure 5: pattern found True
Structure 6: pattern found False

Tags: