Python: Check if string and its substring are existing in the same list

If you have a large list of words, it might be a good idea to use a suffix tree.

Here's a package on PyPI.

Once you created the tree, you can call find_all(word) to get the index of every occurence of word. You simply need to keep the strings which only appear once:

from suffix_trees import STree
# https://pypi.org/project/suffix-trees/
# pip install suffix-trees

words = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure'] + ['sleep', 'anxiety', 'lack of sleep']
st = STree.STree(words)

st.find_all('blood')
# [0, 20, 26, 46]

st.find_all('high blood pressure')
# [41]

[word for word in words if len(st.find_all(word)) == 1]
# ['high blood pressure', 'anxiety', 'lack of sleep']

words needs to be a unique list of strings, so you might need to call list(set(words)) before generating the suffix-tree.

As far as I can tell, the whole script should run in O(n), with n being the total length of the strings.


You could use this one liner:

b = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']
result = [ i for i in b if not any( [ i in a for a in b if a != i]   )]

I admit this is O(n2) and maybe will be slow in performance for large inputs.

This is basically a list comprehension of the following:

word_list =  ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']

result = []
for this_word in word_list:
    words_without_this_word = [ other_word  for other_word in word_list if other_word != this_word]  
    found = False
    for other_word in words_without_this_word:
        if this_word in other_word:
            found = True

    if not found:
        result.append(this_word)

result

Tags:

Python

Nlp