Python: how to determine if a list of words exist in a string

This function was found by Peter Gibson (below) to be the most performant of the answers here. It is good for datasets one may hold in memory (because it creates a list of words from the string to be searched and then a set of those words):

def words_in_string(word_list, a_string):
    return set(word_list).intersection(a_string.split())

Usage:

my_word_list = ['one', 'two', 'three']
a_string = 'one two three'
if words_in_string(my_word_list, a_string):
    print('One or more words found!')

Which prints One or words found! to stdout.

It does return the actual words found:

for word in words_in_string(my_word_list, a_string):
    print(word)

Prints out:

three
two
one

For data so large you can't hold it in memory, the solution given in this answer would be very performant.


To satisfy my own curiosity, I've timed the posted solutions. Here are the results:

TESTING: words_in_str_peter_gibson          0.207071995735
TESTING: words_in_str_devnull               0.55300579071
TESTING: words_in_str_perreal               0.159866499901
TESTING: words_in_str_mie                   Test #1 invalid result: None
TESTING: words_in_str_adsmith               0.11831510067
TESTING: words_in_str_gnibbler              0.175446796417
TESTING: words_in_string_aaron_hall         0.0834425926208
TESTING: words_in_string_aaron_hall2        0.0266295194626
TESTING: words_in_str_john_pirie            <does not complete>

Interestingly @AaronHall's solution

def words_in_string(word_list, a_string):
    return set(a_list).intersection(a_string.split())

which is the fastest, is also one of the shortest! Note it doesn't handle punctuation next to words, but it's not clear from the question whether that is a requirement. This solution was also suggested by @MIE and @user3.

I didn't look very long at why two of the solutions did not work. Apologies if this is my mistake. Here is the code for the tests, comments & corrections are welcome

from __future__ import print_function
import re
import string
import random
words = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']

def random_words(length):
    letters = ''.join(set(string.ascii_lowercase) - set(''.join(words))) + ' '
    return ''.join(random.choice(letters) for i in range(int(length)))

LENGTH = 400000
RANDOM_STR = random_words(LENGTH/100) * 100
TESTS = (
    (RANDOM_STR + ' one two three', (
        ['one', 'two', 'three'],
        set(['one', 'two', 'three']),
        False,
        [True] * 3 + [False] * 7,
        {'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    (RANDOM_STR + ' one two three four five six seven eight nine ten', (
        ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'],
        set(['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']),
        True,
        [True] * 10,
        {'one': True, 'two': True, 'three': True, 'four': True, 'five': True, 'six': True,
            'seven': True, 'eight': True, 'nine': True, 'ten':True}
        )),

    ('one two three ' + RANDOM_STR, (
        ['one', 'two', 'three'],
        set(['one', 'two', 'three']),
        False,
        [True] * 3 + [False] * 7,
        {'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    (RANDOM_STR, (
        [],
        set(),
        False,
        [False] * 10,
        {'one': False, 'two': False, 'three': False, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    (RANDOM_STR + ' one two three ' + RANDOM_STR, (
        ['one', 'two', 'three'],
        set(['one', 'two', 'three']),
        False,
        [True] * 3 + [False] * 7,
        {'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    ('one ' + RANDOM_STR + ' two ' + RANDOM_STR + ' three', (
        ['one', 'two', 'three'],
        set(['one', 'two', 'three']),
        False,
        [True] * 3 + [False] * 7,
        {'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    ('one ' + RANDOM_STR + ' two ' + RANDOM_STR + ' threesome', (
        ['one', 'two'],
        set(['one', 'two']),
        False,
        [True] * 2 + [False] * 8,
        {'one': True, 'two': True, 'three': False, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    )

def words_in_str_peter_gibson(words, s):
    words = words[:]
    found = []
    for match in re.finditer('\w+', s):
        word = match.group()
        if word in words:
            found.append(word)
            words.remove(word)
            if len(words) == 0: break
    return found

def words_in_str_devnull(word_list, inp_str1):
    return dict((word, bool(re.search(r'\b{}\b'.format(re.escape(word)), inp_str1))) for word in word_list)


def words_in_str_perreal(wl, s):
    i, swl, strwords = 0, sorted(wl), sorted(s.split())
    for w in swl:
        while strwords[i] < w:  
            i += 1
            if i >= len(strwords): return False
        if w != strwords[i]: return False
    return True

def words_in_str_mie(search_list, string):
    lower_string=string.lower()
    if ' ' in lower_string:
        result=filter(lambda x:' '+x.lower()+' ' in lower_string,search_list)
        substr=lower_string[:lower_string.find(' ')]
        if substr in search_list and substr not in result:
            result+=substr
        substr=lower_string[lower_string.rfind(' ')+1:]
        if substr in search_list and substr not in result:
            result+=substr
    else:
        if lower_string in search_list:
            result=[lower_string]

def words_in_str_john_pirie(word_list, to_be_searched):
    for word in word_list:
        found = False
        while not found:
            offset = 0
            # Regex is expensive; use find
            index = to_be_searched.find(word, offset)
            if index < 0:
                # Not found
                break
            if index > 0 and to_be_searched[index - 1] != " ":
                # Found, but substring of a larger word; search rest of string beyond
                offset = index + len(word)
                continue
            if index + len(word) < len(to_be_searched) \
                    and to_be_searched[index + len(word)] != " ":
                # Found, but substring of larger word; search rest of string beyond
                offset = index + len(word)
                continue
            # Found exact word match
            found = True    
    return found

def words_in_str_gnibbler(words, string_to_be_searched):
    word_set = set(words)
    found = []
    for match in re.finditer(r"\w+", string_to_be_searched):
        w = match.group()
        if w in word_set:
             word_set.remove(w)
             found.append(w)
    return found

def words_in_str_adsmith(search_list, big_long_string):
    counter = 0
    for word in big_long_string.split(" "):
        if word in search_list: counter += 1
        if counter == len(search_list): return True
    return False

def words_in_string_aaron_hall(word_list, a_string):
    def words_in_string(word_list, a_string):
        '''return iterator of words in string as they are found'''
        word_set = set(word_list)
        pattern = r'\b({0})\b'.format('|'.join(word_list))
        for found_word in re.finditer(pattern, a_string):
            word = found_word.group(0)
            if word in word_set:
                word_set.discard(word)
                yield word
                if not word_set:
                    raise StopIteration
    return list(words_in_string(word_list, a_string))

def words_in_string_aaron_hall2(word_list, a_string):
    return set(word_list).intersection(a_string.split())

ALGORITHMS = (
        words_in_str_peter_gibson,
        words_in_str_devnull,
        words_in_str_perreal,
        words_in_str_mie,
        words_in_str_adsmith,
        words_in_str_gnibbler,
        words_in_string_aaron_hall,
        words_in_string_aaron_hall2,
        words_in_str_john_pirie,
        )

def test(alg):
    for i, (s, possible_results) in enumerate(TESTS):
        result = alg(words, s)
        assert result in possible_results, \
            'Test #%d invalid result: %s ' % (i+1, repr(result))

COUNT = 10
if __name__ == '__main__':
    import timeit
    for alg in ALGORITHMS:
        print('TESTING:', alg.__name__, end='\t\t')
        try:
            print(timeit.timeit(lambda: test(alg), number=COUNT)/COUNT)
        except Exception as e:
            print(e)

Easy way:

filter(lambda x:x in string,search_list)

if you want the search to ignore character's case you can do this:

lower_string=string.lower()
filter(lambda x:x.lower() in lower_string,search_list)

if you want to ignore words that are part of bigger word such as three in threesome:

lower_string=string.lower()
result=[]
if ' ' in lower_string:
    result=filter(lambda x:' '+x.lower()+' ' in lower_string,search_list)
    substr=lower_string[:lower_string.find(' ')]
    if substr in search_list and substr not in result:
        result+=[substr]
    substr=lower_string[lower_string.rfind(' ')+1:]
    if substr in search_list and substr not in result:
        result+=[substr]
else:
    if lower_string in search_list:
        result=[lower_string]


If performance is needed:
arr=string.split(' ')
result=list(set(arr).intersection(set(search_list)))

EDIT: this method was the fastest in an example that searches for 1,000 words in a string containing 400,000 words but if we increased the string to be 4,000,000 the previous method is faster.


if string is too long you should do low level search and avoid converting it to list:
def safe_remove(arr,elem):
    try:
        arr.remove(elem)
    except:
        pass

not_found=search_list[:]
i=string.find(' ')
j=string.find(' ',i+1)
safe_remove(not_found,string[:i])
while j!=-1:
    safe_remove(not_found,string[i+1:j])
    i,j=j,string.find(' ',j+1)
safe_remove(not_found,string[i+1:])

not_found list contains words that are not found, you can get the found list easily, one way is list(set(search_list)-set(not_found))

EDIT: the last method appears to be the slowest.

Tags:

Python

Regex