How to slice a string input at a certain unknown index

A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant:

#!/usr/bin/env python

import enchant

GLOSSARY = enchant.Dict("en_US")

def isWord(word):
    return True if GLOSSARY.check(word) else False

sentences = [
"eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]

for sentence in sentences:
    for i,w in enumerate(sentence.split()):
        if isWord(w):
            print('index: {} => {}'.format(i, w))
            break

The above piece of code gives as a result:

index: 3 => What
index: 0 => What
index: 0 => Given

You could try a regular expression like \b[A-Z][a-z][^?]+\?, meaning:

  • The start of a word \b with an upper case letter [A-Z] followed by a lower case letter [a-z],
  • then a sequence of non-questionmark-characters [^?]+,
  • followed by a literal question mark \?.

This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.

>>> tests = ["eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
             "What is your\nlastname and email?\ndasf?lkjas",
             "\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]

>>> import re
>>> p = r"\b[A-Z][a-z][^?]+\?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
 'What is your\nlastname and email?',
 'Given your skills\nhow would you rate yourself?']

If that's one blob of text, you can use findall instead of search:

>>> text = "\n".join(tests)
>>> re.findall(p, text)
['What is your name?',
 'What is your\nlastname and email?',
 'Given your skills\nhow would you rate yourself?']

Actually, this also seems to work reasonably well for questions with names in them:

>>> t = "asdGARBAGEasd\nHow did you like St. Petersburg? more stuff with ?" 
>>> re.search(p, t).group()
'How did you like St. Petersburg?'

Tags:

Python