Retrieve definition for parenthesized abbreviation, based on letter count

Use the regex match to find the position of the start of the match. Then use python string indexing to get the substring leading up to the start of the match. Split the substring by words, and get the last n words. Where n is the length of the abbreviation.

import re
s = 'Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP).'


for match in re.finditer(r"\((.*?)\)", s):
    start_index = match.start()
    abbr = match.group(1)
    size = len(abbr)
    words = s[:start_index].split()[-size:]
    definition = " ".join(words)

    print(abbr, definition)

This prints:

FHH family health history
NP nurse practitioner

An idea, to use a recursive pattern with PyPI regex module.

\b[A-Za-z]+\s+(?R)?\(?[A-Z](?=[A-Z]*\))\)?

See this pcre demo at regex101

\b[A-Za-z]+\s+ matches a word boundary, one or more alpha, one or more white space
(?R)? recursive part: optionally paste the pattern from start
\(? need to make the parenthesis optional for recursion to fit in \)?
[A-Z](?=[A-Z]*\) match one upper alpha if followed by closing ) with any A-Z in between

Does not check if the first word letter actually match the letter at position in the abbreviation.
Does not check for an opening parenthesis in front of the abbreviation. To check, add a variable length lookbehind. Change [A-Z](?=[A-Z]*\)) to (?<=\([A-Z]*)[A-Z](?=[A-Z]*\)).

does this solve your problem?

a = 'Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP).'
splitstr=a.replace('.','').split(' ')
output=''
for i,word in enumerate(splitstr):
    if '(' in word:
        w=word.replace('(','').replace(')','').replace('.','')
        for n in range(len(w)+1):
            output=splitstr[i-n]+' '+output

print(output)

actually, Keatinge beat me to it

Retrieve definition for parenthesized abbreviation, based on letter count

Tags:

Python

Text

Regex

Abbreviation

Text Parsing

Related

Recent Posts