Improving the extraction of human names with nltk

For anyone else looking, I found this article to be useful: http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code

>>> import nltk
>>> def extract_entities(text):
...     for sent in nltk.sent_tokenize(text):
...         for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
...             if hasattr(chunk, 'node'):
...                 print chunk.node, ' '.join(c[0] for c in chunk.leaves())
...

I actually wanted to extract only the person name, so, thought to check all the names that come as an output against wordnet( A large lexical database of English). More Information on Wordnet can be found here: http://www.nltk.org/howto/wordnet.html

import nltk
from nameparser.parser import HumanName
from nltk.corpus import wordnet


person_list = []
person_names=person_list
def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)

    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []
#     print (person_list)

text = """

Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
for person in person_list:
    person_split = person.split(" ")
    for name in person_split:
        if wordnet.synsets(name):
            if(name in person):
                person_names.remove(person)
                break

print(person_names)

OUTPUT

['Francois R. Velde', 'Richard Branson', 'Economist Paul Krugman', 'Nick Colas']

Apart from Larry Summers all the names are correct and that is because of the last name "Summers".


The answer of @trojane didn't quite work for me, but helped a lot for this one.

Prerequesites

Create a folder stanford-ner and download the following two files to it:

  • english.all.3class.distsim.crf.ser.gz
  • stanford-ner.jar (Look for download and extract the archive)

Script

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import nltk
from nltk.tag.stanford import StanfordNERTagger

text = u"""
Some economists have responded positively to Bitcoin, including
Francois R. Velde, senior economist of the Federal Reserve in Chicago
who described it as "an elegant solution to the problem of creating a
digital currency." In November 2013 Richard Branson announced that
Virgin Galactic would accept Bitcoin as payment, saying that he had invested
in Bitcoin and found it "fascinating how a whole new global currency
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical.
Economist Paul Krugman has suggested that the structure of the currency
incentivizes hoarding and that its value derives from the expectation that
others will accept it as payment. Economist Larry Summers has expressed
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market
strategist for ConvergEx Group, has remarked on the effect of increasing
use of Bitcoin and its restricted supply, noting, "When incremental
adoption meets relatively fixed supply, it should be no surprise that
prices go up. And that’s exactly what is happening to BTC prices.
"""

st = StanfordNERTagger('stanford-ner/english.all.3class.distsim.crf.ser.gz',
                       'stanford-ner/stanford-ner.jar')

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1] in ["PERSON", "LOCATION", "ORGANIZATION"]:
            print(tag)

Results

('Bitcoin', 'LOCATION')       # wrong
('Francois', 'PERSON')
('R.', 'PERSON')
('Velde', 'PERSON')
('Federal', 'ORGANIZATION')
('Reserve', 'ORGANIZATION')
('Chicago', 'LOCATION')
('Richard', 'PERSON')
('Branson', 'PERSON')
('Virgin', 'PERSON')         # Wrong
('Galactic', 'PERSON')       # Wrong
('Bitcoin', 'PERSON')        # Wrong
('Bitcoin', 'LOCATION')      # Wrong
('Bitcoin', 'LOCATION')      # Wrong
('Paul', 'PERSON')
('Krugman', 'PERSON')
('Larry', 'PERSON')
('Summers', 'PERSON')
('Bitcoin', 'PERSON')        # Wrong
('Nick', 'PERSON')
('Colas', 'PERSON')
('ConvergEx', 'ORGANIZATION')
('Group', 'ORGANIZATION')     
('Bitcoin', 'LOCATION')       # Wrong
('BTC', 'ORGANIZATION')       # Wrong

Must agree with the suggestion that "make my code better" isn't well suited for this site, but I can give you some way where you can try to dig in.

Disclaimer: This answer is ~7 years old. Definitely, it needs to be updated to newer Python and NLTK versions. Please, try to do it yourself, and if it works, share your know-how with us.

Take a look at Stanford Named Entity Recognizer (NER). Its binding has been included in NLTK v 2.0, but you must download some core files. Here is script which can do all of that for you.

I wrote this script:

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag

and got not so bad output:

('Francois', 'PERSON') ('R.', 'PERSON') ('Velde', 'PERSON') ('Richard', 'PERSON') ('Branson', 'PERSON') ('Virgin', 'PERSON') ('Galactic', 'PERSON') ('Bitcoin', 'PERSON') ('Bitcoin', 'PERSON') ('Paul', 'PERSON') ('Krugman', 'PERSON') ('Larry', 'PERSON') ('Summers', 'PERSON') ('Bitcoin', 'PERSON') ('Nick', 'PERSON') ('Colas', 'PERSON')

Hope this is helpful.

Tags:

Python

Nlp

Nltk