German Stemming for Sentiment Analysis in Python NLTK

As a computer scientist, you are definitely looking in the right direction to tackle this linguistic issue ;). Stemming is usually quite a bit more simplistic, and used for Information Retrieval tasks in an attempt to decrease the lexicon size, but usually not sufficient for more sophisticated linguistic analysis. Lemmatisation partly overlaps with the use case for stemming, but includes rewriting for example verb inflections all to the same root form (lemma), and also differentiating "work" as a noun and "work" as a verb (although this depends a bit on the implementation and quality of the lemmatiser). For this, it usually needs a bit more information (like POS-tags, syntax trees), hence takes considerably longer, rendering it less suitable for IR tasks, typically dealing with larger amounts of data.

In addition to GermaNet (didn't know it was aborted, but never really tried it, because it is free, but you have to sign an agreement to get access to it), there is SpaCy which you could have a look at: https://spacy.io/docs/usage/

Very easy to install and use. See install instructions on the website, then download the German stuff using:

python -m spacy download de

then:

>>> import spacy
>>> nlp = spacy.load('de')
>>> doc = nlp('Wir suchen ein Beispiel')
>>> for token in doc:
...     print(token, token.lemma, token.lemma_)
... 
Wir 521 wir
suchen 1162 suchen
ein 486 ein
Beispiel 809 Beispiel
>>> doc = nlp('Er sucht ein Beispiel')
>>> for token in doc:
...     print(token, token.lemma, token.lemma_)
... 
Er 513 er
sucht 1901 sucht
ein 486 ein
Beispiel 809 Beispiel

As you can see, unfortunately it doesn't do a very good job on your specific example (suchen), and I'm not sure what the number represents (i.e. must be the lemma id, but not sure what other information can be obtained from this), but maybe you can give it a go and see if it helps you.


A good and easy solution is to use the TreeTagger. First you have to install the treetagge manually (which is basically unzipping the right zip-file somewhere on your computer). You will find the binary distribution here: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

Then you need to install a wrapper to call it fron Python.

The folowing code installs the wrapper and lemmatizes a tokenized sentence:

import treetaggerwrapper

tagger = treetaggerwrapper.TreeTagger(TAGLANG='de')

tags = tagger.tag_text(tokenized_sent,tagonly=True) #don't use the TreeTagger's tokenization!

pprint.pprint(tags)

You also can use a method form the treetaggerwrapper to make nice objects out of the Treetagges output:

tags2 = treetaggerwrapper.make_tags(tags)
pprint.pprint(tags2)

That is all.