How to get domain of words using WordNet in Python?

As @alvas suggests, you can use WordNetDomains. You have to download both WordNet2.0 (in its current status WordNetDomains does not support the sense inventory of WordNet3.0, which is the default version of WordNet used by NLTK) and WordNetDomains.

WordNet2.0 can be downloaded from here
WordNetDomains can be downloaded from here (after having being granted permission).

I have created a very simple Python API that loads both resources in Python3.x and provides some common routines you might need (such as getting a set of domains linked to a given term, or to a given synset, etc.). The data load of WordNetDomains is from @alvas.

This is how it looks like (with most comments omitted):

from collections import defaultdict
from nltk.corpus import WordNetCorpusReader
from os.path import exists


class WordNetDomains:
    def __init__(self, wordnet_home):
        #This class assumes you have downloaded WordNet2.0 and WordNetDomains and that they are on the same data home.
        assert exists(f'{wordnet_home}/WordNet-2.0'), f'error: missing WordNet-2.0 in {wordnet_home}'
        assert exists(f'{wordnet_home}/wn-domains-3.2'), f'error: missing WordNetDomains in {wordnet_home}'

        # load WordNet2.0
        self.wn = WordNetCorpusReader(f'{wordnet_home}/WordNet-2.0/dict', 'WordNet-2.0/dict')

        # load WordNetDomains (based on https://stackoverflow.com/a/21904027/8759307)
        self.domain2synsets = defaultdict(list)
        self.synset2domains = defaultdict(list)
        for i in open(f'{wordnet_home}/wn-domains-3.2/wn-domains-3.2-20070223', 'r'):
            ssid, doms = i.strip().split('\t')
            doms = doms.split()
            self.synset2domains[ssid] = doms
            for d in doms:
                self.domain2synsets[d].append(ssid)

    def get_domains(self, word, pos=None):
        word_synsets = self.wn.synsets(word, pos=pos)
        domains = []
        for synset in word_synsets:
            domains.extend(self.get_domains_from_synset(synset))
        return set(domains)

    def get_domains_from_synset(self, synset):
        return self.synset2domains.get(self._askey_from_synset(synset), set())

    def get_synsets(self, domain):
        return [self._synset_from_key(key) for key in self.domain2synsets.get(domain, [])]

    def get_all_domains(self):
        return set(self.domain2synsets.keys())

    def _synset_from_key(self, key):
        offset, pos = key.split('-')
        return self.wn.synset_from_pos_and_offset(pos, int(offset))

    def _askey_from_synset(self, synset):
        return self._askey_from_offset_pos(synset.offset(), synset.pos())

    def _askey_from_offset_pos(self, offset, pos):
        return str(offset).zfill(8) + "-" + pos

There is no explicit domain information in the Princeton WordNet nor the NLTK's WN API.

I would recommend you get a copy of the WordNet Domain resource and then link your synsets using the domains, see http://wndomains.fbk.eu/

After you've registered and completed the download you will see a wn-domains-3.2-20070223 textfile, which is a tab-delimited file with first column the offset-PartofSpeech identifier and the 2nd column contains the domain tags separated by spaces, e.g.

00584282-v  military pedagogy
00584395-v  military school university
00584526-v  animals pedagogy
00584634-v  pedagogy
00584743-v  school university
00585097-v  school university
00585271-v  pedagogy
00585495-v  pedagogy
00585683-v  psychological_features

Then you use the following script to access synsets' domain(s):

from collections import defaultdict
from nltk.corpus import wordnet as wn

# Loading the Wordnet domains.
domain2synsets = defaultdict(list)
synset2domains = defaultdict(list)
for i in open('wn-domains-3.2-20070223', 'r'):
    ssid, doms = i.strip().split('\t')
    doms = doms.split()
    synset2domains[ssid] = doms
    for d in doms:
        domain2synsets[d].append(ssid)

# Gets domains given synset.
for ss in wn.all_synsets():
    ssid = str(ss.offset).zfill(8) + "-" + ss.pos()
    if synset2domains[ssid]: # not all synsets are in WordNet Domain.
        print ss, ssid, synset2domains[ssid]

# Gets synsets given domain.        
for dom in sorted(domain2synsets):
    print dom, domain2synsets[dom][:3]

Also look for the wn-affect that is very useful to disambiguate words for sentiment within the WordNet Domain resource.

With updated NLTK v3.0, it comes with the Open Multilingual WordNet (http://compling.hss.ntu.edu.sg/omw/), and since the French synsets share the same offset IDs, you can simply use the WND as a crosslingual resource. The french lemma names can be accessed as such:

# Gets domains given synset.
for ss in wn.all_synsets():
    ssid = str(ss.offset()).zfill(8) + "-" + ss.pos()
    if synset2domains[ssid]: # not all synsets are in WordNet Domain.
        print ss, ss.lemma_names('fre'), ssid, synset2domains[ssid]

Note that the most recent version of NLTK changes synset properties to "get" functions: Synset.offset -> Synset.offset()

How to get domain of words using WordNet in Python?

Tags:

Python

Nlp

Nltk

Wordnet

Related

Recent Posts