Data structure for multi-language dictionary?

I will not win points here, but some things.

A multi-language dictionary is a large and time-consuming undertaking. You did not talk in detail about the exact uses for which your dictionary is intended: statistical probably, not translating, not grammatical, .... Different usages require different data to be collected, for instance classifying "went" as passed tense.

First formulate your first requirements in a document, and with a programmed interface prototype. Asking data structures before algorithmic conception I see often for complex business logic. One would then start out wrong, risking feature creep. Or premature optimisation, like that romanisation, which might have no advantage, and bar bidrectiveness.

Maybe you can start with some active projects like Reta Vortaro; its XML might not be efficient, but give you some ideas for organisation. There are several academic linguistic projects. The most relevant aspect might be stemming: recognising greet/greets/greeted/greeter/greeting/greetings (@smci) as belonging to the same (major) entry. You want to take the already programmed stemmers; they often are well-tested and already applied in electronic dictionaries. My advise would be to research those projects without losing to much energy, impetus, to them; just enough to collect ideas and see where they might be used.

The data structures one can think up, are IMHO of secondary importance. I would first collect all in a well defined database, and then generate the software used data structures. You can then compare and measure alternatives. And it might be for a developer the most interesting part, creating a beautiful data structure & algorithm.


An answer

Requirement:

Map of word to list of [language, definition reference]. List of definitions.

Several words can have the same definition, hence the need for a definition reference. The definition could consist of a language bound definition (grammatical properties, declinations), and/or a language indepedendant definition (description of the notion).

One word can have several definitions (book = (noun) reading material, = (verb) reserve use of location).

Remarks

As single words are handled, this does not consider that an occuring text is in general mono-lingual. As a text can be of mixed languages, and I see no special overhead in the O-complexity, that seems irrelevant.

So a over-general abstract data structure would be:

Map<String /*Word*/, List<DefinitionEntry>> wordDefinitions;
Map<String /*Language/Locale/""*/, List<Definition>> definitions;

class Definition {
    String content;
}

class DefinitionEntry {
    String language;
    Ref<Definition> definition;
}

The concrete data structure:

The wordDefinitions are best served with an optimised hash map.


Please let me add:

I did come up with a concrete data structure at last. I started with the following.

Guava's MultiMap is, what we have here, but Trove's collections with primitive types is what one needs, if using a compact binary representation in core.

One would do something like:

import gnu.trove.map.*;

/**
 * Map of word to DefinitionEntry.
 * Key: word.
 * Value: offset in byte array wordDefinitionEntries,
 * 0 serves as null, which implies a dummy byte (data version?)
 * in the byte arrary at [0].
 */
TObjectIntMap<String> wordDefinitions = TObjectIntHashMap<String>();
byte[] wordDefinitionEntries = new byte[...]; // Actually read from file.

void walkEntries(String word) {
    int value = wordDefinitions.get(word);
    if (value == 0)
        return;
    DataInputStream in = new DataInputStream(
        new ByteArrayInputStream(wordDefinitionEntries));
    in.skipBytes(value);
    int entriesCount = in.readShort();
    for (int entryno = 0; entryno < entriesCount; ++entryno) {
        int language = in.readByte();
        walkDefinition(in, language); // Index to readUTF8 or gzipped bytes.
    }
}

I'm not sure whether or not this will work for your particular problem, but here's one idea to think about.

A data structure that's often used for fast, compact representations of language is a minimum-state DFA for the language. You could construct this by creating a trie for the language (which is itself an automaton for recognizing strings in the language), then using of the canonical algorithms for constructing a minimum-state DFA for the language. This may require an enormous amount of preprocessing time, but once you've constructed the automaton you'll have extremely fast lookup of words. You would just start at the start state and follow the labeled transitions for each of the letters. Each state could encode (perhaps) a 40-bit value encoding for each language whether or not the state corresponds to a word in that language.

Because different languages use different alphabets, it might a good idea to separate out each language by alphabet so that you can minimize the size of the transition table for the automaton. After all, if you have words using the Latin and Cyrillic alphabets, the state transitions for states representing Greek words would probably all be to the dead state on Latin letters, while the transitions for Greek characters for Latin words would also be to the dead state. Having multiple automata for each of these alphabets thus could eliminate a huge amount of wasted space.


A common solution to this in the field of NLP is finite automata. See http://www.stanford.edu/~laurik/fsmbook/home.html.