Gensim train word2vec on wikipedia - preprocessing and parameters

I've been working on a project to massage the wikipedia corpus and get vectors out of it. I might generate the Italian vectors soon but in case you want to do it on your own take a look at: https://github.com/idio/wiki2vec


Your approach is fine.

model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words

This could be because of pruning infrequent words (the default is min_count=5).

To speed up computation, you can consider "caching" the preprocessed articles as a plain .txt.gz file, one sentence (document) per line, and then simply using word2vec.LineSentence corpus. This saves parsing the bzipped wiki XML on every iteration.

Why word2vec doesn't produce "meaningful similarities" for Italian wiki, I don't know. English wiki seems to work fine. See also here.