How does Pyspark Calculate Doc2Vec from word2vec word embeddings?

One simple way to go from word-vectors, to a single vector for a range-of-text, is to average the vectors together. And, that often works well-enough for some tasks.

However, that's not how the Doc2Vec class in gensim does it. That class implements the 'Paragraph Vectors' technique, where separate document-vectors are trained in a manner analogous to word-vectors.

The doc-vectors participate in training a bit like a floating synthetic word, involved in every sliding window/target-word-prediction. They're not composed-up or concatenated-from preexisting word-vectors, though in some modes they may be simultaneously trained alongside word-vectors. (However, the fast and often top-performing PV-DBOW mode, enabled in gensim with the parameter dm=0, doesn't train or use input-word-vectors at all. It just trains doc-vectors that are good for predicting the words in each text-example.)

Since you've mentioned multiple libraries (both Spark MLib and gensim), but you've not shown your code, it's not certain exactly what your existing process is doing.

How does Pyspark Calculate Doc2Vec from word2vec word embeddings?

Tags:

Nlp

Apache Spark

Word2Vec

Pyspark

Doc2Vec

Related

Recent Posts