Mapping word vector to the most similar/closest word using spaCy

A word of caution on this answer. Traditionally Word similarity (in gensim, spacy, and nltk) uses cosine similarity while by default, scipy's cdist uses euclidean distance. You can get the cosine distance which is not the same as similarity, but they are related. To duplicate gensim's calculation, change your cdist call to the following:

distance.cdist(p, vectors, metric='cosine').argmin()

However, you should also note that scipy measures cosine distance which is "backwards" from cosine similarity where "cosine dist" = 1 - cos x (x is the angle between vectors), so to match/duplicate the gensim numbers, you must subtract your answer from one (and of course, take the MAX argument--similar vectors are closer to 1). It is a very subtle difference but can cause a great deal of confusion.

Similar vectors should have a large (near 1) similarity, while the distance is small (close to zero).

Cosine similarity can be negative (meaning the vectors have opposite directions) but their DISTANCE will be positive (as distance should be).

source: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.n_similarity.html#gensim.models.Word2Vec.n_similarity

also to do similarity in spacy is the following:

import spacy
nlp = spacy.load("en_core_web_md")
x = nlp("man")
y = nlp("king")
print(x.similarity(y))
print(x.similarity(x))

Yes, spacy has an API method to do that, just like KeyedVectors.similar_by_vector:

import numpy as np
import spacy

nlp = spacy.load("en_core_web_lg")

your_word = "king"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)
['King', 'KIng', 'king', 'KING', 'kings', 'KINGS', 'Kings', 'PRINCE', 'Prince', 'prince']

(the words are not properly normalized in sm_core_web_lg, but you could play with other models and observe a more representative output).


After a bit of experimentation, I found a scikit function (cdist in scikit.spatial.distance) that finds a "close" vector in a vector space to the input vector.

# Imports
from scipy.spatial import distance
import spaCy

# Load the spacy vocabulary
nlp = spacy.load("en_core_web_lg")

# Format the input vector for use in the distance function
# In this case we will artificially create a word vector from a real word ("frog")
# but any derived word vector could be used
input_word = "frog"
p = np.array([nlp.vocab[input_word].vector])

# Format the vocabulary for use in the distance function
ids = [x for x in nlp.vocab.vectors.keys()]
vectors = [nlp.vocab.vectors[x] for x in ids]
vectors = np.array(vectors)

# *** Find the closest word below ***
closest_index = distance.cdist(p, vectors).argmin()
word_id = ids[closest_index]
output_word = nlp.vocab[word_id].text
# output_word is identical, or very close, to the input word