Is it possible to query Elastic Search with a feature vector?

Version 7.4 of elasticsearch actually has built-in vector comparison functions, including cosine similarity. See: https://www.elastic.co/guide/en/elasticsearch/reference/7.4/query-dsl-script-score-query.html#vector-functions.


I don't have an answer particular to Elastic Search because I've never used it (I use Lucene on which Elastic search is built). However, I'm trying to give a generic answer to your question. There are two standard ways to obtain the nearest vectors given a query vector, described as follows.

K-d tree

The first approach is to store the vectors in memory with the help of a data structure that supports nearest neighbour queries, e.g. k-d trees. A k-d tree is a generalization of the binary search tree in the sense that every level of the binary search tree partitions one of the k dimensions into two parts. If you have enough space to load all the points in memory, it is possible to apply the nearest neighbour search algorithm on k-d trees to obtain a list of retrieved vectors sorted by the cosine similarity values. The obvious disadvantage of this method is that it does not scale to huge sets of points, as often encountered in information retrieval.

Inverted Quantized Vectors

The second approach is to use inverted quantized vectors. A simple range-based quantization assigns pseudo-terms or labels to the real numbers of a vector so that these can then later be indexed by Lucene (or for that matter Elastic search).

For example, we may assign the label A to the range [0, 0.1), B to the range [0.1, 0.2) and so on... The sample vector in your question is then encoded as (J,D,C,..A). (because [.9,1] is J, [0.3,0.4) is D and so on).

Consequently, a vector of real numbers is thus transformed into a string (which can be treated as a document) and hence indexed with a standard information retrieval (IR) tool. A query vector is also transformed into a bag of pseudo-terms and thus one can compute a set of other similar vectors in the collection most similar (in terms of cosine similarity or other measure) to the current one.

The main advantage of this method is that it scales well for massive collection of real numbered vectors. The key disadvantage is that the computed similarity values are mere approximations to the true cosine similarities (due to the loss encountered in quantization). A smaller quantization range achieves better performance at the cost of increased index size.