Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines

Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to ‘vector similarity searching’ over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. We show that this approach allows the indexing and querying of dense vectors in text domains. This opens up exciting avenues for major efficiency gains, along with simpler deployment, scaling and monitoring. The end result is a fast and scalable vector database with a tunable trade-off between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch. We empirically demonstrate its querying performance and quality by applying this solution to the task of semantic searching over a dense vector representation of the entire English Wikipedia.


Introduction
The vector space model (Salton et al., 1975) of representing documents in high-dimensional vector spaces has been validated by decades of research and development. Extensive deployment of inverted-index-based information retrieval (IR) systems has led to the availability of robust open source IR systems such as Sphinx, Lucene or its popular, horizontally scalable extensions of Elasticsearch and Solr.
Representations of document semantics based solely on first order document-term statistics, such as TF-IDF or Okapi BM25, are limited in their expressiveness and search recall. Today, approaches based on distributional semantics and deep learning allow the construction of semantic vector space models representing words, sentences, paragraphs or even whole documents as vectors in highdimensional spaces (Deerwester et al., 1990;Blei et al., 2003;Mikolov et al., 2013).
The ubiquity of semantic vector space modeling raises the challenge of efficient searching in these dense, high-dimensional vector spaces. We would naturally want to take advantage of the design and optimizations behind modern fulltext engines like Elasticsearch so as to meet the scalability and robustness demands of modern IR applications. This is the research challenge addressed in this paper.
The rest of the paper describes novel ways of encoding dense vectors into text documents, allowing the use of traditional inverted index engines, and explores the trade-offs between IR accuracy and speed. Being motivated by pragmatic needs, we describe the results of experiments carried out on real datasets measured on concrete, practical software implementations.

Semantic Vector Encoding
for Inverted-Index Search Engines

Related Work
The standard representation of documents in the Vector Space Model (VSM) (Salton and Buckley, 1988) uses term feature vectors of very high dimensionality. To map the feature space onto a smaller and denser latent semantic subspace, we may use a body of techniques, including Latent Semantic Analysis (LSA) (Deerwester et al., 1990), Latent Dirichlet Allocation (LDA) (Blei et al., 2003) or the many variants of Locality-sensitive hashing (LSH) (Gionis et al., 1999). Throughout the long history of VSM developments, many other methods for improving search efficiency have been explored. Weber et al. ran one of the first rigorous studies that dealt with the ineffectiveness of the VSM and the so-called curse of dimensionality. They evaluated several data partitioning and vector approximation schemes, achieving significant nearest-neighbour search speedup in (Weber and Böhm, 2000). The scalability of similarity searching through a new index data structures design is described in (Zezula et al., 2006). Dimensionality reduction techniques proposed in (Digout et al., 2004) allow a robust speedup while showing that not all features are equally discriminative and have a different impact on efficiency, due to their density distribution. Boytsov shows that the k-NN search can be a replacement for term-based retrieval when the term-document similarity matrix is dense.
Recently, deep learning approaches and tools like doc2vec (Le and Mikolov, 2014) construct semantic representations of documents D as d = |D| (dense) vectors in an n-dimensional vector space.
To take advantage of ready-to-use and optimized systems for term indexing and searching, we have developed a method for representing points in a semantic vector space encoded as plain text strings. In our experiments, we will be using Elasticsearch (Gormley and Tong, 2015) in its 'vanilla' setup. We do not utilize any advanced features of Elasticsearch, such as custom scoring, tokenization or a n-gram analyzers. Thus, our method does not depend on any functionality that is specific to Elasticsearch, and it is possible (and sometimes even desirable) to substitute Elasticsearch with other fulltext engine implementations.

Our Vector to String Encoding Method
Let our query be a document, represented by its vector q, for which we aim to find the top k most similar documents in D. We want to search efficiently, indexing and deleting documents from the index in near-real-time, and in a manner that could scale by eventual parallelization, re-using the significant research and engineering effort that went into designing and implementing systems like Elasticsearch.
Conceptually, our method consists of encoding vector features into string tokens (feature tokens), creating a text document from each dense vector. These encoded documents are consequently indexed in traditional inverted-index-based search engines. At query time, we encode the query vec-tor and retrieve the subset of similar vectors E, |E| ≪ |D|, using the engine's fulltext search functionality. Finally, the small set of candidate vectors E is re-ranked by calculating the exact similarity metric (such as cosine similarity) to the query vector. This makes the search effectively a twophase process, with our encoded fulltext search as the first phase and candidate re-ranking as the second.

Encoding Vectors
The core of our method of encoding vectors into strings lies in encoding the vector feature values at a selected precision as character strings -feature tokens. 1 This is best demonstrated on a small example: Let us take a semantic vector of three dimensions, w = [0.12, −0.13, 0.065]. Each feature token starts with its feature identification (e.g. a feature number such as 0, 1 etc.) followed by a precision encoding schema identifier (such as P2, I10 etc.) and the encoded feature value (such as i0d12, ineg0d2 etc.) depending on the particular encoding method. We propose and evaluate three encoding methods: rounding Feature values are rounded to a fixed number of decimal positions and stored as a string that encodes both the feature identification and its value. Rounding to two decimal places produces representation of w as ['0P2i0d12', '1P2ineg0d13', '2P2i0d07'].
interval quantizes w into intervals of fixed length. For example, with an interval width of 0.1, feature values fall into intervals starting at 0.1, −0.2 and 0.0, which we encode as d1, d2 and d0, respectively. Combined with the interval length denotation of I10, the full vector is encoded into the tokens w = ['0I10i0d1', '1I10ineg0d2', '2I10i0d0'].
The intuition behind all these encoding schemes is a trade-off between increasing feature sparsity and retaining search quality: we show that some types of sparsification actually lead to little loss of quality, allowing an efficient use of inverted-index IR engines.

High-Pass Filtering Techniques
The rationale behind the next two techniques is to filter out semantic vector features of low importance, further increasing feature sparsity. This improves performance at the expense of quality.
Trim: In the trimming phase, a fixed threshold -such as 0.1 -is used. Feature tokens in the query with an absolute value of the feature below the threshold are simply discarded from the query. In the case of our example vector w = [0.12, −0.13, 0.065], the tokens representing the third feature value 0.065 are removed since |0.065| < 0.1.
Best: Features of each vector are ordered according to their absolute value and only a fixed number of the highest-valued features are added to the index, discarding the rest. As an example, with best = 1, only the second feature of −0.13 (the highest absolute value) would be considered from w.
Note that in both cases, this type of filtering is only meaningful when the feature ranges are comparable. In our experiments all vectors are normalized to unit length, ranging absolute values of features from zero (no feature importance) to one (maximal feature importance).

Space and Time Requirements
In this section we summarize and compare the theoretical space and time requirements of our proposed vector-to-string encoding and filtering methods with a baseline of a naive, linear brute force search. The inverted-index-based analysis is based on the documentation of the Lucene search engine implementation 2 .
While the running time of the naive search is stable and predictable, the efficiency of the other optimization methods depends on the data set, such as its vocabulary size and the expected postings list sparsity, after the feature token encoding and filtering. On the other hand, performance can be influenced by how the method is configurated -for example, the expected number of distinct feature values depends on the precision of the rounding of the feature values that was used.
The efficiency trade-offs are summarized in Table 1. Each document is represented as a vector d of n features computed by LSA over TF-IDF.
Baseline -naive brute force search: The naive baseline method avoids using a fulltext search altogether, and instead stores all d 'indexed' vectors in their original dense vector representation in RAM, as an n × d matrix. At query time, it computes a similarity between the query vector q and each of the indexed document vectors d, in a single linear scan through the matrix, and returns the top k vectors with the best score.
Efficiency of a naive brute force search: The index is a matrix of floats of size nd resulting in (nd) space complexity. We have chosen the cosine similarity as our similarity metric. To calculate cossim between the query and all d documents, vector q of length n has to be multiplied with a length-normalized matrix of dimensionality n × d, e.g. (nd). Using the resulting vector of d scores, we then pick the k nearest documents as the final query result, in (d).
Efficiency of encoding: We investigate the efficiency of our vector-to-string encoding when a general inverted-index-based search engine is used to index and search through them.
At worst, we store one token per dimension for each vector. We need (nd) space to store all indexed documents, as is the case with the naive search. In practice, there are several different constants as naive search saves one float per feature (four bytes), while our feature tokens are compressed string-encoded feature values and indices of the sparse feature positions in the inverted index.
Each dimension of the query vector q contains a string-encoded feature q j . For each q j we fetch a list of documents c i together with term frequency t n in that document: tf(t n , c i ).
For each of these (c i , tf(t n , c i )) pairs we add the corresponding score value to the score of document c i in a list of result-candidate documents. The score computation contains a vector dot-product operation in the dimension of the size v of feature vocabulary V that can be computed in (v). Document c i is added to the set C of all result-candidate documents, which we sort in (c log c) time and return the top k results. The whole search is per-formed in (n · p · v + c log c) steps, where p is the expected postings list size and c = |C|.
Efficiency of high-pass filtering: We approximate the full feature vector by storing only the most significant features, e.g. only the top m dimensions. When compared with a naive search, we save on space: only (md), m ≪ n values are needed.
For each feature value q j we have to find all documents with the same feature value. We are able to find the feature set in (log j) steps, where j is the number of distinct indexed values of the feature. Consequently, we retrieve the matched documents in (l) time, where l is the number of documents in the index with the appropriate feature value.
Each of the found documents is added to set C and its score is incremented for this hit. If represented with an appropriate data structure, such as a hash table, we are able to add and increment scores of the items in (1) time. Having all c = |C| candidate documents over all the separate feature-searches, we pick and return the top k items in (c) time.
Combined, (n · (log j + l) + c) steps are needed for the search, where j is the expected number of distinct indexed values per feature and l is the number of documents in the index per feature value.

Experimental Setup
To evaluate our method, we used ScaleText (Rygl et al., 2016) based on Elasticsearch (Gormley and Tong, 2015) as our fulltext IR system. The evaluation dataset was the whole of the English Wikipedia consisting of 4,181,352 articles.

Quality Evaluation
The aim of the quality evaluation was to investigate how well the approximate 'encoded vector' search performs in comparison with the exact naive brute-force search, using cosine similarity as the similarity metric. Cosine similarity is definitely not the only possible metric -we selected cosine similarity as we needed a fully automatic evaluation, without any need of human judgement, and because cosine similarity suits our upstream application logic perfectly.
We converted all Wikipedia documents into vectors using LSA with 400 features. We then randomly selected 1,000 documents from our Wikipedia dataset to act as our query vectors. By doing a naive brute force scan over all the vectors (the whole dataset), we identified the 10 most similar ones for each query vector. This became our 'gold standard'.
We encoded the dataset vectors into various string representations, as described in Section 2.2.1 and stored them in Elasticsearch.
For evaluating the search, we pruned the values in the query vectors (see Section 2.2.2) and encoded them into a string representation. Using these strings, we performed 1,000 Elasticsearch searches. For each query, we measured the overlap between the retrieved documents and the gold standard using Precision@k or the Normalized Discounted Cumulative Gain (nDCG k ). The mean cumulative loss between the ideal and the actual cosine similarities of the top k results (avg. diff.) is also reported.
Note that since we re-rank |E| results obtained from Elasticsearch (see Section 2.2), the positions on which the gold standard vectors were originally returned by the fulltext engine are irrelevant.
Apart from the vector dimensionality n (the number of LSA features), we monitored the trim threshold and the number of best features as described in Section 2.2.2. We also experimented with the number of vectors E retrieved for each Elasticsearch query as the page parameter.
To provide a comparison with an established search approach, and to serve as a baseline, we also evaluated indexing and searching using the native fulltext indexing and searching capabilities of Elasticsearch. In this case, the plain fulltext of every article was sent directly to the fulltext search engine as a string, without any vector conversions or preprocessing. For querying, we use the More Like This (MLT) Query API of Elasticsearch. 3 Any data processing during indexing and querying was done by Elasticsearch in its default settings with a single exception: we evaluated multiple values of the max_query_terms parameter of the MLT API. We tested the default value (25) plus values corresponding to the values of the best parameter used for the evaluation of our method.
We report mainly on the avg. diff., i.e. the mean difference between the ideal and the actual cosine similarities of the first ten retrieved documents to a query, averaged over all 1,000 queries. We also report Precision@10, i.e. the ratio of the gold stan- Table 1: Comparison of encoding methods in terms of space and time. n is the number of semantic vector features, d is the number of semantic vectors, m is the number of semantic vector features after high-pass filtering, p is the expected postings list size (inverted index sparsity), v is the expected vocabulary size per a feature, c is the number of result-candidate semantic vectors, j the expected number of distinct indexed values of the feature, and l the expected number of documents in the index per a feature value. For details see Section 2.3. naive search token encoding high-pass filtering dard documents in the first ten results averaged over all 1,000 queries, and on Normalized Discounted Cumulative Gain (nDCG 10 ) (Manning et al., 2008, Section 8.4), where the relevance value of a retrieved document is taken to be its cosine similarity to the query.

Speed Evaluation
In this section, we evaluate the performance of feature token strings searches in Elasticsearch, using various Elasticsearch configurations as well as various vector filtering parameters. Our Elasticsearch cluster consisted of 6 nodes, with 32 GiB RAM and 8 CPUs each, for a total of 192 GiB and 48 cores.
We report ES avg./std. [s] -the average number of seconds per request Elasticsearch took (the 'took' time from the Elasticsearch response, i.e. the search time inside Elasticsearch) and its standard deviation, Request avg./std. [s] -the average number of seconds and its standard deviation per request including communication overhead with Elasticsearch to get the results (i.e. the time of the client to get the answer), Total time [s] -total number of seconds for processing all requests in the batch. This can differ from the sum of average request times when executing queries in parallel. The number of features in query vectors that passed through threshold trimming is reported as Vec. size avg./std.

Quality Evaluation
Results of the quality evaluation are summarized in Table 2. The results of our method in different settings are put side by side with the results of the brute-force naive search and with Elasticsearch's native More Like This (MLT) search. For the MLT results, the max_query_terms Elasticsearch parameter is reported in the best column since both the semantics and the impact on the search speed are similar. Figure 1 illustrates the impact of feature value filtering and the number of retrieved search candidates from Elasticsearch (page size) on its accuracy. It can be seen that avg. diff. decreases logarithmically with the page size. The results improve all the way up to 640 search results (the maximum value we have tried), which is expected as this increases the size of the subset E that is consequently ordered (re-ranked) in phase 2 with the precise but more costly exact algorithm. Increasing the size of E increases the chance of the inclusion of relevant results.
The shape of the curve suggests that there would only be a slight improvement in accuracy, and this would be at the cost of a substantial drop in performance. The impact of including only a limited number of features with the highest absolute value (see Section 2.2.2), is rather low. This is an excellent result with regard to performance as it means we may effectively sparsify the query vector with very little impact on search quality. We observe little difference between no filtering (searching by T r i m B e s t P a g e M i n . P @ 1 0 A v g . P @ 1 0 M a x . P @ 1 0 n D C G 1 0 A v g . d i ff .  all 400 encoded features) and trimming to only the 90 best query vector values. However, trimming to as low as 6 values results in a significant increase in avg. diff. See Figure 1a. Our methods scores above the MLT baseline in all of the followed metrics, even with aggressive high-pass filtering and with a small page size.
We expect that similar setup parameters will work similarly, at least for general multi-topic text datasets. Its behaviour for a dramatically different dataset, such as images instead of texts, or without normalized feature ranges, cannot be directly inferred and remains to be investigated in our future work. However, we expect our speed optimization methods to be applicable in some form, with the concrete parameters to be validated on the particular dataset and algorithmic setup.

Speed Evaluation
Selected results of our speed evaluation are summarized in Table 3. For clarity, we selected only the configurations using 400 LSA features and 48 Elasticsearch shards, as this setup turned out to provide the optimal performance on the Elasticsearch cluster and dataset we used according to the quality evaluation (see Section 4.1), where the same parameters were used.
The speed of the native Elasticsearch MLT search is summarized in Table 4. The speed is comparable to our method when high-pass filtering is involved.  (Figure 2a) with four parallel queries on the same Elasticsearch cluster configuration (Figure 2b) shows that at the expense of doubling the response time, we are able to answer four requests in parallel.
The best results in Figure 2 are located in the bottom right corner where the precision is high and the response time is low. For our dataset and algorithm (LSA with 400 features), the best overall results are represented by the largest gray dots, i.e. retrieving 320 vectors from Elasticsearch while filtering the query vector to roughly 90 values via trimming with a threshold of 0.05.
To achieve the optimal results, we suggest retrieving as large a set of candidates (Elasticsearch page size, E) as the response-time constraints allow, as the page size seems to have significantly lower influence on the response time compared to trimming, while having a significant positive effect on accuracy.
Our experiments were done on the Wikipedia dataset. Wikipedia is a multi-topic dataset -articles are on a wide variety of different topics using different keywords (names of people, places and things, etc) and notation (text only articles, articles on mathematics using formulae, articles on chemistry using different formulae, etc) are included. This provides enough room for the machine learning algorithms to build features that reflect these unique markers of particular topics and makes particular features significantly irrelevant for particular documents in the dataset. For general multi-topic text datasets, we recommend trimming features values by their absolute value below 5% of the maximum (i.e. between −0.05 and 0.05 in our experiments). Trimming more feature tokens decreases the precision with almost no influence on the response times, while keeping more feature tokens in the index has almost no positive effect on the precision but slows down the search significantly.

Conclusions
In this paper we have demonstrated a novel method for the conversion of semantic vectors into a set of string 'feature tokens' that can be subsequently indexed in a standard inverted-index-based fulltext search engine, such as Elasticsearch.
Two techniques of feature tokens filtering were demonstrated to further significantly speed up the search process, with an acceptably low impact on the quality of the results.
Using Elasticsearch MLT on document texts as a baseline, our method performs better than the baseline on all the followed metrics. With sufficient query vector feature reduction, our method is faster than MLT. With moderate query vector feature reduction, we can achieve excellent approximation of the gold standard while being only marginally slower than the MLT. An important conclusion from our experiments is that the search speed can be improved even with filtering the query vectors alone and without the need to trim index vectors. A pleasant practical consequence of this finding is that a vector search engine based on our proposed scheme could allow the users to define the filtering parameters dynamically, at search time rather than at indexing time. In this way, we let the users choose the approximation trade-off between the search speed and accuracy, i.e. use weaker filtering parameters for searches where accuracy is critical, and more aggressive filtering where speed is critical.
In our future work we will focus on the validation of our techniques on different types of data (such as images or audio data) and different text representations (such as doc2vec) in specific domains (such as question answering).