Efficient, Compositional, Order-sensitive n-gram Embeddings

We propose ECO: a new way to generate embeddings for phrases that is Efficient, Compositional, and Order-sensitive. Our method creates decompositional embeddings for words offline and combines them to create new embeddings for phrases in real time. Unlike other approaches, ECO can create embeddings for phrases not seen during training. We evaluate ECO on supervised and unsupervised tasks and demonstrate that creating phrase embeddings that are sensitive to word order can help downstream tasks.


Introduction
Semantic embeddings of words represent word meaning via a vector of real values (Deerwester et al., 1990). The Word2Vec models introduced by Mikolov et al. (2013a) greatly popularized this semantic representation method and since then improvements to the basic Word2Vec model have been proposed (Levy and Goldberg, 2014;Ling et al., 2015).
Although techniques exist to sufficiently induce representations of single tokens (Mikolov et al., 2013a;Pennington et al., 2014), current methods for creating n-gram embeddings are far from satisfactory. Recent approaches cannot embed n-grams that do not appear during training. For example, Hill et al. (2016) used a heuristic of converting phrases to tokens before learning the embeddings. Additionally, Yin and Schütze (2014) queried sources to determine which phrases to embed.
We propose a new method for creating phrase embeddings on-the-fly.
Offline, we compute decomposed word embeddings (Figure 1a) that can be used online to Efficiently generate Compositional n-gram embeddings that are sensitive to word Order (Figure 1b). We refer to our method as ECO. ECO is * denotes equal contribution. a novel way to incorporate knowledge about phrases into machine learning tasks. We evaluate our method on different supervised and unsupervised tasks.

Background
Before introducing our approach for creating decomposed word embeddings to ultimately create n-gram embeddings online, we introduce our notation and provide a brief overview of the Word2Vec model.

Notation
We define s to be a sequence of words and s j to be the j th word of sequence s. Let |s| be the length of the sequence and let S be the set of all sequences. Additionally, let W denote an indexed set of words, w denote a generic word and w i denote the i th word of W . V and V out denote indexed sets of vectors of length d corresponding to W , i.e. v ∈ V , v out ∈ V out , and v w corresponds to the vector representing word w ∈ W . These two sets of vectors correspond to the input and output representations of a word as described by Mikolov et al. (2013b). The notation [., .) denotes a set of integers that contain successive integers starting from and including the left and excluding the right argument.
Word2Vec Model The popular Word2Vec model consists of four possible models: Continuous Bag-of-Words (CBOW) with hierarchical softmax or negative sampling, and Skip-Gram (SG) with the same choices for optimizing training parameters. CBOW aims to predict a single word w surrounded by the given context while SG tries to predict the context words around w (Rong, 2014). The SG model maximizes the following average log-probability of the sentence averaged over the entire corpus: where c refers to the window size, i.e. half the size of the context. The probability of token s k given token s j is computed as the softmax over the inner products of the embeddings of the two tokens: . (2) 3 Possible approaches to embed n-grams Before introducing ECO, we present a discussion of other possible ways to combine unigram embeddings to generate n-gram embeddings. This discussion motivates the need for ECO and the issues that our novel approach solves.
Treat n-grams as words The simplest way to create embeddings for phrases would be to treat phrases as single words and run out-of-the-box software to embed those n-grams just like one would for single words. Implementing such an approach would just require changing how one pre-processes text and then running Word2Vec. Yin and Schütze (2014) use external sources to determine common bigrams to embed offline. This approach can not embed unknown n-grams regardless of whether each of the n-words in the sequence appeared in a training corpus. Since this situation will often occur, especially when increasing the minimum count for words used to learn an unknown embedding, this approach is insufficient and cannot embed n-grams on-the-fly.
Combining individual word embeddings The next plausible approach to create n-gram embeddings would be to combine the individual word embeddings into one new embedding with heuristics such as averaging, adding, or multiplying the word embeddings (Mitchell and Lapata, 2010). Averaging the embeddings, which we use as a baseline for our experiments, can be viewed as where v [w 1 :wn] is the embedding for a phrase of size n. However, regardless of how one combines the individual word embeddings, the ordering of words in a phrase is not captured in the new n-gram embedding. For example, with this method, the embeddings for the bigrams shark killer and killer shark would be the same. Therefore, an ordered approach is needed.

The ECO Way
We now present our strategy to eliminate the shortcomings of the previously discussed approaches and propose an intuitive method for creating n-gram embeddings.
Skip-Embeddings The Word2Vec model encodes a word w using a single embedding v w that must maximize the log probability of the tokens that occur around it. This encourages the embedding of a word to be representative of the context surrounding it. However a careful look reveals that the context around a word can be split into multiple categories, specifically that each word has at least 2c contexts, one for each position in the window being considered.
Thus, we can parameterize each word w with 2c embeddings. For all i ∈ [−c:c] such that i = 0, v i w encodes the context of word w at a specific position, to the left (−) or right (+), from w. With this strategy, instead of having one model with the objective function from (1), we now have 2c independent models with their own objective function of where s k is the word i positions away from s j in s.
The new probability distribution is now We refer to each of the newly decompositional 2c embeddings created per word as skip-embeddings. Since a skip-embedding only considers a single token separated by i tokens from w, the dimensionality of v i w should be kept to d 2c to allow for direct comparison to Word2Vec that uses d dimensional embeddings. Consequently, each skip-embedding is trained with only d 2c parameters. Another major benefit of this architecture is that the training can run in parallel, since the 2c skip embeddings are generated independently. As evidenced in section 5.2, our approach does not sacrifice quality in single word embeddings.
Combining Skip-Embeddings After creating skip-embeddings offline, we are ready to embed n-grams on-the-fly, regardless of whether a n-gram appeared in the original training corpus. Although we could concatenate the 2c embeddings to create a unigram embedding, instead of creating n-grams embeddings, we average the position specific skip-embeddings of words to create two vectors v We then concatenate v L [w 1 :wn] and v R [w 1 :wn] to create a single embedding of the entire n-gram. After concatenation, the dimensionality of a ECO n-gram embedding is d c .

Experiments
Our proposed method decomposes previous word embedding work into 2c models as explained in (4) and uses an order-sensitive heuristic (6) (7) to combine skip-embeddings to embed n-grams. Our experiments demonstrate that this novel method retains more semantic meaning than other approaches. We evaluate our n-gram embeddings through both supervised and unsupervised tasks to test how well our technique embeds phrases and words.
Data We extracted over 111 million sentences 1 consisting of over 2 billion words of raw text from English Wikipedia (Ferraro et al., 2014) and ran our ECO framework 2 to create skip-embeddings for each word that appeared at least five times in the text. We also ran out-of-the-box Word2Vec on the English Wikipedia

Phrase Similarity
We compare similarities between source and target phrases extracted from the paraphrase database (PPDB). To create our evaluation set of source and a pair of corresponding target phrases, we randomly sampled source phrases from PPDB that had at least two corresponding target phrases in the database. We then randomly sampled two target phrases for each source phrase (bolded in the figure above). For each tuple consisting of a source phrase and two target phrases, we manually chose which target phrase best captured the meaning of the source phrase or whether both target phrases have the same meaning. This became our gold data. Our evaluation set consists of 279 source phrases: 137 source phrases from PPDB's extra-extra-large phrasal subset and 142 source phrases from PPDB's extra-extra-large lexical subset 3 . Figure 2 illustrates an example from our evaluation dataset. We use our proposed model to embed the source and target phrases. If the absolute difference between cosine similarities is less than .01, we count the two target phrases as having the same meaning. Otherwise, we choose the target phrase whose embedding had a higher cosine similarity with the embedding of the source phrase. We compare our results with the PPDB1.0 (Ganitkevitch et al., 2013) and PPDB2.0 (Pavlick et al., 2015) similarity scores and the cosine similarity scores computed by the naive approach PPDB p=100 w=2 p=100 w=5 p=500 w=2 p=500 w=5 p=700 w=2 as discussed in section 3. The accuracies reported in Table 1 demonstrate that ECO captures semantics on n-grams better than the baseline approach. In all of the configurations, ECO outperforms Word2Vec for phrases that are longer than one word.

Word Embedding Similarity
Although ECO's primary goal is to create n-gram embeddings, it is important for our approach to not sacrifice quality in single word embeddings. Thus, we compare our word embeddings to seven word similarity benchmarks provided by Faruqui and Dyer (2014)'s online system. To evaluate how well ECO embeds unigrams, we concatenate v −1 w and v 1 w for the 5629 words provided by Faruqui and Dyer (2014) and upload our ECO word embeddings to Faruqui and Dyer (2014)'s website 4 . We also upload the embeddings we generate by running Word2Vec as our baseline. The scores reported in Table 2 suggest that as the number of parameters increase, ECO better retains information for word embeddings than Word2Vec.

Supervised Scoring Model
Unlike the original paraphrase ranking heuristic, Pavlick et al. (2015)

Previous work
Due to the popularity of word embeddings and the boost they have provided in supervised (Le and Mikolov, 2014) and unsupervised (Lin et al., 2015) NLP tasks, recent work has focused on how to properly embed sentences and phrases. Yin and Schütze (2014)'s method is similar to the method discussed in Section 3. They use Wiktionary and WordNet to determine the most common bigrams and create embeddings for those. Hill et al. (2016) use reverse dictionaries to determine which phrases define single words and use neural language models to learn a mapping between the phrases and word vectors. Both of these approaches can not generate embeddings for phrases on the fly and require an external corpus. Recent work has also focused on capturing word order in embeddings. While Yuan et al. (2016) are not concerned with embedding phrases, they point out issues with concatenating or averaging standard word embeddings. They train an LSTM to appropriately incorporate word vectors in the Word Sense Disambiguation task. Their model is sensitive to word order when determining the sense of a specific word. Yuan et al. (2016)'s approach is more computationally intensive than ECO. Le and Mikolov (2014)'s Paragraph Vector framework also focus on capturing word order in their embeddings. However, our method is more efficient since ECO does not require training the n-gram embeddings. Ling et al. (2015)'s work on structured Word2Vec is most similar to ours. However, instead of decomposing Word2Vec into 2c models with the same number of parameters, Ling et al. (2015) combine the contexts into one large model, creating a single model with 2c parameters. Even though Ling et al. (2015) incorporate positional information into the Word2Vec models, their approach cannot be used to create efficient, compositional, and order-sensitive n-gram embeddings.

Conclusion
We investigated a general view of Word2Vec based upon creating multiple separate, skip-embeddings per word, where each skip-embedding is individually much smaller in size in comparison to the single Word2Vec word embedding. Our method allows us to efficiently compose embeddings for n-grams that were not seen during training of the skip-embeddings while maintaining order sensitivity. Our experiments also demonstrated that averaging skip-embeddings for creating n-gram embeddings that preserve order-sensitive information is useful for NLP tasks while using the same number of parameters as the word2vec method. In comparison to previous approaches (Le and Mikolov, 2014;Yuan et al., 2016), our method is computationally efficient. This tradeoff between efficiency, both in terms of the number of parameters stored and learnt, computations performed, and order sensitivity is unique to our proposed model.
In future work, we will investigate other heuristics for combining skip-embeddings into n-gram embeddings. Additionally, we hope to use similar techniques as ECO to embed full sentences and documents in real time. Finally, we plan to explore tensor factorization methods (Cotterell et al., 2017) to incorporate morphology, syntactic relations, and other linguistic structures into ECO n-gram embeddings.