Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics

The existing word representation methods mostly limit their information source to word co-occurrence statistics. In this paper, we introduce ngrams into four representation methods: SGNS, GloVe, PPMI matrix, and its SVD factorization. Comprehensive experiments are conducted on word analogy and similarity tasks. The results show that improved word representations are learned from ngram co-occurrence statistics. We also demonstrate that the trained ngram representations are useful in many aspects such as finding antonyms and collocations. Besides, a novel approach of building co-occurrence matrix is proposed to alleviate the hardware burdens brought by ngrams.


Introduction
Recently, deep learning approaches have achieved state-of-the-art results on a range of NLP tasks. One of the most fundamental work in this field is word embedding, where low-dimensional word representations are learned from unlabeled corpora through neural models. The trained word embeddings reflect semantic and syntactic information of words. They are not only useful in revealing lexical semantics, but also used as inputs of various downstream tasks for better performance (Kim, 2014;Collobert et al., 2011;Pennington et al., 2014).
Most of the word embedding models are trained upon <word, context> pairs in the local window. Among them, word2vec gains its popularity by its amazing effectiveness and efficiency (Mikolov et al., 2013b,a). It achieves state-of-the-art results on a range of linguistic tasks with only a fraction of time compared with previous techniques. A challenger of word2vec is GloVe (Pennington et al., 2014). Instead of training on <word, context> pairs, GloVe directly utilizes word co-occurrence matrix. They claim that the change brings the improvement over word2vec on both accuracy and speed. Levy and Goldberg (2014b) further reveal that the attractive properties observed in word embeddings are not restricted to neural models such as word2vec and GloVe. They use traditional count-based method (PPMI matrix with hyper-parameter tuning) to represent words, and achieve comparable results with the above neural embedding models.
The above models limit their information source to word co-occurrence statistics (Levy et al., 2015). To learn improved word representations, we extend the information source from cooccurrence of 'word-word' type to co-occurrence of 'ngram-ngram' type. The idea of using ngrams is well supported by language modeling, one of the oldest problems studied in statistical NLP. In language models, co-occurrence of words and ngrams is used to predict the next word (Kneser and Ney, 1995;Katz, 1987). Actually, the idea of word embedding models roots in language models. They are closely related but are used for different purposes. Word embedding models aim at learning useful word representations instead of word prediction. Since ngram is a vital part in language modeling, we are inspired to integrate ngram statistical information into the recent word representation methods for better performance.
The idea of using ngrams is intuitive. However, there is still rare work using ngrams in recent representation methods. In this paper, we introduce ngrams into SGNS, GloVe, PPMI, and its SVD factorization. To evaluate the ngram-based models, comprehensive experiments are conducted on word analogy and similarity tasks. Experimental results demonstrate that the improved word representations are learned from ngram co-occurrence statistics. Besides that, we qualitatively evaluate the trained ngram representations. We show that they are able to reflect ngrams' meanings and syntactic patterns (e.g. 'be + past participle' pattern). The high-quality ngram representations are useful in many ways. For example, ngrams in negative form (e.g. 'not interesting') can be used for finding antonyms (e.g. 'boring').
Finally, a novel method is proposed to build ngram co-occurrence matrix. Our method reduces the disk I/O as much as possible, largely alleviating the costs brought by ngrams. We unify different representation methods in a pipeline. The source code is organized as ngram2vec toolkit and released at https://github.com/ zhezhaoa/ngram2vec.

Related Work
SGNS, GloVe, PPMI, and its SVD factorization are used as baselines. The information used by them does not go beyond word co-occurrence statistics. However, their approaches to using the information are different. We review these methods in the following 3 sections. In section 2.4, we revisit the use of ngrams in the deep learning context.

SGNS
Skip-gram with negative sampling (SGNS) is a model in word2vec toolkit (Mikolov et al., 2013b,a). Its training procedure follows the majority of neural embedding models (Bengio et al., 2003): (1) Scan the corpus and use <word, context> pairs in the local window as training samples.
(2) Train the models to make words useful for predicting contexts (or in reverse). The details of SGNS is discussed in Section 3.1. Compared to previous neural embedding models, S-GNS speeds up the training process, reducing the training time from days or weeks to hours. Also, the trained embeddings possess attractive properties. They are able to reflect relations between two words accurately, which is evaluated by a fancy task called word analogy.
Due to the above advantages, many models are proposed on the basis of SGNS. For example, Faruqui et al. (2015) introduce knowledge in lexical resources into the models in word2vec.  extend the contexts from the local window to the entire documents. Li et al. (2015) use supervised information to guide the training. Dependency parse-tree is used for defining context in (Levy and Goldberg, 2014a). LSTM is used for modeling context in (Melamud et al., 2016) Sub-word information is considered in (Sun et al., 2016;Soricut and Och, 2015).

GloVe
Different from typical neural embedding models which are trained on <word, context> pairs, GloVe learns word representation on the basis of co-occurrence matrix (Pennington et al., 2014). GloVe breaks traditional 'words predict contexts' paradigm. Its objective is to reconstruct non-zero values in the matrix. The direct use of matrix is reported to bring improved results and higher speed. However, there is still dispute about the advantages of GloVe over word2vec (Levy et al., 2015;Schnabel et al., 2015). GloVe and other embedding models are essentially based on word cooccurrence statistics of the corpus. The <word, context> pairs and co-occurrence matrix can be converted to each other. Suzuki and Nagata (2015) try to unify GloVe and SGNS in one framework.

PPMI & SVD
When we are satisfied with the huge promotions achieved by embedding models on linguistic tasks, a natural question is raised: where the superiorities come from. One conjecture is that it's due to the neural networks. However, Levy and Goldberg (2014c) reveal that SGNS is just factoring P-MI matrix implicitly. Also, Levy and Goldberg (2014b) show that positive PMI (PPMI) matrix still rivals the newly proposed embedding models on a range of linguistic tasks. Properties like word analogy are not restricted to neural models. To obtain dense word representations from PP-MI matrix, we factorize PPMI matrix with SVD, a classic dimensionality reduction method for learning low-dimensional vectors from sparse matrix (Deerwester et al., 1990).

Ngram in Deep Learning
In the deep learning literature, ngram has shown to be useful in generating text representations. Recently, convolutional neural networks (CNNs) are reported to perform well on a range of NLP tasks (Blunsom et al., 2014;Hu et al., 2014;Severyn and Moschitti, 2015). CNNs are essentially using ngram information to represent texts. They use 1-D convolutional layers to extract ngram features and the distinct features are selected by max-pooling layers. In , ngram embedding is introduced into Paragraph Vector model, where text embedding is trained to be useful to predict ngrams in the text. In the word embedding literature, a related work is done by Melamud et al. (2014), where word embedding models are used as baselines. They propose to use ngram language models to model the context, showing the effectiveness of ngrams on similarity tasks. Another work that is related to ngram is from Mikolov et al. (2013b), where phrases are embedded into vectors. It should be noted that phrases are different from ngrams. Phrases have clear semantics and the number of phrases is much less than the number of ngrams. Using phrase embedding has little impact on word embedding's quality.

Model
In this section, we introduce ngrams into SGNS, GloVe, PPMI, and SVD. Section 3.1 reviews the SGNS. Section 3.2 and 3.3 show the details of introducing ngrams into SGNS. In section 3.4, we show the way of using ngrams in GloVe, PPMI, and SVD, and propose a novel way of building ngram co-occurrence matrix.

Word Predicts Word: the Revisit of SGNS
First we establish some notations. The raw input is a corpus T = {w 1 ,w 2 ,......,w |T | }. Let W and C denote word and context vocabularies. θ is the parameters to be optimized. SGNS's parameters involve two parts: word embedding matrix and context embedding matrix. With embedding w ∈ R d , the total number of parameters is (|W|+|C|)*d.
The SGNS's objective is to maximize the conditional probabilities of contexts given center words: where C(wt) = {wi, t − win ≤ i ≤ t + win and i = t} and win denotes the window size. As illustrated in figure 1, the center word 'written' predicts its surrounding words 'Potter', 'is', 'by', and 'J.K.'. In this paper, negative sampling (Mikolov et al., 2013b) is used to approximate the conditional probability: where σ is sigmoid function. k samples (from c 1 to c k ) are drawn from context distribution raised to the power of n.

Word Predicts Ngram
In this section, we introduce ngrams into context vocabulary. We treat each ngram as a normal word and give it a unique embedding. During the training, the center word should not only predict its surrounding words, but also predict its surrounding ngrams. As shown in figure 2, center word 'written' predicts the bigrams in the local window such as 'by J.K.'. The objective of 'word predicts ngram' is similar with the original SGNS. The only difference is the definition of the C(w). In ngram case, C(w) is formally defined as follows: where w i:i+n denotes the ngram w i w i+1 ...w i+n−1 and N is the order of context ngram. Two points need to be noticed from the above definition. The first is how to determine the distance between center word and context ngram. In this paper, we use the distance between the word and the ngram's far-end word. As show in figure 2, the distance between 'written' and 'Harry Potter' is 3. As a result, 'Harry Potter' is not included in the center word's context. This distance definition ensures that the ngram models don't use the information beyond the pre-specified window, which guarantees fair comparisons with baselines. Another point is whether the overlap of word and ngram is allowed or not. In the overlap situation, ngrams are used as context even they contain the center word. As the example in figure 2 shows, ngram 'is written' and 'written by' are predicted by the center word 'written'. In the non-overlap case, these ngrams are excluded. The properties of word embeddings are different when overlap is allowed or not, which will be discussed in experiments section.

Ngram Predicts Ngram
We further extend the model to introduce ngrams into center word vocabulary. During the training, center ngrams (including words) predict their surrounding ngrams. As shown in figure 3, center bigram 'is written' predicts its surrounding words and bigrams. The objective of 'ngram predicts ngram' is as follows: where N w is the order of center ngram. The definition of C(w t:t+nw ) is as follows: where N c is the order of context ngram. To this end, the word embeddings are not only affected by the ngrams in the context, but also indirectly affected by co-occurrence statistics of 'ngramngram' type in the corpus. SGNS is proven to be equivalent with factorizing pointwise mutual information (PMI) matrix (Levy and Goldberg, 2014c). Following their work, we can easily show that models in section 3.2 and 3.3 are implicitly factoring PMI matrix of 'word-ngram' and 'ngram-ngram' type. In the next section, we will discuss the content of introducing ngrams into positive PMI (PPMI) matrix.

Co-occurrence Matrix Construction
Introducing ngrams into GloVe, PPMI, and SVD is straightforward: the only change is to replace word co-occurrence matrices with ngram ones. In the above three sections, we have discussed the way of taking out <word(ngram), word(ngram)> pairs from a corpus. Afterwards, we build the cooccurrence matrix upon these pairs. The rest steps are identical with the original baseline models.

Win
Type However, building the co-occurrence matrix is not an easy task as it apparently looks like. The introduction of ngrams brings huge burdens on the hardware. The matrix construction cost is closely related to the number of pairs (#Pairs). Table 1 shows the statistics of pairs extracted from corpus wiki2010 1 . We can observe that #Pairs is huge when ngrams are considered.
To speed up the process of building ngram co-occurrence matrix, we take advantages of 'mixture' strategy (Pennington et al., 2014) and 'stripes' strategy (Dyer et al., 2008;Lin, 2008). The two strategies optimize the process in different aspects. Computational cost is reduced significantly when they are used together.
When words (or ngrams) are sorted in descending order by frequency, the co-occurrence matrix's top-left corner is dense while the rest part is sparse. Based on this observation, the 'mixture' of two data structures are used for storing matrix. Elements in the top-left corner are stored in a 2D array, which stays in memory. The rest of the elements are stored in the form of <ngram, H>, where H<context, count> is an associative array recording the number of times the ngram and context co-occurs ('stripes' strategy). Compared with storing <ngram, context> pairs explicitly, the 'stripes' strategy provides more opportunities to aggregate pairs outside of the top-left corner.
Algorithm 1 shows the way of using the 'mixture' and 'stripes' strategies together. In the first stage, pairs are stored in different data structures according to topLeft function. Intermediate results are written to temporary files when memory is full. In the second stage, we merge these sorted temporary files to generate co-occurrence matrix. The getSmallest function takes out the pair <ngram, H> with the smallest key from temporary files. In practice, algorithm 1 is efficient. Instead of using computer clusters (Lin, 2008), we can build the matrix of 'bi bi' type even in a laptop. It only requires 12GB to store temporary files (win=2, sub-sampling=0, memory size=4GB), which is much smaller than the implementations in (Pennington et al., 2014;Levy et al., 2015) . More detailed analysis about these strategies can be found in the ngram2vec toolkit.

Datasets
The tasks used in this paper is the same with the work of Levy et al. (2015), including six similarity and two analogy datasets. In similarity task, a scalar (e.g. a score from 0 to 10) is used to measure the relation between the two words. For example, in a similarity dataset, the 'train, car' pair is given the score of 6.31. A problem of similarity task is that scalar only reflects the strength of the relation, while the type of relation is totally ignored (Schnabel et al., 2015).
Due to the deficiency of similarity task, analogy task is widely used as benchmark recently for evaluation of word embedding models. To answer analogy questions, relations between the two words are reflected by a vector, which is usually obtained by the difference between word embeddings. Different from a scalar, the vector provides more accurate descriptions of relations. For example, capital-country relation is encoded in vec(Athens)-vec(Greece), vec(Tokyo)vec(Japan) and so on. More concretely, the questions in the analogy task are in the form of 'a is to b as c is to d'. 'd' is an unknown word in the test phase. To correctly answer the questions, the models should embed the two relations, vec(a)vec(b) and vec(c)-vec(d), into similar positions in the space. Following the work of Levy and Goldberg (2014b), both additive (add) and multiplicative (mul) functions are used for finding word 'd'. The latter one is more suitable for sparse representation in practice.

Pipeline and Hyper-parameter Setting
We implement SGNS, GloVe, PPMI, and SVD in a pipeline, allowing the reuse of code and intermediate results. Figure 4 illustrates the overview of the pipeline. Firstly, <word(ngram), word(ngram)> pairs are extracted from the corpus as the input of SGNS. Afterwards, we build the co-occurrence matrix upon the pairs. GloVe and PPMI learn word representations on the basis of co-occurrence   Table 3: Performance of (ngram) SGNS on similarity datasets. matrix. SVD factorizes the PPMI matrix to obtain low-dimensional representation. Most hyper-parameters come from 'corpus to pairs' part and four representation models. 'corpus to pairs' part determines the source of information for the subsequent models and its hyper-parameter setting is as follows: low-frequency words (ngrams) are removed with a threshold of 10. Highfrequency words (ngrams) are removed with subsampling at the degree of 1e-5 2 . Window size is set to 2 and 5. Clean strategy (Levy et al., 2015) is used to ensure no information beyond pre-specified window is included. Overlap setting is used in default. For hyper-parameters of four representation models, we use the embeddings of 300 dimensions in dense representations. SGNS is trained by 3 iterations. The rest strictly follow the baseline models 3 . We consider unigrams (words), bigrams, and trigrams in this work. The implementation of higher-order models and their results will be released with ngram2vec toolkit.

Ngrams on SGNS
SGNS is a popular word embedding model. Even compared with its challengers such as GloVe, S-GNS is reported to have more robust performance with faster training speed (Levy et al., 2015). Table 2 lists the results on analogy datasets. We can observe that the introduction of bigrams provides significant improvements at different hyperparameter settings. The SGNS of 'bi bi' type provides the highest results. It is very effective on capturing semantic information (Google semantic). Around 10 percent improvements are wit-    Table 6: Performance of (ngram) SVD on analogy and similarity datasets.
nessed on semantic questions compared with uni uni baseline. For syntactic questions (Google syntactic and MSR datasets), around 5 percent improvements are obtained on average. The effect of overlap is large on analogy datasets. Semantic questions prefer the overlap setting. Around 10 and 3 percent improvements are witnessed compared with non-overlap setting at the window size of 2 and 5. While in syntactic case, non-overlap setting performs better by a margin of around 5 percent.
The introduction of trigrams deteriorates the models' performance on analogy datasets (especially at the window size of 2). It is probably because that trigram is sparse on wiki2010, a relatively small corpus with 1 billion tokens. We conjecture that high order ngrams are more suitable for large corpora and will report the results in our future work. It should be noticed that trigram is not included in vocabulary in non-overlap case at the window size of 2. The shortest distance between a word and a trigram is 3, which exceeds the window size. Table 3 illustrates the SGNS's performance on similarity task. The conclusion is similar with the case in analogy datasets. The use of bigrams is effective while the introduction of trigrams deteriorates the performance in most cases. In general, the bigrams bring significant improvements over SGNS on a range of linguistic tasks. It is generally known that ngram is a vital part in traditional language modeling problem. Results in table 2 and 3 confirm the effectiveness of ngrams again on SGNS, a more advanced word embedding model.

Ngrams on PPMI, GloVe, SVD
In this section, we only report the results of models of 'uni uni' and 'uni bi' types. Using higher order co-occurrence statistics brings immense costs (especially at the window size of 5). Levy and Goldberg (2014b) demonstrate that traditional count-based models can still achieve competitive results on many linguistic tasks, challenging the dominance of neural embedding models. Table 4 lists the results of PPMI matrix on analogy and similarity datasets. PPMI prefers Multiplicative (Mul) evalution. To this end, we focus on analyzing the results on Mul columns. When bigrams are used, significant improvements are witnessed on analogy task. On Google dataset, bigrams bring over 10 percent increase on the total accuracies. At the window size of 2, the accuracy in semantic questions even reaches 0.854, which is the stateof-the-art result to the best of our knowledge. On MSR dataset, around 20 percent improvements are achieved. The use of bigrams does not always bring improvements on similarity datasets. PPMI matrix of 'uni bi' type improves the results on 5 datasets at the window size of 2. At the window size of 5, using bigrams only improves the results on 2 datasets. Table 5 and 6 list GloVe and SVD's results. For GloVe, consistent (but minor) improvements are achieved on analogy task with the introduction of bigrams. On similarity datasets, improvements are witnessed on most cases. For SVD, bigrams sometimes lead to worse results in both analogy and similarity tasks. In general, significant improvements are not witnessed on GloVe and SVD. Our preliminary conjecture is that the default hyper-parameter setting should be blamed. We strictly follow the hyper-parameters used in baseline models, making no adjustments to cater to the introduction of ngrams. Besides that, some common techniques such as dynamic window, decreasing weighting function, dirty sub-sampling are discarded. The relationships between ngrams and various hyper-parameters require further exploration. Though trivial, it may lead to much better results and give researchers better understanding of different representation methods. That will be the focus of our future work.

Qualitative Evaluations of Ngram Embedding
In this section, we analyze the properties of ngram embeddings trained by SGNS of 'bi bi' type. Ideally, the trained ngram embeddings should reflect ngrams' semantic meanings. For example, vec(wasn't able) should be close to vec(unable). vec(is written) should be close to vec(write) and vec(book). Also, the trained ngram embeddings should preserve ngrams' syntactic patterns. For example, 'was written' is in the form of 'be + past participle' and the nearest neighbors should possess similar patterns, such as 'is written' and 'was transcribed'. Table 7 lists the target ngrams and their top nearest neighbours. We divide the target ngrams into six groups according to their patterns. We can observe that the returned words and ngrams are very intuitive. As might be expected, synonyms of the target ngrams are returned in top positions (e.g. 'give off' and 'emit'; 'heavy rain' and 'downpours'). From the results of the first group, it can be observed that bigram in negative form 'not X' is useful for finding the antonym of word 'X'. Besides that, the trained ngram embeddings also preserve some common sense. For example, the returned result of 'highest mountain' is a list of mountain names (with a few exceptions such as 'unclimbed'). In terms of syntactic patterns, we can observe that in most cases, the returned ngrams are in the similar form with target ngrams. In general, the trained embeddings basically re-flect semantic meanings and syntactic patterns of ngrams.
With high-quality ngram embeddings, we have the opportunity to do more interesting things in our future work. For example, we will construct a antonym dataset to evaluate ngram embeddings systematically. Besides that, we will find more scenarios for using ngram embeddings. In our view, ngram embeddings have potential to be used in many NLP tasks. For example, Johnson and Zhang (2015) use one-hot ngram representation as the input of CNN.  use ngram embeddings to represent texts. Intuitively, initializing these models with pre-trained ngram embeddings may further improve the accuracies.

Conclusion
We introduce ngrams into four representation methods. The experimental results demonstrate ngrams' effectiveness for learning improved word representations. In addition, we find that the trained ngram embeddings are able to reflect their semantic meanings and syntactic patterns. To alleviate the costs brought by ngrams, we propose a novel way of building co-occurrence matrix, enabling the ngram-based models to run on cheap hardware.