Quantifying Context Overlap for Training Word Embeddings

Most models for learning word embeddings are trained based on the context information of words, more precisely first order co-occurrence relations. In this paper, a metric is designed to estimate second order co-occurrence relations based on context overlap. The estimated values are further used as the augmented data to enhance the learning of word embeddings by joint training with existing neural word embedding models. Experimental results show that better word vectors can be obtained for word similarity tasks and some downstream NLP tasks by the enhanced approach.


Introduction
In the last decade, the distributed word representation (a.k.a word embedding) has attracted tremendous attention in the field of natural language processing (NLP). Instead of large vectors, such as the one-hot representation, the distributed word representation embeds semantic and syntactic characteristics of words into a low-dimensional space, which makes it popular in NLP applications.
The main idea of most word embedding models follows the distributional hypothesis (Harris, 1954), i.e., the embedding of each word may be inferred using its context. An important model family for distributional word representation learning is built based on the global matrix factorization approach (Deerwester et al., 1990;Lee and Seung, 2001;Srebro et al., 2005;Mnih and Hinton, 2007;Li et al., 2015;Wang and Cohen, 2016), in which a dimensionality reduction over a sparse matrix is performed to capture the statistical information about a corpus in low-dimensional vectors. Another model family is neural word embeddings (Levy and Goldberg, 2014b), some attempts include the famous Neural Probabilistic Language Model (Bengio et al., 2003), SGNS and CBOW (Mikolov et al., 2013a,b), GloVe (Pennington et al., 2014) and their variants (Shazeer et al., 2016;Kenter et al., 2016;Ling et al., 2017;Patel et al., 2017).
Most of these models capture the context information of each word using the co-occurrence matrix. However, the co-occurrence matrix only represents relatively local information, i.e., it describes context associations based on word pairs' co-occurrence counts without considering global context perspective. Besides, the co-occurrence matrix is only an estimation of a corpus, which is only a sample of a language. A mass of related word pairs may not be observed in the corpus, and the latent relations between unobserved word pairs may not be modeled well due to the missing knowledge.
Few attempts are carried out to indirectly deal with unobserved co-occurrence for dense neural word embeddings.
SGNS (Mikolov et al., 2013a,b) indirectly addresses this problem through negative sampling. Swivel (Shazeer et al., 2016) improves GloVe by using a "soft hinge" loss to prevent from over-estimating zero co-occurrences. However, the latent relations between unobserved word pairs are not explicitly represented. There are also some works around semantic composition and distributional inference (Mitchell and Lapata, 2008;Padó, 2008, 2010;Reisinger and Mooney, 2010;Thater et al., 2011;Kartsaklis et al., 2013;Kober et al., 2016) that are explored to address the sparseness problem, but they are not designed for training neural word embeddings.
In this paper, we explore an approach that utilizes context overlap information to dig up more effective co-occurrence relations and propose extensions for GloVe and Swivel to validate the positive impact of introducing context overlap.

Quantify Context Overlap
In this work, we explore quantifying context overlap based on the observation that to a certain extent the overlap of Point-wise Mutual Information (PMI) (Church and Hanks, 1990) reflects context overlap.
As shown in Figure 1, two separate words may exhibit a particular aspect of interest or be semantically related when the overlap area between their PMI is relatively large.
The calculation of complete PMI-weighted context overlap may be time-consuming when the number of words is large. To make the time complexity affordable, only the context words that have strong lexical association with a target word i are considered: in which V is the vocabulary, h PMI is a threshold which acts as a magnitude to shift PMI, and S i denotes the set that consists of the context words that have enough large PMI values with the target word i. It is expected that most context information associated with the word i can be captured by its PMI values over S i . Then, we measure the degree of context overlap (CO) between two target words i, j as a function of their PMI values over the intersection of S i and S j , i.e., (2) where f is a monotonic mapping function to rectify the data characteristics for certain objective function in word embedding training.
Compared to identity function f (x) = x, we find exponential function f (x) = exp(x) works much better in our experiments. For the quantized context overlap, the exponential mapping function results in a similar data distribution as the cooccurrence counts, i.e., few word pairs have extremely large values while most word pairs' values are distributed in a relatively small range.

Extend to Existing Models
We consider the original co-occurrence matrix as a description of first order co-occurrence relations, while the quantized context overlap as a description of second order co-occurrence relations (Schütze, 1998), i.e., co-co-occurrences, which is represented by "non-logarithmic PMI-weighted context overlap" in this work. The context overlap between two words can be inferred even when they never co-occur in the corpus. According to our statistics, more than 84% word pairs in the second order co-occurrence matrix are not included in the first order co-occurrence matrix. We expect introducing second order co-occurrence relations may enhance the quality of the word embedding that is originally trained on first order co-occurrence relations. GloVe (Pennington et al., 2014) and Swivel (Shazeer et al., 2016) are extended by joint training with context overlap information in this paper.
GloVe The logarithmic co-occurrence matrix is factorized in GloVe with bias terms, and a weighted least squares loss function is optimized: where X ij denotes the word-context cooccurrence count between a target word i and a context word j. The model parameters to be learned include w i ∈ R d ,w j ∈ R d , b i andb j , which correspond to target word vector, context word vector, bias terms associated with the target word and the context word, respectively. λ ij is a weight whose value equals to (min(X ij , x max )/x max ) α . To extend GloVe, two tasks are trained in parallel during the training process: One is the main task that follows the original GloVe training pro-cess as above; Another one is an auxiliary task that tunes word embeddings using context overlap. The parameters of word embeddings are shared in both tasks.
Following GloVe-style loss function, in the auxiliary task, the dot products of word vectors are pushed to estimate logarithmic second order cooccurrence.
(4) where the superscripts (2) are used to differentiate with the terms in the original GloVe. X (2) ij = CO(i, j) represents context overlap, a word independent learnable scale A is adopted to relieve the potential inconformity between first order and second order co-occurrences. The weight λ (2) ij is similar to the original λ ij , but using a different hyperparameter x (2) max . The multi-task (Ruder, 2017) loss function is the weighted sum of the two tasks, i.e., J = J GloV e +β·J (2) GloV e , where the weight β is a hyperparameter.
Swivel As pointed out by (Levy et al., 2015) , if the bias terms in GloVe are fixed to the logarithmic count of the corresponding word, the dot products of target word vectors and context word vectors are almost equivalent to the approximation of logarithmic PMI matrix with a shift of log i,j X ij . Submatrix-wise Vector Embedding Learner (Swivel) directly reconstructs the PMI matrix by dot product between target vectors and context vectors and deals with unobserved co-occurrences using a "soft hinge" loss function. (Shazeer et al., 2016) details its loss functions and training process. In our extended version, we add a supplementary loss function to handle second order co-occurrences. When the second order cooccurrence X (2) ij is more than zero, the PMI of context overlap is approximated.
in which A, B are word independent learnable scale parameters, and PMI (2) (i, j) is the Pointwise Mutual Information computed on the second order co-occurrence matrix [X ij ].

Setup
Corpus The training dataset contains 6 billion tokens collected from diversified corpora, including the News Crawl corpus (Chelba et al., 2013), the April 2010 Wikipedia dump (Shaoul, 2010;Lee and Chen, 2017), and a year-2012 subset of the Reddit comment datasets 1 .
Preprocessing Following (Lee and Chen, 2017), the Stanford tokenizer is used to process the training corpus, which are split into sentences with characters converted to lower cases. Punctuations are removed.
Parameter Configuration The vocabularies are limited to the 200K most frequent words. Following (Pennington et al., 2014), a decreasing weighting function is adopted to construct the cooccurrence matrix. We use symmetric context window of five words to the left and five words to the right.
For GloVe, recommended parameters in (Pennington et al., 2014) are used. Specifically, we set α = 3 4 , x max = 100, initial learning rate as 0.05, 100 iterations. For Swivel, recommended parameters in (Shazeer et al., 2016) are used. The weighting function is 0.1 + 0.25x 0.5 ij , each shard is sampled about 100 times. But we set the block size as 4000 so that the vocabulary size can be divided exactly.
For the auxiliary tasks, we tune the hyperparameters on the small News Crawl corpus. And we find that in an appropriate range, the threshold h PMI is not sensitive to the performance. In this paper, h PMI , x (2) max and β are set to log 100, 10000 and 0.2 respectively. Since there is no difference between target vectors and context vectors (except random initialization), in order to keep symmetry, we not only approximate context overlap between target vectors, but also approximate context overlap between context vectors simultaneously. Final vectors are the sum of w andw in both GloVe and Swivel. Table 1 shows the evaluation results of word similarity tasks and word analogy tasks. Word similarity is measured as the Spearman's rank correlation ρ between human-judged similarity and cosine distance of word vectors. In word analogy  Table 1: Word similarity and analogy results (ρ × 100 and analogy accuracy). We denote context overlap enhanced method with "+ CO". 300-dimensional embeddings are used. The datasets used include WS353 (Finkelstein et al., 2001), SL999 (Hill et al., 2016), SCWS (Huang et al., 2012), RW (Luong et al., 2013), MEN (Bruni et al., 2014), MT771 (Halawi et al., 2012), and Mikolov's analogy dataset (Mikolov et al., 2013a).

Intrinsic Evaluation
task, the questions are answered over the whole vocabulary through 3CosMul (Levy and Goldberg, 2014a). In addition to GloVe and Swivel, the evaluations of SGNS are also reported for reference. We train SGNS with the word2vec tool, using symmetric context window of five words to the left and five words to the right, and 5 negative samples.
As can be seen from the table, the context overlap information enhanced word embeddings perform better in most word similarity tasks and get higher analogy accuracy in semantic aspect at the cost of syntactic score. The improved semantics performance, to a certain extent, reflects second order co-occurrence relations are more semantic.

Text Classification
Text classification tasks are conducted on five shared benchmark datasets from (Kim, 2014) including binary classification tasks CR (Hu and Liu, 2004), MR (Pang and Lee, 2005), Subj (Pang and Lee, 2004) and multiple classification tasks TREC (Li and Roth, 2002), SST1 . Texts are preprocessed following the description of Section 4.1. We train Convolutional Neural Networks (CNN) on top of our static pretrained word vectors following (Kim, 2014). To avoid the high-risk of single-run estimate being false (Melis et al., 2017;Reimers and Gurevych, 2017), average classification accuracies of 20 runs are reported as the final scores. The results are shown in Table 2. As can be seen from the results that the enhanced word embeddings outperform the baselines.

Model Analysis
As it is known to all, word frequency plays an important role in the computation of word embeddings (Gittens et al., 2017)  the graph in (Shazeer et al., 2016), relations between word analogy accuracy and the log mean frequency of the words in analogy questions and answers are plotted on Figure 2. The word embeddings trained by GloVe with or without context overlap information are used here. An obvious semantic performance improvement is observed in the range of low frequency. Our observation of second order co-occurrences may explain this fact. We randomly sample 1 million word pairs, and rank these word pairs in descending order by their quantized context overlap. In all the word pairs, average word frequency is 13934.4. However, it is only 1676.1 in the top 0.1% word pairs, it is 3984.8 in the top 1%, and it is 7904.9 in the top 10%. This may be caused by PMI's bias towards infrequent words, but it illustrates infrequent words carry more information in second order co-occurrence relations.

Conclusion
In this paper, we propose an empirical metric to enhance the word embeddings through estimating second order co-occurrence relations using con- text overlap. Instead of only local statistical information, context overlap leverages global association distribution to measure word pairs correlation.
The proposed method is easy to extend to existing models, such as GloVe and Swivel, by an auxiliary objective function. The improvement in experimental results helps to validate the positive impact of introducing quantized context overlap.
We have considered the feasibility of enriching SGNS and CBOW with information from contextoverlap. However, because of their training mode, we can't remake them in a straightforward way following their "original spirit". When training SGNS and CBOW, the program scans the training text. The target and context words are chosen using a slide window and negative sampling is used. In this process, no co-occurrence matrix is explicitly computed, and we fail to extend it in a united form as we extend GloVe and Swivel. The extensions for GloVe and Swivel can also be used for reference for extending other word embedding approaches that are trained on co-occurrence matrix. The exploration for second order co-occurrence can be traced back to 1990s. We think it is helpful to revive the classical method in a modern, embedding driven way. How to integrate second order co-occurrence information for approaches like SGNS, CBOW should be an interesting future work.
As future works, we suggest further investigating the characteristics of context overlap in diversified ways.