Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings

The number of word embedding models is growing every year. Most of them are based on the co-occurrence information of words and their contexts. However, it is still an open question what is the best definition of context. We provide a systematical investigation of 4 different syntactic context types and context representations for learning word embeddings. Comprehensive experiments are conducted to evaluate their effectiveness on 6 extrinsic and intrinsic tasks. We hope that this paper, along with the published code, would be helpful for choosing the best context type and representation for a given task.


Introduction
Recently, there is a growing interest in word embedding models, where words are embedded into low-dimensional (dense) real-valued vectors. The trained word embeddings can be directly used for solving intrinsic tasks like word similarity and word analogy. They are also helpful for solving extrinsic tasks, such as part-of-speech tagging, chunking, named entity recognition (Collobert and Weston, 2008;Collobert et al., 2011) and text classification Kim, 2014).
The training objectives of word embedding models are based on the Distributional Hypoth-esis (Harris, 1954) that can be stated as follows: "words that occur in similar contexts tend to have similar meanings". In most word embedding models, the "context" is defined as the words which precede and follow the target word within some fixed distance (Bengio et al., 2003;Mnih and Hinton, 2007;Mikolov et al., 2013a;Pennington et al., 2014). Among them, Global Vectors (GloVe) proposed by Pennington et al. (2014), Continuous Skip-Gram (CSG) 1 and Continuous Bag-Of-Words (CBOW) proposed by Mikolov et al. (2013b) achieve state-of-the-art results on a range of linguistic tasks, and scale to corpora with billions of words.
The traditional sparse vector-space models have explored many different types of context. Curran (2004); Padó and Lapata (2007); Clark (2012) have discussed a set of context definitions beyond simple linear context. For example, a sentence or document could be used as the boundary instead of window size. Contextual words could be associated with their relative sides (left/right) or positions (+1/-2) to the target word. They could also be associated with part-of-speech or grammatical relation labels. The weight of each contextual word can be explicitly defined. Moreover, words that are connected to target word in dependency parse  (Mikolov et al., 2013a) this work Skip-Gram bound Structured SG (Ling et al., 2015) POSIT (Levy and Goldberg, 2014b) DEPS (Levy and Goldberg, 2014a) generalized unbound CBOW (Mikolov et al., 2013a) this work Bag-Of-Words bound CWINDOW (Ling et al., 2015) this work original unbound GloVe (Pennington et al., 2014) this work GloVe bound this work this work Table 1: Summary of prior research on word embedding models with different syntactic context types and context representations. For linear context, bound indicates words associated with positional information. For DEPS context, bound indicates words associated with dependency relation.
tree can be considered as context. Recent word embedding models have also explored some of the above context types. Levy and Goldberg (2014b); Ling et al. (2015) 2 improve CSG and CBOW by introducing positionaware context representation. Levy and Goldberg (2014a) propose dependency-based context (DEP-S) for CSG.
However, different types of syntactic context have not been systematically compared for different word embeddings. This paper explores two context types (linear or DEPS) and two context representations (bound or unbound), as shown in Table 1. Three popular word embedding models (CBOW, GloVe, and CSG) are compared on word similarity, word analogy, part-of-speech tagging, chunking, named entity recognition, and text classification tasks.

Related Work
Several studies directly compare different word embedding models. Lai et al. (2016) compare 6 word embedding models using different corpora and hyper-parameters. Nayak and Manning (2016) provide a set of evaluations, along with an online tool, for word embedding models. Levy and Goldberg (2014c) show the theoretical equivalence of CSG and PPMI matrix factorization. Levy et al. (2015) further discuss the connections between 4 word embedding models (PP-MI, SVD, CSG, GloVe) and re-evaluate them with the same hyper-parameters. Suzuki and Nagata (2015) investigate different configurations of CS-G and GloVe and merge them together. Yin and Schutze (2016) propose 4 ensemble methods and show their effectiveness over individual ones.
There is also research evaluating different context types in learning word embeddings. Heylen et al. (2008) compares dependency-based and linear vector space model for finding semantically related nouns in Dutch. Vulic and Korhonen (2016) compare CSG and dependency-based models on various languages. Their results suggest that dependency-based models are better at detecting functional similarity in English, although that does not necessarily hold for other languages. Bansal et al. (2014) show that DEPS context is preferable to linear context on parsing task. Melamud et al. (2016) investigate the performance of CSG, DEP-S and a substitute-based word embedding model (Yatbaz et al., 2012) 3 , which shows that different types of intrinsic tasks have clear preference for particular types of contexts. On the other hand, for extrinsic tasks, the optimal context types need to be carefully tuned on specific dataset.
The contribution of this study is that in addition to linear and dependency-based context we also consider bound and unbound context representations, as will be described below. Furthermore, we systematically evaluate three word embedding models: CSG, CBOW and GLoVe.
his is the other neue k around duced: the r w. For ord w are he contexts ar, with. 2 may miss not a cone accidenrs).  where lbl is the type of the dependency relation between the head and the modifier (e.g. nsubj, dobj, prep with, amod) and lbl −1 is used to mark the inverse-relation. Relations that include a preposition are "collapsed" prior to context extraction, by directly connecting the head and the object of the preposition, and subsuming the preposition itself into the dependency label. An example of the dependency context extraction is given in Figure 1.
Notice that syntactic dependencies are both more inclusive and more focused than bag-ofwords. They capture relations to words that are far apart and thus "out-of-reach" with small window bag-of-words (e.g. the instrument of discover is telescope/prep with), and also filter out "coincidental" contexts which are within the window but not directly related to the target word (e.g. Australian is not used as the context for discovers). In addition, the contexts are typed, indicating, for example, that stars are objects of discovery and scientists are subjects. We thus expect the syntactic contexts to yield more focused embeddings, capturing more functional and less topical similarity.

Experiments and Evaluation
We experiment with 3 training conditions: BOW5 (bag-of-words contexts with k = 5), BOW2 (same, with k = 2) and DEPS (dependency-based syntactic contexts). We modified word2vec to support arbitrary contexts, and to output the context embeddings in addition to the word embeddings. For bag-of-words contexts we used the original word2vec implementation, and for syntactic contexts, we used our modified version. The negative-sampling parameter (how many negative contexts to sample for every correct one) was 15.

Word Embeddings Models
In this section, we first introduce different contexts in detail, and discuss their strengths and weaknesses. We then show how CSG, CBOW and GloVe can be generalized to use these contexts.

Context Types
There are many different types of context, both on document and sentence level. For syntactic contexts, the current literature discusses mainly the linear (used in most word embedding models) and dependency-based contexts (DEPS (Levy and Goldberg, 2014a)). Linear context is defined as the positional neighbors of the target word in texts. DEPS context is defined as the syntactic neighbors of the target word based on dependency parse tree, as shown in Figure 1 4 .
Compared to the linear context, DEPS context can capture more relevant words that are further away from the target word in the text. For example in Figure 1, linear context does not include the word-context pair "discovers telescope", while DEPS context contains this information. DEP-S context can also exclude some uninformative word-context pairs like "with star" and "telescope with".
Note that dependency parsing is timeconsuming.
Despite its parallelizability, our implementation still takes nearly a month to finish dependency parsing for the Wikipedia corpus on a 32-core machine. it is only fair to compare linear and DEPS context if we ignore the time complexity. it is also worth noting that part-of-speech labels are required when performing dependency parsing.

Context Representations
In the original CSG, CBOW and GloVe models, contexts are represented by words without any additional information. Ling et al. (2015)   This example is based on Figure 1, and the target word is "discovers".
CSG and CBOW by introducing position-bound words, where each contextual word is associated with their relative position to the target word. This allows CSG and CBOW to distinguish different sequential positions and capture the structural information from the context. We refer to methods that bind positional information with the contextual word as bound (context) representation, as opposed to unbound (context) representation where contextual words are treated the same irrespective of their positions with regards to the target word. The original DEPS uses "bound" representation by default: each word is associated with its dependency relation to the target word. In this paper, we also investigate the simpler context representation where no dependency relation is associated with a word. This enables a fair comparison with conventional models like CSG, CBOW and GloVe, since they do not use bound representation either. An example of different syntactic context types and context representations is shown in Table 2.
Intuitively, bound representation should work better than unbound representation, since it uses information about relative word positions. However, this is not always the case in practice. An obvious drawback is that bound representation is more sparse than unbound representation, especially for DEPS context type. In our data, there were 47 dependency relations in dependency parse tree. Although not every combination of dependency relations and words appear in the wordcontext pair collection, in practice it still enlarges the contextual words' vocabulary about 5 times.
Both syntactic context types (linear and DEPS) and the choice of context representations (bound and unbound) have a dramatic effect on the word embeddings. Bound linear representation transfers each contextual word into a new one, and the
(australian, scientist, 1) (scientist, australian, 1) (scientist, discovers, 1) (discovers, scientist, 1) (discovers, star, 1) (discovers, telescope, 1) . . . Table 3: Illustration of collection P , M and M for sentence "australian scientist discovers star with telescope". Unbound representation is used in this example. Words in the collections are Bold. . word-context pairs are changed completely. DEP-S, as compared to the linear contexts, increases the likelihood that the contextual words are in a meaningful relation with the target word, although some words captured by DEPS would also be found in the linear contexts if the window is wide enough. For example, in Table 2, "scientist" and "star" are considered as the contextual words of "discovers" in both linear and DEPS context types.

Generalization
Let P be a collection of word-context pairs. P can be merged based on the words to form a collection M with size of |V |, where V is the vocabulary. Each element (w, c 1 , c 2 , .., c nw ) ∈ M is word w and its contexts, where n w is the number of word w's contexts. P can also be merged based on both words and contexts to form a collection M . Each element (w, c, #(w, c)) ∈ M is the word w, context c, and the times they appear in collection P . An example of these collections is shown in Table 3.

Generalized Bag-Of-Words
The objective function of Generalized Bag-Of-Words (GBOW) is defined as: With negative sampling technique, the log probability is calculated by: where σ is the sigmoid function, K is the negative sampling size, w and c is the vector for word w and c respectively. The negatively sampled word w N k is randomly selected on the basis of its unigram distribution ( #(w) w #(w) ) ds , where #(w) is the number of times that word w appears in the corpus, and ds is the distribution smoothing hyperparameter which is usually defined as 0.75.
Note that with negative sampling technique, both GBOW and original CBOW (Mikolov et al., 2013a) will learn two sets of embeddings (word embeddings and context embeddings). In the original CBOW, the context embeddings can also be considered as word embeddings, since the vocabulary set of words and contexts are the same. However, for bound context, the words (i.e. scientist) and contexts (i.e. scientist/nsubj) are quite different. It is necessary to distinguish conditioned and conditioning variables. For example, in Figure 1, the context "scientist/nsubj" can only be predicted by word "discovers". However, most of the word is connected to several contextual words. Due to this, the sum of contextual word embeddings should be used for predicting the target word.

Generalized Skip-Gram
For generalized Skip-Gram (GSG), the definition is more straightforward and the objective function actually needs no specification (Levy and Goldberg, 2014b). Nonetheless, in order to make it consistent with our GBOW, we also specify the conditioned and conditioning variables in the objective function: Note that this generalization does not change the nature of the models for linear context. In our pilot experiments on word analogy and word similarity, the performance of both GSG and GBOW is almost identical to their original versions.

GloVe
Unlike GSG and GBOW, GloVe explicitly optimizes a log-bilinear regression model based on word co-occurrence matrix. Since GloVe is already a very generalized model, with the previous defined collection M , the final objective function is written as: where b w and b c are biases for word and context. f is a non-decreasing weighting function and ensures that large #(w, c) is not over-weighted. Note that the inputs of GSG, GBOW and Glove are the collections P , M and M respectively. Once the corpus and hyper-parameters are fixed, these collections (and thus the learned word embeddings) are determined only by the choice of context types and representations.

Experiments
We evaluate the effectiveness of different syntactic context types and context representations on word similarity, word analogy, part-of-speech tagging, chunking, named entity recognition, and text classification tasks. In this section we describe our models, and then report and discuss the experimental results on each task.

Word Embeddings
Previously, the word2vecf toolkit 5 (Levy et al., 2015) extended the word2vec toolkit 6 (Mikolov et al., 2013b) to accept the input of collection P 5 https://bitbucket.org/yoavgo/ word2vecf 6 http://code.google.com/p/word2vec/ rather than raw corpus. This makes CSG model accept arbitrary contexts (e.g. DEPS context). However, CBOW and GloVe are not considered in that toolkit. We implement word2vecPM toolkit, a further extension of word2vecf, which supports generalized CSG, CBOW and GloVe with the input of collection P , M and M respectively. For fair comparison, as suggested by Levy et al. (2015), we use the same hyper-parameters 7 for all embedding models. English Wikipedia (August 2013 dump) is used as the training corpus. The Stanford CoreNLP  is used for dependency parsing. After parsing, tokens are converted to lowercase. Words and contexts that appear fewer than 100 times in the collection P are ignored.

Word Similarity Task
Word similarity task aims at producing semantic similarity scores of word pairs, which are compared with the human scores using Spearman's correlation. The cosine distance is used for generating similarity scores between two word vectors. We use the WordSim353 (Finkelstein et al., 2001) dataset, divided into similarity and relatedness categories (Zesch et al., 2008;Agirre et al., 2009). Previous research (Levy and Goldberg, 2014a;Melamud et al., 2016) concluded that compared to linear context, DEPS context can capture more functional similarity (e.g. tiger/cat) rather than topical similarity (relatedness) (e.g. tiger/jungle). However, their experiments do not distinguish the 7 Negative sampling size is set to 5 for SG and 2 for CBOW. Distribution smoothing is set to 0.75. No dynamic context or "dirty" sub-sampling is used. The window size is fixed to 2. The number of iterations is set to 2, 5 and 30 for SG, CBOW and GloVe respectively.   effect of different context representations: unbound representation is used for linear context (Mikolov et al., 2013b), while bound representation is used for dependency-based context (Levy and Goldberg, 2014a). Moreover, only CSG model is considered.
We revisit those claims with more systematical experiments. As shown in the top-left sub-figure of Figure 2, DEPS does outperform the linear contextin GSG and GloVe in the similarity section of WordSim353, confirming its ability to capture functional similarity. However, the advantage of DEPS does not fully transfer to GBOW. Although bound DEPS context for GBOW is still the best performer, unbound DEPS context performs the worst, which shows the importance of bound vs unbound representation.
Note that the results are also reversed on Word-Sim353 relatedness section (the right subfigure of Figure 2), which shows that linear context is more suitable for capturing topical similarity.
Overall, DEPS context type does not get all the credit for capturing functional similarity. Context representations play an important role for word similarity task. it is only safe to say that DEP-S context captures functional similarity with the "help" of bound representation. In contrast, linear context type captures topical similarity with the "help" of unbound representation. However, the above findings come with a major caveat: a lot seems to depend on the particular dataset, in addition to the model and context type. We experimented with MEN dataset (Bruni et al., 2012), Mechanical Turk dataset (Radinsky et al., 2011), Rare Words dataset (Luong et al., 2013), and SimLex-999 dataset (Hill et al., 2016) (Table 4), and we were not able to observe uniform trends even for datasets that are supposed to capture the same relation -like the similarity part of WordSim353, Rare Words and SimLex.
Still, some models do favor a certain context type for both similarity and relatedness: e.g. GBOW favors linear unbound contexts, while GLoVE in most cases prefers DEPS over the linear context. In case of GCG, however, context type needs to be optimized for the particular dataset.

Word Analogy Task
Word analogy task aims at answering the questions like "a is to a' as b is to ?", such as "London is to Britain as Tokyo is to Japan". We follow the evaluation protocol in Levy and Goldberg (2014b), which answers the questions using LR-Cos method . LRCos shows significant improvement over the traditional vector offset method. We use BATS analogy dataset  in our experiments.
As shown in Figure 3, context representation plays an important role in word analogy task. The choice of context representation (bound or unbound) actually has much larger impact than the choice of context type (linear or DEPS). The results on Encyclopedia category are perhaps the most evident. The performance of unbound linear context and unbound DEPS context is similar. However, for most models and categories, bound representation seems to outperform unbound representation. When bound representation is used, the performance drops around 5 − 15 percent for DEPS context in terms of accuracy. This is consistent with the findings of Levy and Goldberg (2014a), who report that DEPS context did not work well for the analogy task.
As shown in Table 5, we have also experimented on two much smaller datasets: MSR analogy dataset (Mikolov et al., 2013c), and Google analogy dataset (Mikolov et al., 2013a) (with semantic and syntactic questions). They also show that the choice of context representation has more impact than the choice of context type.

POS, Chunking and NER Tasks
Although intrinsic evaluations like word similarity and word analogy tasks could provide direct insights about different context types and representations, they have certain methodological problems , and the experimental results above cannot be directly translated to the typical uses of word embeddings in downstream tasks (Schnabel et al., 2015;Linzen, 2016;Chiu et al., 2016). Thus extrinsic tasks should also be considered.
In this subsection, we evaluate the effectiveness of different word embedding models with different contexts on Part-of-Speech Tagging (POS), Chunking 8 and Named Entity Recognition (NER) tasks 9 . For these tasks, a NLP system assigns labels to elements of texts. Note that in practice, one should NOT use DEPS context for POS-tagging and chunking tasks, since their labels are used in parsing the source corpus.
Following the evaluation protocol used in Kiros et al. (2015), we restrict the predicting model to Logistic Regression Classifier 10 . The classifier's input for predicting the label of word w i is simply the concatenation of word vectors w i−2 , w i−1 , w i , w i+1 , w i+2 . This ensures that the quality of embedding models is directly evaluated, and their strengths and weaknesses are easily observed.  As shown in Figure 4 and Table 6, GSG, GBOW and GloVe exhibit overall similar trends. When the same context type is used, bound representation outperforms unbound representation on all tasks. Sequence labeling tasks are not sensitive to 10 The implementation by scikit is used http:// scikit-learn.org/ syntax. For bound representation, the ignorance of syntax becomes beneficial, since it decreases the amount of noise and sparsity.
Moreover, DEPS context type works slightly better than linear context type in most cases. These results suggest that unbound linear context (as in traditional CSG and CBOW) may not be the best choice of input word vectors for sequence labeling. Bound representations should always be used and DEPS context type is also worth considering. Again, similar to the word analogy task, GloVe is more sensitive to different context representations than Skip-Gram and CBOW.

Text Classification Task
Finally, we evaluate the effectiveness of different word embedding models with different syntactic contexts on text classification task. Text classification is one of the most popular and well-studied tasks in natural language processing. Recently, deep neural networks achieve state-of-the-art results on this task Kim, 2014;Dai and Le, 2015). They often need pre-trained word embeddings as inputs to improve their performances. Similarly to the previous evaluation of sequence labeling tasks, instead of building complex deep neural networks, we use a simpler classification method called Neural Bag-of-Words (Li et al., 2017) to directly evaluate the word embeddings: texts are first represented by the sum of their word vectors, then a Logistic Regression Classifier (the same as that in previous subsection)  Different word embedding models are evaluated on 5 text classification datasets. The first 3 datasets are sentence-level: short movie review sentiment (MR) (Pang and Lee, 2005), customer product reviews (CR) (Nakagawa et al., 2010), and subjectivity/objectivity classification (SUBJ) (Pang and Lee, 2004). The other 2 datasets are documentlevel with multiple sentences: full-length movie review (RT-2k) (Pang and Lee, 2004), and IMDB movie review (IMDB) (Maas et al., 2011) 11 .
As shown in Table 7, pre-trained word embeddings outperform random word embeddings by a large margin. This strengthens the previous claim that pre-trained word embeddings are highly useful for text classification (Iyyer et al., 2015;Li et al., 2017). Unlike in the other tasks, in text classification all models exhibit similar performance. Text classification has less focus on syntax and function similarity. Because of that, models with bound representation perform worse than those with unbound representation on almost all datasets except CR. Models with DEPS context type and linear context type are comparable. These observations suggest that simple unbound linear context type (as in traditional CSG and CBOW) is still the best choice of pre-training word embeddings for text classification, which is already used in most studies.

Conclusion
This paper provides a first systematical investigation of different syntactic context types (linear vs 11 Please see Wang and Manning (2012) for more detailed introduction and pre-processing of these datasets. dependency-based) and different context representations (bound vs unbound) for learning word embeddings. We evaluate GSG, GBOW and GloVe models on intrinsic property analysis tasks (word similarity and word analogy), sequence labeling tasks (POS, Chunking and NER) and text classification task.
We find that most tasks have clear preference for different context types and representations. Context representation plays a more important role than context type for learning word embeddings. Only with the "help" of bound representation does DEPS context capture functional similarity. Word analogies seem to prefer unbound representation, although performance varies by question type No matter which syntactic context type is used, bound representation is essential for sequence labeling tasks, which benefits from its ability of capturing functional similarity. GSG with unbound linear context is still the best choice for text classification task. Linear context is sufficient for capturing topical similarity compared to more labor-intensive DEPS context. Words' position information is generally useless for text classification, which makes bound representation contribute less to this task.
In the spirit of transparent and reproducible experiments, the word2vecPM toolkit 12 is published along with this paper. We hope researchers will take advantage of the code for further improvements and applications to other tasks.