Modeling Context Words as Regions: An Ordinal Regression Approach to Word Embedding

Vector representations of word meaning have found many applications in the field of natural language processing. Word vectors intuitively represent the average context in which a given word tends to occur, but they cannot explicitly model the diversity of these contexts. Although region representations of word meaning offer a natural alternative to word vectors, only few methods have been proposed that can effectively learn word regions. In this paper, we propose a new word embedding model which is based on SVM regression. We show that the underlying ranking interpretation of word contexts is sufficient to match, and sometimes outperform, the performance of popular methods such as Skip-gram. Furthermore, we show that by using a quadratic kernel, we can effectively learn word regions, which outperform existing unsupervised models for the task of hypernym detection.


Introduction
Word embedding models such as Skip-gram (Mikolov et al., 2013b) and GloVe (Pennington et al., 2014) represent words as vectors of typically around 300 dimensions. The relatively lowdimensional nature of these word vectors makes them ideally suited for representing textual input to neural network models (Goldberg, 2016;Nayak, 2015). Moreover, word embeddings have been found to capture many interesting regularities (Mikolov et al., 2013b;Kim and de Marneffe, 2013;Gupta et al., 2015;Rothe and Schütze, 2016), which makes it possible to use them as a source of semantic and linguistic knowledge, and to align word embeddings with visual features (Frome et al., 2013) or across different languages (Zou et al., 2013;Faruqui and Dyer, 2014).
Notwithstanding the practical advantages of representing words as vectors, a few authors have advocated the idea that words may be better represented as regions (Erk, 2009), possibly with gradual boundaries (Vilnis and McCallum, 2015). One important advantage of region representations is that they can distinguish words with a broad meaning from those with a more narrow meaning, and should thus in principle be better suited for tasks such as hypernym detection and taxonomy learning. However, it is currently not well understood how such region based representations can best be learned. One possible approach, suggested in (Vilnis and McCallum, 2015), is to learn a multivariate Gaussian for each word, essentially by requiring that words which frequently occur together are represented by similar Gaussians. However, for large vocabularies, this is computationally only feasible with diagonal covariance matrices.
In this paper, we propose a different approach to learning region representations for words, which is inspired by a geometric view of the Skip-gram model. Essentially, Skip-gram learns two vectors p w andp w for each word w, such that the probability that a word c appears in the context of a target word t can be expressed as a function of p t ·p c (see Section 2). This means that for each threshold λ ∈ [−1, 1] and context word c, there is a hyperplane H c λ which (approximately) separates the words t for which p t ·p c ≥ λ from the others. Note that this hyperplane is completely determined by the vectorp c and the choice of λ. An illustration of this geometric view is shown in Figure 1(a), where e.g. the word c is strongly related to a (i.e. a has a high probability of occurring in the context of c) but not closely related to b. Note in particular that there is a half-space containing those words which are strongly related to a (w.r.t. a given threshold λ).
Our contribution is twofold. First, we empirically show that effective word embeddings can be learned from purely ordinal information, which stands in contrast to the probabilistic view taken by e.g. Skip-gram and GloVe. Specifically, we propose a new word embedding model which uses (a ranking equivalent of) max-margin constraints to impose the requirement that p t ·p c should be a monotonic function of the probability P (c|t) of seeing c in the context of t. Geometrically, this means that, like Skip-gram, our model associates with each context word a number of parallel hyperplanes. However, unlike in the Skip-gram model, only the relative position of these hyperplanes is imposed (i.e. if λ 1 < λ 2 < λ 3 then H λ 2 c should occur between H λ 1 c and H λ 3 c ). Second, by using a quadratic kernel for the max-margin constraints, we obtain a model that can represent context words as a set of nested ellipsoids, as illustrated in Figure 1 (b). From these nested ellipsoids we can then estimate a Gaussian which acts as a convenient region based word representation. Note that our model thus jointly learns a vector representation for each word (i.e. the target word representations) as well as a region based representation (i.e. the nested ellipsoids representing the context words). We present experimental results which show that the region based representations are effective for measuring synonymy and hypernymy. Moreover, perhaps surprisingly, the region based modeling of context words also benefits the target word vectors, which match, and in some cases outperform the vectors obtained by standard word embedding models on various benchmark evaluation tasks.

Word Embedding
Various methods have already been proposed for learning vector space representations of words, e.g. based on matrix factorization (Turney and Pantel, 2010) or neural networks. Here we briefly review Skip-gram and GloVe, two popular models which share some similarities with our model.
The basic assumption of Skip-gram (Mikolov et al., 2013b) is that the probability P (c|t) of seeing word c in the context of word t is given as:  In principle, based on this view, the target vectors p w and context vectorsp w could be learned by maximizing the likelihood of a given corpus. Since this is computationally not feasible, however, it was proposed in (Mikolov et al., 2013b) to instead optimize the following objective: where the left-most summation is over all N word occurrences in the corpus, w i is the i th word in the corpus, C i are the words appearing in the context of w i and C i consists of k · |C i | randomly chosen words, called the negative samples for w i . The context C i contains the t i words immediately preceding and succeeding w i , where t i is randomly sampled from {1, ..., t max } for each i . The probability of choosing word w as a negative sample is proportional to , with occ(w) the number of occurrences of word w in the corpus. Finally, to reduce the impact of frequent words, some word occurrences are removed from the corpus before applying the model, with the probability of removing an occurrence of word w being 1 − θ occ(w) . Default parameter values are t max = 5 and θ = 10 −5 .
GloVe is another popular model for word embedding (Pennington et al., 2014). Rather than explicitly considering all word occurrences, it directly uses a global co-occurrence matrix X = (x ij ) where x ij is the number of times the word w j appears in the context of w i . Like Skip-gram, it learns both a target vector p w and context vectorp w for each word w, but instead learns these vectors by optimizing the following objective: where b w i andb w j are bias terms, and f is a weighting function to reduce the impact of very rare terms, defined as: The default values are x max = 100 and α = 0.75.

Region Representations
The idea of representing words as regions was advocated in (Erk, 2009), as a way of modeling the diversity of the contexts in which a word appears. It was argued that such regions could be used to more accurately model the meaning of polysemous words and to model lexical entailment. Rather than learning region representations directly, it was proposed to use a vector space representation of word occurrences. Two alternatives were investigated for estimating a region from these occurrence vectors, respectively inspired by prototype and exemplar based models of categorization. The first approach defines the region as the set of points whose weighted distance to a prototype vector for the word is within a given radius, while the second approach relies on the k-nearest neighbor principle. In contrast, (Vilnis and McCallum, 2015) proposed a method that directly learns a representation in which each word corresponds to a Gaussian. The model uses an objective function which requires the Gaussians of words that co-occur to be more similar than the Gaussians of words of negative samples (which are obtained as in the Skipgram model). Two similarity measures are considered: the inner product of the Gaussians and the KL-divergence. It is furthermore argued that the asymmetric nature of KL-divergence makes it a natural choice for modeling hypernymy. In particular, it is proposed that the word embeddings could be improved by imposing that words that are in a hypernym relation have a low KL-divergence, allowing for a natural way to combine corpus statistics with available taxonomies. Finally, another model that represents words using probability distributions was proposed in (Jameel and Schockaert, 2016). However, their model is aimed at capturing the uncertainty about vector representations, rather than at modeling the diversity of words. They show that capturing this uncertainty leads to vectors that outperform those of the GloVe model, on which their model is based. However, the resulting distributions are not suitable for modeling hypernymy. For example, since more information is available for general terms than for narrow terms, the distributions associated with general terms have a smaller variance, whereas approaches that are aimed at modeling the diversity of words have the opposite behavior.

Ranking Embedding
The model we propose only relies on the rankings induced by each context word, and tries to embed these rankings in a vector space. This problem of "ranking embedding" has already been studied by a few authors. An elegant approach for embedding a given set of rankings, based on the product order, is proposed in (Vendrov et al., 2016). However, this method is specifically aimed at completing partially ordered relations (such as taxonomies), based on observed statistical correlations, and would not be directly suitable as a basis for a word embedding method. The computational complexity of the ranking embedding problem was characterized in (Schockaert and Lee, 2015), where the associated decision problem was shown to be complete for the class ∃R (which sits between NP and PSPACE).
Note that the problem of ranking embedding is different from the learning-to-rank task (Liu, 2009). In the former case we are interested in learning a vector space representation that is somehow in accordance with a given completely specified set of rankings, whereas in the latter case the focus is on representing incompletely specified rankings in a given vector space representation.

Learning the Embedding
In this section we explain how a form of ordinal regression can be used to learn both word vectors and word regions at the same time. First we introduce some notations.
Recall that the Positive Pointwise Mutual Information (PPMI) between two words w i and w j is defined as where we write n(w i , w j ) for the number of times word w j occurs in the context of w i , and W represents the vocabulary. For each word w j , we write W j 0 , ..., W j n j for the stratification of the words in the vocabulary according to their PPMI value with w j , i.e. we have that: As a toy example, suppose W = {w 1 , w 2 , w 3 , w 4 , w 5 } and: To learn the word embedding, we use the following objective function, which requires that for each context word w j there is a sequence of parallel hyperplanes that separate the representations of the words in W j i−1 from the representations of the words in W j i (i ∈ {1, ..., n j }): j for each j. Note that we write [x] + for max(0, x) and φ denotes the feature map of the considered kernel function. In this paper, we will in particular consider linear and quadratic kernels. If a linear kernel is used, then φ is simply the identity function. Using a quadratic kernel leads to a quadratic increase in the dimensionality of φ(p w ) andp w j . In practice, we found our model to be about 3 times slower when a quadratic kernel is used, when the word vectors p w are chosen to be 300-dimensional. Note that p w j and b i j define a hyperplane, separating the kernel space into a positive and a negative half-space. The constraints of the form pos(j, i − 1) essentially encode that the elements from W j i−1 should be represented in the positive half-space, whereas the constraints of the form neg(j, i) encode that the elements from W j i should be represented in the negative half-space.
When using a linear kernel, the model is similar in spirit to Skip-gram, in the sense that it associates with each context word a sequence of parallel hyperplanes. In our case, however, only the ordering of these hyperplanes is specified, i.e. the specific offsets b i j are learned. In other words, we make the assumption that the higher PPMI(w, w j ) the stronger w is related to w j , but we do not otherwise assume that the numerical value of PPMI(w, w j ) is relevant. When using a quadratic kernel, each context word is essentially modeled as a sequence of nested ellipsoids. This gives the model a lot more freedom to satisfy the constraints, which may potentially lead to more informative vectors.
The model is similar in spirit to the fixed margin variant for ranking with large-margin constraints proposed in (Shashua and Levin, 2002), but with the crucial difference that we are learning word vectors and hyperplanes at the same time, rather than finding hyperplanes for a given vector space representation. We use stochastic gradient descent to optimize the proposed objective. Note that we use a squared hinge loss, which makes optimizing the objective more straightforward. As usual, the parameter λ controls the trade-off between maintaining a wide margin and minimizing classifica-tion errors. Throughout the experiments we have kept λ at a default value of 0.5. We have also added L2 regularization for the word vectors w t with a weight of 0.01, which was found to increase the stability of the model. In practice, W j 0 is typically very large (containing most of the vocabulary), which would make the model too inefficient. To address this issue, we replace it by a small subsample, which is similar in spirit to the idea of negative sampling in the Skip-gram model. In our experiments we use 2k randomly sampled words from W , where k = n j i=1 |W j i | is the total number of positive samples. We simply use a uniform distribution to obtain the negative samples, as initial experiments showed that using other sampling strategies had almost no effect on the result.

Using Region Representations
When using a quadratic kernel, the hyperplanes defined by the vectorp w j and offsets b i j define a sequence of nested ellipsoids. To represent the word w j , we estimate a Gaussian from these nested ellipsoids. The use of Gaussian representations is computationally convenient and intuitively acts as a form of smoothing. In Section 3.2.1 we first explain how these Gaussians are estimated, after which we explain how they are used for measuring word similarity in Section 3.2.2

Estimating Gaussians
Rather than estimating the Gaussian representation of a given word w j from the vectorp w j and offsets b i j directly, we will estimate it from the locations of the words that are inside the corresponding ellipsoids. In this way, we can also take into account the distribution of words within each ellipsoid. In particular, for each word w j , we first determine a set of words w whose vector p w is inside these ellipsoids. Specifically, for each word w that occurs at least once in the context of w j , or is among the 10 closest neighbors in the vector space of such a word, we test whether φ(p w )·p w j < −b 1 j , i.e. whether w is in the outer ellipsoid for w j . Let M w j be the set of all words w for which this is the case. We then represent w j as the Gaussian G(.; µ w j , C w j ), where µ w j and C w j are estimated as the sample mean and covariance of the set {p w | w ∈ M w j }.
We also consider a variant in which each word w from M w j is weighted as follows. First, we determine the largest k in {1, ..., n j } for which φ(p w ) ·p w j < −b k j ; note that since w ∈ M w j such a k exists. The weight λ w of w is defined as the PPMI value that is associated with the set W k j . When using this weighted setting, the mean µ w j and covariance matrix C w j are estimated as: Note that the two proposed methods to estimate the Gaussian G(.; µ w j , C w j ) do not depend on the choice of kernel, hence they could also be applied in combination with a linear kernel. However, given the close relationships between Gaussians and ellipsoids, we can expect quadratic kernels to lead to higher-quality representations. This will be confirmed experimentally in Section 4.

Measuring similarity
To compute the similarity between w and w , based on the associated Gaussians, we consider two alternatives. First, following (Vilnis and Mc-Callum, 2015), we consider the inner product, defined as follows: The second alternative is the Jensen-Shannon divergence, given by: with f w = G(.; µ w , C w ), f w = G(.; µ w , C w ), and KL the Kullback-Leibler divergence. When computing the KL-divergence we add a small value δ to the diagonal elements of the covariance matrices, following (Vilnis and McCallum, 2015); we used 0.01. This is needed, as for rare words, the covariance matrix may otherwise be singular.
Finally, to measure the degree to which w entails w , we use KL-divergence, again in accordance with (Vilnis and McCallum, 2015).

Experiments
In this section we evaluate both the vector and region representations produced by our model. In our experiments, we have used the Wikipedia dump from November 2nd, 2015 consisting of 1,335,766,618 tokens. We used a basic text preprocessing strategy, which involved removing punctuations, removing HTML/XML tags and lowercasing all tokens. We have removed words with less than 10 occurrences in the entire corpus. We used the Apache sentence segmentation tool 2 to detect sentence boundaries. In all our experiments, we have set the number of dimensions as 300, which was found to be a good choice in previous work, e.g. (Pennington et al., 2014). We use a context window of 10 words before and after the target word, but without crossing sentence boundaries. The number of iterations for SGD was set to 20. The results of all baseline models have been obtained using their publicly available implementations. We have used 10 negative samples in the word2vec code, which gave better results than the default value of 5. For the baseline models, we have used the default settings, apart from the D-GloVe model for which no default values were provided by the authors. For D-GloVe, we have therefore tuned the parameters using the ranges discussed in (Jameel and Schockaert, 2016). Specifically we have used the parameters that gave the best results on the Google Analogy Test Set (see below).
As baselines we have used the following standard word embedding models: the Skip-gram (SG) and Continuous Bag-of-Words (CBOW) models 3 , proposed in (Mikolov et al., 2013a), the GloVe model 4 , proposed in (Pennington et al., 2014), and the D-GloVe model 5 proposed in (Jameel and Schockaert, 2016). We have also compared against the Gaussian word embedding model 6 from (Vilnis and McCallum, 2015), using the means of the Gaussians as vector representations, and the Gaussians themselves as region representations. As in (Vilnis and McCallum, 2015), we consider two variants: one with diagonal covariance matrices (Gauss-D) and one with spherical covariance matrices (Gauss-S). For our model, we will consider the following configurations: Reg-li-cos word vectors, obtained using linear kernel, compared using cosine similarity; Reg-li-eucl word vectors, obtained using linear kernel, compared using Euclidean distance; Reg-qu-cos word vectors, obtained using quadratic kernel, compared using cosine similarity; Reg-qu-eucl word vectors, obtained using quadratic kernel, compared using Euclidean distance; Reg-li-prod Gaussian word regions, obtained using linear kernel, compared using the inner product E; Reg-li-wprod Gaussian word regions estimated using the weighted variant, obtained using linear kernel, compared using the inner product E; Reg-li-JS Gaussian word regions, obtained using linear kernel, compared using the Jensen-Shannon divergence; Reg-li-wJS Gaussian word regions estimated using the weighted variant, obtained using linear kernel, compared using Jensen-Shannon divergence.

Analogy Completion
Analogy completion is a standard evaluation task for word embeddings. Given a pair (w 1 , w 2 ) and a word w 3 the goal is to find the word w 4 such that w 3 and w 4 are related in the same way as w 1 and w 2 . To solve this task, we predict the word w 4 which is most similar to w 2 − w 1 + w 3 , either in terms of cosine similarity or Euclidean distance. The evaluation metric is accuracy. We use two popular benchmark data sets: the Google Analogy Test Set 7 and the Microsoft Research Syntactic Analogies Dataset 8 . The former contains both semantic and syntactic relations, for which we show the results separately, respectively referred to as Gsem and Gsyn; the latter only contains syntactic relations and will be referred to as MSR. The results are shown in Table 1. Recall that the parameters of D-GloVe were tuned on the Google Analogy Test Set, hence the results reported for this model for Gsem and Gsyn might be slightly higher than what would normally be obtained. Note that for our model, we can only use word vectors for this task. We outperform SG and CBOW for Gsem and Gsyn but not for MSR, and we outperform GloVe and D-GloVe for Gsyn and MSR but not for Gsem. The vectors from the Gaussian embedding model are not competitive for this task. For our model, using Euclidean distance slightly outperforms using cosine. For GloVe, SG and CBOW, we only show results for cosine, as this led to the best results. For D-GloVe, we used the likelihood-based similarity measure proposed in the original paper, which was found to outperform both cosine and Euclidean distance for that model.
For our model, the quadratic kernel leads to better results than the linear kernel, which is somewhat surprising since this task evaluates a kind of linear regularity. This suggests that the additional flexibility that results from the quadratic kernel leads to more faithful context word representations, which in turn improves the quality of the target word vectors.

Similarity Estimation
To evaluate our model's ability to measure similarity we use 12 standard evaluation sets 9 , for which we will use the following abbreviations: S1: MTurk-287, S2:RG-65, S3:MC-30, S4:WS-353-REL, S5:WS-353-ALL, S6:RW-STANFORD, S7: YP-130, S8:SIMLEX-999, S9:VERB-143, S10: WS-353-SIM, S11:MTurk-771, S12:MEN-TR-3K. Each of these datasets contains similarity judgements for a number of word pairs. The task evaluates to what extent the similarity scores produced by a given word embedding model lead to the same ordering of the word pairs as the provided ground truth judgments. The evaluation metric is the Spearman ρ rank correlation coefficient. For this task, we can either use word vectors or word regions. The results are shown in Table 2.
For our model, the best results are obtained when using word vectors and the Euclidean distance (Reg-qu-eucl), although the differences with the word regions (Reg-qu-wprod) are small. We use prod to refer to the configuration where similarity is estimated using the inner product, whereas we write JS for the configurations that use Jensen-Shannon divergence. Moreover, we use wprod and wJS to refer to the weighted variant for estimating the Gaussians. We can again observe that using a quadratic kernel leads to better results than using a linear kernel. As the weighted versions for estimating the Gaussians do not lead to a clear improvement, for the remainder of this paper we will only consider the unweighted variant.
With the exception of S9, our model substantially outperforms the Gaussian word embedding model. Of the standard models SG and D-GloVe obtain the strongest performance. Compared to our model, these baseline models achieve similar results for S2, S10, S11 and S12, worse results for S1, S3, S4, S5, S6 and better results for S7, S8 and S9. Two general trends can be observed. First, the data sets where our model performs better tend to be datasets which describe semantic relatedness rather than pure synonymy. Second, the standard models appear to perform better on data sets that contain verbs and adjectives, as opposed to nouns.

Modeling properties
In (Rubinstein et al., 2015), it was analysed to what extent word embeddings can be used to identify concepts that satisfy a given attribute. While good results were obtained for taxonomic properties, attributive properties such as 'dangerous', 'round', or 'blue' proved to be considerably more problematic. We may expect region-based models to perform well on this task, since each of these attributes then explicitly corresponds to a region in space. To test this hypothesis, Table 3 shows the results for the same 7 taxonomic properties and 13 attributive properties as in (Rubinstein et al., 2015), where the positive and negative examples for all 20 properties are obtained from the McRae feature norms data (McRae et al., 2005). Following (Rubinstein et al., 2015), we use Table 2: Results for similarity estimation (Spearman ρ). Reg-li-* and Reg-qu-* are our models with a linear and quadratic kernel. S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12  5-fold cross-validation to train a binary SVM for each property and compute the average F-score due to unbalanced class label distribution. We separately present results for SVMs with a linear and a quadratic kernel. The results indeed support the hypothesis that region-based models are wellsuited for this task, as both the Gaussian embedding model and our model outperform the standard word embedding models.

Hypernym Detection
For hypernym detection, we have used the following 5 benchmark data sets 10 : H1 (Baroni et al., 2012), H2 (Baroni and Lenci, 2011), H3 (Kotler-10 https://github.com/stephenroller/ emnlp2016 man et al., 2010), H4  and H5 (Turney and Mohammad, 2015). Each of the data sets contains positive and negative examples, i.e. word pairs that are in a hypernym relation and word pairs that are not. Rather than treating this problem as a classification task, which would require selecting a threshold in addition to producing a score, we treat it as a ranking problem. In other words, we evaluate to what extent the word pairs that are in a valid hypernym relation are the ones that receive the highest scores. We use average precision as our evaluation metric. Apart from our model, the Gaussian embedding model is the only word embedding model that can by design support unsupervised hyperynym detection. As an additional baseline, however, we also show how Skip-gram performs when using cosine similarity. While such a symmetric measure cannot faithfully model hypernyny, it was nonetheless found to be a strong baseline for hypernymy models (Vulić et al., 2016), due to the inherent difficulty of the task. We also compare with a number of standard bag-of-words based models for detecting hypernyms: WeedsPrec (Kotlerman et al., 2010), ClarkeDE (Clarke, 2009) and invCL (Lenci and Benotto, 2012). These latter models take as input the PPMI weighted co-occurrence counts.
The results are shown in Table 4, where Reg-li-KL and Reg-qu-KL refer to variants of our model Table 4: Results for hypernym detection (AP). Reg-li-* and Reg-qu-* are our models with a linear and quadratic kernel.
in which Kullback-Leibler divergence is used to compare word regions. Surprisingly, both for our model and for the Gaussian embedding model, we find that using cosine similarity between the word vectors outperforms using the word regions with KL-divergence. In general, our model outperforms the Gaussian embedding model and the other baselines. Given the effectiveness of the cosine similarity, we have also experimented with the following metric: hyp(w 1 , w 2 ) = (1 − cos(w 1 , w 2 )) · KL(f w 1 ||f w 2 ) The results are referred to as Reg-li-KLC and Regqu-KLC in Table 4. These results suggest that the word regions can indeed be useful for detecting hypernymy, when used in combination with cosine similarity. Intuitively, for w 2 to be a hypernym of w 1 , both words need to be similar and w 2 needs to be more general than w 1 . While word regions are not needed for measuring similarity, they seem essential for modeling generality (in an unsupervised setting).
The datasets considered so far all treat hypernyms as a binary notion. In (Vulić et al., 2016) a evaluation set was introduced which contains graded hypernym pairs. The underlying intuition is that e.g. cat and dog are more typical/natural hyponyms of animal than dinosaur or amoeba. The results for this data set are shown in Table 5. In this case, we use Spearman ρ as an evaluation metric, measuring how well the rankings induced by different models correlate with the ground truth. Following (Vulić et al., 2016), we separately mention results for nouns and verbs. In the case of  Table 4 Interesting, for verbs we find that Skip-gram substantially outperforms the region based models, which is in accordance with our findings in the word similarity experiments.

Conclusions
We have proposed a new word embedding model, which is based on ordinal regression. The input to our model consists of a number of rankings, capturing how strongly each word is related to each context word in a purely ordinal way. Word vectors are then obtained by embedding these rankings in a low-dimensional vector space. Despite the fact that all quantitative information is disregarded by our model (except for constructing the rankings), it is competitive with standard methods such as Skip-gram, and in fact outperforms them in several tasks. An important advantage of our model is that it can be used to learn region representations for words, by using a quadratic kernel.
Our experimental results suggest that these regions can be useful for modeling hypernymy.