Investigating Word-Class Distributions in Word Vector Spaces

This paper presents an investigation on the distribution of word vectors belonging to a certain word class in a pre-trained word vector space. To this end, we made several assumptions about the distribution, modeled the distribution accordingly, and validated each assumption by comparing the goodness of each model. Specifically, we considered two types of word classes – the semantic class of direct objects of a verb and the semantic class in a thesaurus – and tried to build models that properly estimate how likely it is that a word in the vector space is a member of a given word class. Our results on selectional preference and WordNet datasets show that the centroid-based model will fail to achieve good enough performance, the geometry of the distribution and the existence of subgroups will have limited impact, and also the negative instances need to be considered for adequate modeling of the distribution. We further investigated the relationship between the scores calculated by each model and the degree of membership and found that discriminative learning-based models are best in finding the boundaries of a class, while models based on the offset between positive and negative instances perform best in determining the degree of membership.


Introduction
Several studies have been successful in representing the meaning of a word with a vector in a continuous vector space (e.g., Mikolov et al. 2013a;Pennington et al. 2014). These representations are useful for a range of natural language processing (NLP) tasks. The interpretation and geometry of the word embeddings have also attracted attention (e.g., Kim and de Marneffe 2013;Mimno and Thompson 2017). However, little attention has been paid to the distribution of words belonging to a certain word class in a word vector space, though empirical analysis of such a distribution provides a better understanding of word vector spaces and insight into algorithmic choices for several NLP tasks, including selectional preference acquisition and entity set expansion. Figure 1 shows a 2D projection of word embeddings. We extracted 200 words that can be a direct object of the verb play (positive instances) and 1000 other words (negative instances) and projected their GloVe vectors (Pennington et al., 2014) into two dimensions using t-distributed Stochastic Neighbor Embedding (t-SNE) (van der Maaten and Hinton, 2008). The plus symbols (+) represent the positive instances, and the squares ( ) represent the negative instances. This figure shows that the positive instances tend to be densely distributed around their centroid but they are not evenly distributed near the centroid in the 2D spaces. In this study, we aimed to understand how these positive instances are distributed in the pre-trained word vector spaces built by three representative generalpurpose models: CBOW, skip-gram (Mikolov et al., 2013a), and GloVe.
More specifically, we attempted to determine the following: whether or not a simple centroid-based approach can provide a reasonably good model, whether or not considering the geometry of the distribution and the existence of subgroups is useful for modeling the distribution, and whether or not considering the negative instances is essential to achieve adequate modeling. To this end, we first tackled properly modeling the vector distribution to distinguish a possible member of a word class from others when a subset of the class members is given. Note that although various approaches have been proposed to improve word vectors by taking knowledge related to word classes into account (Faruqui et al., 2015;Rothe and Schütze, 2015;Mrkšić et al., 2017), we explored ways to model the distribution of word vectors rather than attempting to improve the word vectors themselves.
We started with a centroid-based model, which is a simple but widely used way of representing a set of word vectors (e.g., Baroni et al. 2014;Woodsend and Lapata 2015) and assumes that how likely a word in the vector space is a member of a word class is proportional to the proximity to the centroid vectors of the class members. We then explored models that take the geometry of the distribution and the existence of subgroups into account. Here, we made two assumptions: vectors of words belonging to a certain word class are distributed with different variances depending on the direction, and most word sets will consist of several subgroups. We then explored the models that also consider negative instances. We assumed that the vectors of the words that do not belong to the target word class can be essential clues to distinguish a possible member of a word class from others. Specifically, we explored a model based on the offset between positive and negative instances and discriminative learning-based models to investigate the impact of negative instances. Furthermore, we investigated the relationship between the scores calculated by each model and the degree of membership using the Rosch (1975) dataset. The dataset contains typicality ratings for some instances of a category. Through experiments, we found that discriminative learning-based models perform better at distinguishing a possible member of a word class from others, while the offset-based model achieves higher correlations with the degree of membership.

Related Work
The interpretation and geometry of word embeddings have attracted attention. Mimno and Thompson (2017) reported that vector positions trained with skip-gram negative sampling (SGNS) do not span the possible space uniformly but occupy a narrow cone instead. Mikolov et al. (2013b) showed that constant vector offsets of word pairs can represent linguistic regularities. Kim and de Marneffe (2013) demonstrated that vector offsets can be used to derive a scalar relationship amongst adjectives. Yaghoobzadeh and Schütze (2016) performed an analysis of subspaces in word embedding. These analyses suggest that a certain direction or subspace in the word vector space represents an aspect of the words and the possibility that a word class is distributed with different variances depending on the direction in the vector space.
While we investigated ways to model the distribution of a set of words in pre-trained word vector spaces to validate several assumptions about the distribution, various approaches have been proposed to improve word embeddings by considering knowledge related to word classes into account. For example, Faruqui et al. (2015) proposed a method of refining vector representations using relational information from semantic lexicons by encouraging linked words to have similar vector representations. Mrkšić et al. (2017) proposed an algorithm for improving the semantic quality of word vectors by injecting constraints extracted from lexical resources. Glavaš and Vulić (2018) use the linguistic constraints as training examples to learn an explicit specialization function with deep neural network architecture.
There are also several studies that expand the method for acquiring a word vector to consider the uncertainty of a word meaning via Gaussian models (Vilnis and McCallum, 2015;Athiwaratkun and Wilson, 2017) and word polysemy by introducing several vectors for each word Neelakantan et al., 2014;Tian et al., 2014;Athiwaratkun et al., 2018). In this study, we only considered a vector for representing each word, but inspired by these studies, we explored models that can consider the geometry of the distribution and the existence of subgroups.
The problem we tackled is similar to a selectional preference acquisition task. There have been a number of studies on selectional preference acquisition. Resnik (1996) presented an information- theoretic approach that inferred selectional preferences based on the WordNet hypernym hierarchy. Erk et al. (2010) described a method that uses corpus-driven distributional similarity metrics for selectional preference induction. Van de Cruys (2014) investigated the use of neural networks for selectional preference acquisition. An entity set expansion task (Pantel et al., 2009) is also similar to our problem and has been well studied. For example, Sadamitsu et al. (2011) disambiguated entity word senses and alleviated semantic drift by extracting topic information from LDA for entity set expansion. Zhang et al. (2016) proposed a joint model for entity set expansion and attribute extraction. In this study, we seek to understand how these vectors are distributed in the pre-trained word vector space without using contextual or lexical information. A comparison with the state-of-the-art models for selectional preference induction and entity set expansion is beyond the scope of this work.

Problem Formulation
First, let us introduce the notation. W c is a subset of words that belong to the target word class c. W o is a subset of words that do not belong to the word class. w t is a target word that can be a member of the word class c but is not included in W c . v w ∈ V w is a pre-trained vector for word w. We normalize all the word vectors to unit length. 1 Note that we select the words in W o to share the same grammatical category as the words in W c .
Our objective is to distinguish the word w t from the words in W o , given W c and V w . More specifically, we aim to find a scoring function f(w, W c ) that assigns a higher score to w t and lower scores to the words in W o . For example, suppose c is a class of words that can be a direct object of the verb play; W c , W o and w t will be as follows: W c = {role, part, game, golf, tennis}, W o = {school, apple, milk, arch, idea}, and w t = basketball. Our objective is to find a scoring function that assigns a higher score to basketball than to school, apple, milk, arch, and idea.

Models
We will start with a centroid-based model (CENT) that measures the score between a word w and a word set W c by calculating the cosine similarity between the word vector and the centroid vector of the word vectors in the word set ( Figure 2-(a)). The scoring function can be written as: CENT provides a reasonable baseline, but it does not take the geometry of the distribution of the word vectors into account. Therefore, we introduce a simple Gaussian model (GM) to represent the distribution of word vectors belonging to a word class c (Figure 2-(b)). The scoring function is as follows: where mean µ and covariance matrix Σ are estimated from {v wc |w c ∈ W c }. We select the constraint on covariance matrices for Gaussian distribution from {spherical, diagonal, full} by performing cross-validation on W c . GM is identical with CENT when the covariance matrix is an identity matrix. Next, we introduce a Gaussian mixture model (GMM) to take the existence of subgroups in a word class c into account (Figure 2-(c)). The scoring function can be written as: where weights π k , means µ k , and covariance matrices Σ k are estimated from {v wc |w c ∈ W c }. We select the number of components of a Gaussian mixture K from {1, 2, . . . , 10} and the constraint on covariance matrices from {spherical, diagonal, full} by performing cross-validation on W c . GMM can be considered an extension of CENT because it is identical to the CENT when K is 1 and the covariance matrix is an identity matrix. Furthermore, we will consider another extension of CENT that only considers the existence of subgroups. Since all word vectors are normalized to unit length, f CENT (w, W c ) can also be written as: where α Wc is a normalization term depending only on W c and thus does not affect the ranking. That is, we can consider that CENT takes the average of the cosine similarities between a word vector v w and all word vectors in the given word set W c . If the words in the word set consist of several subgroups, it would be more plausible to consider only the top-k most similar words for scoring. Accordingly, we introduce the k-nearest neighbor model (kNN), which takes the average of only the top k similar vectors. The scoring function can be written as: where kNN w (W c ) is a function returning a set of words in W c that take the top-k highest cosine similarities against the word w. The number of k is selected from {1, 2, 2 2 , . . . , |W c |} by performing cross-validation on W c . kNN is identical to CENT when |W c | is selected as k.
As the last model without negative instances, we adopt a one-class support vector machine (SVM) (Schölkopf et al., 2001)-based model (1-SVM) to clarify the importance of the negative instances. We select the kernel from {linear, cubic polynomial, RBF} and tune the parameter nu ∈ {0.05, 0.10, . . . , 0.50} by performing cross-validation. Note that models without negative instances learn a decision function for outlier detection: classifying new data as similar or different to the given positive instances.
Next, we explore models that also leverage negative instances. Here, we introduce a word set W n as negative instances, where W n consists of words that are not included in either W c or W o . We select the words in W n to share the same grammatical category as the words in W c as well as W o . Both W o and W n consist of words that are not included in W c , but their roles are different. While words in W o are used as negative instances in the estimation, words in W n are used as negative instances for modeling the word-class distribution.
As the first model with negative instances, we introduce a model based on the offset between positive and negative instances (OffSet). This model is inspired by the Kim and de Marneffe (2013)'s work, which demonstrates that vector offsets can be used to derive adjectival scales. We assume that the vector offset between the centroid of the positive instances and that of the negative instances represents the degree of membership in the vector space ( Figure 2-(d)). The scoring function of OffSet is as follows: where Now let us move on to discriminative learningbased models. In this study, we chose a support vector machine with a linear kernel (SVM L ) or a radial basis function (RBF) kernel (SVM R ). We only used word vectors as the input of these models and regard the decision function as the scoring function. We tuned the parameter C ∈ {0.1, 0.2, 0.5, 1, 2, 5, 10} and class weight for positive instances P ∈ {1, 2, 4, 8} for SVM L and the parameter C ∈ {0.2, 0.5, 1, 2, 5}, γ ∈ {0.2, 0.5, 1, 2}, and class weight for positive instances P ∈ {1, 2, 4, 8} for SVM R by performing cross-validation on W c and W n . Note that we wanted to determine the usefulness of negative instances in modeling the distribution of word vectors; thus we make no assertions that these are optimal models.

Word embeddings
We used three publicly available pre-trained word vectors for English: the 300-dimensional embeddings trained on the Google News corpus with the CBOW model (CBOW), 2 the 300-dimensional embeddings trained on Wikipedia with the skip-gram model (SGNS), 3 and the 300-dimensional embeddings trained on Wikipedia and Gigaword with the GloVe model (GloVe). 4 For Japanese, we trained 300-dimensional embeddings on an approximately 1.5 billion word corpus collected from the Web, with the CBOW model (CBOW), the skip-gram model (SGNS), 5 and the GloVe model (GloVe). 6 We also trained 50-, 100-, and 200-dimensional embeddings on the same corpus for each model in order to investigate the effect of the vector size.

Datasets
For the evaluation, we used two types of datasets for English and Japanese, respectively.

SP dataset
As the first type, we used word sets that consist of words which can be a direct object of a certain verb. For example, suppose a word set consists of {role, part, game, golf, tennis, etc.}, where each word can be a direct object of the verb play. We did not use the verb itself for evaluation but we can regard this as a selectional preference (SP) task.
For the English SP dataset, we extracted pairs of verbs and their direct objects from the Google Books Syntactic N-grams dataset (Goldberg and Orwant, 2013). We first extracted verbs with the POS tag of VBD, VBP or VBZ that have direct objects at a rate of more than 40%. We decided on a threshold of 40% empirically to extract transitive verbs only. Then, we listed the extracted verbs in descending order of the number of the different direct objects and chose the top 1,000 of them.  Figure 3: The pair of a synset and a set of its hyponyms in a distance of at most five. The hyponyms are surrounded by a broken line.
For the Japanese SP dataset, we extracted pairs of verbs and their accusative arguments from the predicate-argument data used by Sasano and Okumura (2016). First, we extracted verbs that have accusative arguments at a rate of more than 70%. Again, we decided on a threshold of 70% empirically to extract transitive verbs only. Then, we listed the extracted verbs in descending order of the number of the different accusative arguments and chose the top 1,000 of them.
Both datasets consisted of 1,000 verbs with at least 250 unique direct objects. We selected 200 direct objects as W c from the most frequent 250 direct objects and the other 50 direct objects as w t for each verb. Thus, the number of tasks N was 50,000, i.e., 50 tasks for each of the 1,000 verbs. We used 2,000 negative instances against 200 positive instances to build models with negative instances.

WordNet datasets
We used word sets extracted from English and Japanese WordNet (Fellbaum, 1998;Isahara et al., 2008) as the second type. For example, a word set consists of {dog, llama, hedgehog, wolf, etc.}, which are all hyponyms of the same synonym set (synset n01886756, placental ). We extracted the pair of a synset ID and a set of words in the synset and its hyponyms in a distance of at most five from the target synset in the WordNet hyponym tree, as shown in Figure 3. We did not use multiword expressions or words whose word vectors are not included in any of the three pre-trained word embeddings.
We extracted synsets that have at least 250 words. There are 109 word sets for English datasets and 120 word sets for Japanese datasets. We selected 200 words as W c and the other 50 words as w t for each synset. The number of tasks N was 5,450, i.e., 50 tasks for each of the 109 synsets for English, and 6,000, i.e., 50 tasks for each of the 120 synsets for Japanese. We used 2,000 negative instances against 200 positive instances to build models with negative instances as well as the SP datasets.

Experimental settings
We compared eight models: CENT, GM, GMM, kNN, 1-SVM, OffSet, SVM L , and SVM R . For each dataset, we made W o by extracting 999 words from the other word sets; that is, the number of words for scoring was 1,000, including the target word w t . For OffSet, SVM L , and SVM R , we make W n by extracting words from the other word sets subject to the constraint W o ∩ W n = {}.
We regarded the problem as a ranking task and adopted the mean reciprocal rank (MRR) as the metric for evaluation. The MRR is calculated by the following equation: where rank(w t i ) is the rank of the target word w t i for each task. We tune the parameters to maximize the MRR in parameter tuning. We measured the statistical significance with an approximate randomization test (Chinchor, 1992) with 99,999 iterations and significance level α = 0.05 after Bonferroni correction. To satisfy the independence assumption, we treated each verb (for the SP datasets) or synset (for the WordNet datasets) as the unit of a randomization test.

Results on the SP datasets
Tables 1 and 2 show the experimental results on the SP dataset for English and Japanese, respectively. In these tables, the best scores for each word embedding model and the scores with no significant difference from the best score are indicated in bold. In addition, the CENT score and the scores with no significant difference from the CENT score are italicized.
The results in these tables indicate that the models considering the geometry of the distribution or the existence of subgroups in the word class outperform the centroid-based model (CENT) for both the English and Japanese SP datasets. In particular, Model CENT GM GMM kNN 1-SVM OffSet SVM L SVM R CBOW . 1642.2539.2360.2097.1726.2782.3397 .3905 SGNS .1887.2461.2308.1918.2252.2189.3365 .3608 GloVe .1925.2596.2462.2245.2295.1150 Ave.
. 1677 .2624 .2454 .2185 .1996 .2399 .3936 .4355  a simple Gaussian model (GM) performed the best among the models that only depend on positive instances. This indicates that these word sets are distributed with different variances depending on the direction in the vector space and it is useful to consider the geometry of the distribution. The two discriminative learning-based models with negative instances, SVM L and SVM R , achieved much higher performance, whereas 1-SVM yielded a limited improvement over CENT. This demonstrates that modeling the distribution with only positive instances has an obvious limitation, and it is essential to leverage the negative instances as well. OffSet with CBOW or SGNS achieved a relatively good performance, but OffSet with GloVe did not, which suggests that the usefulness of the offset depends on the word embedding model. Tables 3 and 4 show the experimental results on the WordNet dataset for English and Japanese, respectively. The meaning of bold and italic fonts is identical to that on the SP dataset.

Results on the WordNet datasets
The two discriminative learning-based models with negative instances and OffSet with CBOW or SGNS achieved a relatively high performance. This demonstrates that the negative instances must be taken into account to model the distribution properly. On the other hand, in contrast with the SP datasets, there were no significant improvements when the geometry of the distribution and the existence of subgroups were considered.
. 1665 .1564 .1532 .1614 .1643 .1857 .2310 .2433   intuition, whereas the SP datasets are automatically built from the corpus and are highly compatible with the pre-trained word vectors. In addition, we examined which types of words tend to rank low and found that words extracted from a synset corresponding to their infrequent sense such as stock in the sense of livestock tend to rank low. We leave further exploration for future work.

Discussion
It is interesting that although SVM L is effectively just a linear classifier, SVM L achieves a relatively high performance. This is likely due to the relatively large vector size compared to the number of positive instances and indicates that the positive instances occupy a certain span in the vector space though such a span cannot be determined by only using positive instances. We confirmed two desirable properties of the discriminative learningbased models with negative instances for practical applications. One is that since we used simple models, they do not require much training time. The other is that their performance is relatively stable among the different word embeddings and datasets compared to the other models. We also investigated the effect of the vector size and the number of positive instances on the Japanese SP dataset. Table 5 shows the averaged CBOW, SGNS, and GloVe scores for different vector dimensions, 50, 100, 200, and 300. We found that while CENT and 1-SVM were not affected much by the vector size, the other models, particularly OffSet, SVM L , and SVM R , were significantly affected by the vector size. Table 6 shows the averaged CBOW, SGNS, and GloVe scores for the different number of positive instances, 25, 50, 100, Size CENT GM GMM kNN 1-SVM OffSet SVM L SVM R 50 . 1686 .2360 .2055 .1909 .1825 .1769 .2842 .3568 100 .1738 .2557 .2177 .2075 .1954 .2189 .3366 .4044 200 .1724 .2697 .2233 .2178 .2005 .2363 .3813 .4340 300 .1677 .2624 .2454 .2185 .1996 .2399 .3936 .4355

25
. 1563 .1728 .1522 .1635 .1562 .1880 .2326 .2600 50 .1612 .2008 .1779 .1795 .1722 .2144 .2898 .3157 100 .1652 .2388 .2098 .1988 .1880 .2307 .3475 .3790 200 .1677 .2624 .2454 .2185 .1996 .2399 .3936 .4355  and 200. We can conclude that all the models perform at a higher level based on the larger number of positive instances, especially for GM, GMM, SVM L , and SVM R . This is not surprising, since these models have a large number of parameters and can extract a rich variety of information from the large number of positive instances. Similar tendencies were also observed with the other dataset. These results demonstrate that we can obtain relatively high performance by using discriminative learning-based models with a large enough vector and training data size. Rosch (1975) developed the prototype concept and proved that not all members of a category are equally representative of the category. Here, we are interested in the relationship between the scores calculated by each model and the degree of membership. We thus investigated how consistent the score calculated by each model is with human intuition on the degree of membership.

Degree of membership
For this experiment, we used the typicality data by Rosch (1975). Rosch asked 209 college students to use a 7-point scale to rate the extent to which each instance represents their idea or image of the meaning of the category term, and reported the rank orders with the mean ratings for ten categories. 7 For example, for the Furniture category, 60 examples are ranked with the mean ratings, chair and sofa are top-ranked with the score of 1.04, and  Furniture  n03405725  60  89  26  Fruit  n13134947  51  165  41  Vehicle  n04524313  50  346  34  Weapon  n04565375  60  119  19  Vegetable n07707451  56  102  27  Bird  n01503061  54  330  51  Sport  n00523513  59  106  33  Clothing  n03051540  55  409  31   Table 7: Statistics of the typicality dataset.
.  stove is ranked as 50th with the score of 5.4. In this study, we used eight categories that have a corresponding synset in WordNet. Table 7 shows the statistics of the dataset. In the table, |W R | denotes the number of examples in Rosch's dataset, |W c | denotes the number of words in the synset and its hyponyms in the WordNet, and |W R ∩ W c | is the number of words included in both W R and W c , which we try to rank here.
In this experiment, the objective was not to distinguish a possible member from others but to rank the positive member w c in W c according to the degree of membership. That is, we first formed the scoring function by using W c and W n and then applied the function to each member of W R ∩ W c to predict the typicality ranking. We evaluated the ranking by calculating Spearman's rank correlation coefficient (ρ) and Kendall's rank correlation coefficient (τ ) against the ranking of the goodnessof-example in Rosch's dataset. We computed the average rank correlation coefficient over the eight categories for ρ and τ . Table 8 shows the experimental results.
In contrast with the previous experiments, the highest scores were achieved by OffSet. These results suggest that the vector offsets can be used to derive the degree of membership. We can say that, while discriminative learning-based models, espe-cially SVM R , can find the boundary of a category in a vector space with high accuracy, the vector offset between the centroid of positive instances and that of negative instances can properly represent the degree of membership in a category.
When we focused on each combination of the embedding and distribution models, we found that the highest and second highest scores were achieved by OffSet with GloVe and GMM with SGNS, respectively. In contrast, both achieved relatively low performance in distinguishing a possible member of a word class from others, as shown in Table 3. These results demonstrate that the proper models for finding the boundaries of a class and those for determining the degree of membership are different and that choosing a proper model depending on the task is essential.

Conclusion and Future Work
We investigated the distribution of words that belong to a certain word class in a pre-trained generalpurpose word vector space. The experimental results show that a centroid-based approach cannot provide a reasonably good model and considering the geometry of the distribution and the existence of subgroups is useful for modeling the distribution in some cases. However, the impact is limited, and the negative instances must be taken into account for adequate modeling. The results indicate that just observing the distribution of positive instances is not enough to understand the geometry of word embedding spaces. Furthermore, we investigated the relationship between the score calculated by each model and the degree of membership and demonstrated that, while discriminative learning-based models can distinguish a possible member of a word class from others, the offsetbased model achieves higher correlations with the degree of membership.
The investigation in this study leveraged only general-purpose word vectors to represent the meaning of a word. However, several studies have expanded the method for acquiring a word vector to account for the uncertainty of word meanings and word polysemy (e.g., Athiwaratkun et al. 2018). In addition, contextualized word embeddings have been shown to be very effective on a range of NLP tasks (Peters et al., 2018;Devlin et al., 2019). Furthermore, Gong et al. (2018) reported that word embeddings learned in several tasks are biased towards word frequency: the embeddings of high-frequency and low-frequency words lie in different subregions of the embedding space. Thus, in the future, we will take the uncertainty, polysemy, and context sensitivity of the word meanings and the frequency of words into account and explore better ways of modeling the word-class distributions in semantic vector spaces.