Understanding the Source of Semantic Regularities in Word Embeddings

Semantic relations are core to how humans understand and express concepts in the real world using language. Recently, there has been a thread of research aimed at modeling these relations by learning vector representations from text corpora. Most of these approaches focus strictly on leveraging the co-occurrences of relationship word pairs within sentences. In this paper, we investigate the hypothesis that examples of a lexical relation in a corpus are fundamental to a neural word embedding’s ability to complete analogies involving the relation. Our experiments, in which we remove all known examples of a relation from training corpora, show only marginal degradation in analogy completion performance involving the removed relation. This finding enhances our understanding of neural word embeddings, showing that co-occurrence information of a particular semantic relation is the not the main source of their structural regularity.


Introduction
The representation of words has been a longstanding task in natural language processing (NLP). The main underlying principle is known for decades, as explained by Firth (1957). This principle was based on the idea that the meaning of a word can be understood by its surrounding company (i.e., the words in its context). Most modern representation learning theory in NLP is based on this assumption, with vector representation being the most successful area to date (Turney and Pantel, 2010). More recently, low-dimensional word representations learned from text corpora using neural networks (i.e., word embeddings) have emerged (Mikolov et al., 2013a;Pennington et al., 2014;Bojanowski et al., 2017) stemming from cognitive frameworks based on distributed representation Feldman and Ballard, 1982). Neural word embeddings have been proven to contain useful information about concepts and entities, and provide a generalization boost to many NLP applications (Goldberg, 2017). Surprisingly, these representations have also been shown to exhibit linear relationships between words in the vector space, demonstrated by analogy. For example, Mikolov et al. (2013b) showed that a simple operation such as king-man+woman will result in a point near queen in the vector space 1 . These word analogies have been extensively investigated in the literature, aiming to shed light on this surprising property. However, while there has been a body of research seeking to understand how these analogies work (Arora et al., 2016;Gittens et al., 2017;Ethayarajh et al., 2019;Allen and Hospedales, 2019), and noting issues about their methodology (Linzen, 2016;Nissim et al., 2020), there has not been a specific analysis on the source of statistical cues that leads to their high performance on this task.
Concurrently, a thread of research has focused on explicitly modeling lexical relationships of word pairs in text corpora Espinosa-Anke and Schockaert, 2018;Washio and Kato, 2018;Joshi et al., 2019;Camacho-Collados et al., 2019). While these methods employ different means for learning relation vectors, they share a common initial premise: only co-occurring words in the corpus are considered 2 . While this simplified assumption works well enough in practice, providing a useful signal even in downstream NLP applications, in this paper we find that valuable information is likely lost in the process. In fact, we 1 More information of how word analogies work can be found in Section 2.1.
2 Some of these methods also provide tools to learn representations for out-of-vocabulary pairs (Joshi et al., 2019;Camacho-Collados et al., 2019). However, their initial vector spaces are based on co-occurring word pairs only. find that a text corpus provides enough information to infer pairwise relations without training on any specific examples of a given relation.
In our experiments, we focus on semantic relations in particular, which are the most suited for both word and relation embedding models. Neural word embeddings are used to learn representations from text corpora, with word analogies as the evaluation mechanism to test our hypothesis. We run an extensive set of control experiments where the co-occurrence information is completely removed from the reference corpora. The results show that, with relationship instance removal, analogy performance degrades to only a limited extent, overall. 3 This finding suggests that neural embeddings do not learn lexical relation regularities from examples of the relation, but that they are still able to be inferred through the semantic featurization of individual words.

Understanding analogies in word embeddings
The surprising result of Mikolov et al. (2013b), showing that word embedding can solve linear analogy problems, led to a careful investigation by researchers from different fields. A line of research proposed mathematical formalisms to try to understand the intrinsic properties of word embeddings. Arora et al. (2016) was one of the first to provide a rigorous theoretical explanation on the linear algebraic structure of word embeddings. Their formalism is based on a latent variable model that makes assumptions on the nature of the vector space. Later works rely on the notion of paraphrasing (Gittens et al., 2017;Allen and Hospedales, 2019), based on the observation that different words can be used in similar contexts interchangeably, dropping some of the previous assumptions made by Arora et al. (2016). Concurrently, other works have attempted to provide explanations of the compositional properties of distributional models through additions (Levy and Goldberg, 2014a;Paperno and Baroni, 2016;Ethayarajh et al., 2019), which lie at the core of word analogy completion. While these works formalize word analogies and attempt to explain how they work mathematically, our empirical analysis is focused on understanding the source of signal in corpora that affect the performance of word analogy completion, without asserting any predefined assumption. In particular, we are mostly interested in determining whether relationship pair co-occurrence in sentences is necessary in order for a word embedding to succeed at analogy completion.

Issues in word analogies
A number of publications have encountered methodological issues in the word analogy task through word embeddings. Levy and Goldberg (2014a) found that the addition operations may not be optimal, as they are reduced to three separate similarity problems that can be solved through more appropriate operations. Linzen (2016) showed that simple baselines based on nearest neighbour searches are competitive in the analogy categories proposed by Mikolov et al. (2013b). Because of this,  proposed a new dataset, partially addressing some of the previous shortcomings. Other works have shown that linear relationships, while being implicit, are not directly apparent in the word embedding space, and therefore word analogies may not be the best method to retrieve this information Schluter, 2018;. Finally, Gonen and Goldberg (2019) and Nissim et al. (2020) cautioned against over-reliance on analogies as a means to uncover and correct for biases in word embeddings.
These methodological observations challenge the supremacy of analogy evaluations as the optimal proxy for downstream task performance of a word embedding. Nevertheless, they represent a valuable mechanism with which to compare the semantic regularities of two different neural embeddings. In particular, word analogies represent an ideal benchmark for our research questions, as the impact of co-occurrence statistics within word relations can be evaluated directly through analogy validation. This would not be the case for other more complicated tasks such as relation classification or extraction, which may add additional confounds.

Methodology
In this section we explain the experimental methodology we follow to answer our main research question. First, we briefly describe how to solve word analogies using word embeddings (Section 3.1). We then explain our methodology to compile corpora to train word embeddings (Section 3.2).

Solving word analogies with neural word embeddings
The first step to solve word analogies using neural word embeddings is to first learn word vectors from an unlabeled text corpora. To do so, standard word embedding models such as Word2Vec (Mikolov et al., 2013a), GloVe (Pennington et al., 2014) or FastText (Bojanowski et al., 2017) are often used. The output of these models is a vector space where each word is represented as a single point. With these vectors, mathematical operations can then be made to solve a given analogy. Formally, given three words (a, b and c), the task of word analogy completion consists of predicting the most appropriate word d that satisfies a is to b as c is to d. In this case, both a-b and c-d are part of the same relationship. For instance, in the case of Paris, France and Berlin, the word to retrieve would be Germany. In this case both Paris-France and Berlin-Germany belong to the capital-of relation. With word embeddings this can be solved with the simple vector operation 4 b − a + c, retrieving the word whose vector is closest to that point in the space.
Word analogy completion is used as the main evaluation for our experiments.

Corpus preparation
Our main research question is whether an explicit observation of a relationship in a corpus is necessary to complete an example of that relationship via analogy. To this end, given a reference unlabeled corpus, we devise the following methodology per lexical relation type.

Sentence removal
First, for each relation type (e.g. capital-of ) in a dataset, we remove all sentences from the corpus that contain word pairs belonging to the relation. This results in a modified corpus for each relation type and a respective word embedding for each relation type trained on the modified corpus. For example, for the pair Lisbon-Portugal, we would remove all sentences from our reference corpus where Lisbon and Portugal co-occur, and this process would be repeated for all pairs of the capital-of relation.

Sentence replacement
This setting is similar to the sentence removal strategies with the added inclusion of new sentences to replace the removed ones. In particular, for each removed sentence containing a word pair of a relation type, two sentences from a similar corpus replace it, where each sentence contains one of the words in the pair. This is arguably the most fairly comparable setting to no removal, as the number of occurrences for each word in the relation would be approximately the same with respect to the default setting. The overall number of sentences would be slightly higher, though this is negligible in comparison to the full corpus (see Section 4.1 for the specific details on the corpora used for the evaluation).
For both the sentence replacement and sentence removal strategies, we also experiment with a more aggressive setting. In this setting (referred to as removal+ or replacement+ in our experiments), all sentences containing any two words from the vocabulary of the relation 5 are removed, in the case of removal+, or replaced, in the case of replace-ment+. This is the most aggressive setting where no co-occurrence information of a given relation is preserved.

Evaluation
In this section we provide the details of our experimental setup (Section 4.1) and then, the main results of our evaluation (Section 4.2).

Experimental Setting
In the following we describe the experimental setting for all our experiments. More details and code to reproduce our experiments can be found online 6 .
Text corpora. As our reference corpus we selected UMBC (Han et al., 2013), which is a diverse 3-billion-token corpus of paragraphs extracted from the web, amounting to a total of 132M sentences. In particular, we randomly select 80% of all sentences which would be the base corpus for the experiments. Then, we used the remaining 20% to add replacement sentences when necessary (see Section 3.2.2 for more details on the replacement setting). To complement the main results and test the generalization of our findings, we also use Wikipedia 7 (2 billion tokens and 104M sentences) for the base removal experiments.
Word embedding models. For word embedding models, we use both the CBOW and Skip-Gram variants of Word2Vec (Mikolov et al., 2013a). These neural representation learning approaches have shown to be amenable to analogy completion since their introduction (Mikolov et al., 2013b). Unlike FastText (Bojanowski et al., 2017), they do not include character information and are therefore more suitable for our experiments as pure word-based models. We use standard hyperparameters for both CBOW and Skip-Gram, with 300dimensions and a window size of 10 in both cases. Given the difference in speed (CBOW being around five times faster to train) and the small performance difference, we considered CBOW for all our main experiments, but included Skip-Gram results in the appendix.
Validation datasets. Google (Mikolov et al., 2013b) and BATS  relation example datasets are used for analogy completion. BATS was introduced after the Google dataset to address some of its shortcomings in the number and type of relations, so the inclusion of both datasets in our experiments help give a more general overview. In particular, we focus on the semantic relations of each dataset. Table 1 shows the statistics of the relations considered in each of the datasets.
Evaluation protocol. For solving word analogies, we follow the original methodology of Mikolov et al. (2013b) 8 , as explained in Section 3.1. For simplicity, we unify the evaluation setting for both datasets where we only consider a single solution. In the case of BATS, we consider the first answer as provided in the dataset, which is generally the most specific. With the expectation that performance after co-occurrence removal may degrade substantially, we report analogy completion results using recall at 1 (accuracy), 10, and 50. Recall@50, for example, reports the percentage of analogy completions in which the expected word was among the 50 nearest neighbors to the three 7 We use the Wikipedia dump of November 2016. 8 As in the original protocol, words that are included in the instance are excluded from the nearest neighbours search.  word subtraction and addition. The inclusion of recall at different thresholds allows for a more complete overview of the performance, as the standard accuracy measure alone may not reflect the full picture Schluter, 2018).
Training. For each relation type, we utilized three different variants of each corpora that are used to train word embeddings (i.e., the original corpus and two resulting from our removal and replacement strategies, as explained in Section 3.2). To reduce the amount of training, we only considered the default and removal strategy for Wikipedia 9 , while all experiments are performed on the main UMBC reference corpus. In total, we compiled 156 different corpora, occupying 2.2TB of disk space, and learned 184 different word embedding models, totalling around 1,980 hours (around 83 full days) of model training on a high performance single node (48-core) system.

Results
As explained in the previous section, our experiments are aimed at understanding the role of co-occurring words in a given relation type (e.g.  capital-of ) as they pertain to analogy completion. Table 2 shows the main set of results of the CBOW model on the UMBC corpus. As can be observed, and as expected, the default model (i.e., no cooccurrence removal) trained on the original corpus provides the highest analogy completion results, overall. However, less expected is the low magnitude decrease in performance of the experiments involving co-occurrence removal. The default experiments performed analogies with 51.9% accuracy (R@1), on average, compared to 42.7% with the most aggressive replace plus (Rp+) strategy on the Google dataset, and 18.1% vs. 16.0%, respectively, on the BATS dataset. Figure 1 shows the average decrease in performance of the replace strategy (Rp) per relation type as a percentage of the default performance. For R@1, the decrease in performance is lower than 10% for the majority of relations. When considering R@10 and R@50, this decrease in performance is even less pronounced, which suggests that the main geometrical features of the space were largely preserved (a more visu-alization of the space is presented in Section 5.2). The animal-shelter significant decrease in performance is a special case as the performance of the default model was very low to start with (3.3%), which highlights the difficulty to model that particular relationship via word analogies. The same could be attributed to the city-state (Default accuracy of 14.1%), which is the relation with the second highest decrease in performance for R@1.
Finally, Table 3 shows experimental results for the default, remove (Rm) and remove plus (Rm+) strategies for models trained on the Wikipedia corpus. The results are slightly higher overall, given the clean, consistent, and topically comprehensive nature of the corpus. The overall difference between default and removal strategies was similar to that of UMBC (reminder that the remove plus strategy consisted of removing all sentences where any two words from a given relation co-occur

Analysis
In this section we aim to better understand the results presented in the previous section. In particular, we analyze to what extent the performance drops that exist are correlated with frequencies of words in the relationship (Section 5.1). We also compare, through principal component analysis (PCA) visualization, the structure of the relation with the highest point decrease, before and after co-occurrence removal (Section 5.2).

Correlation with word frequency
A natural question that may arise when looking at the results is whether word frequency has any influence on the performance drop of co-occurring word pairs. For instance, one may wonder if getting a high-quality word embedding, which is generally achieved when word frequency is sufficiently high in the corpus, is enough to compensate for the lack of sentences with words forming a certain relation. Or, alternatively, if the relative frequency of co-occurring words in a relation has an effect on the final embedding, as this would mean that frequency is a necessary condition for learning robust semantic regularities. To answer these questions, we computed the correlation between word and pair frequencies in a word analogy instance and the performance drop. For the frequency indicator, we computed two numbers, H ind and H pair : 1. Harmonic mean 10 between all individual 10 We decided to use the harmonic mean because it is gener-words in the analogy instance (H ind ) was computed as follows: where x i is the frequency of a word in the given word analogy instance and n is the number of words, i.e., four in the case of individual words in word analogies. For example, the harmonic mean of the four words, king (220,958 occurrences in UMBC), queen (52,262), man (751,262), and woman (296,915) would be 141,048.
2. The relative pairwise frequency (H pair ) was computed as the previous number divided by the harmonic mean of the number of sentences where two words of a pair co-occur (i.e., H co ): where p i corresponds to the frequency of a relation pair in a word analogy instance. This number can give a better indication of how relevant the co-occurrence information of a given word pair is. Following the previous example, the relative pairwise frequency H pair of the instance composed of king-queen (5,498 joint co-occurrences in UMBC) and man-woman (36,189) is 0.068.
ally more robust to outliers (e.g., a highly frequent word) than the usual arithmetic mean.  With respect to performance drops, for each analogy completion instance we considered the ranks 11 of the correct completion words in both the default and replace settings and computed the difference. Table 4 shows the results of the correlation results in the Google analogy dataset. Not surprisingly, the correlation between the individual frequency of the words in an instance and the rank difference is negative in all relation types except for one (nationality-adjective). However, the correlation is rather weak, as the addition of new sentences compensate for the initial removal, even if the sentences are of a different kind. As for the relative frequency of pairs in the instance, the correlation is positive as expected. In this case, the signal is higher than with the individual frequency case, especially in the family relationship. Overall, this experiment shows a level of support for our initial premises on the effect of relative pair frequencies, but further research would be necessary to understand other 11 We only considered the position of the fifty first nearest neighbours. If the correct word was not among the fifty nearest neighbours, 51 was used as the position, which would be equivalent to a wrong answer. reasons behind the performance drop.  Table 4: Pearson correlations between frequency (average among all instances in the dataset) and performance drop between the default and replace (Rp) corpora.

Visualization
In this section, we present visualizations of the word pairs from one lexical relation, before and after co-occurrence removal, in order to gain insight into the effect of removal on the learned structure of the space. In particular, we selected the relation from the Google datasets with the largest raw performance point drop 12 . Figure 2 shows the two principal components of word pairs in the capitalworld relation for both the default (Def) and remove (Rm) settings. A reminder from Section 3.2.1, the remove setting involves removing all sentences where both words from a relation co-occur, without any replacement. As can be seen, even in the case of a relation with a R@1 performance gap as high as 36.6%, the linear relations are largely preserved in the word embedding space. This can also be supported by the fact that the performance drop is much smaller for R@10 and R@50 (see Figure 1), suggesting that the correct completion word is still somewhere near in the space. For example, for the instance Australia, Canberra, Spain, the correct word Madrid is found as the fifth nearest neighbours in the Rm vector space, with the top two words being Barcelona and Valencia (other large Spanish cities). Intuitively, this error does not affect the overall representation of the relation in the vector space, as those words were already in a similar linear relationship in the default model.

The source of semantic regularities
In this paper, we find that neural word embeddings (i.e., Word2Vec models) do not require observation of instances of the relation (e.g., Madrid is the capital of Spain) in order to maintain nominal accuracy in relation completion tasks. We believe this is the first time such an observation has been made, empirically, using natural language, though it has been observed in neural embeddings trained 12 In the appendix we also include the same visualizations for the relations with the second largest performance drop (city-in-state) and the smallest performance drop (nationalityadjective), which largely share the same conclusions. on non-linguistic data (Pardos and Nam, 2020).
In Mikolov et al. (2013b), where Word2Vec models were first introduced, the phrase "Linguistic Regularities" was used. While it was not made explicit what the word regularities referred to, analogy completion was exclusively used for validation, leaving open the possibility that regularities referred to some pattern of structure allowing for lexical relationships to be expressed. If the regularities relevant to analogy completion are not formed from examples of the lexical relationship contained in the analogy, then how are they formed and how was the accuracy of the completion mostly retained in our experiments in the absence of examples? Instead, it may leverage the robustness of regularities, or features, learned about individual words to lay the structural foundation for inferences to then be made about a lexical relation. Removing co-occurrences of capitals and countries, for example, would not completely remove the concept of capitals and countries from the corpus. The embedding of "Madrid" would likely still encode features associated with a busy city, government buildings, culture, and European regionality. This is also related to work that showed relations and relevant information from relations can be captured from word embeddings (Jadhav et al., 2020), even if the relation cannot be retrieved explicitly from linear transformations . Interestingly, however, our results indicate that the frequency of an individual word in a corpus is only weakly related to the robustness of features leveraged for successful analogy completion.
Finally, even though co-occurrences of pairs from a specific relation are not necessary to learn the necessary features, word pairs still may play a critical role in regularity development. Most word embedding models (including the Skip-Gram model of Word2Vec) are trained in pairwise fashion after-all, making predictions and calculating loss based on each pair of input and context words.

Cognitive perspective
Neural embeddings come from a cognitive perspective on semantic representation. They stem from a hypothesized architecture of the mind called Connectionism (Feldman and Ballard, 1982) in which emergent concepts (Hopfield, 1982; are learned as distributed representations across the embedding space . If neural word embeddings are a candidate model of a component of human cognition, then our results suggest that the faculties of the mind that understand relational concepts (e.g., male and female) may establish these concepts primarily through induction and observations of behavior. For example, this would mean that we learn features of male and female separately, rather than through explicit declaration of representative pairs (i.e., explicit cooccurrences). It is perhaps a separate faculty of the mind that queries this conceptual representation framework for inferences to be made about relationships between new elements. These inferences, conducted by way of analogy, may indeed be key to innovation (Hope et al., 2017) and a possible component of human creativity (Holyoak et al., 1996).

Conclusion and Future Work
In this paper we have presented a large-scale analysis on the role of co-occurring relational word pairs in completing analogies. In the analyses we have measured to what extent the loss of co-occurrence information within relation types affects analogy completion using neural word embeddings. Perhaps surprisingly, this effect is quite small, to the point that word embeddings can complete analogies of a relationship in the vector space even if the co-occurrence information from the reference corpus is totally removed.
In order to complement this analysis, for future work it would be interesting to analyze to what extent the conclusions of this analysis apply to purely distributional models, e.g., PMI-based, as they have shown to share similarity properties with word embeddings (Levy et al., 2015), to the point of Skip-Gram being viewed as an implicit co-occurrence matrix factorization (Levy and Goldberg, 2014b).
Moreover, the analysis could be extended to other types of relations, not only semantic. Further investigation could then focus on how the main sources of concepts and linguistic regularities in word embeddings are learned, and how they can be leveraged to improve unsupervised relation models, e.g., Joshi et al., 2019). Finally, as a follow-up to recent work aiming at understanding how language models and contextualized embeddings capture relations (Petroni et al., 2019;Bouraoui et al., 2020;Jiang et al., 2020), further research could be devoted to analyze the performance of such models with and without pairwise co-occurrence information.