Lexical Relation Mining in Neural Word Embeddings

Work with neural word embeddings and lexical relations has largely focused on confirmatory experiments which use human-curated examples of semantic and syntactic relations to validate against. In this paper, we explore the degree to which lexical relations, such as those found in popular validation sets, can be derived and extended from a variety of neural embeddings using classical clustering methods. We show that the Word2Vec space of word-pairs (i.e., offset vectors) significantly outperforms other more contemporary methods, even in the presence of a large number of noisy offsets. Moreover, we show that via a simple nearest neighbor approach in the offset space, new examples of known relations can be discovered. Our results speak to the amenability of offset vectors from non-contextual neural embeddings to find semantically coherent clusters. This simple approach has implications for the exploration of emergent regularities and their examples, such as emerging trends on social media and their related posts.


Introduction
Word vector models such as Word2Vec (Mikolov et al., 2013a), derived empirically from large corpora of natural language, provide the opportunity to explore what constitutes a linguistic regularity. Conventionally, lexical relations in word vector space have been defined by collections of relatively consistent relationships, or vector offsets, between word-pairs. The presence of these relationships has been established through confirmatory analysis (Levy and Goldberg, 2014), in which a pair of relation examples constructed from prior knowledge is validated to exist in the space by way of analogy (e.g. geese − goose + mouse ≈ mice) (Finley et al., 2017). The meaning of these collections, or the nature of the relationship between their respective pairs, can be characterized by relation type descriptions such as syntactic (e.g., "plurality") and semantic (e.g., "capital-country") relations.
In this paper, we explore how well lexical relation examples can be clustered using word vectors extracted from state-of-the-art contextual and non-contextual neural word embedding methods. Furthermore, we demonstrate a method for approximating the number of true lexical relations in a noisy offset space. Contextual embeddings such as BERT (Devlin et al., 2018) are the new gold standard on a variety of NLP tasks and have also been shown to possess relational and factual knowledge (Petroni et al., 2019). We perform the unsupervised task of clustering relation examples by using contextual word vectors from BERT; non-contextual word vectors from FastText (Bojanowski et al., 2017) Word2Vec skip-gram (Mikolov et al., 2013a); and explicitly modeled relation vectors (Camacho Collados et al., 2019). Our results show that Word2Vec offsets outperform relation vectors from other embeddings on this task.
We further explore the amenability of the Word2vec offset space to be clustered by evaluating it on the task of discovering new word-pair examples of known lexical relations. The precision of finding new examples using the offset space varied greatly depending on the lexical relation, indicating promise yet demonstrating the need for more nuanced noise rejection in future work. This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/.

Related Work
Relations, in the context of a neural word embedding (Mikolov et al., 2013a), have been described as vector offsets (or differences) between pairs of word vectors that exemplify a relationship. The types of relations and their exemplar pairs have conventionally been established a priori (Mikolov et al., 2013c). Sets of relations and their respective pairs have served as external validations, tested by way of analogy, of the degree to which a vector space encodes a sampling of prior knowledge on the semantics and syntax of the language. In terms of the accuracy of analogy completion (i.e. recall@1), widely-cited neural embeddings have been found to encode upwards of 61% (Mikolov et al., 2013b), and even 75% (Pennington et al., 2014), of related word relationship pairs. At this level of accuracy, it is reasonable to begin to ponder what can be learned from an embedding in a completely unsupervised way, as opposed to only confirming what predefined expert knowledge is encoded.
Several approaches have taken the first steps in an unsupervised direction by exploring if the membership of a word-pair to a relation could be recovered through classification or clustering. Vylomova et al. (2016) use supervised classification to learn an association between vector offsets produced by wordpairs and relation labels (or classes). While their method considers only context-free word vectors such as Word2Vec and GloVe (Pennington et al., 2014), our method also evaluates performance on contextual and explicit relation embeddings. Levy and Goldberg (2014) surfaced semantics about relations by characterizing what they refer to as a shared aspect between words by inspecting the context words, or features. Evidence of distributed concepts, also learned from skip-gram models, has been observed in neural embeddings produced from random walks of graphs (Ribeiro et al., 2017) and from sequences of course enrollments (Pardos and Nam, 2020).
Recently, a number of methods have been developed that capture relations between two words in the form of a vector using unsupervised approaches. Espinosa-Anke and Schockaert (2018) learn relation vectors of word-pairs based on averaging their context word vectors and then perform dimensionality reduction using an autoencoder. Jameel et al. (2018) model relationships between words as weighted bagof-words representations using generalizations of pointwise mutual information. Camacho Collados et al.
(2019) develop a latent variable model that aims to explicitly determine what words from given sentences best characterize the relationship between two target words. Wu and He (2019) and Papanikolaou et al. (2019) use contextual models to learn relation embeddings requiring fine-tuning of the base model. Though these methods learn the explicit relation embedding in an unsupervised fashion, they evaluate them on supervised tasks, unlike our approach which evaluates neural embeddings on unsupervised tasks.

Datasets
We use a common validation set 1 from the analogical reasoning task introduced in (Mikolov et al., 2013a), which we call the Google validation dataset, as a source of known lexical relations and their corresponding collections of word-pairs. It contains 550 unique word-pairs belonging to 13 unique relations 2 . We also use another popular lexical relation validation set named DiffVec 3 (Vylomova et al., 2016), which consists of 36 relations and 12,458 word-pairs. The Google dataset is more balanced than the DiffVec dataset in terms of the number of word-pairs per relation 4 . We train all models used in our model comparison experiments on the Wikipedia corpus. We then use the pre-trained Google-News corpus Word2Vec vectors (Mikolov et al., 2013b) for the subsequent task of finding new examples of known relations.

Motivation
Using the Word2Vec model, a lexical relation between a pair of words (word1, word2) can be represented using a vector offset by subtracting the embedding of word2 from the embedding of word1. Past success in the analogy completion task suggests that offsets generated by word-pairs in the same lexical relation are empirically close to parallel (e.g. geese−goose mice−mouse) (Finley et al., 2017). This naturally presents an opportunity for the clustering of offsets and the exploration of the distributed representation of lexical relations.
A novel visualization of the offsets of the Google dataset can be seen in Figure 1. The figure was produced using t-SNE (Van Der Maaten, 2014) to project the 300-dim offset vectors onto a 2-D space. It depicts offsets naturally grouping by relation and demonstrates the plausibility of a clustering approach capturing relation examples in the offset space. Visualization of offsets had previously been applied only to one relation at a time (Mikolov et al., 2013b). With multiple relation examples represented in a single plot, we can observe relative proximities between groups of relations. For example, "nationality" being closest to "currency," both generated by subtracting a word from a country name. Offsets for "gender" and "opposite" are closest to one another, as are those for "comparative" and "superlative" and for "capital-world" and "city-in-state." "Plural" (nouns) and "plural-verbs" are not only close to one another but also somewhat overlapping. Word vector models when used for relation mining are of sociological interest because they are incentivised to capture norms and the exploration of those norms may be potentially revealing. Unsupervised relation mining is also important for many downstream NLP tasks such as automatically building of knowledge graphs. Knowledge bases like WordNet 5 are manually annotated and are used in variety of tasks such as question answering (Hao et al., 2017) and entity and event extraction (Yang and Mitchell, 2019). Recently, Chiang et al. (2020) demonstrated a robustness to the semantic structure of Word2Vec embeddings, showing that relations could be learned even without examples of them having been observed in the training corpus.

Relation Embeddings
In this section, we introduce the relation embeddings that will be used for comparison in the unsupervised task of relation clustering.
• Contextual Word Vectors: Contextualized word representations have recently shown significant improvements on various NLP tasks. We use BERT-Large (Devlin et al., 2018) pre-trained 6 on the Wikipedia corpus. It produces contextualized embeddings of size 1,024 for each input token. The vocabulary size is 30,522 words. To represent the relation between a pair of words using contextualized BERT embeddings, we use the following formulation: For a given pair of words, (w 1 , w 2 ), we first collect all the sentences from the Wikipedia corpus containing occurrences of both words. Let S be the set of all such sentences. Let v 1 and v 2 be the word embeddings generated by BERT of words w 1 and w 2 with respect to a sentence s ∈ S. To generate word embedding of each token we first take an average of the embeddings for the top four layers of the BERT model and then again average over multiple occurrences of the words in the sentence s. Let v 1 and v 2 be the final embeddings of each of the two words. Then, relation vector r 12 of the two words is given by: This is similar to offsets in Word2Vec. The other form of a BERT relation vector that we use is: where "[;]" denotes concatenation and "|.|" denotes cardinality of the set.
• Non-Contextual Word Vectors: We use FastText and the skip-gram model from Word2Vec to obtain non-contextual word vectors. Each of these models has a vocabulary size of 2,145,353 words and generates word embeddings of size 300. We train Word2Vec on the Wikipedia corpus provided by Camacho Collados et al. (2019) using the skip-gram algorithm. To train the model we use a window size of 5, negative sampling, and ignore the words with a frequency lower than 5. We use the FastText model provided in Camacho Collados et al. (2019). Given a word-pair (w 1 , w 2 ), let (v 1 , v 2 ) and (u 1 , u 2 ) be word vectors generated by Word2Vec and FastText models respectively, then the offset that corresponds to the relation vector r 12 between the words for the Word2Vec model is given by, The offset using the FastText model is given by, We normalize Word2Vec and FastText offsets to unit vectors. We also experimented with normalizing BERT relation vectors but found it did not lead to improvements.
• Explicit Relation Embeddings: One of the methods that we use for comparison, RelPair (Camacho Collados et al., 2019), is a state-of-the-art method that explicitly calculates relation vectors using a latent variable model. Like many other relation embedding models (Joshi et al., 2018;Washio and Kato, 2018;Espinosa-Anke and Schockaert, 2018), their hypothesis is that the relationship between two words can be characterized by the distribution of words between them in sentences they both appear in. Though this is an unsupervised method to learn relation vectors, they report their performance on supervised tasks. Let the relation vector for two words w 1 and w 2 denoted by r 12 . We generate the embedding r 12 by passing the ordered pair of (w 1 , w 2 ) through a pre-trained RelPair model (denoted by "RelPair-model") 7 . This model has a vocabulary size of 1,138,119 pairs and is trained on the Wikipedia corpus. It generates relation embeddings of size 300 given an input word-pair.

Experiments
We explore the amenability of various neural word embedding spaces to be clustered using classical methods. We evaluate the homogeneity of clusters in the word-pair (i.e., vector offset) space with respect to a common lexical relation and evaluate the validity of new word-pairs found in the clusters of known relations.
In section 6.1 we will compare the results of unsupervised clustering of the different relation embeddings discussed above. Then in section 6.2 we will evaluate the best performing method from section 6.1 on the task of discovering new examples of known relations.

Unsupervised Clustering of Relations
Relation clusters mined in an unsupervised way can give insights into prominent lexical relations in a corpus and the typicality of concepts in a language, of potential cognitive and sociological interest. In this section, we perform experiments to evaluate Word2Vec, FastText, BERT, and RelPair relation vectors on this task. We will also study their more practical suitability by analysing their clustering performance in an extended offset validation space producing noisy, dense word-pairs. In section , we further extend the space past the validation dataset vocabulary.
We use three types of clustering algorithms: K-Means clustering (Kanungo et al., 2002), spectral clustering (Zelnik-Manor and Perona, 2005) and hierarchical agglomerative clustering (HAC). The K-Means algorithm is simple and easily scalable. For initializing clusters we use the K-Means++ algorithm given in Arthur and Vassilvitskii (2006) and minimize the euclidean distance between centroids and the relation embeddings. In spectral clustering, we construct the affinity matrix by computing a graph of nearest neighbors and generate a low-dimension embedding using eigenvalue decomposition 8 . For hierarchical clustering, we use the agglomerative (bottom-up) approach with complete linkage that maximizes euclidean distances between all relation embeddings of any two clusters. We have also repeated the experiments for HAC with cosine distance and complete linkage, whose details and results are given in Appendix B.
For measuring clustering performance we use three metrics: Homogeneity (H score), Completeness (C score), V-measure (V score), and Silhouette (S score) score. High homogeneity score indicates that each cluster contains only word-pairs from a single relation. High completeness score indicates that all word-pairs from a single relation are assigned to the same cluster. V-measure is the harmonic mean of homogeneity and completeness scores. While ground truth labels are required to calculate homogeneity, completeness, and V-measure, silhouette scores are calculated without relation labels for the word-pairs. Let d 2 be the mean distance between a word-pair and all other word-pairs in the same cluster and let d 1 be the mean distance between a word-pair and all other word-pairs in the next nearest cluster. The silhouette score s for a single sample is then given as:

Results
Since each of the relation embeddings has a different vocabulary size, we first filter the vocabulary to be only the words common among all of the trained models as well as present in the validation datasets. Table 1 summarizes clustering scores of the relation vectors generated by Word2Vec, FastText, BERT, and RelPair relation vectors on the Google and DiffVec datasets. It shows that Word2Vec offsets outperform the other embeddings with respect to the ground truth labels in the K-Means and HAC algorithms. When considering the average of all four scores, Word2Vec's average for HAC is 9.5% higher than that for K-Means while clustering the Google dataset. On the other hand, Word2Vec's average score for HAC is 20% lower than that for K-Means on the DiffVec dataset. This suggests that HAC marginally improves the clustering score over K-Means when clustering pairs from the Google dataset but K-Means significantly improves these scores over HAC when clustering pairs from the DiffVec dataset. Thus, K-Means has an advantage over HAC when using the Word2Vec embeddings for clustering relations. For this reason, we proceed using K-Means while analyzing relation clustering in a dense embedding space. Another advantage of using K-Means clustering for dense space word-pairs is scalability. The standard K-Means clustering algorithm has complexity O(n 2 ) and that of standard spectral and hierarchical clustering is O(n 3 ). This shows that it is easier to scale the K-Means algorithm to large datasets. For the next set of experiments where we cluster hundreds of thousands of word-pairs we use K-Means clustering.
We notice that the RelPair model has the smallest vocabulary overlap with both the DiffVec and Google datasets. This limits the size of the vocabulary used in the above experiment. Hence, we also compare Word2Vec and FastText with BERT and with RelPair separately. Details of these experiments are given in Appendix A in the supplementary material, respectively. We find that Word2Vec consistently outperforms both BERT and RelPair for all three clustering algorithms and for both evaluation datasets, with an exception of RelPair having a higher silhouette score on the DiffVec dataset.
Thus far, we have analyzed the performance of various relation embeddings on the word-pairs that are known to be part of a relation as per the two validation datasets. However, in practice, while performing clustering, many noisy word-pair offsets may appear which may not be part of any relation. We will now evaluate the clustering performance of the relation embeddings in the presence of such noisy pairs.
A simple method to determine if a word-pair is unrelated is if the two words do not co-appear in any sentence. Methods such as (Espinosa-Anke and Schockaert, 2018) use pointwise mutual information to determine the relatedness of word-pairs. However, this can hurt low-frequency word-pairs, so we proceed by filtering out the word-pairs that do not occur together in any sentence in the training corpus. This also ensures that the BERT relation embedding exists for the remaining pairs. We do not include RelPair vectors in the following analysis since the model has very limited vocabulary size and does not allow for word-pairs that aren't in its vocabulary. The BERT model allows for word-pairs as long as the two words appear in the same sentence, while the Word2Vec and FastText models allow us to calculate offsets between any two words as long as each word is in the model vocabulary.
We expand our word-pairs to include additional noisy pairs by considering all possible pairwise com-binations of the words in each of the Google and DiffVec datasets separately. We call these sets of word-pairs the "extended datasets". We then cluster them into 2n+1 clusters, where n is the number of original ground truth relations (13 for the Google dataset and 36 for the DiffVec dataset). The additional one cluster represents all the word-pairs expected to be noise which should not represent any relation. The additional n represents the ground truth relations in the opposite direction, e.g. the relation "plural-to-singular" for offset of reversed word-pair becomes "singular-to-plural". Out of 104M sentences from the Wikipedia corpus, we found 18.48M sentences containing at least one pair from the extended Google dataset and 10.45M sentences from the extended DiffVec dataset. In these experiments, we use mini-batch K-Means (Sculley, 2010) which converges faster than standard K-Means but has only slightly worse performance. We use initialization similar to K-Means++ and a batch size of 10,000. Table 2 shows the performance of the four methods on the Google and the DiffVec datasets. These results show that Word2Vec is the highest performer among all embeddings and across all metrics with the exception of silhouette score on the DiffVec extended dataset, where it places a close second. We find all four scores for all of the methods to be low compared to the scores from the previous experiments on the non-extended datasets. This is likely due to treating the noisy, unrelated word-pairs as a single label and expecting them to cluster as such. Though this seems like an impossible problem, we can easily imagine coming across this task in a real-life application.
To estimate the true number of clusters in these extended datasets without supervision, we perform mini-batch K-Means over a range of numbers of clusters and calculate the silhouette score as shown in Figure 2 and Figure 3. This approach assumes that silhouette score will correlate with ground truth metrics, as it did to a moderate degree in the previous result tables. We find the number of clusters with maximum silhouette score is 1,000 and 5,000 for Google and DiffVec datasets, respectively. In Figure  2, a spike in silhouette score can be observed when the number of clusters assumed is 20, which is very close to 27, the value used in the experiments summarized in Table 2. In Figure 3 no similar spikes can be observed. We also report V-measure in this experiment, assuming 2n+1 labels as explained above. This is an estimation and may not be the set of true ground truth labels. Details of these experiments are given in Appendix C in the supplementary material.

Discovering New Examples of Known Relations
In this section, we study if new examples of known lexical relations can be discovered in a further expanded Word2Vec offsets space. This task is ideally performed on the full set of relation vector offsets corresponding to the complete enumeration of all possible ordered word-pairs, which grows exponentially with the size of the vocabulary. For tractability reasons, we use pre-trained Google-News Word2Vec vectors with the Google dataset and limit the size of the vocabulary to the union of the top 10,000 most frequent terms in the Google-News corpus and the 905 unique words which appear in the Google dataset, yielding a set of N = 10, 354 unique words. This vocabulary generates 107,194,962, or 2 × N 2 , vector offsets: a dense collection which we call an open-world set. To tractably expand the offset space, we define 13 hypercones (Figure 4) in the space based around the centroids of ground truth vector offsets representing each of the 13 Google dataset relations. These centroids were used instead of arbitrary points in the space in order to retain a degree of ground truth examples in the open-world that can be used to guide the discovery of new examples. For tractability, we restrict the size of each hypercone by considering only the 2,000 vector offsets closest (using cosine-similarity) to each of the 13 centroids (totalling 26,000 vectors). For each of the 13 hypercones, we perform unsupervised k-search (Zelnik-Manor and Perona, 2005) to yield a clustering of the 2,000 vector offsets. For inspection, we narrow our analysis in the open-world to one cluster in each of the 13 hypercones-the cluster with the highest number of labeled word-pairs.   Table 3: A summary of the 13 open-world clusters (one per hypercone subspace). For each cluster, the total number of word-pairs, the number of pairs which also appear in the validation set, and the recall (with respect to the validation set) are listed. Table 3 summarizes the 13 open-world clusters by the number of word-pairs in the cluster, the number of those pairs which appear in the corresponding validation set grouping, and the percentage of the validation set pairs contained within the cluster. The "gender" cluster, for example, contains 12 wordpairs, five of which are "gender" pairs from the validation set. These five constitute 21.7% of the original 23 "gender" pairs which appear in the validation set. We inspect the unlabeled word-pairs (not in the validation set) in each cluster to investigate if the clustering methodology has led to discovery of legitimate new examples of the relations. We manually label the unlabeled word-pairs in each cluster as relevant or not relevant to the relation. We report the precision among unlabeled word-pairs, along with a random sampling of five pairs from each cluster in Table  4. Among others, examples of relevant new pairs are found in the "adj-to-adv" cluster, such as (adequate, adequately); the "capital-world" cluster, such as (Pyongyang, North Korea); and the "nationality" cluster, such as (Belgium, Belgian).

Results
Of highest precision is the "gender" cluster, containing 100% relevant word-pairs to the relation. This resonates with "gender" as among the higher performing relations in analogy completion (Finley et al.,   2017). "Plural-verbs" and "opposite" are also high in precision with 88.9% and 76.9% new word-pair relevance, respectively. Additionally, the majority (52.5%) of the 368 unlabeled word-pairs from the "pres.-participle" cluster are relevant to the relation. The lowest precision is found in the "nationality" cluster (10.5%) and in the "plural" cluster, of which 0% of the unlabeled word-pairs are relevant. In the "plural" cluster, three out of five of the examples in the table show melon as word2 of the new pairs. This is a likely case of "default-behavior errors," as introduced by (Levy and Goldberg, 2014), which refers to the incorrect completion of a set of analogies by one specific word, which is described as the "prototypical" dominant word of the word2s (e.g. melon being the prototypical plural noun). Aside from the unlabeled word-pairs marked relevant as a result of matching the relations perfectly, there are also some unlabeled pairs that contain a synonym or antonym of one of the words which would appear in the "perfect" pair. For example, the "opposite" cluster contains the near-perfect pair (adequate, insufficient), which is synonymous with (sufficient, insufficient). Furthermore, the "comparative" cluster contains (small, larger), where small is the antonym of the ideal word1, large. The vector offsets of these word-pairs may cluster tightly because synonyms and antonyms tend to have high cosine-similarity in word embeddings due to their use in similar contexts in language (Adel and Schütze, 2014).

Conclusions
We explored the amenability of clustering lexical relations in various neural embeddings. On the task of unsupervised clustering of known examples, our experiments showed that baseline non-contextual models (i.e., Word2Vec and FastText) outperformed the relation vectors derived from the contextual model, BERT. Among the clustering methods of spectral, K-Means, and Hierarchical, K-Means yielded the highest supervised and unsupervised scores on both DiffVec and Google validation datasets. Word2Vec with K-Means was best able to cluster the offset vectors into homogeneous clusters with respect to the labeled relations, a trend that continued when expanding the task to an extended, noisy set of word-pairs. In both spaces, the unsupervised metric of silhouette score largely correlated with supervised metrics. This allowed for the possibility of estimating the true number of lexical relations (i.e., clusters), which we attempted in an experiment with Mini-Batch K-Means. The results showed that the maximum silhouette score did not reliably correspond to the estimated true number of clusters; however, the Google dataset showed promise in providing a cluster number (20) close to the true number (27) based on the first silhouette score spike.
On the second task of discovering new relation examples in a space of over 100M word-pairs, results varied greatly by relation. Thirteen subspaces of offsets were created centered around the known lexical relation examples. After clustering within each of these subspaces, the best performing cluster within each subspace was able to capture existing ground truth examples with recall ranging from 0.063 (adjto-adv) to 0.515 (pres.-participle). When evaluating new examples collected in these top performing clusters, precision ranged from 0 (plural) to 1 (gender).
Our results suggest that linear, non-contextual embeddings have an advantage over contextual embeddings on these lexical relationship mining tasks when using classical clustering techniques, but that modeling improvements are necessary to reduce noise to a level of practical utility in noisy real-world scenarios. These improvements may come from developing work exploring linearities in contextual model embeddings (Reif et al., 2019;Jawahar et al., 2019) and would have the potential to identify emergent semantic regularities in large corpora.

A Additional Results of Clustering Experiments
A.1 Comparing Word2Vec, BERT-C, BERT-D and FastText

B Hierarchical Clustering Experiments
We have repeated the experiments for HAC with cosine distance and complete linkage. Results of these experiments are given in Table 7. We can see that Word2Vec offsets outperform BERT-C, BERT-D and RelPair in all the three experiments.  Details of these experiments are: • Experiment 1: Comparison between Word2Vec, FastText and BERT offsets. Using the Google dataset where the number of clusters is 13 and word-pairs is 387. Using the DiffVec dataset where the number of clusters is 36 and word-pairs is 7782.
• Experiment 2: Comparison between Word2Vec, FastText, and RelPair. Using the Google dataset where the number of clusters is 11 and word-pairs is 272. Using DiffVec dataset where the number of clusters is 34 and word-pairs is 1,889.
• Experiment 3:Comparison between Word2Vec, FastText, BERT, and RelPair. Using the Google dataset where the number of clusters is 13 and word-pairs is 207. Using DiffVec dataset where the number of clusters is 34 and word-pairs is 1,512.