A Rank-Based Similarity Metric for Word Embeddings

Word Embeddings have recently imposed themselves as a standard for representing word meaning in NLP. Semantic similarity between word pairs has become the most common evaluation benchmark for these representations, with vector cosine being typically used as the only similarity metric. In this paper, we report experiments with a rank-based metric for WE, which performs comparably to vector cosine in similarity estimation and outperforms it in the recently-introduced and challenging task of outlier detection, thus suggesting that rank-based measures can improve clustering quality.


Introduction
"All happy families resemble one another, but each unhappy family is unhappy in its own way." Anna Karenina, Leo Tolstoy Distributional Semantic Models (DSMs) have received an increasing attention in the NLP community, as they constitute an efficient data-driven method for creating word representations and measuring their semantic similarity by computing their distance in the vector space (Turney and Pantel, 2010).
The most popular similarity metric in DSMs is the vector cosine. Compared to Euclidean distances, vector cosine scores are normalized on each dimension and hence are robust to the scaling effect. On the other hand, one limitation of this metric is that it regards each dimension equally, without taking into account the fact that some dimensions might be more relevant for characteriz-ing the semantic content of a word. Such a limitation led to the introduction of alternative metrics based on feature ranking, which have been reported to outperform vector cosine in several similarity tasks (Santus et al., 2016a,b).
Recently, the focus of the research on word representations has been shifting onto the so-called word embeddings (WE), which are dense vectors obtained by means of neural network training that achieved significant improvements in several similarity-related tasks (Mikolov et al., 2013a;. Although the representation type of the embeddings was helpful for reducing the sparsity of traditional count vectors, their nature does not sensibly differ (Levy et al., 2015). Most research works involving WE still adopt vector cosine for similarity estimation, yet little experimentation has been done on alternative metrics for comparing dense representations (exceptions include Camacho-Collados et al. (2015)).
Some attempts to directly transfer rank-based measures from traditional DSMs to WE have faced difficulties (see, for example, Jebbara et al. (2018)). In this paper, we suggest a possible solution to this problem by adapting AP Syn, a rank-based similarity metric originally proposed for sparse vectors (Santus et al., 2016b,a), to low-dimensional word embeddings. This goal is achieved by removing the parameter N (the extent of the feature overlap to be taken into account) and adding a smoothing parameter that is proven to be constant under multiple settings, therefore making the measure unsupervised. Our experiments show performance improvements both in similarity estimation and in the more challenging outlier detection task (Camacho-Collados and Navigli, 2016), which consists in cluster and outlier identification. 2 2 Code and vectors used for the experiments are available at https://github.com/esantus/Outlier Detection.

Similarity, Relatedness and Dissimilarity: Current Issues in the Evaluation of DSMs
A classical benchmark for DSMs is represented by the estimation of word similarity: evaluation datasets are built by asking human subjects to rate the degree of semantic similarity of word pairs, and the performance is assessed in terms of the correlation between the average scores assigned to the pairs by the subjects and the cosines of the corresponding vectors (similary estimation task). Similarity as modeled by DSMs has been under debate, as its definition is underspecified. It in fact includes an ambiguity with the more generic notion of semantic relatedness, which is present also in many popular datasets (i.e. the concepts of coffee and cup are certainly related, but there is very little similarity about them), as opposed to 'genuine' semantic similarity (i.e. the relation holding between concepts such as coffee and tea) (Agirre et al., 2009;Hill et al., 2015;Gladkova and Drozd, 2016). Therefore, when testing a DSM, it is important to pay attention to what type of semantic relation is actually modeled by the evaluation dataset. Moreover, researchers pointed out that similarity estimation alone does not constitute a strong benchmark, as the inter-annotator agreement is relatively low in all datasets and the performance of several automated systems is already above the upper bound (Batchkarov et al., 2016). As a consequence, workshops such as RepEval have been organized with the explicit purpose of finding alternative evaluation tasks for DSMs.
A recent proposal is the challenging outlier detection task (Camacho-Collados and Navigli, 2016;Blair et al., 2016), which consists in the recognition of cluster membership, as well as of a relative degree of semantic dissimilarity. The task is described as follows: given a group of words, identify the outlier, namely the word that does not belong to the group (i.e. the one that is less similar to the others). On top of its potential applications (e.g. ontology learning), detecting outliers in clusters is a goal that poses a more strict quality requirement on the distributional representations compared to tests based simply on pairwise comparisons, as it is required that similar words group into semantically meaningful clusters. Clearly, the task involves the identification of discriminative semantic dimensions, which could set the cluster members apart from non-members. Outliers are not necessarily unrelated to the other words: rather, they have a lower degree of similarity with respect to some prominent property of the cluster (e.g. the case of Los Angeles Lakers as an outlier in a cluster of football teams). In our view, a similarity metric has to exploit such discriminative dimensions to form cohesive clusters.

A Rank-Based Metric for Embeddings
We use cosine as a baseline and we test an adaptation of a rank-based measure to the dense features of the word embeddings.
Vector cosine computes the correlation between all the vector dimensions, independently of their relevance for a given word pair or for a semantic cluster, and this could be a limitation for discerning different degrees of dissimilarity. The alternative rank-based measure is based on the hypothesis that similarity consists of sharing many relevant features, whereas dissimilarity can be described as either the non-sharing of relevant features or the sharing of non-relevant features (Santus et al., 2014(Santus et al., , 2016b.
This hypothesis could turn out to be very helpful for a task like the outlier detection, where prominent features might be the key to improve clustering quality: semantic dimensions that are shared by many of the cluster elements should be weighted more, as they are likely to be useful for setting the outliers apart. In fact, a cohesive cluster should be mostly characterized by the same 'salient' dimensions, and thus, basing word comparisons on such dimensions should lead to more reliable estimates of cluster membership.
In our contribution, we propose to adapt AP Syn, a metric originally proposed by Santus et al. (2016a,b), to dense word embeddings representations. 3 AP Syn was shown to perform well on both synonymy detection and similarity estimation tasks, and it was recently adapted to achieve state-of-the-art results in thematic fit estimation (Santus et al., 2017). The original AP Syn formula is shown in equation 1.
For every feature f i in the intersection between the top N features of two vectors w x and w y , we add the inverse of the average rank of such feature, r sx (f x ) and r sy (f y ), in the two decreasingly value-sorted vectors s x and s y (in traditional vectors, often the parameter N ≥ 1000, but in WE N = |f |). AP Syn scores low if the features of the two vectors are inversely ranked and high if they are similarly ranked. AP Syn maps the average feature ranks to a non-linear function, emphasizing the contribution of top-ranked features. Its direct application to dense embeddings would shrink too much the contribution of lower ranks (see Figure 1), with the score mostly affected by the top ∼ 25 features. While this is reasonable for the traditional vectors derived from co-occurrence counts, where thousands of smaller contributions can still affect the final score, dense vectors need a smoother curve. While preserving the idea of the non-linear weight allocation across the average feature ranks during the summation, we modify the original AP Syn formula by taking the exponential of the feature ranks to a power of a constant value ranging between 0 and 1 (excluded), as shown in equation 2, such that now the number of ranks contributing to the final score is widen to all features (see the smoother curve of AP SynP ower in Figure 1). We name this variant AP SynP ower or, shortly, AP SynP .
The power p added to AP SynP formula is a trainable parameter. We trained it on the similarity subset of WordSim dataset, obtaining the optimal value of p = 0.1, which has been successfully used in all evaluations, under all settings (i.e. embedding types and training corpora). Such regularity allows us to consider p = 0.1 as a constant, therefore dropping p. Since in WE we can drop also the parameter N by defining N = |f |, AP SynP can be not parametrized at all.

Embeddings
For our experiments, we used two popular word embeddings architectures: the Skip-Gram with negative sampling (Mikolov et al., 2013a,b) and the GloVe vectors (Pennington et al., 2014) (standard hyperparameter settings: 300 dimensions, Figure 1: Comparison of weight per feature rank in AP Syn and AP SynP (p = 0.1) across feature ranks ranging from 1 to 300.
context size of 10 and negative sampling). 4 For comparison with Camacho-Collados and Navigli (2016) on outlier detection, we used the same training corpora: the UMBC (Han et al., 2013) and the English Wikipedia. 5

Datasets
As for the similarity estimation task, we evaluate the Spearman correlation between systemgenerated scores and human judgments. We used three popular benchmark datasets: WordSim-353 (Finkelstein et al., 2001), MEN (Bruni et al., 2014) and SimLex-999 (Hill et al., 2015). It is important to point out that SimLex-999 is the only one specifically built for targeting genuine semantic similarity, while the others tend to mix similarity and relatedness scores.
As for outlier detection, we evaluate our DSMs on the 8-8-8 dataset (Camacho-Collados and Navigli, 2016). The dataset consists of eight clusters, each one with a different topic and consisting in turn of eight lexical items belonging to the cluster and eight outliers (with four degrees of relatedness to the cluster members: C1, C2, C3, C4). In total, the dataset includes 64 sets of 8 words + 1 outlier for the evaluation. For each word w of a cluster W of n words, the authors defined a compactness score c(w) corresponding to the average of all pairwise similarities of the words in W \ {w}. On the basis of the compactness score, they proposed two evaluation metrics: Outlier Position (OP) and Outlier Detection (OD). Given a set W of n + 1 words, OP is the rank of the outlier w n + 1 according to the compactness score. Ideally, the rank of the outlier should be n, mean-   ing that it has the lowest average similarity with the other cluster elements. The second metric, Outlier Detection (OD), is indeed defined as 1 iff OP (w n + 1) = n, 0 otherwise. Finally, the performance on a dataset composed of |D| sets of words was estimated in terms of Outlier Position Percentage (OP P , Eq. 3) and Accuracy (Eq. 4):

Pairwise and Prototype Approaches to Outlier Detection
While for the similarity task scores are always calculated pairwise, for spotting the outlier two different methods were tested: the pairwise comparisons and the cluster prototype.
In the first case, we reimplemented the method of Camacho-Collados and Navigli (2016): (i) compute the average similarity score of each word with the other words in the cluster; (ii) pick as the outlier the word with the lowest average score. An alternative consists in creating a cluster prototype: (i) for a cluster of N words, we create N prototype vectors by excluding each time one of the words and averaging the vectors of the other ones; (ii) we pick as the outlier the word with the lowest similarity score with the prototype built out of the vectors of the other words in the cluster. Table 1 summarizes the correlations for the similarity task. AP SynP outperforms both vector cosine and AP Syn in all the datasets described in 4.2 when GloVe embeddings are used. The advantage is statistically significant over the cosine on the MEN dataset (p < 0.05) and over AP Syn on all datasets (p < 0.01). 6 With Skip-Gram embeddings, AP SynP performs comparably to vector cosine for relatedness, dominant in WordSim and MEN, while retaining a significant advantage over AP Syn on the same datasets (p < 0.05). It also performs slightly better than cosine in SimLex-999, and this complies with previous findings of Santus et al. (2016a), who showed that AP Syn performs better on genuine similarity datasets. Apparently, the top-ranked vector dimensions (those contributing more to APSyn scores) are more often shared by similar words, than by simply related ones. Table 2 shows the results for the outlier detection task. The line CC-Cos contains the scores by Camacho-Collados and Navigli (2016) as a baseline. The models are divided into pairwise comparison and cluster prototype (see Section 4.3).

Results and Discussion
As it can be easily noticed by looking at the bold line, AP SynP outperforms the baselines in all settings for both Skip-Gram and GloVe, obtaining higher accuracies and OPPs. Not only AP SynP is better at identifying the outlier, but when it is not able to do so, its error is minimum (e.g. the outlier is eventually the second ranked candidate). The best accuracy (73.4 vs. SOA of 70.3) and the best OPP (94.9 vs. SOA of 93.8) are both obtained by AP SynP with the prototype approach, using the Skip-Gram trained on Wikipedia. We also tested the significance of the accuracy improvements with the χ 2 test but, also given the small size of the 8-8-8 dataset, the result was negative. Finally, we observe that the two approaches described in 4.3 do not lead to sensitively different results. The major factors of difference can be found instead in the embeddings (with Skip-Gram outperforming Glove) and in the training corpus (the smaller Wikipedia, 1.7B words, outperforms the bigger UMBC, 3B words).

Error Analysis
In Table 3, we report the 5 outliers that were most difficult to detect by AP SynP . Most of them are related to the German Car Manufacturers topic, which was ambiguous and populated by rare terms. All outliers in the Months and in the South American countries clusters (except for the two South-American cities Rio de Janeiro and Bogotá) are successfully identified under all experimental settings. Finally, the reader can notice that most errors belong to C1 and C2, which are the most challenging classes in the dataset, as these outliers are either very related or very similar to other cluster members.

Conclusions
We have introduced AP SynP , an adaptation of the rank-based similarity measure AP Syn (Santus et al., 2016a,b) for word embeddings. This adaptation introduces a power parameter p, which is shown to be constant in multiple tasks (i.e. p = 0.1). The stability of this parameter, together with the possibility of dropping the parameter N of AP Syn when using WE by setting N = |f |, makes the measure unsupervised. We have tested it on the tasks of similarity estimation and outlier detection, obtaining similar or better performances than vector cosine and the original AP Syn. AP SynP performs more consistently on SimLex-999, showing a preference for genuine similarity, as already noticed by Santus et al. (2016a). We also introduced a new approach to the outlier detection task, based on a cluster prototype. The prototype method is competitive and computationally less expensive than pairwise comparisons. We leave to future work a systematic comparison of AP SynP and other rank-based measures. Pilot tests have shown that other rank-based metrics (e.g. Spearman's Rho) also outperform vector cosine in multiple settings and tasks.