Word Rotator’s Distance

One key principle for assessing textual similarity is measuring the degree of semantic overlap between two texts by considering the word alignment. Such alignment-based approaches are both intuitive and interpretable; however, they are empirically inferior to the simple co-sine similarity between general-purpose sentence vectors. We focus on the fact that the norm of word vectors is a good proxy for word importance, and the angle of them is a good proxy for word similarity. Alignment-based approaches do not distinguish the norm and direction, whereas sentence-vector approaches automatically use the norm as the word importance. Accordingly, we propose decoupling word vectors into their norm and direction then computing the alignment-based similarity with the help of earth mover’s distance (optimal transport), which we refer to as word rotator’s distance. Furthermore, we demonstrate how to “grow” the norm and direction of word vectors ( vector converter ); this is a new systematic approach derived from the sentence-vector estimation methods, which can signiﬁcantly improve the performance of the proposed method. On several STS benchmarks, the proposed methods outperform not only alignment-based approaches but also strong baselines. 1


Introduction
This paper addresses the semantic textual similarity (STS) task, the goal of which is to measure the degree of semantic equivalence between two sentences (Agirre et al., 2012). High-quality STS methods can be used to upgrade the loss functions and automatic evaluation metrics of text generation tasks because a requirement of these metrics is precisely the calculation of STS (Wieting et al., 2019;Zhao et al., 2019;. There are two major approaches to tackling STS. One is to measure the degree of semantic overlap between texts by considering the word alignment, which we refer to as alignment-based approaches (Sultan et al., 2014;Zhao et al., 2019). The other approach involves generating general-purpose sentence vectors from two texts (typically comprising word vectors), and then calculating their similarity, which we refer to as sentence-vector approaches . Alignment-based approaches are consistent with human intuition about textual similarity, and their predictions are interpretable. However, the performance of such approaches is lower than that of sentence-vector approaches.
We hypothesize that one reason for the inferiority of alignment-based approaches is that they do not separate the norm and direction of the word vectors. In contrast, sentence-vector approaches automatically exploit the norm of the word vectors as the relative importance of words.
Thus, we propose an STS method that first decouples word vectors into their norms and direction vectors and then aligns the direction vectors using earth mover's distance (EMD). Here, the key idea is to map the norm and angle of the word vectors to the EMD parameters probability mass and transportation cost, respectively. The proposed method is natural from both optimal transport and word embeddings perspectives, preserves the features of alignment-based methods, and can directly incorporate sentence-vector estimation methods, which results in fairly high performance.
Our primary contributions are as follows. • We demonstrate that the norm of a word vector implicitly encodes the importance weight of a word and that the angle between word vectors is a good proxy for the dissimilarity of words. • We propose a new textual similarity measure, i.e., word rotator's distance, that separately utilizes the norm and direction of word vectors. • To enhance the proposed WRD, we utilize a new word-vector conversion mechanism, which is formally induced from recent sentence-vector estimation methods. • We demonstrate that the proposed methods achieve high performance compared to strong baseline methods on several STS tasks.

Related Work
We briefly review the methods that are directly related to unsupervised STS.
Alignment-based Approach. One major approach for unsupervised STS is to compute the degree of semantic overlap between two texts (Sultan et al., 2014(Sultan et al., , 2015. Recently, determining the soft alignment between word vector sets has become a mainstream method. Tools used for alignment include attention mechanism , fuzzy set , and earth mover's distance (EMD) Clark et al., 2019;Zhao et al., 2019). Of those, EMD has several unique advantages. First, it has a rich theoretical foundation for measuring the differences between probability distributions in a metric space (Villani, 2009;Peyré and Cuturi, 2019). Second, EMD can incorporate structural information such as syntax trees (Alvarez-Melis et al., 2018;Titouan et al., 2019). Finally, with a simple modification, EMD can be differentiable and can be incorporated into larger neural networks (Cuturi, 2013). Despite these advantages, EMD-based methods have underperformed sentence-vector-based methods on STS tasks. The goal of this study is to identify and resolve the obstacles faced by EMD-based methods (Section 5).
Sentence-vector Approach. Another popular approach is to employ general-purpose sentence vectors of given texts and to compute the cosine similarity between such vectors. A variety of methods to compute sentence vectors have been proposed, ranging from utilizing deep sentence encoders (Kiros et al., 2015;Conneau et al., 2017;Cer et al., 2018), learning and using word vectors optimized for summation (Pagliardini et al., 2018;Wieting and Gimpel, 2018), and estimating latent sentence vectors from pre-trained word vectors Liu et al., 2019b). This paper demonstrates that some recently proposed sentence vectors can be reformulated as a sum of the converted word vectors. By utilizing the converted word vectors, our method can achieve similar or better performance compared to sentence-vector approaches (Section 6).
1. Two probability distributions, µ (initial arrangement) and µ (final arrangement) 3 : Here, µ denotes a probability distribution, where each point x i ∈ R d has a probability mass m i ∈ [0, 1] ( i m i = 1). In Figure 1, each circle represents a pair (x i , m i ), where the location and size of the circle represent a vector x i and its probability m i , respectively. 2. The transportation cost function, c: Here, c(x i , x j ) determines the transportation cost per unit amount (distance) between two points x i and x j .
The EMD between µ and µ is then defined via the following optimization problem: T 1 n = m := (m 1 , . . . , m n ) , Here, a solution T ∈ R n×n ≥0 denotes a transportation plan, where each element T ij represents the mass transported from x i to x j . In summary, EMD(µ, µ ; c) is the cost of the best transportation plan between two distributions µ and µ .
Side Benefit: Alignment. Under the above optimization, if the locations x i and x j are close (i.e., if the transportation cost c(x i , x j ) is small), they where the Dirac delta function describes a discrete probability measure. In this paper, we omit delta for notational simplicity. are likely to be aligned (i.e., T ij may be assigned a large value). In this way, EMD can be considered to align the points of two discrete distributions. This is one reason we adopt EMD as a key technology in the computation of STS.

Word Mover's Distance
Word mover's distance (WMD)  is a dissimilarity measure between texts and is a pioneering work that introduced EMD to the natural language processing (NLP) field. This study is strongly inspired by this work. We introduce WMD prior to presenting the proposed method.
WMD is the cost of transporting a set of word vectors in an embedding space (Euclidean space) ( Figure 2). Formally, after removing stopwords,  regard each sentence s as a uniformly weighted distribution µ s comprising word vectors (i.e., bag-of-word-vectors distribution): . (7) In Figure 2, each circle represents each word, where the location and size of the circle represent the position vector w i and its weight 1 n , respectively. Next, Euclidean distance is used as the transportation cost between word vectors Then, WMD is defined as the EMD between two such distributions using the cost function c E WMD(s, s ) := EMD(µ s , µ s ; c E ).

Issues with Word Mover's Distance
Despite its intuitive formulation, the WMD often misaligns words with each other, and the STS performance of WMD is less than that of recent methods. For example, by WMD, "noodle" and "snack" may be aligned rather than "noodle" and "pho" (a type of Vietnamese noodle).

Word Rotator's Distance
Here, we first discuss the roles of the norm and direction of word vectors. Then, we describe issues with WMD from the perspective of the roles of the norm and direction. Finally, we present the proposed method, i.e., word rotator's distance, which can resolve the issues with WMD.

Roles of Norm and Direction
We hypothesis that the norm and direction of word vectors have the following unique roles. • Norm of a word vector as weighting factor: The norm of a word vector indicates the extent to which the word contributes to the overall meaning of a sentence. • Angle between word vectors as dissimilarity: The angle between two word vectors (i.e., the difference between the direction of these vectors) approximates the (dis)similarity of two words. We elaborate on the validity of this hypothesis in this section. Henceforth, λ i and u i denote the norm and the direction vector of word vector w i , resp.: Each u i is a unit vector u i = 1 .
Additive Compositionality. As a starting point, we review the well-known nature of additive compositionality. The NLP community has confirmed that a simple sentence vector, i.e., the average of the vectors of the words in a sentence, can achieve remarkable results when used in STS tasks and many downstream tasks (Mitchell and Lapata, 2010;Mikolov et al., 2013;Wieting et al., 2016).
Norm as Weighting Factor. Equation 11 may initially appear to treat each word vector equally. However, several previous studies have confirmed that the norm of word vectors has large dispersion (Schakel and Wilson, 2015;Arefyev et al., 2018). In other words, a sentence vector would contain word vectors of various lengths. In such cases, a word vector with a large norm will dominate in the resulting sentence vector, and vice versa (Figure 3). Here the usefulness of additive composition (i.e., implicit weighting by the norm) suggests that the norm of each word vector functions as the weighting factor of the word when generating a sentence representation. In our experiments, we provide data-driven evidence to support this claim.
In addition, the following are known about the relationship between word vector norm and the word importance: (i) content words tend to have larger norms than function words (Schakel and Wilson, 2015); and (ii) fine-tuned word vectors have larger norms for medium-frequency words, which is consistent with the traditional weighting guideline by Luhn in information retrieval (Khodak et al., 2018;Pagliardini et al., 2018). Both of these observations suggest that the norm serves as a weighting factor in cases where additive composition is effective.
Angle as Dissimilarity. What does a direction vector (i.e., the rest of the word vector "minus" its norm) represent? 4 Obviously, the most common calculation using the direction vectors of words is to measure their angles, i.e., their cosine similarity It is widely known that the cosine similarity of word vectors trained on the basis of the distributional hypothesis well approximates word similarity (Pennington et al., 2014;Mikolov et al., 2013;Bojanowski et al., 2017). Naturally, the difference in direction vectors represents the dissimilarity of words. In our experiments, we confirmed that cosine similarity is an empirically better proxy for word similarity compared to other measures.

Why doesn't WMD Work?
According to the above discussion, WMD has the following limitations.
• Weighting of words: While EMD can consider the weights of each point via their probability mass (3), and the weighting factor of each word is encoded in the norm, WMD ignores the norm and weights each word vector uniformly (7). • Dissimilarity between words: While EMD can consider the distance between points via a transportation cost (4), and the dissimilarity between  words can be measured by angle, WMD uses Euclidean distance, which mixes the weighting factor and dissimilarity. The problematic nature of this mixing can be explained as follows. Euclidean transportation cost (8) would misestimate the similarity of word pairs as low A , whose meanings are close B but whose concreteness or importance is very different C , e.g., "noodle" and "pho" (Figure 4). This is clear from the relationship between the Euclidean (8) and cosine distances (14): From Equation 16, c E (w, w ) would be estimated as large A even if c cos (w, w ) is small B , as long as |λ − λ | is large C . Note that this undesirable property is also confirmed when using real data. Table 1 and Figure 4 show the cosine and Euclidean distances between the vectors of "noodle," "pho," "snack," and "Pringles" (the name of a snack). By using Euclidean distance, "noodle" and "snack" are judged to be similar (i.e., more likely to be aligned) than "noodle" and "pho."

Word Rotator's Distance
Given the above considerations, we propose a simple yet powerful sentence similarity measure using EMD. The proposed method considers each sentence as a discrete distribution on the unit hypersphere and calculates EMD on this hypersphere ( Figure 5). Here, the alignment of the direction vectors corresponds to a rotation on the unit hypersphere; thus, we refer to the proposed method as word rotator's distance (WRD). Formally, we consider each sentence s as a discrete distribution ν s comprising direction vectors weighted by their norm (bag-of-direction-vectors distribution) where Z and Z are normalizing constants (Z := i λ i , so as Z ). In Figure 5, each circle represents a word, where the location and size of the circle represent the direction vector u i and its weight λ i /Z, respectively. For the cost function, we use the cosine distance That is, a rotation cost is required to align words. Then, the WRD between two sentences is given as Unlike WMD, the above procedure allows the proposed WRD to follow appropriate correspondences between the EMD and word vectors.
• Probability mass (weight of each point) ↔ Norm (weight of each word) • Transportation cost (distance between points) ↔ Angle (dissimilarity between words) Algorithm. To ensure reproducibility, we show the specific (and fairly simple) algorithm and implementation guidelines for WRD in Appendix C.

Vector Converter-enhanced WRD
To further improve the performance of WRD, we attempted to integrate existing methods to estimate latent sentence vectors, i.e., the most powerful sentence encoders for STS, into WRD. However, determining a method to combine sentence-vector estimation methods with WRD is not straightforward task because WRD takes word vectors as input, whereas sentence-vector estimation methods require the processing of sentence vectors.

From Sentence Vector to Word Vector
Sentence-vector Estimation. On the basis of Arora's pioneering random-walk language model (LM) (Arora et al., 2016, a number of sentence-vector estimation methods have been proposed Liu et al., 2019b,a), and have achieved success in many NLP applications, including STS. Given pretrained vectors of words comprising a sentence, these methods allow us to estimate the latent sentence vectors that generated the word vectors. Such methods can be summarized in the following form Here, we focus on only the form of the equation for sentence-vector estimation. For specific algorithms, refer to the experimental section and Appendix D.
Word Vector Converter. Note that all of the existing denoising function f 3 is linear; thus, Equation 20 can be rewritten as Here, the encoders first perform a transformation f VC on each word vector independently and then sum them up (i.e., additive composition!). We refer to the f VC as (word) vector converter (VC).

Norm and Direction
We believe that the vector converter improves the norm and direction of pre-trained word vectors.
Norm as Weighting Factor. In Section 5, given the success of additive composition, we proposed the use of norms to weight words. In addition, in Section 6.1, we confirmed that the sentencevector estimation methods, which have achieved greater success in STS than standard additive composition, simply sum the transformed word vectors (i.e., improved additive composition). Therefore, we expect that the importance of a word w is better encoded in the norm of a converted word vector w than in that of the original word vector w.
Angle as Dissimilarity. On the bases of the random-walk LM (Arora et al., 2016), the denoising function f 1 makes the word vector space isotropic, i.e., uniform in the sense of angle. As a result, the angle of word vectors becomes a better proxy for word dissimilarity (Mu and Viswanath, 2018). Further, the functions α 2 and f 3 assume a more realistic LM . Thus, VC is expected to further improve the isotropy of the vector space and make the angle of the word vector a better proxy for word dissimilarity.

Vector Converter-enhanced WRD
As we discussed previously, converted word vectors { w} may have preferable properties in terms of their norm and direction, and they remain word vectors (i.e., they are no longer sentence vectors); thus, { w} can be used as is for the input of WRD. Let λ and u denote the norm and direction vector of w, resp.; then, a variant of WRD using { w} is where Z and Z are normalizing constants. We believe that using { w} will improve WRD performance because WRD depends on the weights and dissimilarities encoded in the norm and angle.

Experiments
In this section, we experimentally confirm our hypotheses about the norm and direction vectors and the performance of WRD and VC. For word vectors, we used the standard GloVe For VC, we used the followings algorithms. • f 1 : All-but-the-top (Mu and Viswanath, 2018), sentence-wise feature Scaling(Ethayarajh, 2018) 5 .

Workings of Norm
Here, we experimentally confirm whether the norm of a word vector is in fact a good proxy of the word's in-sentence importance.
Pre-trained Word Vectors. Let us consider another additive composition than that in Equation 11, which excludes the effect of weighting by the norm Table 2a shows the experimental results obtained using two types of sentence vectors (11), (25). Ignoring the norm of the word vectors produced consistently poor performance. This demonstrates that the norm of a word vector certainly plays the role of the weighting factor of the word.

Converted Word Vectors.
To verify our hypothesis that VC improves the norm, we performed the same experiments as above using converted word vectors. Table 2b shows that as the word vectors are gradually converted, the difference in predictive performance between Equation 11 and 25 (i.e., the performance gain by norm) increased. This fact supports our hypothesis that VC "grows" the norm.
P. For the correctness, we abbreviate this series of methods as SUP, which was abbreviated as UP in the original paper.
6 "+ AW" is omitted from  The angle of word vectors is a good proxy for word similarity. Spearman's ρ × 100 between the predicted and gold scores is reported. In each row, the best result and results where the difference from the best result was < 0.5 are indicated in bold. 6

Workings of Angle
We assumed the angle between two word vectors is a good proxy for the dissimilarity of two words. Presently, the cosine similarity between word vectors is a common metrics to compute word dissimilarity; however, several alignment-based STS methods employ Euclidean distance  or dot product (Zhelezniak et al., 2019). Therefore, a question arises, i.e., which is the most suitable method to compute word dissimilarity? To answer this question, we compared dissimilarity metrics using nine word similarity datasets 7 .
Pre-trained Word Vectors. Table 3a shows that cosine similarity (i.e., ignoring the norm) yields relatively higher correlation with human evaluations compared to dot product or Euclidean distance (i.e., using the norm). This indicates that the angle of  Table 4: The combination of WRD and VC gave the best performance. Pearson's r × 100 between the predicted and gold scores is reported. The STS-B dataset (dev) was used. The best result and results where the difference from the best < 0.5 in each row are in bold, and the best results are further underlined.
word vectors encodes the dissimilarity of words relatively well; in contrast, the norm is not relevant.

Converted Word Vectors.
In view of the discussion given in Section 6, we expected that the word dissimilarity of w and w would be better encoded in the angle between the converted word vectors u, u = cos( w, w ) than that between the original word vectors u, u = cos(w, w ). Table 3b shows that, as the word vectors were gradually converted, the angle of word vectors became more accurate as a measure of the dissimilarity of words.

Ablation Study
We experimentally confirmed the effectiveness of each WRD and VC, through the degree of performance improvement over the baseline WMD. Table 4 shows the results. In nearly all cases, WRD demonstrated higher performance than WMD. We summarize some major findings as follows.
• The performance of WRD improves steadily, as word vectors are transformed by VC, because WRD can directly utilize the weight and dissimilarity encoded in the norm and angle, whose quality is enhanced by VC. Conversely, WMD does not benefit from VC. • One may consider that W (SIF weighting) can be used directly as the probability mass for WMD because it is simply a scaling factor for each word. "+ SIF weights" in Table 4 represents such a computation; however, even when WMD and employed SIF directly, it did not reach the performance of WRD and VC.
Following , we further experimented with stopword removal. Stopword removal was a good heuristic that gave both WMD and WRD a large performance gain similar to SIF; how-  Table 5: Pearson's r × 100 between the predicted and gold scores is shown. The best results in each dataset, word vector, and strategy for computing the textual similarity ("Additive composition" or "Considering Word Alignment") are in bold; and the best results regardless of the strategy are further underlined. Each row marked ( †) was re-implemented by us. Each value marked ( ‡) was taken from Perone et al. (2018), and marked ( * ) was taken from STS Wiki 8 . ever, the above two findings remained unchanged. See Appendix E for additional details.

Benchmark Tasks
Finally, we compared the performance of the proposed WRD and VC methods to that of various baselines, including recent alignment-based methods, i.e., WMD , BERTScore  Table 5. We summarize our major findings as follows.
• Among the methods that consider word alignment, WRD + VC achieved the best performance. This is likely due to the fact that other meth-ods employ Euclidean distance (WMD) or dot product (DynaMax) as word similarity measures. These metrics cannot distinguish the two types of information (i.e., weight and dissimilarity). BERTScore applies cosine similarity like WRD; however, BERTScore was inferior to WRD on average, which can be attributed to the fact that BERTScore completely disregards the norm. • Compared to strong baselines based on additive composition (+WR, +SUP), WRD using the same word vectors (+VC(WR), +VC(WR)) performed equally or better. This result was unexpected given that +WR and +SUP were originally proposed to create sentence vectors, and WRD simply uses them without tuning. Thus, we believe that considering word alignment is an inherently good principle for STS. Refer to Appendix E for the more comprehensive results obtained using additional datasets and methods, including (semi-)supervised approaches.

Connection to Other Methods
Finally, we discuss the relationships among WRD, WMD, and cosine similarity of additive composition (11, 12), which we refer to as ADD, from a sentence representation perspective.
Connection to Additive Composition. Surprisingly, ADD is a special case (i.e., a simplified version) of WRD. In fact, given a discrete-distribution representation containing only a single sentence vector, i.e., µ point s = {(s, 1)}, the obvious EMD cost using cosine distance is equivalent to ADD. This relationship between ADD and WRD becomes clearer when examining their sentence representations using the norm (λ i ) and direction vector (u i ): (27 a,b) where Z := i λ i , and δ[·] is the Dirac delta function. Initially, they appear quite similar. However, the key difference is that ADD treats a sentence as a single vector (the barycenter of direction vectors), whereas WRD treats a sentence as a set of direction vectors. Given that STS tasks require word alignment (where words are treated disjointly), it is natural that WRD (where word vectors are treated disjointly) achieves better performance on STS tasks 10 .
10 In contrast, we have confirmed that ADD demonstrated higher performance than WRD on the topic similarity task Connection to WMD. Why do WMD and WRD differ in performance on STS tasks even though both represent sentences as bag-of-word-vectors representations? Sentence representations for ADD and WMD are as follows: The barycenters (27a), (28a) for ADD are identical up to scale because λ i u i = w i holds. In contrast, the discrete distributions (27b) for WRD and (28b) for WMD are quite different. WRD treats the norm λ as a weighting factor, as ADD implicitly does (28a). In contrast, WMD assigns uniform weights to both long and short vectors (28b), which is one reason the most natural representation (28b) employed by WMD does not work effectively. The difference in performance between WMD and WRD can also be explained by the difference in the transportation cost functions. Many word embeddings use inner product as the training objective function, i.e., the origin of the embedding space is meaningful. Also, cosine distance used in WRD depends on the position of the origin. In contrast, parallel translation invariant Euclidean cost used in WMD ignores the position of the origin.

Conclusion
In this paper, we first indicated (i) that the norm and angle of word vectors are good proxies for the importance of a word and dissimilarity between words, respectively, and (ii) that some previous alignment-based STS methods inappropriately "mix up" them. With these findings, we have proposed word rotator's distance (WRD), which is a new unsupervised, EMD-based STS metric. WRD was designed so that the norm and angle of word vectors correspond to the probability mass and transportation cost in EMD, respectively. In addition, we found that the latest powerful sentencevector estimation methods implicitly improve the norm and angle of word vectors, and we can exploit this effect as a word vector converter (VC). In experiments on multiple STS tasks, the proposed methods outperformed not only alignment-based methods such as word mover's distance, but also powerful addition-based sentence vectors.
(SICK-R. See Appendix E for details). For a task where it is sufficient to know the trend of the meaning of the entire sentence, it may be preferable to aggregate the meaning of the entire sentence into a single vector.

A.2 Word Similarity Datasets
We used the following nine word similarity tasks in our experiments.
• Tokenization. In each experiment, we first tokenized all the STS datasets (besides the Twitter dataset) by NLTK (Bird and Loper, 2004) with some post-processing steps following Ethayarajh (2018) 17 . The Twitter dataset has already been tokenized by the workshop organizer. We then lowercased all tokens to conduct experiments under the same conditions with cased embeddings and non-cased embeddings.

A.4 Stopword List
The stopword list based on the SMART Information Retrieval System 18 was used for WMD  and conceptor removal (C) (Liu et al., 2019a).

B Contextualized Word Embeddings on Unsupervised STS
BERT (Devlin et al., 2019) and its variants have not yet shown good results on unsupervised STS (note that, in a supervised or semi-supervised setting where there exists training data or external resources, BERT-based models show the current, best results). One particularly promising usage of BERT-based models for unsupervised STS is BERTScore , which was originally proposed as an automatic evaluation metric. However, our preliminary experiments 19 show that BERTScore with pre-trained BERT/RoBERTa performs poorly on unsupervised STS. Nonetheless, BERTScore is definitely promising as a method. We then reported the results of BERTScore using non-contextualized word vectors, e.g., GloVe, and we confirmed higher performance compared to using pre-trained BERT. Needless to say, the application of BERT-based models to unsupervised STS is an important future research topic.
17 https://github.com/kawine/usif 18 https://github.com/igorbrigadir/stopwords 19 We used BERT-large and RoBERTa-large. For embeddings, we used either the last layer or the concatenation of all the layers. In the original paper, which allows the use of teacher data, the development set was used to select the layer.

C Algorithm of Word Rotator's Distance
The algorithm used in the actual computation of WRD is shown in Algorithm 1.
For EMD computation, off-the-shelf libraries can be used 20 . Note that most EMD (optimal transport) libraries take two probabilities (mass) m ∈ R n , m ∈ R n and a cost matrix C ∈ R n×n with C ij = d(x i , x j ) as inputs. Parameters (m, m , C) have the same information as (µ, µ , d), introduced in Section 4.3. The notation of Algorithm 1 follows this style.
The cosine distance 1 − cos(w i , w j ) in line 7 of Algorithm 1 is equivalent to 1 − cos(u i , u j ) in Equation 18. We adopted the former simply to reduce the computation steps.

D Algorithms of Vector Converter
Algorithm 2 summarizes the overall procedure of word vector converter f VC (Equation 22).
When computing Algorithm 2, we set hyperparameters as Prior to performing all-but-the-top (A), we restricted the vocabulary of word vectors to words appearing more than 200 times in the enwiki corpus 21 , following Liu et al. (2019b).
We used the unigram probability P of English words estimated using the enwiki dataset, preprocessed by   22 .
See Table 6 for an overview of the existing methods. There are many possible combinations of f 1 , f 2 , and f 3 , and exploring them is a good direction for future work.  Table 7 for full results using nine word similarity datasets.

E.2 Ablation Study
See Table 8 for full results. WRD without stopword removal achieves the best results. This is likely because WRD can more continuatively compare the differences in the importance between stopwords using their norm.

E.3 Benchmark Tasks
See Table 9 for full results in an unsupervised settings. See Table 10  Compute parameters of f 1 : · · · if using All-but-the-top: Compute parameters of α 2 : 8: for each w do · · · if using SIF Weighting:

14: end for
Compute parameters of f 3 : · · · if using CCR or Piecewise CCR: 15: (v 1 , σ 1 ), . . . , (v D 3 , σ D 3 ) ← PCA({s, . . . ,s |S| }) top D 3 singular vectors and singular values · · · else if using Conceptor removal: Convert word vectors: 22: for i ← 1 to |V| do 23:   The angle of word vectors is a good proxy for word similarity (full results). Spearman's ρ × 100 between the predicted and gold scores is reported. In each row, the best result and results where the difference from the best result was < 0.5 are indicated in bold. "+ AW" is omitted from  Table 8: The combination of WRD and VC gave the best performance (full results). Pearson's r × 100 between the predicted and gold scores is reported. The STS-B dataset (dev) was used. The best result and results where the difference from the best < 0.5 in each row are in bold, and the best result in each word vector is further underlined.  Table 9: Pearson's r×100 between the predicted scores and the gold scores is shown. The best results in each block is in bold, and the best results regardless of the strategy for computing textual similarity are further underlined.
The results of our methods are slanted. Each row marked ( †) was re-implemented by us. Each value marked ( ‡) was taken from Perone et al.  Table 10: Pearson's r × 100 between the predicted and gold scores is show. The best results in each dataset, word vector, and strategy for computing textual similarity ("Additive composition" or "Considering Word Alignment") is in bold; and the best results regardless of the strategy for computing textual similarity are further underlined.