Word Usage Similarity Estimation with Sentence Representations and Automatic Substitutes

Usage similarity estimation addresses the semantic proximity of word instances in different contexts. We apply contextualized (ELMo and BERT) word and sentence embeddings to this task, and propose supervised models that leverage these representations for prediction. Our models are further assisted by lexical substitute annotations automatically assigned to word instances by context2vec, a neural model that relies on a bidirectional LSTM. We perform an extensive comparison of existing word and sentence representations on benchmark datasets addressing both graded and binary similarity. The best performing models outperform previous methods in both settings.


Introduction
Traditional word embeddings, like Word2Vec and GloVe, merge different meanings of a word in a single vector representation (Mikolov et al., 2013;Pennington et al., 2014). These pre-trained embeddings are fixed, and stay the same independently of the context of use. Current contextualized sense representations, like ELMo and BERT, go to the other extreme and model meaning as word usage (Peters et al., 2018;Devlin et al., 2018). They provide a dynamic representation of word meaning adapted to every new context of use.
In this work, we perform an extensive comparison of existing static and dynamic embeddingbased meaning representation methods on the usage similarity (Usim) task, which involves estimating the semantic proximity of word instances in different contexts (Erk et al., 2009). Usim differs from a classical Semantic Textual Similarity task (Agirre et al., 2016) by the focus on a particular word in the sentence. We evaluate on this task word and context representations obtained using pre-trained uncontextualized word embeddings (GloVe) (Pennington et al., 2014), with and without dimensionality reduction (SIF) (Arora et al., 2017); context representations obtained from a bidirectional LSTM (context2vec) (Melamud et al., 2016); contextualized word embeddings derived from a LSTM bidirectional language model (ELMo) (Peters et al., 2018) and generated by a Transformer (BERT) (Devlin et al., 2018); doc2vec (Le and Mikolov, 2014) and Universal Sentence Encoder representations (Cer et al., 2018). All these embedding-based methods provide direct assessments of usage similarity. The best representations are used as features in supervised models for Usim prediction, trained on similarity judgments. We combine direct Usim assessments, made by the embedding-based methods, with a substitutebased Usim approach. Building up on previous work that used manually selected in-context substitutes as a proxy for Usim (Erk et al., 2013;Mc-Carthy et al., 2016), we propose to automatize the annotation collection step in order to scale up the method and make it operational on unrestricted text. We exploit annotations assigned to words in context by the context2vec lexical substitution model, which relies on word and context representations learned by a bidirectional LSTM from a large corpus (Melamud et al., 2016).
The main contributions of this paper can be summarized as follows: • we provide a direct comparison of a wide range of word and sentence representation methods on the Usage Similarity (Usim) task and show that current contextualized representations can successfully predict Usim; • we propose to automatize, and scale up, previous substitute-based Usim prediction methods; • we propose supervised models for Usim prediction which integrate embedding and lexical substitution features; • we propose a methodology for collecting new training data for supervised Usim prediction from datasets annotated for related tasks.
We test our models on benchmark datasets containing gold graded and binary word Usim judgments (Erk et al., 2013;Pilehvar and Camacho-Collados, 2019). From the compared embeddingbased approaches, the BERT model gives best results on both types of data, providing a straightforward way for word usage similarity calculation. Our supervised model performs on par with BERT on the graded and binary Usim tasks, when using embedding-based representations and clean lexical substitutes.

Related Work
Usage similarity is a means for representing word meaning which involves assessing in-context semantic similarity, rather than mapping to word senses from external inventories (Erk et al., 2009(Erk et al., , 2013. This methodology followed from the gradual shift from word sense disambiguation models that would select the best sense in context from a dictionary, to models that reason about meaning by solely relying on distributional similarity (Erk and Padó, 2008;Mitchell and Lapata, 2008), or allow multiple sense interpretations (Jurgens, 2014). In Erk et al. (2009), the idea is to model meaning in context in a way that captures different degrees of similarity to a word sense, or between word instances.
Due to its high reliance on context, Usim can be viewed as a semantic textual similarity (STS) (Agirre et al., 2016) task with a focus on a specific word instance. This connection motivated us to apply methods initially proposed for sentence similarity to Usim prediction. More precisely, we build sentence representations using different types of word and sentence embeddings, ranging from the classical word-averaging approach with traditional word embeddings (Pennington et al., 2014), to more recent contextualized word representations (Peters et al., 2018;Devlin et al., 2018). We explore the contribution of each separate method for Usim prediction, and use the best performing ones as features in supervised models. These are trained on sentence pairs labelled with Usim judgments (Erk et al., 2009) to predict the similarity of new word instances.
Previous attempts to automatic Usim prediction involved obtaining vectors encoding a distribution of topics for every target word in context (Lui et al., 2012). In this work, Usim was approximated by the cosine similarity of the resulting topic vectors. We show how contextualized representations, and the supervised model that uses them as features, outperform topic-based methods on the graded Usim task.
We combine the embedding-based direct Usim assessment methods with substitute-based representations obtained using an unsupervised lexical substitution model. McCarthy et al. (2016) showed it is possible to model usage similarity using manual substitute annotations for words in context. In this setting, the set of substitutes proposed for a word instance describe its specific meaning, while similarity of substitute annotations for different instances points to their semantic proximity. 1 We follow up on this work and propose a way to use substitutes for Usim prediction on unrestricted text, bypassing the need for manual annotations. Our method relies on substitute annotations proposed by the context2vec model (Melamud et al., 2016), which uses word and context representations learned by a bidirectional LSTM from a large corpus (UkWac) Baroni et al. (2009).  (Erk et al., 2013) for the nouns paper (Usim score = 4.34) and coach.n (Usim score = 1.5), with the substitutes assigned by the annotators (GOLD). For comparison, we give the substitutes selected for these instances by the automatic substitution method (context2vec) used in our experiments from two different pools of substitutes (AUTO-LSCNC and PPDB). More details on the automatic substitution configurations are given in Section 4.2.

The LexSub and Usim Datasets
We use the training and test datasets of the SemEval-2007 Lexical Substitution (LexSub) task (McCarthy and Navigli, 2007), which contain instances of target words in sentential context handlabelled with meaning-preserving substitutes. A subset of the LexSub data (10 instances x 56 lemmas) has additionally been annotated with graded pairwise Usim judgments (Erk et al., 2013). Each sentence pair received a rating (on a scale of 1-5) by multiple annotators, and the average judgment for each pair was retained. McCarthy et al. (2016) derive two additional scores from Usim annotations that denote how easy it is to partition a lemma's usages into sets describing distinct senses: Uiaa, the inter-annotator agreement for a given lemma, taken as the average pairwise Spearman's ρ correlation between ranked judgments of the annotators; and Umid, the proportion of midrange judgments over all instances for a lemma and all annotators.
In our experiments, we use 2,466 sentence pairs from the Usim data for training, development and testing of different automatic Usim prediction methods. Our models rely on substitutes automatically assigned to words in context using context2vec (Melamud et al., 2016), and on various word and sentence embedding representations. We also train a model using the gold substi-tutes, to test how well our models perform when substitute quality is high. Performance of the different models is evaluated by measuring how well they approximate the Usim scores assigned by annotators. Table 1 shows examples of sentence pairs from the Usim dataset (Erk et al., 2013) with the GOLD substitutes and Usim scores assigned by the annotators. The Usim score is high for similar instances, and decreases for instances that describe different meanings. The semantic proximity of two instances is also reflected in the similarity of their substitutes sets. For comparison, we also give in the Table the substitutes selected for these instances by the automatic context2vec substitution method used in our experiments (more details in Section 4.2).

The Concepts in Context Corpus
Given the small size of the Usim dataset, we extract additional training data for our models from the Concepts in Context (CoInCo) corpus (Kremer et al., 2014), a subset of the MASC corpus (Ide et al., 2008). CoInCo contains manually selected substitutes for all content words in a sentence, but provides no usage similarity scores that could be used for training. We construct our supplementary training data as follows: we gather all instances of a target word in the corpus with at least four substitutes, and keep pairs with (1) no overlap in substitutes, and (2) minimum 75% substitute over-lap. 2 We view the first set of pairs as examples of completely different usages of a word (DIFF), and the second set as examples of identical usages (SAME). The two sets are unbalanced in terms of number of instance pairs (19,060 vs. 2,556). We balance them by keeping in DIFF the 2,556 pairs with the highest number of substitutes.
We also annotate the data with substitutes using context2vec (Melamud et al., 2016), as described in Section 4.2. We apply an additional filtering to the sentence pairs extracted from CoInCo, discarding instances of words that are not in the con-text2vec vocabulary and have no embeddings. We are left with 2,513 pairs in each class (5,026 in total). We use 80% of these pairs (4,020) together with the Usim data to train our supervised Usim models described in Section 4.3. 3

The Word-in-Context dataset
The third dataset we use in our experiments is the recently released Word-in-Context (WiC) dataset (Pilehvar and Camacho-Collados, 2019), version 0.1. WiC provides pairs of contextualized target word instances describing the same or different meaning, framing in-context sense identification as a binary classification task. For example, a sentence pair for the noun stream is: ['Stream of consciousness' -'Two streams of development run through American history']. A system is expected to be able to identify that stream does not have the same meaning in the two sentences.
WiC sentences were extracted from example usages in WordNet (Fellbaum, 1998), VerbNet (Schuler, 2006), and Wiktionary. Instance pairs were automatically labeled as positive (T) or negative (F) (corresponding to the same/different sense) using information in the lexicographic resources, such as presence in the same or different synsets. Each word is represented by at most three instances in WiC, and repeated sentences are excluded. It is important to note that meanings represented in the WiC dataset are coarser-grained than WordNet senses. This was ensured by excluding WordNet synsets describing highly similar meanings (sister senses, and senses belonging to the same supersense). The human-level performance upper-bound on this binary task, as measured on two 100-sentence samples, is 80.5%. Inter-annotator agreement is also high, at 79%. The dataset comes with an official train/dev/test split containing 7,618, 702 and 1,366 sentence pairs, respectively. 4

Methodology
We experiment with two ways of predicting usage similarity: an unsupervised approach which relies on the cosine similarity of different kinds of word and sentence representations, and provides direct Usim assessments; and supervised models that combine embedding similarity with features based on substitute overlap. We present the direct Usim prediction methods in Section 4.1. In Section 4.2, we describe how substitute-based features were extracted, and in Section 4.3, we introduce the supervised Usim models.

Direct Usage Similarity Prediction
In the unsupervised Usim prediction setting, we apply different types of pre-trained word and sentence embeddings as follows: we compute an embedding for every sentence in the Usim dataset, and calculate the pairwise cosine similarity between the sentences available for a target word. Then, for every embedding type, we measure the correlation between sentence similarities and gold usage similarity judgments in the Usim dataset, using Spearman's ρ correlation coefficient. We experiment with the following embedding types.
GloVe embeddings are uncontextualized word representations which merge all senses of a word in one vector (Pennington et al., 2014). We use 300-dimensional GloVe embeddings pre-trained on Common Crawl (840B tokens). 5 The representation of a sentence is obtained by averaging the GloVe embeddings of the words in the sentence.
SIF (Smooth Inverse Frequency) embeddings are sentence representations built by applying dimensionality reduction to a weighted average of uncontextualized embeddings of words in a sentence (Arora et al., 2017). We use SIF in combination with GloVe vectors.
Context2vec embeddings (Melamud et al., 2016). The context2vec model learns embeddings for words and their sentential contexts simultaneously. The resulting representations reflect: a) the similarity between potential fillers of a sentence with a blank slot, and b) the similarity of contexts that can be filled with the same word. We use a context2vec model pre-trained on the UkWac corpus (Baroni et al., 2009) 6 to compute embeddings for sentences with a blank at the target word's position.
ELMo (Embeddings from Language Models) representations are contextualized word embeddings derived from the internal states of an LSTM bidirectional language model (biLM) (Peters et al., 2018). In our experiments, we use a pre-trained 512-dimensional biLM. 7 Typically, the best linear combination of the layer representations for a word is learned for each end task in a supervised manner. Here, we use out-of-the-box embeddings (without tuning) and experiment with the top layer, and with the average of the three hidden layers. We represent a sentence in two ways: by the contextualized ELMo embedding obtained for the target word, and by the average of ELMo embeddings for all words in a sentence.
BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2018). BERT representations are generated by a 12-layer bidirectional Transformer encoder that jointly conditions on both left and right context in all layers. 8 BERT can be fine-tuned to specific end tasks, or its contextualized word representations can be used directly in applications, similar to ELMo. We try different layer combinations and create sentence representations, in the same way as for ELMo: using either the BERT embedding of the target word, or the average of the BERT embeddings for all words in a sentence.
Universal Sentence Encoder (USE) makes use of a Deep Averaging Network (DAN) encoder trained to create sentence representations by means of multi-task learning (Cer et al., 2018). USE has been shown to improve performance on different NLP tasks using transfer learning. 9 doc2vec is an extension of word2vec to the sentence, paragraph or document level (Le and Mikolov, 2014). One of its forms, dbow (distributed bag of words), is based on the skip-gram model, where it adds a new feature vector representing a document. We use a dbow model trained on English Wikipedia released by Lau and Baldwin (2016). 10 We test the above models with representations built from the whole sentence, and using a smaller context window (cw) around the target word. Sentences in the WiC dataset are quite short (7.9 ± 3.9 words), but the length of sentences in the Usim and CoInCo datasets varies a lot (27.4 ± 13.2 and 18.8 ± 10.2, respectively). We want to check whether information surrounding the target word in the sentence is more relevant, and sufficient for Usim estimation. We focus on the words in a context window of ± 2, 3, 4 or 5 words at each side of a target word. Then, we collect their word embeddings to be averaged (for GloVe, ELMo and BERT), or derive an embedding from this specific window instead of the whole sentence (for USE).
We approximate Usim by measuring the cosine similarity of the resulting context representations. We compare the performance of these direct assessment methods on the Usim dataset and report the results in Section 5.

Substitute-based Feature Extraction
Following up on McCarthy et al.'s (2016) sense clusterability work, we also experiment with a substitute-based approach for Usim prediction. McCarthy et al. showed that manually selected substitutes for word instances in context can be used as a proxy for Usim. Here, we propose an approach to obtain these annotations automatically that can be applied to the whole vocabulary.
Automatic LexSub We generate rankings of candidate substitutes for words in context using the context2vec method (Melamud et al., 2016). The original method selects and ranks substitutes from the whole vocabulary. To facilitate comparison and evaluation, we use the following pools of candidates: (a) all substitutes that were proposed for a word in the LexSub and CoInCo annotations (we call this substitute pool AUTO-LSCNC); (b) the paraphrases of the word in the Paraphrase Database (PPDB) XXL package (Ganitkevitch et al., 2013;Pavlick et al., 2015) (AUTO-PPDB). 11 In the WiC experiments, where no substitute annotations are available, we only use PPDB paraphrases (AUTO-PPDB). We obtain a context2vec embedding for a sentence by replacing the target word with a blank. AUTO-LSCNC substitutes are high-quality since they were extracted from the manual LexSub and CoInCo annotations. They are semantically similar to the target, and con-text2vec just needs to rank them according to how well they fit the new context. This is done by measuring the cosine similarity between each substitute's context2vec word embedding and the context embedding obtained for the sentence.
The AUTO-PPDB pool contains paraphrases from PPDB XXL, which were automatically extracted from parallel corpora (Ganitkevitch et al., 2013). Hence, this pool contains noisy paraphrases that should be ranked lower. To this end, we use in this setting the original context2vec scoring formula which also accounts for the similarity between the target word and the substitute: In formula (1), s and t are the word embeddings of a substitute and the target word, and C is the context2vec vector of the context. Following this procedure, context2vec produces a ranking of candidate substitutes for each target word instance in the Usim, CoInCo and WiC datasets, according to their fit in context. Every candidate is assigned a score, with substitutes that are a good fit in a specific context being higher-ranked than others. For every new target word instance, context2vec ranks all candidate substitutes available for the target in each pool. Consequently, the automatic annotations produced for different instances of the target include the same set of substitutes, but in different order. This does not allow for the use of measures based on substitute overlap, which were shown to be useful for Usim prediction in McCarthy et al. (2016). In order to use this type of measures, we propose ways to filter the automatically generated rankings, and keep for each instance only substitutes that are a good fit in context.

Substitute Filtering
We test different filters to 11 http://paraphrase.org/ discard low quality substitutes from the annotations proposed by context2vec for each instance.
• PPDB 2.0 score: Given a ranking R of n substitutes R = [s 1 , s 2 , ..., s n ] proposed by context2vec, we form pairs of substitutes in adjacent positions {s i ↔ s i+1 }, and check whether they exist as paraphrase pairs in PPDB. We expect substitutes that are paraphrases of each other to be similarly ranked. If s i and s i+1 are not paraphrases in PPDB, we keep all substitutes up to s i and use this as a cut-off point, discarding substitutes present from position s i+1 onwards in the ranking.
• GloVe word embeddings: We measure the cosine similarity (cosSim) between GloVe embeddings of adjacent substitutes {s i ↔ s i+1 } in the ranking R obtained for a new instance. We first compare the similarity of the first pair of substitutes (cosSim(s 1 , s 2 )) to a lower bound similarity threshold T. If cosSim(s 1 , s 2 ) exceeds T, we assume that s 1 and s 2 have the same meaning, and use cosSim(s 1 , s 2 ) as a reference similarity value, S, for this instance. The middle point between the two values, M = (T + S)/2, is then used as a threshold to determine whether there is a shift in meaning in subsequent pairs. If cosSim(s i , s i+1 ) < M , for i > 1, then only the higher ranked substitute (s i ) is retained and all subsequent substitutes in the ranking are discarded. The intuition behind this calculation is that if cosSim is much lower than the reference S (even if it exceeds T ), substitutes possibly have different senses.
• Context2vec score: This filter uses the score assigned by context2vec to each substitute, reflecting how good a fit it is in each context. context2vec scores vary a lot across instances, it is thus not straightforward to choose a threshold. We instead refer to the scores assigned to adjacent pairs of substitutes in the ranking produced for each instance, R = [s 1 , s 2 , ..., s n ]. We view the pair with the biggest difference in scores as the cut-off point, considering it reflects a degradation in substitute fit. We retain only substitutes up to this point.
• Highest-ranked X substitutes. We also test two simple baselines, which consist in keeping the 5 and 10 highest-ranked substitutes for each instance.
We test the efficiency of each filter on the portion of the LexSub dataset (McCarthy and Navigli, 2007) that was not annotated for Usim. We compare the substitutes retained for each instance after filtering to its gold LexSub susbtitutes using the F1-score, and the proportion of false positives out of all positives. Filtering results are reported in Appendix A. The best filters were GloVe word embeddings (T = 0.2) for AUTO-LSCNC, and the PPDB filter for AUTO-PPDB. Feature Extraction After annotating the Usim sentences with context2vec and filtering, we extract, for each sentence pair (S 1 , S 2 ), a set of features related to the amount of substitute overlap.
• Common substitutes. The proportion of shared substitutes between two sentences.
• GAP score. The average of the Generalized Average Precision (GAP) score (Kishida, 2005) taken in both directions (GAP (S 1 , S 2 ) and GAP (S 2 , S 1 )). GAP is a measure that compares two rankings considering not only the order of the ranked elements but also their weights. It ranges from 0 to 1, where 0 means that rankings are completely different and 1 indicates perfect agreement. We use the frequency in the manual Usim annotations (i.e. the number of annotators who proposed each substitute) as the weight for gold substitutes, and the context2vec score for automatic substitutes. We use the GAP implementation from Melamud et al. (2015).
• Substitute cosine similarity. We form substitute pairs (S 1 ↔ S 2 ) and calculate the average of their GloVe cosine similarities. This feature shows the semantic similarity of substitutes, even when overlap is low.

Supervised Usim Prediction
We train linear regression models to predict Usim scores for word instances in different contexts using as features the cosine similarity of the different representations in Section 4.1, and the substitutebased features in 4.2. For training, we use the Usim dataset on its own (cf. Section 3.1), and combined with the additional training examples extracted from CoInCo (cf. Section 3.2).
To be able to evaluate the performance of our models separately for each of the 56 target words in the Usim dataset, we train a separate model for each word in a leave-one-out setting. Each time, we use 2,196 pairs for training, 225 for development and 45 for testing. 12 Each model is evaluated on the sentences corresponding to the left out target word. We report results of these experiments in Section 5. The performance of the model with context2vec substitutes from the two substitute pools is compared to that of the model with gold substitute annotations. We replicate the experiments by adding CoInCo data to the Usim training data.
To test the contribution of each feature, we perform an ablation study on the 225 Usim sentence pairs of the development set, which cover the full spectrum of Usim scores (from 1 to 5). We report results of the feature ablation in Appendix C.
We also build a model for the binary Usim task on the WiC dataset (Pilehvar and Camacho-Collados, 2019), using the official train/dev/test split. We train a logistic regression classifier on the training set, and use the development set to select the best among several feature combinations. We report results of the best performing models on the WiC test set in Section 5. For instances in WiC where no PPDB substitutes are available (133 out of 1,366 in the test set) we back off to a model that only relies on the embedding features.

Evaluation Direct Usim Prediction
Correlation results between Usim judgments and the cosine similarity of the embedding representations described in Section 4.1 are found in Table 2. Detailed results for all context window combinations are given in Appendix B. We observe that target word BERT embeddings give best performance in this task. Selecting a context window around (or including) the target word does not always help, on the contrary it can harm the models. Context2vec sentence representations are the next best performing representation, after BERT, but their correlation is much lower. The simple GloVe-based SIF approach for sentence representation, which consists in applying dimensionality reduction to a weighted average of GloVe vectors of the words in a sentence, is much superior to the simple average of GloVe vectors and even better than doc2vec sentence representations, obtaining  a correlation comparable to that of USE.
Graded Usim To evaluate the performance of our supervised models, we measure the correlation of the predictions with human similarity judgments on the Usim dataset using Spearman's ρ. Results reported in Table 3 are the average of the correlations obtained for each target word with gold and automatic substitutes (from the two substitute pools), and for each type of features, substitutebased and embedding-based (cosine similarities from BERT and context2vec). We also report results with the additional CoInCo training data. Unsurprisingly, the best results are obtained by the methods that use the gold substitutes. This is consistent with previous analyses by Erk et al. (2009) who found overlap in manually-proposed substitutes to correlate with Usim judgments. The lower performance of features that rely on automatically selected substitutes (AUTO-LSCNC and AUTO-PPDB) demonstrates the impact of substitute quality on the contribution of this type of features. The addition of CoInCo data does not seem to help the models, as results are slightly lower than in the only Usim setting. This can be due to the fact that CoInCo data contains only extreme cases of similarity (SAME/DIFF) and no in-termediate ratings. The slight improvement in the combined settings over embedding-based models is not significant in AUTO-LSCNC substitutes, but it is for gold substitutes (p < 0.001). 13 For comparison to the topic-modelling approach of Lui et al. (2012), we evaluate on the 34 lemmas used in their experiments. They report a correlation calculated over all instances. With the exception of the substitute-only setting with PPDB candidates, all of our Usim models get higher correlation than their model (ρ = 0.202), with ρ = 0.512 for the combination of AUTO-LSCNC substitutes and embeddings. The average of the per target word correlation in Lui et al. (2012) (ρ = 0.388) is still lower than that of our AUTO-LSCNC model in the combined setting (ρ = 0.500).

Binary Usim
We evaluate the predictions of our binary classifiers by measuring accuracy on the test portion of the WiC dataset. Results for the best configurations for each training set are reported in Table 4. Experiments on the development set showed that target word BERT representations and USE sentence embeddings are the best-suited for WiC. Therefore, 'embedding-based features' here refers to these two representations. Results on the development set can be found in Appendix D. All configurations obtain higher accuracy than the previous best reported result on this dataset (59.4) (Pilehvar and Camacho-Collados, 2019), obtained using DeConf vectors, which are multi-prototype embeddings based on WordNet knowledge (Pilehvar and Collier, 2016). Similar to the graded Usim experiments, adding substitute-based features to embedding features slightly improves the accuracy of the model. Also, combining the Co-InCo and WiC data for training does not have a clear impact on results, even in this binary classification setting.

Discussion
Results reported for Usim are the average correlation for each target word, but the strength of the correlation varies greatly for different words for all models and settings. For example, in the case of direct Usim prediction with embeddings using BERT target, Spearman's ρ ranges from 0.805 (for the verb fire) to -0.111 (for the verb suffer). This variation in performance is not surprising,   since annotators themselves found some lemmas harder to annotate than others, as reflected in the Usim inter-annotator agreement measure (Uiaa) (McCarthy et al., 2016). We find that BERT target word embeddings results correlate with Uiaa per target word (ρ = 0.59, p < 0.05), showing that the performance of this model depends to a certain extent on the ease of annotation for each lemma. Uiaa also correlates with the standard deviation of average Usim scores by target word (ρ = 0.66, p < 0.001). Indeed, average Usim values for the word suffer do not exhibit high variance as they only range from 3.6 to 4.9. Within a smaller range of scores, a strong correlation is harder to obtain. The negative correlation between Uiaa and Umid (−0.46, p < 0.001) also suggests that words with higher disagreement tend to exhibit a higher proportion of mid-range judgments. We believe that this analysis highlights the difference between usage similarity across target words and encourages a by-lemma approach where the specificities of each lemma are taken into account.

Conclusion
We applied a wide range of existing word and context representations to graded and binary usage similarity prediction. We also proposed novel supervised models which use as features the best performing embedding representations, and make high quality predictions especially in the binary setting, outperforming previous approaches. The supervised models include features based on in-context lexical substitutes. We show that automatic substitutions constitute an alternative to manual annotation when combined with the embedding-based features. Nevertheless, if there is no specific reason for using substitutes for measuring Usim, BERT offers a much more straightforward solution to the Usim prediction problem.
In future work, we plan to use automatic Usim predictions for estimating word sense partitionability. We believe such knowledge can be useful to determine the appropriate meaning representation for each lemma.

Acknowledgments
We would like to thank the anonymous reviewers for their helpful feedback on this work. We would also like to thank Jose Camacho-Collados for his help with the WiC experiments.
The work has been supported by the French National Research Agency under project ANR-16-CE33-0013.

A Filtering experiments
Tables 5 and 6 contain results obtained using the different substitute filters described in Section 4.2.
We measure the quality of the substitutes retained in the automatic ranking produced by context2vec after filtering against gold substitute annotations in LexSub data. Here, we only use the portion of LexSub data that does not contain Usim judgments.
We measure filtered substitute quality against the gold standard using the F1-score, and the proportion of false positives (FP) over all positives (TP+FP). Table 5 shows results for annotations assigned by context2vec using the the Lex-Sub/CoInCo pool of substitutes (AUTO-LSCNC). Table 6 shows results for context2vec annotations with the PPDB pool of substitutes (AUTO-PPDB).

B Direct Usage Similarity Estimation
Correlations between gold Usim scores for all words and cosine similarities of different embed-ding types can be found in Tables 7 and 8.

C Feature Ablation on Usim
Results of feature ablation experiments on the Usim development sets are given in Table 9.