Swords: A Benchmark for Lexical Substitution with Improved Data Coverage and Quality

We release a new benchmark for lexical substitution, the task of finding appropriate substitutes for a target word in a context. For writing, lexical substitution systems can assist humans by suggesting words that humans cannot easily think of. However, existing benchmarks depend on human recall as the only source of data, and therefore lack coverage of the substitutes that would be most helpful to humans. Furthermore, annotators often provide substitutes of low quality, which are not actually appropriate in the given context. We collect higher-coverage and higher-quality data by framing lexical substitution as a classification problem, guided by the intuition that it is easier for humans to judge the appropriateness of candidate substitutes than conjure them from memory. To this end, we use a context-free thesaurus to produce candidates and rely on human judgement to determine contextual appropriateness. Compared to the previous largest benchmark, our Swords benchmark has 3x as many substitutes per target word for the same level of quality, and its substitutes are 1.4x more appropriate (based on human judgement) for the same number of substitutes.


Introduction
Imagine you are writing the message "I read an amazing paper today" to a colleague, but you want to choose a more descriptive adjective to replace "amazing." At first you might think of substitutes like "awesome" and "great," but feel that these are also unsatisfactory. You turn to a thesaurus for inspiration, but among reasonable alternatives like "incredible" and "fascinating" are words like "prodigious" which do not quite fit in your context. Ultimately, you choose to go with "fascinating," but reaching this decision required a non-trivial amount of time and effort. * Equal contribution.
Research on lexical substitution (McCarthy, 2002;McCarthy and Navigli, 2007;Erk and Padó, 2008;Szarvas et al., 2013;Kremer et al., 2014;Melamud et al., 2015;Hintz and Biemann, 2016;Zhou et al., 2019;Arefyev et al., 2020) considers the task of replacing a target word in context with appropriate substitutes. There are two widely-used English benchmarks for this task: SEMEVAL (McCarthy and Navigli, 2007) and COINCO (Kremer et al., 2014). For both benchmarks, data was collected by asking human annotators to think of substitutes from memory. Because lexical substitution was originally proposed as a means for evaluating word sense disambiguation systems (McCarthy, 2002), this data collection strategy was designed to avoid a bias towards any particular word sense inventory.
In this work, we consider a different use case for lexical substitution: writing assistance. For this use case, we are interested in evaluating a system's ability to produce appropriate substitutes that are likely to be difficult for humans to think of. We show that the data collection strategy used in past benchmarks yields low coverage of such uncommon substitutes-for our previous example, they might contain words like "awesome" and "great," but miss words like "incredible" and "fascinating." Furthermore, we observe that these benchmarks have low quality, containing words like "fun," which are easy to think of, but not quite appropriate in context.
We present SWORDS-the Stanford Word Substitution Benchmark-an English lexical substitution benchmark that raises the bar for both coverage and quality (Table 1). We collect SWORDS by asking human annotators to judge whether a given candidate word is an appropriate substitute for a target word in context, following the intuition that judging a given substitute is easier than producing that same substitute from memory. To bootstrap a set of candidates for humans to annotate, we look up target words in an existing context-free thesaurus. 1 Because a thesaurus might miss substitutes that would not typically be synonymous with the target word outside of the provided context (e.g. "thought-provoking" for "amazing"), we also include human-proposed candidates from the previous COINCO benchmark.
Determining whether a substitute is appropriate is intrinsically subjective. To address this, we collect binary labels from up to ten annotators for each substitute, inducing a score for each substitute. In COINCO, analogous scores are derived from the number of independent annotators who thought of a substitute-hence, as we will show in Section 4, these scores tend to correspond more to ease-of-recollection than appropriateness. In contrast, scores from SWORDS correspond to appropriateness, and also allow us to explicitly trade off coverage and quality, permitting more nuanced evaluation. Our analysis shows that compared to COINCO, SWORDS has 3x more substitutes per target word for the same level of quality, and its substitutes are 1.4x more appropriate based on scores for the same number of substitutes.
We demonstrate that SWORDS is a challenging benchmark by evaluating state-of-the-art lexical substitution systems and large-scale, pre-trained language models including systems based on BERT (Devlin et al., 2019;Zhou et al., 2019) and GPT-3 (Brown et al., 2020). In our evaluation, we find that humans substantially outperform all existing systems, suggesting that lexical substitution can be used as a downstream language understanding task for pre-trained models. We release SWORDS publicly as a benchmark for lexical substitution, coupled with a Python library that includes previous benchmarks in a common format, standardized evaluation scripts for prescribed metrics, and reproducible re-implementations of several baselines. 2
Lexical substitution. Lexical substitution is the task of generating a list of substitutes w that can replace a given target word w in a given context c (McCarthy, 2002): The context c is one or more sentences where the target word w is situated. The target word w is one word in the context, which is either manually chosen by humans (McCarthy and Navigli, 2007) or automatically selected based on the part-of-speech of the target word (Kremer et al., 2014). The substitute w can be a word or phrase. Note that the task of lexical substitution does not consider inflection and does not involve grammar correction; all benchmarks contain lemmas as substitutes (e.g. "run" instead of "ran").
SEMEVAL. The first lexical substitution benchmark, SEMEVAL-2007 Task 10 (McCarthy andNavigli, 2007), contains 201 manually chosen target words. For each target word, 10 sentences were chosen as contexts (mostly at random, but in part by hand) from the English Internet Corpus (Sharoff, 2006) and presented to five human annotators. The five annotators were instructed to produce up to three substitutes from memory as a replacement for the target word in context that "preserves the meaning of the original word." This resulted in 12,300 labels in total with four substitutes per target word on average.
COINCO. The previous largest lexical substitution benchmark, COINCO (Kremer et al., 2014), was constructed by first choosing 2474 contexts from the Manually Annotated Sub-Corpus (Ide et al., 2008(Ide et al., , 2010. Then, all content words (nouns, verbs, adjective, and adverbs) in the sentences were selected to be target words in order to reflect a realistic frequency distribution of target words and their senses. Each target word was presented to six human annotators, who were asked to provide up to five substitutions or mark it as unsubstitutable. All the annotators were instructed to provide (preferably single-word) substitutes for the target that "would not change the meaning." This resulted in 167,446 labels in total and 7.2 substitutions per target word on average. 3 For the rest of the paper, we focus on COINCO (but not SEMEVAL) as our benchmark is built on COINCO and it is the largest existing benchmark.

Our benchmark
SWORDS is composed of context, target word, and substitute triples (c, w, w ), each of which has a score that indicates the appropriateness of the substitute. We consider a substitute to be acceptable if its score is greater than 50% (e.g. bolded words in Table 1) and unacceptable if the score is less than or equal to 50%. Similarly, a substitute with a score greater than 0% is considered conceivable, and otherwise inconceivable. Note that these terms are operational definitions for convenience, and different thresholds can be chosen for desired applications.

Addressing limitations of past work
Improving quality. In prior work, annotators were prompted to consider whether a substitute "preserves the meaning" (McCarthy and Navigli, 2007) or "would not change the meaning" (Kremer et al., 2014) of the target word. Instead, we ask annotators whether they "would actually consider using this substitute as the author of the original sentence." We believe this wording encourages a higher standard. In Section 4.1, we provide evidence that substitutes from SWORDS have higher quality than those from COINCO on average.
Improving coverage. For prior benchmarks, annotators were asked to generate a list of substitutes from memory. Psycholinguistic studies have shown that when humans are asked to predict the next word of a sentence, they deviate systematically from the true corpus probabilities (Smith and Levy, 2011;Eisape et al., 2020). Thus, we may reasonably expect that asking humans to generate substitutes would similarly lead to systematic omissions of some appropriate substitutes.
We observe that prior benchmarks exclude many appropriate substitutes that are difficult for humans to think of (Section 4.2). To address this limitation, we first obtain a set of candidate substitutes and then ask annotators to judge whether they would consider using a given candidate to replace the target word in the context. That is, given a context c, target word w, and candidate substitute w , we ask humans to judge whether w is a good replacement for the target word: where a positive label 1 corresponds to "I would actually consider using this substitute as the author of the original sentence," and a negative label 0 as the opposite. As described in Section 3.2, we annotate a large pool of candidate substitutes to ensure high coverage of all possible substitutes. We confirm that this increases coverage compared to COINCO in Section 4.2.
Redefining scores to reflect appropriateness. In past work, each substitute w has an associated  score defined as the number of annotators who produced w given the associated context c and target word w. Instead, we define the score as the fraction of annotators who judged the w to be an appropriate replacement of w. We argue that the previous definition of score reflects ease-of-recollection, but not necessarily appropriateness. In Section 4.3, we show that our definition of score better represents the appropriateness of each substitute.

Data collection
We collect substitutes and scores for a context and target word pair (c, w) via the following three steps.
Step 1: Select contexts, targets, and substitutes. We use the subset of contexts and target words from COINCO. Concretely, we start with the (c, w) pairs in COINCO and randomly select one w per c to annotate. Here, the context c consists of three sentences, where the middle sentence has the target word w. Next, we choose a set of candidate substitutes w to annotate for each (c, w) pair, as framing annotation as binary classification requires determining the set of candidate substitutes a priori. We use human-generated substitutes from COINCO, then add substitutes suggested by a thesaurus (see Appendix A.2 for details). In principle, candidate substitutes can be retrieved from any lexical resource or even sampled from a generative model, which we leave as future work. By combining candidates from COINCO and the thesaurus, we increase the coverage of acceptable substitutes.
Step 2: Reduce the pool of substitutes. Given a list of candidate substitutes from the previous step, we collect three binary labels on each (c, w, w ) triple (see Section 3.3 for details). Then, we pass any substitute with at least one positive label to Step 3 and further collect fine-grained scores. We show that the probability that an acceptable substitute gets incorrectly filtered out as an inconceivable substitute (three negative labels) is very low (0.8%) in Section 4.4.
Step 3: Collect fine-grained scores. In the final step, we collect seven more binary labels on the substitutes which received at least one positive label from Step 2. This yields a total of 10 binary labels for the substitutes.

Crowdsourcing
We used Amazon Mechanical Turk (AMT) to crowdsource labels on substitutes. Each Human Intelligence Task (HIT) contained a target word highlighted in the context and at most 10 candidate substitutes for the target word. Each candidate substitute had three radio buttons for positive, negative, and abstain. Annotators were asked to choose positive if they would actually consider using the substitute to replace the target word as the author of the context, negative if they would not consider using the substitute, and abstain if they do not know the meaning of the substitute. We treated all abstain labels (1.24% of total labels) as negative labels, thereby making it binary. The benchmark includes abstain labels to maintain the option for them to be handled separately (e.g. excluded) in the future. The interface, instructions, qualification conditions, and filtering criteria used for crowdsourcing can be found in Appendix B.

4366
4 Data analysis Table 2 shows overall statistics of our benchmark. SWORDS comprises a total of 1250 context and target word pairs (494 nouns, 448 verbs, 189 adjectives, 119 adverbs) and 71,813 total substitutes that have been labeled (including both acceptable and unacceptable substitutes). For brevity, we defer an analysis of annotator agreement to Appendix C.1.

High quality
With our notion of acceptability, we first observe that 75.4% of the substitutes from COINCO 4 are considered unacceptable, and 28.6% of the substitutes are even inconceivable (receiving scores less than 50% and 0% from our human annotators). Table 3 shows examples of substitutes that received relatively high scores under COINCO, yet were considered unacceptable under SWORDS. With the same size as COINCO (by taking the subset of our benchmark with the highest scoring substitutes per target), the average score of the substitutes is 4.9 for SWORDS and 3.4 for COINCO, resulting in 1.4x higher quality. Furthermore, SWORDS minimizes the potential noise by having fine-grained scores to account for appropriateness (Section 4.3) as well as explicit inconceivable substitutes, which is useful for evaluation (Section 5.2).

High coverage
We show that SWORDS achieves high coverage. Among the conceivable substitutes in SWORDS, 14.4% are only in COINCO (COINCO-only), 14.6% are common to both COINCO and the thesaurus (COINCO ∩ Thesaurus), and 71.1% are only from thesaurus (Thesaurus-only). Among the acceptable substitutes, 24% are from COINCO-only, 37.1% are from COINCO ∩ Thesaurus, and 38.9% are from Thesaurus-only. This suggests that a substantial number of substitutes are not present in COINCO. Overall, SWORDS contains 3.9 acceptable and 20.1 conceivable substitutes per target word on average, increasing those numbers by nearly 2x and 3x over COINCO, respectively. In addition, we find that substitutes from COINCO-only are more likely to be common words whereas substitutes from Thesaurus-only are more likely to be rare words. We compute the Zipf frequency (Speer et al., 2018) of each substitute based on the Google n-gram corpus (Brants and Franz, 2006) and threshold conceivable substitutes into three groups: uncommon (≤ 3.5), neutral, common (> 4.5). We observe that substitutes from COINCO-only are more likely to be common words (52.7%) than those from Thesaurus-only (38%). On the other hand, the substitutes from Thesaurusonly tend to be more uncommon words (29%) than those from COINCO-only (17.5%).

Reflection of appropriateness in scores
We show that scores in SWORDS better reflect the appropriateness of each substitute compared to COINCO both quantitatively and qualitatively. We find that if a substitute has a high score under COINCO (score > 1), it is likely to be acceptable under SWORDS (score > 50%) almost all the time (99.6%). However, the converse does not hold: the acceptable substitutes under SWORDS have low scores (score ≤ 1) under COINCO half of the time (49.4%). Intuitively, this is because COINCO's scores reflect the ease of producing the substitute from memory, whereas SWORDS's scores reflect the appropriateness of the substitute. Table 3 shows examples of context, target word, and substitute triples which received a low score from COINCO but a high score from SWORDS.

Validation with additional data
We show that the probability of an acceptable substitute falsely filtered out in Step 2 is very low. To this end, we collected 10 additional labels on 100 context-target word pairs randomly selected from the test set, without reducing the pool of substitutes as in Step 2. By comparing the first three labels to the entire 10 labels, we find that 35.5% of substitutes without any positive labels in Step 2 could have received one or more positive labels if they were kept in Step 3. However, we find that 99.2% of these substitutes were eventually considered unacceptable (judged by 10 labels), indicating that the probability of an acceptable substitute incorrectly filtered out in Step 2 is very low (0.8%). Figure 1 shows the score distribution of substitutes in SWORDS along with the source of substitutes: COINCO-only, COINCO ∩ Thesaurus, or Thesaurus-only. Across scores, neither COINCO nor thesaurus completely dominates substitutes, and the overlap between COINCO and thesaurus is Context with target word Substitute COINCO's score SWORDS's score (max: 6) (max: 100%) I don't wish to be a spokesman for any campaign. effort 3 0% She was heading for a drink and slipped out of the crowd. look 2 10% "Name me," she said. nickname 2 0%

Score distribution
The e-commerce free zone is situated in north Dubai. district 1 90% She will have reunions in the next few weeks.
forthcoming 1 60% It's very reassuring that I'll not only be an outsider but a curiosity. extraordinarily 0 70%  Figure 1: Score distribution of SWORDS's substitutes with the source of substitutes. We find that neither COINCO nor the thesaurus completely dominates substitutions across scores, indicating the necessity of both human-generated substitutes as well as substitutes from the thesaurus. Substitutes with score 0% are not shown to make the bars visually distinguishable. 5 quite small, thus indicating the necessity of both human-recalled substitutes as well as substitutes from a thesaurus. We also find that SWORDS adds more substitutes for all the scores, although substitutes from the thesaurus tend to have a lower range of scores compared to those from COINCO. Lastly, we observe that substitutes from COINCO roughly form a normal distribution, which suggests that even the substitutes provided by human annotators are controvertible, and that it is important to account for the intrinsically gradable nature of appropriateness with fine-grained scores.

Model evaluation
In this section, we evaluate several methods on SWORDS. The goals of this evaluation are threefold: (1) to prescribe our recommended evaluation practice for SWORDS, (2) to measure performance of existing large-scale pre-trained models and stateof-the-art lexical substitution systems, and (3) to measure human performance for the purpose of comparing current and future systems.

Evaluation settings
There are two primary evaluation settings in lexical substitution research: the generative setting (Mc-Carthy and Navigli, 2007) and the ranking setting (Thater et al., 2010). In the generative setting, systems output a ranked list of candidate substitutes. There are no restrictions on the number of candidates that a system may output. In the ranking setting, systems are given all candidate substitutes from the benchmark (including those marked as unacceptable) and tasked with ranking them by appropriateness. Here we primarily focus on the generative setting, as it is more relevant to writing assistance. We defer our experiments on the ranking setting to Appendix D.

Evaluation metrics
In a writing assistance context, we envision that lexical substitution systems would be used to suggest a limited number of substitutes to users (e.g. 10 substitutes as opposed to 100). Hence, we consider evaluation metrics that examine the quality and coverage of the top-ranked substitutes from a system with respect to the substitutes that humans judged as acceptable (score > 50%). Specifically, we compute precision (P k ) and recall (R k ) at k 6 : P k = # acceptable substitutes in system top-k # substitutes in system top-k R k = # acceptable substitutes in system top-k min(k, # acceptable substitutes) Because we care about both quality (precision) and coverage (recall) when comparing systems, we report F k , the harmonic mean of P k and R k . Likewise, we evaluate against the list of substitutes which humans judged as conceivable (score > 0%). P k c and R k c constitute precision and recall of systems against this larger candidate list, and F k c their harmonic mean. Motivated by past work (Mc-Carthy and Navigli, 2007), we primarily examine performance for k = 10 and lemmatize system and reference substitutes during comparison.
We note that these metrics represent a departure from standard lexical substitution methodology, established by McCarthy and Navigli (2007). Like P k and R k , the previously-used BEST and OOT metrics are also measures of precision and recall, but do not take advantage of the negative labels from our binary data collection protocol as no such labels existed in the earlier benchmarks. Nevertheless, we report performance of all systems on these metrics in Appendix E as reference.

Baselines
We evaluate both state-of-the-art lexical substitution systems and large-scale pre-trained models as baselines on SWORDS. We reimplement the BERTbased lexical substitution system (BERT-LS) from Zhou et al. (2019), which achieves state-ofthe-art results on past benchmarks. As another lexical substitution system, we examine WORD-TUNE (AI21, 2020), a commercial system which offers lexical substitution capabilities. 7 We also examine two large-scale pre-trained models adapted to the task of lexical substitution: BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020). To generate and rank candidates with BERT, we feed in the context with target word either masked (BERT-M) or kept intact (BERT-K), and output the top 50 most likely words according to the masked language modeling head. Because the target word is removed, BERT-M is expected to perform poorly-its main purpose is to assess the relative importance of the presence of the target word compared to the context. Note that both of these strategies for using BERT to generate candidates differ from that of BERT-LS, which applies dropout to the target word embedding to partially obscure it. To generate candidates with GPT-3, we formulate lexical substitution as natural language generation (see Appendix D.5 for details).

Human and oracle systems
Here we consider human and oracle "systems" to help contextualize the performance of automatic 7 WORDTUNE is not optimized for lexical substitution. lexical substitution systems evaluated on SWORDS. We evaluate the performance of HUMANS using labels from a separate pool of annotators as described in Section 4.4. Because this task is inherently subjective, this system represents the agreement of two independent sets of humans on this task, which should be thought of as the realistic upper bound for all metrics. We consider the substitutes that have score > 0% from the separate pool of annotators as HUMANS's substitutes in the generative setting.
We also consider both of the candidate sources, COINCO and THESAURUS, as oracle systems. Each source contains a list of substitutes for every target word, and therefore can be viewed as a lexical substitution system and evaluated on SWORDS. COINCO provides substitutes for a target word that were provided by (six) human annotators. This can be thought of as a proxy for how humans perform on lexical substitution when recalling words off the top of their heads (as opposed to making binary judgements as in HUMANS). THESAURUS provides context-free substitutes for a target word (regardless of their word senses) with the default ranking retrieved from the thesaurus. This represents the context-insensitive ordering that a user of the same thesaurus would encounter.
Because these oracle systems only produce candidates which are guaranteed to be in SWORDS, they have an inherent advantage on the evaluation metrics over other systems. Hence, to be more equitable to other systems, we additionally compute F 10 and F 10 c in a "lenient" fashion-filtering out model generated substitutes which are not in SWORDS (we refer to the setup without filtering as "strict"). It is our intention that future systems should not use COINCO or THESAURUS in any way, as they leak information about the SWORDS benchmark. Table 4 shows that the performance of all methods falls short of that of humans on all metrics. We interpret this as evidence that SWORDS is a challenging benchmark, since strong (albeit unsupervised) baselines like BERT and GPT-3 do not reach parity with humans. We also observe that two models (WORDTUNE and GPT-3) achieve higher F 10 than COINCO. In other words, while all models perform worse than humans who are judging the appropriateness of substitutes (HUMANS), some models appear to slightly outperform humans who  are thinking of substitutes off the top of their head (COINCO). This implies that some lexical substitution models may already be helpful to humans for writing assistance, with room for improvement. Overall, we find that there is no single system which emerges as the best on all metrics. We note that, despite BERT-LS representing the state-ofthe-art for past lexical substitution benchmarks, its performance falls short of that of commercial systems like GPT-3 and WORDTUNE on most criteria. Also, the BERT-based methods output around 5x as many candidates as the other models on average, thus having an inherent advantage in recall with the lenient criteria (see Table 7 in Appendix E).

Evaluation results
In Table 4, we additionally report the performance of generative models by re-ranking their lists of substitutes using the best ranker from our candidate ranking evaluation, BERT (see Appendix D for details). This procedure unilaterally improves performance for all systems on all metrics except for GPT-3. 8 Hence, we speculate that improved performance on the ranking setting will be complementary to improved performance on the generative setting.
From a qualitative perspective, many of the systems we evaluate already produce helpful substi-tutes (Table 5). In examining errors, we find that BERT-based models and WORDTUNE tend to produce words that differ semantically from the target (e.g. "league" for zone). Substitutes generated by GPT-3 are often repetitive (e.g. for zone, GPT-3 produced 64 substitutes, out of which only 13 were unique)-we filter out duplicates before evaluating. Finally, we observe that some systems produce appropriate substitutes which are not present in SWORDS (e.g. GPT-3 produces "precinct" for zone), indicating that SWORDS still has gaps in coverage. However, the higher coverage and quality in SWORDS compared to past benchmarks still improves the reliability of our proposed evaluation.

Related work
As we already discussed previous lexical substitution benchmarks in Section 2 and models in Section 5, we use this section to draw connections to other related literature.
Word sense disambiguation. The task of word sense disambiguation consists of selecting the intended meaning (i.e. sense) from the pre-defined set of senses for that word in a sense inventory. The task of lexical substitution is closely related to word sense disambiguation, as many words are sense synonyms-some of their senses are synonymous, but others are not (Murphy, 2010). In fact, McCarthy (2002) proposed lexical substitution as an application-oriented word sense disambiguation task that avoids some of the drawbacks of standard word sense disambiguation, such as biases created by the choice of sense inventory (Kilgarriff, 1997).
Near-synonym lexical choice. Words are often near-synonyms-they can substitute for each other in some contexts, but not every context (DiMarco et al., 1993;Murphy, 2010). SWORDS can be viewed as a collection of human judgments on when certain near-synonyms are substitutable in a given context. The task of near-synonym lexical choice consists of selecting the original target word from a set of candidate words given a context where the target word is masked out (Edmonds and Hirst, 2002). The candidate words are composed of the target word and its near-synonyms which are often retrieved from a lexical resource such as Hayakawa (1994). In this task, systems are tested whether they can reason about near-synonyms and choose the best substitute that fits in the context, without knowing any direct semantic information about the
target word and without having to explicitly judge the appropriateness of other candidates.
Lexical and phrasal resources. Lexical resources such as thesauri are often used to identify possible word substitutes. WordNet (Fellbaum, 1998) is a widely used lexical resource for English that includes synonymy, antonymy, hypernymy, and other relations between words. PPDB (Pavlick et al., 2015) includes both word-level and phraselevel paraphrase rules ranked by paraphrase quality. These resources relate words and phrases in the absence of context, whereas lexical substitution requires suggesting appropriate words in context.
Paraphrase generation. Work on sentencelevel paraphrase generation considers a wide range of meaning-preserving sentence transformations, including phrase-level substitutions and large syntactic changes (Madnani and Dorr, 2010;Iyyer et al., 2018;Hu et al., 2019). Our work could be extended to phrases given appropriate methods for identifying target phrases and proposing candidate substitute phrases. One benefit of focusing on word substitutions is that we can cover a large fraction of all appropriate substitutes, and thus estimate recall of generative systems. Some word-level substitutions, such as function word variation and substitutions that rely on external knowledge, are also outside the scope of our work but occur in standard paraphrase datasets (Bhagat and Hovy, 2013).
Self-supervised pre-trained models. The task of suggesting words given surrounding context bears strong resemblance to masked language modeling, which is commonly used for pretraining (Devlin et al., 2019). However, for lexical substitution, appropriate substitutes must not only fit in context but also preserve the meaning of the target word; thus, additional work is required to make BERT perform lexical substitution (Zhou et al., 2019;Arefyev et al., 2020).
Modeling human disagreement. In SWORDS, we find considerable subjectivity between annotators on the appropriateness of substitutes. For the task of natural language inference, recent work argues that inherent disagreement between human annotators captures important uncertainty in human language processing that current NLP systems model poorly (Pavlick and Kwiatkowski, 2019;Nie et al., 2020). We hope that the fine-grained scores in SWORDS encourage the development of systems that more accurately capture the graded nature of lexical substitution.

A Data collection
A.1 Deduplicating context SWORDS uses the same contexts as COINCO, but with slight modifications to avoid duplication issues and incomplete contexts found in COINCO. COINCO uses a subset of contexts from the Manually Annotated Sub-Corpus (MASC) (Ide et al., 2008(Ide et al., , 2010, in which some sentences are erroneously repeated multiple times due to multiple IDs assigned to a single sentence. Consequently, COINCO contains duplicate sentences in some contexts, as shown below: " -was kindly received," "But an artist who would stay first among his fellows can tell when he begins to fail." "But an artist who would stay first among his fellows can tell when he begins to fail." Furthermore, we found that some parts of the document context are missing in COINCO because an ID was not assigned to the parts in MASC (e.g. "he said." is missing from the above passage after the word "received").
To address this issue, we re-extracted full contexts from MASC. Given a sentence containing a target word in COINCO, we located the sentence in MASC and used three non-overlapping adjacent MASC regions as our context. As a result, our context contains additional text that was erroneously omitted in COINCO (including newlines), thereby reducing annotator confusion. The context of the above example in our benchmark is as follows: " -was kindly received," he said. "But an artist who would stay first among his fellows can tell when he begins to fail." "Oh?"

A.2 Retrieving substitutes from THESAURUS
We use thesaurus.com, which is based on Roget's Thesaurus (Kipfer, 2013), as a primary source of context-free substitutes for target words in SWORDS. This resource contains substitutes for 133K words, with an average of four senses per word (median of one-a fraction of words have dozens of senses) and 25 substitutes per senses.
To select substitutes for a particular target word, we gather all substitutes from all senses that have the same part of speech as the original target, in order to disentangle lexical substitution from the task of word sense disambiguation as well as to include challenging distractors for evaluating models. 9 Because lemmas typically contain more substitutes than their associated word forms (e.g. "jump" has more substitutes than "jumping"), we lemmatize target words before querying the thesaurus. Figures 2 and 3 show the instructions and interface we used for Amazon Mechanical Turk (AMT) to crowdsource labels on substitutes. Following the practice of COINCO, we showed a highlighted target word in the context, which consisted of three sentences to provide sufficient context. We instructed annotators to provide a negative label if the target word is a proper noun or part of a fixed expression or phrase.

B.1 Instructions and interface
Since our Human Intelligence Task (HIT) concerns acceptability judgement as opposed to substitute generation, we made the following modifications to the COINCO's setup. First, we asked annotators whether they "would actually consider using this substitute" rather than whether the substitute "would not change the meaning" of the target word (Section 3.1). Second, we allowed annotators to abstain if they do not know the definition of the substitute, while asking them to return the HIT if they do not know the definition of the target word or more than three substitutes. Third, we asked annotators to accept a substitute which is "good but not quite grammatically correct." Lastly, we asked annotators to accept the substitute identical to the target word, in attempt to filtering out spammed HITs (Section B.3).

B.2 Setting on Amazon Mechanical Turk
Each HIT contained at most 10 candidate substitutes for a context-target word pair. When there were more than 10 candidate substitutes, we generated multiple HITs by partitioning the candidate substitutes into multiple subsets with potentially different length, using numpy.array_split. We randomized the ordering of substitutes so that each HIT is likely to contain substitutes from both COINCO and the thesaurus. The following qualification conditions were used to allow experienced annotators to participate in our task:

• HIT Approval Rate (%) for all Requesters'
HITs is greater than 98.
• Location is the United States.
• Number of HITs Approved is greater than 10,000.
Our target hourly wage for annotators was $15. Based on our in-person pilot study with five native English speakers, we approximated the time per assignment (labeling at most twelve substitutes) to be 25 seconds. Then, we assumed that it may take 1-2x longer for crowd workers to complete the assignments and decided on the compensation of $0.10 to fall into the range of $7.25 (US federal minimum wage) and $15 per hour, which corre-sponds to 50 seconds and 24 seconds per assignment, respectively.
It may be surprising that our assignments only take 25 seconds on average, though there are several reasons why this is the case: (1) In general, making binary judgements about substitute words takes very little time for native speakers.
(2) Annotators only have to read the target sentence once to provide judgements for all substitutes in an assignment.
(3) Annotators usually do not need to read the two additional context sentences to make judgements. (4) Annotators can almost instantly judge two control substitutes (Section B.3), and are therefore only realistically evaluating at most ten candidates per assignment.

B.3 Filtering spam
In order to filter out work done by spammers, we included two additional control candidate substitutes in every HIT: the original target word and a randomly chosen dictionary word. Annotators were instructed to accept the substitute identical to the target word and were expected to either reject or abstain on the random word. We used these control substitutes to filter out spammed HITs. Concretely, we filtered out all the HITs with any wrong label assigned to the control substitutes as well as HITs completed by annotators whose overall accuracy on control substitutes across HITs was less than 90%. Then, we re-collected labels on these filtered HITs for Step 2 and Step 3.

C Data analysis C.1 Annotator agreement
McCarthy and Navigli (2007) introduced two interannotator agreement measures, which assumes that a fixed number of annotators generate a set of substitutes for every context-target word pair. 10 However, these measures are not designed for the case when there is only one collective set of substitutes for each context-target word pair, and every contexttarget word pair is labeled by various combinations 10 McCarthy and Navigli (2007)   : Annotator agreement between SWORDS and k additional annotators measured by rank-biased overlap (Webber et al., 2010). The standard deviations from 100 simulations are shown as error bars. We observe quite low RBO for k < 3 and diminishing returns as k grows. This indicates that there is wide variation in opinions, and it is necessary to use sufficiently large k to capture the distribution. of annotators.
Instead, we compute correlation between the two ranked lists using Rank-Biased Overlap (RBO) (Webber et al., 2010), which handles non-conjoint lists and weights high ranks more heavily than low unlike other common rank similarity measures such as Kendall's τ and Spearman's ρ. With additionally collected 10 labels (Section 4.4), we computed RBO by comparing the ranked list of substitutes  derived from the data to that of SWORDS and simulate the effect of having k annotators by sampling k labels per substitute without replacement a total of 100 times. Figure 4 shows the correlation between SWORDS and k additional human annotators. We observe quite low RBO for k < 3 and diminishing returns as k grows. Based on this observation, we argue that there is wide variation in opinions and it is necessary to use sufficiently large k to capture the distribution.

D Model evaluation D.1 Ranking setting
As opposed to the generative setting where candidates must generate and rank substitutes, in the (easier) ranking setting, systems are given all candidate substitutes from the benchmark (including those marked as unacceptable) and tasked with ranking them by their appropriateness.

D.2 Evaluation metrics
To evaluate ranking models, we adopt standard practice and report generalized average precision (GAP) (Kishida, 2005). GAP is similar to mean average precision, but assigns more credit for systems which produce substitutes that have higher scores in the reference list. Considering that our data collection procedure results in reference scores which correspond more to substitute appropriateness than ease-of-recollection, GAP is aligned with our high-level goals.

D.3 Baselines
We evaluate contextual embeddings from BERT and word embeddings from GLOVE (Pennington et al., 2014), using cosine similarity of the target and substitute embeddings as the score. To compute the contextual embedding of a target or substitute with BERT, we mean pool contextual embeddings of its constituent word pieces. Because GLOVE discards contextual information, we expect it to perform worse than BERT, and is mainly used to assist interpretation of GAP scores. In the ranking setting, we are unable to evaluate GPT-3 and WORDTUNE, as we interface with these systems via an API which provides limited access to the underlying models. We report GAP scores in Table 6.

D.4 Results
We posit that contextual word embedding models should be invariant to contextual synonymy-they should embed acceptable substitutes nearby to one another. Hence, the SWORDS ranking setting may offer a useful perspective for evaluating this aspect of such models. In the ranking setting, our best contextual embedding model (BERT) achieves a GAP score of 57.1. While BERT outperforms a simple context-free baseline (GLOVE), it falls short of the 67.7 GAP score achieved by HUMANS. We interpret this as evidence that contextual embedding models have room to improve before attaining the aforementioned invariance.

D.5 Lexical substitution as natural language generation
GPT-3 is a language model which generates text in a left-to-right order, and is not designed specifically for the task of lexical substitution. To use GPT-3 to perform lexical substitution, we formulate the task in terms of natural language generation, and use in-context learning as described in (Brown et al., 2020). Specifically, we draw examples at random from the SWORDS development set to construct triplets of text consisting of (context with target word indicated using asterisks, natural language query, comma-separated list of all substitutes with score > 0% in descending score order) as follows: Phone calls were monitored. An undercover force of Manhattan Project security agents **infiltrated** the base and bars in the little town of Wendover (population 103) to spy on airmen. Karnes knew the 509th was preparing for a special bombing mission, but he had no idea what kind of bombs were involved.

4377
Q: What are appropriate substitutes for **infiltrated** in the above text?
A: penetrate, swarm, break into, infest, overtake, encompass, raid, breach We construct as many of these priming triplets as as can fit in GPT-3's 2048-token context (roughly 12 examples on average), leaving enough room for a test example formatted the same way except without the list of answers. Then, we query the 175B-parameter davinci configuration of GPT-3 to generate a text continuation with up to 128 tokens. Finally, we parse the generated text from GPT-3, using its natural language ordering as the ordering for evaluation.
In an initial pilot study on a random split of our development set, we selected the sampling hyperparameters for GPT-3 as temperature 0, presence_penalty 0.5, and frequency_penalty 0, among possible candidates of {0, 1} and {0, 0.5, 1.0}, and {0, 0.5, 1.0}, respectively. We used a grid search (18 runs) to select values based on highest F 10 c .

E Additional evaluation results
We include additional results from our evaluation. In Table 7, we break down F 10 from Table 4 into P 10 and R 10 . In Table 8, we report performance of all generative baselines on traditional metrics for lexical substitution.  Table 8: Evaluation of models on SWORD in the generative setting using traditional evaluation metrics. We also include numbers for an ORACLE, as (unlike for F 10 and GAP), the oracle does not achieve a score of 100. *Computed on a subset of the test data. † Reranked by our best ranking model (BERT).