BERT-based Lexical Substitution

Previous studies on lexical substitution tend to obtain substitute candidates by finding the target word’s synonyms from lexical resources (e.g., WordNet) and then rank the candidates based on its contexts. These approaches have two limitations: (1) They are likely to overlook good substitute candidates that are not the synonyms of the target words in the lexical resources; (2) They fail to take into account the substitution’s influence on the global context of the sentence. To address these issues, we propose an end-to-end BERT-based lexical substitution approach which can propose and validate substitute candidates without using any annotated data or manually curated resources. Our approach first applies dropout to the target word’s embedding for partially masking the word, allowing BERT to take balanced consideration of the target word’s semantics and contexts for proposing substitute candidates, and then validates the candidates based on their substitution’s influence on the global contextualized representation of the sentence. Experiments show our approach performs well in both proposing and ranking substitute candidates, achieving the state-of-the-art results in both LS07 and LS14 benchmarks.


Introduction
Lexical substitution (McCarthy and Navigli, 2007) aims to replace a target word in a sentence with a substitute word without changing the meaning of the sentence, which is useful for many Natural Language Processing (NLP) tasks like text simplification and paraphrase generation.
One main challenge in this task is proposing substitutes that not only are semantically consistent with the original target word and fits in the * This work was done during the first author's internship at Microsoft Research Asia.

Sentence
The wine he sent to me as my birthday gift is too strong to drink.  Figure 1: (a) WordNet and original BERT cannot propose the valid substitute powerful in their top-K results but applying target word embedding dropout enables BERT to propose it; (b) Undesirable substitutes (e.g., hot, tough) tend to change the contextualized representation of the sentence more than good substitutes (e.g., powerful). The numbers after the words are the cosine similarity of the words' contextualized vector to the original target words; while the numbers after the sentence are the similarity of the sentence's contextualized representation before and after the substitution, defined in Eq (2).
context but also preserve the sentence's meaning.
Most previous approaches to this challenge first obtain substitute candidates by picking synonyms from manually curated lexical resources as candidates, and then rank them based on their appropriateness in context, or instead ranking all words in the vocabulary to avoid the usage of lexical resources. For example, knowledge-based lexical substitution systems (Yuret, 2007;Hassan et al., 2007) use pre-defined rules to score substitute candidates; vector space modeling approach (Erk and Padó, 2008;Dinu and Lapata, 2010;Thater et al., 2010;Apidianaki, 2016) uses distributional sparse vector representations based on the syntactic context; substitute vector approach (Yuret, 2012;Melamud et al., 2015b) comprises the potential fillers for the target word slot in that context; word/context embedding similarity approach (Melamud et al., 2015a;Roller and Erk, 2016;Melamud et al., 2016) uses the similarity of word embeddings to rank substitute words; and supervised learning approaches (Biemann, 2013;Szarvas et al., 2013a,b;Hintz and Biemann, 2016) uses delexicalized features to rank substitute candidates. Although these approaches work well in some cases, they have two key limitations: (1) they rely heavily on lexical resources. While the resources can offer synonyms for substitution, they are not perfect and they are likely to overlook some good candidates, as Figure 1(a) shows.
(2) most previous approaches only measure the substitution candidates' fitness given the context but they do not consider whether the substitution changes the sentence's meaning. Take Figure 1(b) as an example, although tough may fit in the context as well as powerful, it changes the contextualized representation of the sentence more than powerful. Therefore, it is not so good as powerful for the substitution. To address the above issues, we propose a novel BERT-based lexical substitution approach, motivated by that BERT (Devlin et al., 2018) not only can predict the distribution of a masked target word conditioned on its bi-directional contexts but also can measure two sentences' contextualized representation's similarity. To propose substitute candidates for a target word in a sentence, we introduce a novel embedding dropout mechanism to partially mask the target word and use BERT to predict the word at the position. Compared to fully masking or keeping the target word, partially masking with embedding dropout allows BERT to take a balanced consideration of target word's semantics and its contexts, helping avoid generating substitute candidates that are either semantically inconsistent with the target word or unfit in the contexts, as Figure 1(a) shows. To validate a substitute candidate, we propose to evaluate a candidate's fitness based on the substitution's influence on the contextualized representation of the sentence, which avoids selecting a substitute that changes the sentence's meaning much, as Figure  1(b) illustrates. We conduct experiments on the official LS07 and LS14 benchmarks. The results show that our approach substantially outperforms previous approaches in both proposing and validating substitute candidates, achieving new stateof-the-art results in both datasets.
The contributions of our paper are as follows: • We propose a BERT-based end-to-end lexi-unmask dropout mask

strong strong [MASK]
… is too strong to drink. Figure 2: Unmasking, masking and partially masking the target word through target embedding dropout.
cal substitution approach without relying on any annotated data and external linguistic resources.
• Based on BERT, we introduce target word embedding dropout for helping substitute candidate proposal, and a substitute candidate validation method based on the substitution's influence on the global contexts.
• Our approach largely advances the state-ofthe-art results of lexical substitution in both LS07 and LS14 benchmarks.

BERT-based Lexical Substitution
BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2018) is a bidirectional transformer encoder (Vaswani et al., 2017) trained with the objective of masked language modeling and the next-sentence prediction task, which proves effective in various NLP tasks.
In this section, we present how to effectively leverage BERT for lexical substitution.

Substitute Candidate Proposal
As BERT is a bi-directional language model trained by masking the target word, it can be used to propose a substitute candidate to reconstruct the sentence. In practice, however, if we mask the target word and let BERT predict the word at the position, BERT is very likely to generate candidates that are semantically different from the original target word although it fits in the context; on the other hand, if we do not mask the target word, approximately 99.99% of the predicted probability distribution will fall into the original target word, making it unreliable to choose the alternative candidates from the remaining 0.01% probability space, as Figure 1 shows. For a trade-off between the two extreme cases, we propose to apply embedding dropout to partially mask the target word. It forces a portion of dimension of the target word's input embedding to zero, as illustrated in Figure 2. In this way, BERT can only receive vague information from the target word and thus has to consider other contexts to reconstruct the sentence, which improves substitute candidate proposal as Figure 1(a) shows.
Formally, for the target word x k to be replaced in sentence x = (x 1 , · · · , x k , · · · , x L ), we define s p (x k |x, k) as the proposal score for choosing x k as the substitution for x k : where P (x k |x, k) is the probability for the k th word predicted by BERT given x, and x is the same with x except that its k th position's word is partially masked with embedding dropout. The denominator is the probability of the prediction that is not x k , normalizing P (x k | x, k) against all the words in the vocabulary excluding x k .

Substitute Candidate Validation
After we propose substitute candidates, we need to validate them because not all proposed candidates are appropriate. As Figure 1(b) shows, a proposed candidate (e.g., tough) may change the sentence's meaning. To avoid such cases, we propose to evaluate a candidate's fitness by comparing the sentence's contextualized representation before and after the substitution for validation. Specifically, for a word x i , we use the concatenation of its representations in top four layers in BERT as its contextualized representation. We denote the sentence after the substitution as x = (x 1 , · · · , x k , · · · , x L ). The validation score for the substitution of x k is defined in Eq (2): where SIM(x, x ; k) is BERT's contextualized representation similarity of x and x , which is defined as follows: where h(x i |x) is BERT's contextualized representation of the i th token in the sentence x and Λ(a, b) is cosine similarity of vector a and b. w i,k is the average self-attention score of all heads in all layers from i th token to k th position in x, which is used for weighing each position based on its semantic dependency to x k . In this way, we can use s v (x k |x, k) to measure the influence of the substitution of x k → x k on the semantics of the sentence. The undesirable substitute candidates like hot and tough in Figure 1(b) will get a lower s v and thus fail in ranking, while appropriate candidates like powerful will have a high s v and will be preferred.
In practice, we consider both the proposal score s p in Eq (1) and the validation score s v in Eq (2) for overall recommendation for a candidate: where α is the weight for the proposal score.

Experimental Setting
We evaluate our approach on the SemEval 2007 dataset (McCarthy and Navigli, 2007) (denoted as LS07), and the CoinCo dataset (Kremer et al., 2014) (denoted as LS14), benchmark datasets which are the most widely used datasets for lexical substitution evaluation. LS07 consists of 201 target word types, each of which has 10 instances in different contexts (i.e., sentences); while LS14 provides the same kind of data as LS07 but is much larger -with 4,255 target word types in over 15K sentences.
We use official evaluation metrics best, bestmode, oot, oot-mode in SemEval 2007 task as well as Precision@1 as our evaluation metrics. Among them, best, best-mode and Precision@1 evaluate the quality of the best predictions while oot (out-of-ten) and oot-mode evaluate the coverage of the gold substitutes in 10-best predictions.
We use uncased BERT large model in Devlin et al. (2018) in our experiments. We use LS07 trial set as our development set for tuning the hyperparameters in our model. Empirically, we set the dropout ratio of the target word's embedding to 0.3 and set the weight α in Eq (3) to 0.01. For each test instance, we propose 50 candidates using the approach in Section 2.1 and validate and rank them by Eq (3). As the embedding dropout introduces randomness to the final results, we repeat our experiments 5 times and report average scores with standard deviation. Table 1 shows the results of our approaches as well as the state-of-the-art approaches in LS07 and LS14 benchmarks. Our approach substantially outperforms all previous approaches in both  (Hintz and Biemann, 2016) WordNet 17.2 -48.8 -supervised learning (Szarvas et al., 2013b) WordNet 15.9 -48.8 -40.8 KU (knowledge-based) (Yuret, 2007) multiple resources 12.9 20.7 46.2 61.3 -UNT (knowledge-based) (Hassan et al., 2007) Table 2: Ablation study results of our approach. BERT (Keep/Mask) are the baselines that uses BERT unmasking/masking the target word to propose candidates and rank by the proposal scores. Remember that our approach is a linear combination of proposal score s p and validation score s v , as in Eq (3). In the baselines "w/o s p ", we alternatively use BERT (Keep), BERT (Mask) or WordNet to propose candidates.

Experimental Results
benchmarks, even those trained through supervised learning with external resources (Szarvas et al., 2013b), in terms of all the five metrics. Though our approach introduces randomness due to the embedding dropout, no large fluctuation is observed in our results.
For understanding the improvement, we conduct an ablation test and show the result in Table  2. According to Table 2, we observe that the original BERT cannot perform as well as the previous state-of-the-art approaches by its own. Applying embedding dropout to BERT improves the model, allowing it to achieve 13.1% and 14.3% P@1 in LS07 and LS14 respectively. When we further add our candidate valuation method in Section 2.2 to validate the candidates, its performance is significantly improved. Furthermore, it is clear that our substitute candidate proposal method is much bet-  (Melamud et al., 2016) 56.0 47.9 substitute vector (Melamud et al., 2015b) 55.1 50.2 addcos (Melamud et al., 2015a) 52.9 48.3 PIC (Roller and Erk, 2016) 52.4 48.3 vector space modeling (Kremer et al., 2014) 52.5 47.8 transfer learning (Hintz and Biemann, 2016) 51.9 supervised learning (Kremer et al., 2014) 55.0 -BERT (word similarity) 55.2 52.1 Table 3: GAP scores in the substitute ranking subtask. Note that for the baseline w/o s p , we do not need to propose candidates using BERT like Table 2 since candidates are given in advance in the ranking subtask. BERT (word similarity) ranks candidates by the cosine similarity of BERT contextualized representations of the original target word and a substitute candidate. We do not compare to Apidianaki (2016) as it only evaluates on a sample of the test data in a different setting.
ter than WordNet for candidate proposal when we compare our approach to the -w/o s p (WordNet) baseline where candidates are obtained by Word-Net and validated by our validation approach. Also, we evaluate our approach in the substitute ranking subtask of LS07 and LS14. In the ranking subtask, a system does not need to propose candidates by itself; instead, the substitute candidates for each test instance are given in advance, either from lexical resources (e.g. wordnet) or pooled substitutes. Following prior work, we use GAP score (Kishida, 2005) for evaluation in the subtask, which is a variant of MAP (Mean Average Precision). According to Table 3, we observe that both our proposal score s p and validation score s v contribute to the improvement, allowing our approach to outperform previous stateof-the-art approaches, even with the same substitute candidates.
By comparing our approach without s p to the BERT baseline approach BERT (word similarity) in Table 3, we confirm that the comparison of sentence-level contextualized representations before and after the substitution is more effective and reliable than the word-level comparison for lexical substitution. This is because some changes in sentence's meaning after the substitution can be better captured by the sentence-level analysis, just as the example in Figure 1(b) illustrates.

Conclusion
In our work, we propose an end-to-end lexical substitution approach based on BERT, which can propose and validate substitute candidates without using any annotated data and manually curated resources. Experiments in LS07 and LS14 benchmark datasets show that our proposed embedding dropout for partially masking the target word is helpful for BERT to propose substitute candidates, and that analyzing a sentence's contextualized representation before and after the substitution can largely improve the results of lexical substitution.