Always Keep your Target in Mind: Studying Semantics and Improving Performance of Neural Lexical Substitution

Lexical substitution, i.e. generation of plausible words that can replace a particular target word in a given context, is an extremely powerful technology that can be used as a backbone of various NLP applications, including word sense induction and disambiguation, lexical relation extraction, data augmentation, etc. In this paper, we present a large-scale comparative study of lexical substitution methods employing both rather old and most recent language and masked language models (LMs and MLMs), such as context2vec, ELMo, BERT, RoBERTa, XLNet. We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly. Several existing and new target word injection methods are compared for each LM/MLM using both intrinsic evaluation on lexical substitution datasets and extrinsic evaluation on word sense induction (WSI) datasets. On two WSI datasets we obtain new SOTA results. Besides, we analyze the types of semantic relations between target words and their substitutes generated by different models or given by annotators.


Introduction
Lexical substitution is the task of generating words that can replace a given word in a given textual context. For instance, in the sentence "My daughter purchased a new car" the word car can be substituted by its synonym automobile, but also with co-hyponym bike, or even hypernym motor vehicle while keeping the original sentence grammatical. Lexical substitution has been proven effective in various applications, such as word sense induction (Amrami and Goldberg, 2018), lexical relation extraction (Schick and Schütze, 2020), paraphrase generation, text simplification, textual data augmentation, etc. Note that the preferable type (e.g., synonym, hypernym, co-hyponym, etc.) of generated substitutes depends on the task at hand.
The new generation of language and masked language models (LMs/MLMs) based on deep neural networks enabled a profound breakthrough in almost all NLP tasks. These models are commonly used to perform pre-training of deep neural networks, which are then fine-tuned to some final task different from language modeling. However, in this paper we study how the progress in unsupervised pre-training over the last five years affected the quality of lexical substitution. We adapt context2vec (Melamud et al., 2016), ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and XLNet (Yang et al., 2019) to solve lexical substitution task without any fine-tuning, but using additional techniques to ensure similarity of substitutes to the target word, which we call target word injection techniques. We provide the first large-scale comparison of various neural LMs/MLMs with several target word injection methods on lexical substitution and WSI tasks. Our research questions are the following (i) which pre-trained models are the best for substitution in context, (ii) additionally to pre-training larger Supervised approaches to lexical substitution include (Szarvas et al., 2013a;Szarvas et al., 2013b;Hintz and Biemann, 2016). These approaches rely on manually curated lexical resources like WordNet, so they are not easily transferable to different languages, unlike those described above. Also, the latest unsupervised methods were shown to perform better (Zhou et al., 2019).

Neural Language Models for Lexical Substitution with Target Word Injection
To generate substitutes, we introduce several substitute probability estimators, which are models taking a text fragment and a target word position in it as input and producing a list of substitutes with their probabilities. To build our substitute probability estimators we employ the following LMs/MLMs: context2vec (Melamud et al., 2016), ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and XLNet (Yang et al., 2019). These models were selected to represent the progress in unsupervised pre-training with language modeling and similar tasks over the last five years. Given a target word occurrence, the basic approach for models like context2vec and ELMo is to encode its context and predict the probability distribution over possible center words in this particular context. This way, the model does not see the target word. For MLMs, the same result can be achieved by masking the target word. This basic approach employs the core ability of LMs/MLMs of predicting words that fit a particular context. However, these words are often not related to the target. The information about the target word can improve generated substitutes, but what is the best method of injecting this information is an open question.

Target Word Injection Methods
We experiment with several methods to introduce information about the original target word into neural lexical substitution models and show that their performance differs significantly. Suppose we have an example LT R, where T is the target word, and C = (L, R) is its context (left and right correspondingly). For instance, the occurrence of the target word fly in the sentence "Let me fly away!" will be represented as T ="fly", L ="Let me", R ="away!".
+embs This method combines a distribution provided by a context-based substitute probability estimator P (s|C) with a distribution based on the proximity of possible substitutes to the target P (s|T ). The proximity is computed as the inner product between the respective embeddings, and the softmax function is applied to get a probability distribution. However, if we simply multiply these distributions, the second will have almost no effect because the first is very peaky. To align the orders of distributions, we use temperature softmax with carefully selected temperature hyperparameter: P (s|T ) ∝ exp( embs,emb T T ).
The final distribution is obtained by the formula P (s|C, T ) ∝ P (s|C)P (s|T ) P (s) β . For β = 1, this formula can be derived by applying the Bayes rule and assuming conditional independence of C and T given s. Other values of β can be used to penalize frequent words, more or less. Our current methods are limited to generating only substitutes from the vocabulary of the underlying LM/MLM. Thus, we take word or subword embeddings of the same model we apply the injection to. Other word embeddings like word2vec may perform better, but we leave these experiments for future work.
Word probabilities P (s) are retrieved from wordfreq library 2 for all models except ELMo. Following (Arefyev et al., 2019), for ELMo, we calculate word probabilities from word ranks in the ELMo vocabulary (which is ordered by word frequencies) based on Zipf-Mandelbrot distribution and found it performing better presumably due to better correspondence to the corpus ELMo was pre-trained on.
Dynamic patterns Following the approach proposed in (Amrami and Goldberg, 2018), we replace the target word T by "T and " (e.g. "Let me fly and away!"). Then some LM/MLM is employed to predict possible words at timestep " ". Thus, dynamic patterns provide information about the target word to the model via Hearst-like patterns.
Duplicate input This method duplicates the original example while hiding the target word (e.g., "Let me fly away! Let me away!"). Then possible words at timestep " " are predicted. It is based on our observation that Transformer-based MLMs are very good at predicting words from the context when they fit the specified timestep (copying) while still giving a high probability to their distributionally similar alternatives.
Original input For MLMs such as BERT and RoBERTa, instead of masking the target word, we can leave it intact. Thus, the model predicts possible words at the target position while receiving the original target at its input. We noticed that unlike duplicate input, in this case, the MLM often puts the whole probability mass to the original target and gives very small probabilities to other words making their ranking less reliable. For XLNet, we can use such attention masks that the context words can see the target word in the content stream. Thus, the content stream becomes a full self-attention layer and sees all words in the original example. We do not apply the original input technique with context2vec and ELMo since, for these models, there is no reasonable representation that can be used to predict possible words at some timestep while depending on the input at that timestep, at least without fine-tuning. For other models, this option significantly outperforms target word masking and does not require many additional efforts. Hence, if not specified otherwise, we use it in our experiments with pure BERT, RoBERTa, XLNet estimators, and in +embs method when estimating P (s|C) with BERT and XLNet.

Neural Language Models for Lexical Substitution
Different LMs/MLMs are employed as described below to obtain context-based substitute probability distribution P (s|C). For each of them, we experiment with different target injection methods.
C2V Context2vec encodes left, and right context separately using its forward and backward LSTM layers correspondingly and then combines their outputs with two feed-forward layers producing the final full context representation. Possible substitutes are ranked by the dot product similarity of their embeddings and the context representation. We use the original implementation 3 and the weights 4 pre-trained on ukWac dataset.
ELMo To encode left and right context with ELMo, we use its forward and backward stacked LSTMs, which were pre-trained with LM objective. However, there are no pre-trained layers to combine their outputs. Thus, we obtain two independent distributions over possible substitutes: one given the left context P (s|L), another given the right context P (s|R). To combine these distributions we use BComb-LMs method proposed in (Arefyev et al., 2019), which can be derived similarly to +embs method describe above: P (s|L, R) = P (s|L)P (s|R) . The original version of ELMo described in (Peters et al., 2018) is used, which is the largest version pre-trained on 1B Word Corpus.
BERT/RoBERTa By default, we give our full example without any masking as input and calculate the distribution over the model's vocabulary at the position of the first subword of the target word. We employ BERT-large-cased and RoBERTa-large models as the best-performing ones. Unlike BERT and XLNet, we found that RoBERTa with +embs injection method performs better if cosine similarity instead of dot-product similarity is used when estimating P (s|T ) and the target word is masked when estimating P (s|C). Thus, in the following experiments, we use these choices by default for RoBERTa+embs model.
XLNet By default, we use the XLNet-large-cased model with the original input, obtaining substitute probability distribution similarly to BERT. We found that for short contexts, XLNet performance degrades. To mitigate this problem, we prepend the initial context with some text ending with the end-of-document special token.
OOC Out of Context model ranks possible substitutes by their cosine similarity to the target word and completely ignores given context. Following (Roller and Erk, 2016) we use dependency-based embeddings 5 released by (Melamud et al., 2015b).
nPIC Non-parameterized Probability In Context model returns the product of two distributions measuring the fitness of a substitute to the context and to the target: nP IC(s|T, C) = P (s|T ) × P n (s|C), where P (s|T ) ∝ exp( embs s , embs T ) and P n (s|C) ∝ exp( c∈C embs s , embs c ). Here embs and embs are dependency-based word and context embeddings, and C are those words that are directly connected to the target in the dependency tree.

Intrinsic Evaluation
We perform an intrinsic evaluation of the proposed models on two lexical substitution datasets.

Experimental Setup
Lexical substitution task is concerned with finding appropriate substitutes for a target word in a given context. This task was originally introduced in SemEval 2007 Task 10 ( McCarthy and Navigli, 2007) to evaluate how distributional models handle polysemous words. In the lexical substitution task, annotators are provided with a target word and its context. Their task is to propose possible substitutes. Since there are several annotators, we have some weight for each possible substitute in each example, which is equal to the number of annotators provided this substitute.
We rank substitutes for a target word in a context by acquiring probability distribution over vocabulary on the target position. Lexical substitution task comes with two variations: candidate ranking and allwords ranking. In candidate ranking task, models are provided with a list of candidates. Following previous works, we acquire this list by merging all gold substitutes of the target lemma over the corpus. We measure performance on this task with Generalized Average Precision (GAP) that was introduced in (Thater et al., 2010). GAP is similar to Mean Average Precision, and the difference is in the weights of substitutes: the higher the weight of the word, the higher it should be ranked. Following (Melamud et al., 2015a), we discard all multi-word expressions from the gold substitutes and omit all instances that are left without gold substitutes.
In the all-vocab ranking task, models are not provided with candidate substitutes. Therefore, it is much harder task than the previous one. Models shall give higher probabilities to gold substitutes than to all other words in their vocabulary usually containing hundreds of thousands of words. Following (Roller and Erk, 2016), we calculate the precision of the top 1 and 3 predictions (P@1, P@3) as an evaluation metric for the all-ranking task. Additionally, we look at the recall of top 10 predictions (R@10).
The following lexical substitution datasets are used: SemEval 2007 Task 10 ( McCarthy and Navigli, 2007) consists of 2010 sentences for 201 polysemous words, 10 sentences for each. Annotators were asked to give up to 3 possible substitutes.
CoInCo or Concepts-In-Context dataset (Kremer et al., 2014) consists of about 2500 sentences that come from fiction and news. In these sentences, each content word is a separate example, resulting in about 15K examples. Annotators provided at least 6 substitutes for each example.

Results and Discussion
Comparison to previously published results Table 1 contains metrics for candidate and all-vocab ranking tasks. We compare our best model (XLNet+embs) with the best previously published results presented in (Roller and Erk, 2016), context2vec (c2v) model (Melamud et al., 2016) and BERT for lexical substitution presented in (Zhou et al., 2019). The proposed model outperforms solid models such as PIC, c2v, and substitute vector by a large margin on both ranking tasks. Nevertheless, (Zhou et al., 2019) reported better results than XLNet+embs in both lexical substitution tasks. In (Zhou et al., 2019), authors add a substitute validation metric that measures the fitness of a substitute to a context. It is
computed as the weighted sum of cosine similarities of contextualized representations of words in two sentence versions: original and one where the target word is replaced with the substitute. This technique substantially improves predictions. However, substitute validation requires additional forward passes, hence, increasing computational overhead. Our methods need only one forward pass. Our approach is orthogonal to the substitute validation. Thus, a combination of two methods can improve results further. It is worth mentioning that BERT and XLNet work on a subword level. Hence, their vocabularies are much smaller in size ( 30K subwords) than those of ELMo (800K words) or C2V (180K words) and contain only a fraction of possible substitutes. Thus, these models can be significantly improved by generating multi-token substitutes. Additionally, table 1 includes ablation analysis of our best model. Using ordinary softmax (which is equivalent to setting T = 1.0) results in all metrics decreasing by a large margin. Also, postprocessing of substitutes has a significant impact on all-words ranking metrics. Since LMs/MLMs generate grammatically plausible word forms and often generate the target word itself among top substitutes, additional lemmatization and target exclusion is required to match gold substitutes. In (Roller and Erk, 2016), the authors used the NLTK English stemmer to exclude all forms of the target word. In (Melamud et al., 2016) NLTK WordNet lemmatizer is used to lemmatize only candidates. For a fair comparison, the same post-processing is used for all models in the following experiments.

Model
SemEval 2007 Table 2: Comparison of our models and re-implemented baselines with the same post-processing.
Re-implementation of the baselines In Table 2, we compare our models based on different LMs/MLMs with and without +embs injection technique. Remember that BERT, RoBERTa, and XLNet see the target even if +embs is not applied, thus, providing already strong baseline results. All compared models, including re-implemented OOC and nPIC, employ the same post-processing consisting of substitute lemmatization followed by target exclusion. First, we notice that our best substitution models substantially outperform word2vec based PIC and OOC methods. For example, the XLNet+embs gives 2x better P@1 and P@3 than the baselines. This indicates that proposed models are better at capturing the meaning of a word in a context as such, providing more accurate substitutes. On the candidate ranking task bare C2V model outperforms ELMo and BERT based models, but it shows the lowest Precision@1. We note that +embs technique substantially improves the performance of all models in all-vocab ranking task, and also increases GAP for the majority of models.
Injection of information about target word Next, we compare target injection methods described in Section 3. Figure 1 presents the Recall@10 metric for all of our neural substitution models with each applicable target injection method. CoInCo pattern:"T and _" pattern:"T (or even _)" without target word original input duplicate input +embs Application of dynamic patterns leads to lower performance even compared to the models that do not see the target word. Although we show the target word to the substitute generator, the pattern can spoil predictions, e.g., using "T and " pattern with a verb can produce words denoting subsequent actions related to the verb, but not its synonyms. When we use the original input without any masking, the models produce substitutes more related to the target, resulting in good baseline performance. Applying the +embs method leads to a significant increase of Recall@10 for all models, almost always being the best performing injection method. For C2V and ELMo, it gives 2-3 times better recall than all other injection methods. Duplicating input performs surprisingly good for Transformer-based models but does not help for LSTM-based C2V and ELMo. This is likely related to the previous observations that Transformers are very good at copying words from the surrounding context, thus, predicting the original target word, but also words with similar embeddings. The highest impact is for the XLNet model as it can not straightforwardly use information from the target position due to its autoregressive nature but can easily find the target in the sentence copy. Overall, our experiments show that proper injection of the information about the target word results in much more plausible substitutes generated.

Extrinsic Evaluation
In this section, we perform the extrinsic evaluation of our models applied to the Word Sense Induction (WSI) task. The data for this task commonly consists of a list of ambiguous target words and a corpus of sentences containing these words. Models are required to cluster all occurrences of each target word according to their meaning. Thus, the senses of all target words are discovered in an unsupervised fashion. For example, suppose that we have the following sentences with the target word bank: 1. He settled down on the river bank and contemplated the beauty of nature, 2. They unloaded the tackle from the boat to the bank.

Grand River bank now offers a profitable mortgage.
Sentences 1 and 2 shall be put in one cluster, while sentence 3 must be assigned to another. This task was proposed in several SemEval competitions (Agirre and Soroa, 2007;Manandhar et al., 2010;Jurgens and Klapaftis, 2013). The current state-of-the-art approach (Amrami and Goldberg, 2019) rely on substitute vectors, i.e., each word usage is represented as a substitute vector based on the most probable substitutes, then clustering is performed over these substitute vectors.

Model
SemEval-2010 (AVG) SemEval-2013 (AVG) (Amrami and Goldberg, 2018) -25.43±0.48 (Amrami and Goldberg, 2019) 53  We implemented a WSI algorithm using lexical substitutes from our models. The algorithm is a simplified version of methods described in (Amrami and Goldberg, 2019;Arefyev et al., 2019). In the first step, we generate substitutes for each example, lemmatize them and take 200 most probable ones. We treat these 200 substitutes as a document. Then, TF-IDF vectors for these documents are calculated and clustered using agglomerative clustering with average linkage and cosine distance. The number of clusters maximizing the silhouette score is selected for each word individually.
In table 3 we compare our lexical substitution models on two WSI datasets. For the previous SOTA models, which are stochastic algorithms, the mean and the standard deviation are reported. Our WSI algorithm is deterministic; hence, we report the results of a single run. Our best model achieves higher metrics than the previous SOTA on both datasets, however, the difference is within one standard deviation. Similarly to the intrinsic evaluation results, our +embs injection method substantially improves the performance of all models for WSI, except for the C2V+embs model on SemEval-2010, which probably used suboptimal hyperparameters.

Analysis of Semantic Relation Types
In this section we analyze the types of substitutes produced by different neural substitution models.

Experimental Setup
For this analysis, the CoInCo lexical substitution dataset (Kremer et al., 2014) described above is used. We employ WordNet (Miller, 1995) to find the relationship between a target word and each generated substitute. First, from all possible WordNet synsets two synsets containing the target word and its substitute with the shortest path between them are selected. Then relation between these synsets is identified as follows. If there is a direct relation between the synsets, i.e. synonymy, hyponymy, hypernymy, or cohyponymy with a common direct hypernym, we return this relation. Otherwise, we search for an indirect relation, i.e. transitive hyponymy or hypernymy, or co-hyponymy with a common hypernym at a distance of maximum 3 hops from each synset. We also introduce several auxiliary relations: unknown-word -the target or the substitute is not found among WordNet synsets with the required PoS, unknown-relationthe target and the substitute are in the same WordNet tree, but no relation can be assigned among those described above, no-path -the target and the substitute are in different trees.

Discussion of Results
For nouns and verbs the proportions of non-auxiliary relations are shown in figure 2, for all words and relations see Appendix B. Our analysis shows that a substantial fraction of substitutes has no direct relation to the target word in terms of WordNet, even in case of the gold standard substitutes. Besides, even human annotators occasionally provide substitutes of incorrect PoS, e.g., for bright as an adjective there is the verb glitter among gold substitutes. For adjectives and adverbs 18% and 25% of gold substitutes are unknown words (absent among synsets with the correct PoS), while for verbs and nouns less than 7% are unknown. For baseline models OOC and nPIC, the overwhelming number of substitutes are unknown words. One of the reasons for this might be the fact that their vocabularies contain words with typos, but we also noticed that these models frequently do not preserve the PoS of the target word. The models based on LMs/MLMs produce much fewer unknown substitutes. Surprisingly, our +embs target injection method further reduces the number of such substitutes, achieving the proportion comparable to the gold standard. We can therefore suggest that our injection method helps to better preserve the correct PoS even for SOTA MLMs. For both nouns and verbs, +embs target injection method consistently increases the proportions of synonyms, direct hyponyms and direct hypernyms, while decreasing the proportions of distantly related co-hyponyms (co-hypo-3) or unrelated substitutes. This is more similar to the proportions in human substitutes. Thus, the addition of information from embeddings forces the models to produce words that are more closely related to the target word and more similar to human answers. For C2V and ELMo, which have no information on the target word, target word injection results in 2x-3x more synonyms generated.
For several sentences from SemEval 2007 dataset (McCarthy andNavigli, 2007), Figure 3 shows some examples of substitutes provided by the human annotators (GOLD) and generated by several models, see Appendix A for more examples and models. The first example shows the case when +embs injection method improves the result, ranking closely related substitutes, such as telephone, cellphone, higher. The  substitutes provided by the bare XLNet model, such as electricity and internet, could be used in this context, but all the annotators had preferred the synonym telephone instead. The second case is related to the failure of the proposed method. Bare XLNet model generated substitutes related to the correct sense of the ambiguous target word can, and has all three gold substitutes among its top 10 predictions. In contrast, XLNet+embs produced words that are related to the most frequent sense: will, could, cannot, etc. We hypothesize that this problem could potentially be alleviated by choosing individual temperature for each example based on the characteristics of the combined distributions; this is a possible direction for our further research.

Conclusion
We presented the first comparison of a wide range of LMs/MLMs with different target word injection methods on the tasks of lexical substitution and word sense induction. Our results are the following: (i) large pre-trained language models yield better results than previous unsupervised and supervised methods of lexical substitution; (ii) if properly done, the integration of information about the target word substantially improves the quality of lexical substitution models. The proposed target injection method based on a fusion of a context-based distribution P (s|C) with a target similarity distribution P (s|T ) proved to be the best one. When applied to the XLNet model, it yields new SOTA results on two WSI datasets. Finally, we study the semantics of the produced substitutes. This information can be valuable for practitioners selecting the most appropriate lexical substitution method for a particular NLP application.