Don’t Neglect the Obvious: On the Role of Unambiguous Words in Word Sense Disambiguation

State-of-the-art methods for Word Sense Disambiguation (WSD) combine two different features: the power of pre-trained language models and a propagation method to extend the coverage of such models. This propagation is needed as current sense-annotated corpora lack coverage of many instances in the underlying sense inventory (usually WordNet). At the same time, unambiguous words make for a large portion of all words in WordNet, while being poorly covered in existing sense-annotated corpora. In this paper we propose a simple method to provide annotations for most unambiguous words in a large corpus. We introduce the UWA (Unambiguous Word Annotations) dataset and show how a state-of-the-art propagation-based model can use it to extend the coverage and quality of its word sense embeddings by a significant margin, improving on its original results on WSD.


Introduction
There has been a lot of progress in word sense disambiguation (WSD) recently. This progress has been driven by two factors: (1) the introduction of large pre-trained Transformer-based language models and (2) propagation algorithms that extends the coverage of existing training sets. The gains due to pre-trained Neural Language Models (NLMs) such as BERT (Devlin et al., 2019) have been outstanding, helping reach levels close to human performance when training data is available. These models are generally based on a nearest neighbours strategy, where each sense is represented by a vector, exploiting the contextualized embeddings of these NLMs (Melamud et al., 2016;Peters et al., 2018;Loureiro and Jorge, 2019). However, training data for WSD is hard to obtain, and the most widely used training set nowadays, based on Word-Net, dates back from the 90s (Miller et al., 1993, SemCor). This lack of curated data produces the so-called knowledge-acquisition bottleneck (Gale et al., 1992;Navigli, 2009).
However, there is a key source of information that has been neglected so far in existing senseannotated corpora and propagation methods, which is the presence of unambiguous words from the underlying knowledge resource. Strikingly, Word-Net, which is known to be a comprehensive resource, is mostly composed of unambiguous entries (30k lemmas are ambiguous, compared to 116k unambiguous). While the lack of unambiguous annotations does not have a direct effect in WSD, the fact that these unambiguous words are part of the same semantic network means they can have an effect on ambiguous words via standard propagation algorithms. These propagation algorithms start from a seed of senses occurring in the training data (and therefore their embeddings can be directly computed) and then propagate to the whole sense inventory via the semantic network (Vial et al., 2018;Loureiro and Jorge, 2019). Consequently, computing sense embeddings for unambiguous words can increase the number of seeds and improve the whole process. Covering these unambiguous words, however, is not an arduous task, as unlabelled corpora may suffice. We explore this hypothesis by labeling a large amount of unambiguous words in corpora extracted from the web, using WordNet as our reference sense inventory. While we can certainly find usages of a word not covered by WordNet, we found that our approach can obtain accurate occurrences with simple heuristics.
The contribution of this paper is twofold. First, we devise a simple methodology to construct UWA (Unambiguous Word Annotations), a large and, most importantly, diverse sense-annotated corpus that focuses on WordNet unambiguous words. Second, we show that by leveraging UWA, we can significantly improve a state-of-the-art WSD model.

Related Work
The knowledge-acquisition bottleneck has been frequently addressed by automatically constructing sense-annotated corpora. Recent works propose methods that exploit knowledge from Wikipedia, such as NASARI vectors (Camacho-Collados et al., 2016), for providing sense annotations for concepts and entities (Scarlini et al., 2019;. In the case of Scarlini et al. (2019), and similarly to Raganato et al. (2016), their method requires hyperlinks and category information from Wikipedia, hence not extensible to other kinds of corpora. 1 Previous approaches relied on parallel corpora for two or more languages. The OMSTI corpus (Taghipour and Ng, 2015) was constructed by exploiting the alignments of an English-Chinese corpus. Similarly, Delli Bovi et al. (2017) presented EuroSense, a multilingual sense-annotated corpus using the Europarl parallel corpus for 21 languages as reference. In contrast to these approaches, we focus on unambiguous senses and, therefore, are not constrained to only nouns, knowledge from Wikipedia, or a specific type of corpus.
Earlier works exploiting unambiguous words (Leacock et al., 1998;Mihalcea, 2002;Agirre and Martinez, 2004) and especially the subsequent extension by Martinez et al. (2008) are the most directly related to our paper. Martinez et al. (2008) retrieved example sentences with monosemous nouns from web search snippets and used them towards improved performance on WSD by leveraging WordNet relations. However, the WSD methods analyzed were sensitive to frequency bias, leading their collection effort to collect a large number of examples for fewer senses (and only nouns). In contrast, our solution is designed for all monosemous words, retrieving examples from web texts instead of snippets, attaining performance gains with even a single example per word.

Methodology
In this section we first explain our method to construct a corpus with unambiguous word annotations (Section 3.1). Then, we explain current models based on language models for WSD (Section 3.2) and describe a propagation method to infer additional OOV sense representations (Section 3.3).

Unambiguous Word Annotations (UWA)
In order to properly test our hypothesis, we first require a sizable compilation of unambiguous words in context, particularly words that correspond to lemmas covered by WordNet. The extensiveness of WordNet means that most of its lemmas occur very rarely, and thus require processing large volumes of texts to achieve a high coverage. As such, in this work we develop the Unambiguous Word Annotations (UWA) corpus based on Open-WebText (Gokaslan and Cohen, 2019) and English Wikipedia (November 2019), processing over 53GB of texts from the web.
Each text is annotated for lemmas and part-ofspeech using the Stanford CoreNLP toolkit (Manning et al., 2014). The annotations are filtered so that we only consider lemma/part-of-speech pairs that are present in WordNet, and correspond to a single sense (hence unambiguous), e.g., 'keypad/noun'. Naturally, some lemma/part-of-speech pairs may have additional meanings not covered in WordNet. For example, in "Inception was a boxoffice hit.", Inception makes reference to a movie and not to the unambiguous word inception from WordNet. To mitigate this issue, we applied Named Entity Recognition (NER) tagging, using spaCy (Honnibal and Montani, 2017), to discard lemmas that are recognized as entities but do not correspond to an entity in their WordNet sense. To this end, we leverage the entity annotations of WordNet synsets available in BabelNet (Navigli and Ponzetto, 2012 (Taghipour and Ng, 2015) or T-o-M . These sense-annotated corpora, not aimed specifically at unambiguous words, have limited coverage in this respect, as they are mainly composed of annotations for senses already available in SemCor.

Neural Language Models for WSD
Recent NLMs, such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019), have been used with a high degree of success on WSD. They have been used differently depending on the nature of the disambiguation task: as feature providers for other neural architectures (Vial et al., 2019), simple classifiers after fine-tuning (Wang et al., 2019), or as generators of contextual embeddings to be matched through nearest neighbours (Melamud et al., 2016;Peters et al., 2018;Loureiro and Jorge, 2019;Reif et al., 2019, 1NN). Our experiments in this paper will focus on improving the latter type of approach. In particular, we will investigate the state-of-the-art LMMS model (Loureiro and Jorge, 2019). This model learns sense embeddings based on BERT states. These embeddings are then propagated through WordNet's ontology to infer additional senses, effectively providing a full coverage. While Loureiro and Jorge (2019) proposed variants of LMMS that combined propagation with gloss embeddings, or static embeddings, this paper is only concerned with the propagation method. In our case, we essentially follow LMMS's layer pooling method to generate contextual embeddings for each sense occurrence in context (from a training set), and derive sense embeddings from the average of all corresponding contextual embeddings.

Network Propagation for Full-Coverage
The propagation method used in LMMS exploits the WordNet ontology to obtain a full coverage of sense embeddings from an initial set of embeddings based on a manually sense-annotated corpus like SemCor. This method explores different abstraction levels represented in WordNet: sets of synonyms (synsets), Is-A relations (hypernyms) and categorical groupings (lexnames 2 ).
Initial sense embeddings are first used to compute synset embeddings as the average of all corresponding senses (analogously to how sense embeddings are computed from contextual embeddings).
From that point, missing senses are represented by their corresponding synset embeddings. The remaining unrepresented senses are inferred from their hypernym and lexname embeddings, computed by averaging their neighbour synset embeddings. Note that this propagation process does not follow transitive relations in WordNet, i.e., a single synset's hypernym is considered, while the subsequent hypernyms along the root paths are ignored.
Since lexname embeddings can always be computed, this process can reach a full-coverage of WordNet starting with just the initial set of embeddings produced using SemCor. However, the set of SemCor embeddings only covers 16.1% of Word-Net, so many of the inferred representations are redundant and therefore not entirely meaningful.

Evaluation
For our experiments we are interested in verifying the impact of using UWA to improve WSD performance. In particular, we test the unambiguous annotations of UWA as a complement of existing sense-annotated training data. To this end, as explained in Section 3, we make use of the state-ofthe-art WSD model LMMS (Loureiro and Jorge, 2019). In addition to the original version using BERT, we also provide results with RoBERTa (Liu et al., 2019) for completeness. We use the 24-layer models for both BERT and RoBERTa. 3 Table 2 shows the WSD results on the standard evaluation framework of  for LMMS trained on the concatenation of SemCor and automatically-constructed corpora. In the table we include UWA with two different maximum number of examples per unambiguous word, i.e., 1 and 10. For comparison, we also include the results of EWISE (Kumar et al., 2019) and GlossBERT (Huang et al., 2019), which attempt to overcome the limited coverage of SemCor by exploiting textual definitions. As can be observed, the concatenation of our UWA corpus and SemCor provides the best overall results, regardless of the number of examples cut-off. Perhaps surprisingly, our corpus is the only one that provides improvements over the baseline (SemCor-only). These improvements are statistically significant on the full test set (i.e. ALL) for both BERT and RoBERTa with p < 0.0005, based on a t-test with respect to the  accuracy scores (equal to F1 in this setting). This can be explained by the fact that our corpus is the only one that significantly extends the coverage of SemCor, as explained in Section 3.1.

Uninformed Sense Matching (USM)
In standard WSD benchmarks, models are given the advantage of knowing the pre-defined set of possible senses before-hand. This is because gold PoS tags and lemmas are provided in these datasets. However, to better understand how robust a 1NN WSD model is, we can test it in an uninformed setting, i.e., where PoS tags and lemmas are not given and the model does not have access to the list of candidate senses. Instead, the model has to match senses from the whole sense inventory, unconstrained. Therefore, in this Uninformed Sense Matching (USM) setting we can use information retrieval ranking metrics with the model predictions (i.e. MRR or P@K) in addition to the standard F1.
In line with the WSD results, Table 3 shows that UWA also substantially improves performance in the USM setting when comparing against currently available alternatives.

Analysis
In this section, we provide an analysis based on the number of examples (Section 5.1) and a visualization of the embedding space (Section 5.2).

Number of Examples
When compiling examples for learning sense representations, a natural question that arises is: how many examples are required to learn effective representations? The answer to this question can not only guide collection efforts, but also help clarify the requirements for learning effective representations in the simplest setting. To that end, we analyse the impact of using different number of examples from UWA on LMMS's WSD and USM performance. In Figure 1, we show the WSD performance trend using different number of examples per sense. As can be seen, performance improves substantially with only one example, and then stops improving after just two examples. Similarly to our findings for WSD, Table 4 shows that a low number of examples, such as 2, already achieves the best overall results in the USM setting for BERT. Likewise, RoBERTa does not benefit from more than 5 examples. More generally, in USM the differences with respect to Sem-Cor are more marked in comparison to the regular WSD setting. This is expected as the propagation algorithm has a stronger effect in this setting where all sense embeddings are considered.

Visualization of the Embedding Space
The propagation method used in LMMS is designed to backoff to increasingly abstract repre-  sentation levels, from synsets, to hypernyms, to supersenses (see Section 3.3 of the main paper). This naturally leads to a clustering effect, where many senses are represented with very similar, or equal, embeddings. In fact, we find that only 22% of sense embeddings learned from SemCor, and propagated following LMMS, are actually unique (remaining are shared by two or more senses). The addition of UWA increases this percentage to 68%. To better understand this clustering effect, we used T-SNE (Maaten and Hinton, 2008) to visualize the WordNet synset embedding space. In Figure 2 we show synset embeddings learned from the SemCor+UWA(10) dataset, and learned from SemCor alone, both based on RoBERTa. While the same number of synset embeddings are learned in both cases, SemCor+UWA embeddings are better distributed across the vector space. This, in turn, causes a substantial reduction of high-density clusters, which stand in opposition to a rich distributional representation of senses. 4

Conclusion
Unambiguous words are a surprisingly large portion of existing knowledge resources like Word-Net. At the same time, their coverage in existing sense-annotated corpora is very limited. In this paper, we proposed a simple method which exploits sense annotations of unambiguous words from unlabeled corpora, thereby effectively extending existing sense-annotated corpora with low-effort. By leveraging a state-of-the-art BERT-based WSD sys- tem that propagates sense embeddings across Word-Net, we have shown that these unambiguous words provide an excellent bridge to reach a wider range of OOV senses. This translates, in turn, into improving results for WSD. For future work it would be interesting to test these sense embeddings in a wider range of applications outside WSD. Since the embedding space is clearly more diversified, as shown in Figure 2, this may lead to improvements in other downstream tasks.
Moreover, one of the most surprising findings from this paper is that a single occurrence of OOV unambiguous words is enough to improve the performance of WSD models. This is relevant because (1) it is not always easy to retrieve a large number of examples for unambiguous words, and (2) it facilitates a cheaper manual verification, if required.
Finally, we openly release UWA, a large corpus annotated with unambiguous words, together improved BERT and RoBERTa-based sense embeddings, model predictions and visualizations at http://danlou.github.io/uwa.