Adapting BERT for Word Sense Disambiguation with Gloss Selection Objective and Example Sentences

Domain adaptation or transfer learning using pre-trained language models such as BERT has proven to be an effective approach for many natural language processing tasks. In this work, we propose to formulate word sense disambiguation as a relevance ranking task, and fine-tune BERT on sequence-pair ranking task to select the most probable sense definition given a context sentence and a list of candidate sense definitions. We also introduce a data augmentation technique for WSD using existing example sentences from WordNet. Using the proposed training objective and data augmentation technique, our models are able to achieve state-of-the-art results on the English all-words benchmark datasets.


Introduction
In natural language processing, Word Sense Disambiguation (WSD) refers to the task of identifying the exact sense of an ambiguous word given the context (Navigli, 2009). More specifically, WSD associates ambiguous words with predefined senses from an external sense inventory, e.g. WordNet (Miller, 1995) and BabelNet (Navigli and Ponzetto, 2010).
Recent studies in learning contextualized word representations from language models, e.g. ELMo (Peters et al., 2018), BERT (Devlin et al., 2019) and GPT-2 (Radford et al., 2019) attempt to alleviate the issue of insufficient labeled data by first pre-training a language model on a large text corpus through self-supervised learning. The weights from the pre-trained language model can then be fine-tuned on downstream NLP tasks such as question answering and natural language inference. For WSD, pre-trained BERT has been utilized in multiple ways with varying degrees of success. Notably, Huang et al. (2019) proposed GlossBERT, a model based on fine-tuning BERT on sequence-pair binary classification task, and achieved state-of-the-art results in terms of single model performance on several English all-words WSD benchmark datasets.
In this paper, we extend the sequence-pair WSD model and propose a new task objective that can better exploit the inherent relationships within positive and negative sequence pairs. Briefly, our contribution is two-fold: (1) we formulate WSD as gloss selection task, in which the model learns to select the best context-gloss pair from a group of related pairs; (2) we demonstrate how to make use of additional lexical resources, namely the example sentences from WordNet to further improve WSD performance.
We fine-tune BERT using the gloss selection objective on SemCor (Miller et al., 1994) plus additional training instances constructed from the WordNet example sentences and evaluate its impact on several commonly used benchmark datasets for English all-words WSD. Experimental results show that the gloss selection objective can indeed improve WSD performance; and using WordNet example sentences as additional training data can offer further performance boost.

Related Work
BERT (Devlin et al., 2019) is a language representation model based on multi-layer bidirectional Transformer encoder (Vaswani et al., 2017). Previous experiment results have showed that significant improvement can be achieved in many downstream NLP tasks through fine-tuning BERT on those tasks. Several methods have been proposed to apply BERT for WSD. In this section, we briefly describe two commonly used approaches: featurebased and fine-tuning approach.

Feature-based Approaches
Feature-based WSD systems make use of contextualized word embeddings from BERT as input features for task-specific architectures. Vial et al. (2019) used the contextual embeddings as inputs in a Transformer-based classifier. They proposed two sense vocabulary compression techniques to reduce the number of output classes by exploiting the semantic relationships between different senses. The Transformer-based classifiers were trained from scratch using the reduced output classes on Sem-Cor and WordNet Gloss Corpus (WNGC). Their ensemble model, which consists of 8 independently trained classifiers achieved state-of-the-art results on the English all-words WSD benchmark datasets.
Besides deep learning-based approach, Loureiro and Jorge (2019) and Scarlini et al. (2020) construct sense embeddings using the contextual embeddings from BERT. The former generates sense embeddings by averaging the contextual embeddings of sense-annotated tokens taken from Sem-Cor while the latter constructs sense embeddings by concatenating the contextual embeddings of Ba-belNet definitions with the contextual embeddings of Wikipedia contexts. For WSD, both approaches make use of the constructed sense embeddings in nearest neighbor classification (kNN), in which the simple 1-nearest neighbor approach from Scarlini et al. (2020) showed substantial improvement over the nominal category of the English all-words WSD benchmark datasets.

Fine-tuning Approaches
Fine-tuning WSD systems directly adjust the pretrained weights on annotated corpora rather than learning new weights from scratch. Du et al.
(2019) fine-tuned two separate and independent BERT models simultaneously: one to encode senseannotated sentences and another one to encode sense definitions from WordNet. The hidden states from the 2 encoders are then concatenated and used to train a multilayer perceptron classifier for WSD.
Huang et al. (2019) proposed GlossBERT which fine-tunes BERT on sequence-pair binary classification tasks. The training data consists of contextgloss pairs constructed using annotated sentences from SemCor and sense definitions from Word-Net 3.0. Each context-gloss pair contains a sentence from SemCor with a target word to be disambiguated (context) and a candidate sense definition of the target word from WordNet (gloss). Dur-ing fine-tuning, GlossBERT classifies each contextgloss pair as either positive or negative depending on whether the sense definition corresponds to the correct sense of the target word in the context. Each context-gloss pair is treated as independent training instance and will be shuffled to a random position at the start of each training epoch. At inference stage, the context-gloss pair with the highest output score from the positive neuron among other candidates is chosen as the best answer.
In this paper, we use similar context-gloss pairs as inputs for our proposed WSD model. However, instead of treating individual context-gloss pair as independent training instance, we group related context-gloss pairs as 1 training instance, i.e. context-gloss pairs with the same context but different candidate glosses are considered as 1 group. Using groups of context-gloss pairs as training data, we formulate WSD as a ranking/selection problem where the most probable sense is ranked first. By processing all related candidate senses in one go, the WSD model will be able to learn better discriminating features between positive and negative context-gloss pairs.

Methodology
We describe the implementation details of our approaches in this section. When customizing BERT for WSD, we use a linear layer consisting of just 1 neuron in the output layer to compute the relevance score for each context-gloss pair, in contrast to the binary classification layer used in GlossBERT.
Additionally, we also extract example sentences from WordNet 3.0 and use them as additional training data on top of the sense-annotated sentences from SemCor.

Gloss Selection Objective
Following Huang et al. (2019), we construct positive and negative context-gloss pairs by combining annotated sentences from SemCor and sense definitions from WordNet 3.0. The positive pair contains a gloss representing the correct sense of the target word while a negative pair contains a negative candidate gloss. Each target word in the contexts is surrounded with two special [TGT] tokens. We group context-gloss pairs with the same context and target word as a single training instance so that they are processed sequentially by the neural network. As illustrated in Figure 1,  Cross Entropy Loss Figure 1: Visualisation of the gloss selection objective when computing the loss value for a training instance. The context "He turned slowly and began to crawl back up the bank toward the rampart." is annotated with the target word "bank". A training instance consists of n context-gloss pairs (n=4 in this case), including 1 positive pair (shown in green) and n-1 negative pairs (shown in red). The order of the context-gloss pairs within each training instance is randomized during the dataset construction step. each context-gloss pair as input and calculate the corresponding relevance score. A softmax layer then aggregates the relevance scores from the same group and computes the training loss using cross entropy as loss function. Formally, the gloss selection objective is given as follow: where m is the batch size, n i is number of candidate glosses for the i-th training instance, 1(y i , j) is the binary indicator if index j is the same as the index of the positive context-gloss pair y i , and p ij is the softmax value for the j-th candidate sense of i-th training instance, computed using the following equation: where Rel(context, gloss) denotes the relevance score of a context-gloss pair from the output layer. Similar formulation was presented for web document ranking (Huang et al., 2013) and questionanswering natural language inference (Liu et al., 2019). In the case of WSD, we are only interested in the top-1 context-gloss pair. Hence, during testing, we select the context-gloss pair with the highest relevance score and its corresponding sense as the most probable sense for the target word.

Data Augmentation using Example Sentences
Most synsets in WordNet 3.0 include one or more short sentences illustrating the usage of the synset members (i.e. synonyms). We introduce a relatively straightforward data augmentation technique that combines the example sentences with positive/negative glosses into additional context-gloss pairs. First, example sentences (context) are extracted from each synset and target words are identified via keyword matching and annotated with two [TGT] tokens. Then, context-gloss pairs are constructed by combining the annotated contexts with positive and negative glosses. Using this technique, we were able to obtain 37,596 additional training instances (about 17% more training instances).

Experiments
In this section, we introduce the datasets and experiment settings used to fine-tune BERT. We also present the evaluation results of each model and compare them against existing WSD systems.
(2019) and others, we choose SemEval-07 as the development set for tuning hyperparameters.

Experiment Settings
We experiment with both uncased BERT base and BERT large models. BERT base consists of 110M parameters with 12 Transformer layers, 768 hidden units and 12 self-attention heads while BERT large consists of 340M parameters with 24 Transformer layers, 1024 hidden units and 16 self-attention heads. We use the implementation from the transformers package (Wolf et al., 2019). In total, we trained 4 models on 2 setups: (1) BERT base/large (baseline), using only the baseline dataset; (2) BERT base/large (augmented), using the concatenation of baseline and augmented dataset. At fine-tuning, we set the initial learning rate to 2e-5 with batch size of 128 over 4 training epochs. The remaining hyperparameters are kept at the default values specified in the transformers package.

Evaluation Results
We evaluate the performance of each model and report the F1-scores in Table 1, along with the results from other WSD systems.
All 4 of our models trained on the proposed gloss selection objective show substantial improvement over the non-ensemble systems across all benchmark datasets, which signifies the effectiveness of this task formulation 3 . The addition of augmented training set further improves the performance, particularly in the noun category. It is worth noting that Du et al. (2019) and Huang et al. (2019) reported slightly worse or identical results when finetuning on BERT large , but both of our models finetuned on BERT large obtain considerable better results than the BERT base counterparts. This may be partially attributed to the fact that we were using the recently released whole-word masking variant of BERT large , which was shown to have a better performance on the Multi-Genre Natural Language Inference (MultiNLI) benchmark. Although the BERT large (augmented) model has lower F1-score on the development dataset, it outperforms the ensemble system consisting of eight independent BERT large models on three testing datasets and achieves the best F1-score on the concatenation of all datasets.
To illustrate that the improvement of WSD performance comes from the gloss selection objective instead of hyperparameter settings, we fine-tune a BERT base model on the unagumented training set using the same hyperparameter settings as Gloss-BERT (Huang et al., 2019), i.e. setting learning rate and batch size to 2e-5 and 64 respectively, and using 4 context-gloss pairs for each target word. As shown in Table 2, our model fine-tuned on the proposed gloss selection objective consistently outperforms GlossBERT across all benchmark datasets under the same hyperparameter settings.

Conclusion
We proposed the gloss selection objective for supervised WSD, which formulates WSD as a relevance ranking task based on context-gloss pairs. Our models fine-tuned on this objective outperform other non-ensemble systems on five English all-words benchmark datasets. Furthermore, we demonstrate how to generate additional training data without external annotations using existing example sentences from WordNet, which provides extra performance boost and enable our single-model system to surpass the state-of-the-art ensemble system by a considerable margin on a number of benchmark datasets.