GlossBERT: BERT for Word Sense Disambiguation with Gloss Knowledge

Word Sense Disambiguation (WSD) aims to find the exact sense of an ambiguous word in a particular context. Traditional supervised methods rarely take into consideration the lexical resources like WordNet, which are widely utilized in knowledge-based methods. Recent studies have shown the effectiveness of incorporating gloss (sense definition) into neural networks for WSD. However, compared with traditional word expert supervised methods, they have not achieved much improvement. In this paper, we focus on how to better leverage gloss knowledge in a supervised neural WSD system. We construct context-gloss pairs and propose three BERT based models for WSD. We fine-tune the pre-trained BERT model and achieve new state-of-the-art results on WSD task.


Introduction
Word Sense Disambiguation (WSD) is a fundamental task and long-standing challenge in Natural Language Processing (NLP), which aims to find the exact sense of an ambiguous word in a particular context (Navigli, 2009). Previous WSD approaches can be grouped into two main categories: knowledge-based and supervised methods.
Knowledge-based WSD methods rely on lexical resources like WordNet (Miller, 1995) and usually exploit two kinds of lexical knowledge. The gloss, which defines a word sense meaning, is first utilized in Lesk algorithm (Lesk, 1986) and then widely taken into account in many other approaches (Banerjee and Pedersen, 2002;Basile et al., 2014). Besides, structural properties of semantic graphs are mainly used in graph-based algorithms (Agirre et al., 2014;Moro et al., 2014).
Traditional supervised WSD methods (Zhong and Ng, 2010;Shen et al., 2013;Iacobacci et al., * Corresponding author. 2016) focus on extracting manually designed features and then train a dedicated classifier (word expert) for every target lemma.
Although word expert supervised WSD methods perform better, they are less flexible than knowledge-based methods in the all-words WSD task (Raganato et al., 2017a). Recent neural-based methods are devoted to dealing with this problem. Kågebäck and Salomonsson (2016) present a supervised classifier based on Bi-LSTM, which shares parameters among all word types except the last layer. Raganato et al. (2017a) convert WSD task to a sequence labeling task, thus building a unified model for all polysemous words. However, neither of them can totally beat the best word expert supervised methods.
More recently, Luo et al. (2018b) propose to leverage the gloss information from WordNet and model the semantic relationship between the context and gloss in an improved memory network. Similarly, Luo et al. (2018a) introduce a (hierarchical) co-attention mechanism to generate co-dependent representations for the context and gloss. Their attempts prove that incorporating gloss knowledge into supervised WSD approach is helpful, but they still have not achieved much improvement, because they may not make full use of gloss knowledge.
In this paper, we focus on how to better leverage gloss information in a supervised neural WSD system. Recently, the pre-trained language models, such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018), have shown their effectiveness to alleviate the effort of feature engineering. Especially, BERT has achieved excellent results in question answering (QA) and natural language inference (NLI). We construct context-gloss pairs from glosses of all possible senses (in Word-Net) of the target word, thus treating WSD task as a sentence-pair classification problem. We fine-Sentence with four targets: Your research stopped when a convenient assertion could be made.

Context-Gloss Pairs of the target word [research]
Label  tune the pre-trained BERT model and achieve new state-of-the-art results on WSD task. In particular, our contribution is two-fold: 1. We construct context-gloss pairs and propose three BERT-based models for WSD.
2. We fine-tune the pre-trained BERT model, and the experimental results on several English allwords WSD benchmark datasets show that our approach significantly outperforms the state-of-theart systems.

Methodology
In this section, we describe our method in detail.

Task Definition
In WSD, a sentence s usually consists of a series of words: {w 1 , · · · , w m }, and some of the words {w i 1 , · · · , w i k } are targets {t 1 , · · · , t k } need to be disambiguated. For each target t, its candidate senses {c 1 , · · · , c n } come from entries of its lemma in a pre-defined sense inventory (usually WordNet). Therefore, WSD task aims to find the most suitable entry (symbolized as unique sense key) for each target in a sentence. See a sentence example in Table 1.

BERT
BERT (Devlin et al., 2018) is a new language representation model, and its architecture is a multilayer bidirectional Transformer encoder. BERT model is pre-trained on a large corpus and two novel unsupervised prediction tasks, i.e., masked language model and next sentence prediction tasks are used in pre-training. When incorporating BERT into downstream tasks, the fine-tuning procedure is recommended. We fine-tune the pretrained BERT model on WSD task.
BERT(Token-CLS) Since every target in a sentence needs to be disambiguated to find its exact sense, WSD task can be regarded as a token-level classification task. To incorporate BERT to WSD task, we take the final hidden state of the token corresponding to the target word (if more than one token, we average them) and add a classification layer for every target lemma, which is the same as the last layer of the Bi-LSTM model (Kågebäck and Salomonsson, 2016).

GlossBERT
BERT can explicitly model the relationship of a pair of texts, which has shown to be beneficial to many pair-wise natural language understanding tasks. In order to fully leverage gloss information, we propose GlossBERT to construct context-gloss pairs from all possible senses of the target word in WordNet, thus treating WSD task as a sentencepair classification problem.
We describe our construction method with an example (See Table 1). There are four targets in this sentence, and here we take target word research as an example: Context-Gloss Pairs The sentence containing target words is denoted as context sentence. For each target word, we extract glosses of all N possible senses (here N = 4) of the target word (research) in WordNet to obtain the gloss sentence.
[CLS] and [SEP] marks are added to the context-gloss pairs to make it suitable for the input of BERT model. A similar idea is also used in aspect-based sentiment analysis (Sun et al., 2019).

Context-Gloss Pairs with Weak Supervision
Based on the previous construction method, we add weak supervised signals to the context-gloss  pairs (see the highlighted part in Table 1). The signal in the gloss sentence aims to point out the target word, and the signal in the context sentence aims to emphasize the target word considering the situation that a target word may occur more than one time in the same sentence. Therefore, each target word has N context-gloss pair training instances (label ∈ {yes, no}). When testing, we output the probability of label = yes of each context-gloss pair and choose the sense corresponding to the highest probability as the prediction label of the target word. We experiment with three GlossBERT models: GlossBERT(Token-CLS) We use context-gloss pairs as input. We highlight the target word by taking the final hidden state of the token corresponding to the target word (if more than one token, we average them) and add a classification layer (label ∈ {yes, no}).
GlossBERT(Sent-CLS) We use context-gloss pairs as input. We take the final hidden state of the first token [CLS] as the representation of the whole sequence and add a classification layer (label ∈ {yes, no}), which does not highlight the target word.
GlossBERT(Sent-CLS-WS) We use contextgloss pairs with weak supervision as input. We take the final hidden state of the first token [CLS] and add a classification layer (label ∈ {yes, no}), which weekly highlight the target word by the weak supervision.

Datasets
The statistics of the WSD datasets are shown in Table 2.
WordNet Since Raganato et al. (2017b) map all the sense annotations in these datasets from their original versions to WordNet 3.0, we extract word sense glosses from WordNet 3.0.

Settings
We use the pre-trained uncased BERT BASE model 1 for fine-tuning, because we find that BERT LARGE model performs slightly worse than BERT BASE in this task. The number of Transformer blocks is 12, the number of the hidden layer is 768, the number of self-attention heads is 12, and the total number of parameters of the pre-trained model is 110M. When fine-tuning, we use the development set (SE07) to find the optimal settings for our experiments. We keep the dropout probability at 0.1, set the number of epochs to 4. The initial learning rate is 2e-5, and the batch size is 64. Table 3 shows the performance of our method on the English all-words WSD benchmark datasets. We compare our approach with previous methods.

Results
The first block shows the MFS baseline, which selects the most frequent sense in the training corpus for each target word.
The second block shows two knowledge-based systems. Lesk ext+emb (Basile et al., 2014) is a variant of Lesk algorithm (Lesk, 1986) by calculating the gloss-context overlap of the target word. Babelfy (Moro et al., 2014) is a unified graphbased approach which exploits the semantic network structure from BabelNet.   Raganato et al. (2017b), and others from the corresponding papers.
. The third block shows two word expert traditional supervised systems. IMS (Zhong and Ng, 2010) is a flexible framework which trains SVM classifiers and uses local features. And IMS +emb (Iacobacci et al., 2016) is the best configuration of the IMS framework, which also integrates word embeddings as features.
The fourth block shows several recent neuralbased methods. Bi-LSTM (Kågebäck and Salomonsson, 2016) is a baseline for neural models. Bi-LSTM +att.+LEX+P OS (Raganato et al., 2017a) is a multi-task learning framework for WSD, POS tagging, and LEX with self-attention mechanism, which converts WSD to a sequence learning task. GAS ext (Luo et al., 2018b) is a variant of GAS which is a gloss-augmented variant of the memory network by extending gloss knowledge. CAN s and HCAN (Luo et al., 2018a) are sentence-level and hierarchical co-attention neural network models which leverage gloss knowledge.
In the last block, we report the performance of our method. BERT(Token-CLS) is our baseline, which does not incorporate gloss information, and it performs slightly worse than previous traditional supervised methods and recent neural-based methods. It proves that directly using BERT cannot obtain performance growth. The other three methods outperform other models by a substantial margin, which proves that the improvements come from leveraging BERT to better exploit gloss information. It is worth noting that our method achieves significant improvements in SE07 and Verb over previous methods, which have the highest ambi-guity level among all datasets and all POS tags respectively according to Raganato et al. (2017b).
Moreover, GlossBERT(Token-CLS) performs better than GlossBERT(Sent-CLS), which proves that highlighting the target word in the sentence is important. However, the weakly highlighting method GlossBERT(Sent-CLS-WS) performs best in most circumstances, which may result from its combination of the advantages of the other two methods.

Discussion
There are two main reasons for the great improvements of our experimental results. First, we construct context-gloss pairs and convert WSD problem to a sentence-pair classification task which is similar to NLI tasks and train only one classifier, which is equivalent to expanding the corpus. Second, we leverage BERT (Devlin et al., 2018) to better exploit the gloss information. BERT model shows its advantage in dealing with sentence-pair classification tasks by its amazing improvement on QA and NLI tasks. This advantage comes from both of its two novel unsupervised prediction tasks.
Compared with traditional word expert supervised methods, our GlossBERT shows its effectiveness to alleviate the effort of feature engineering and does not require training a dedicated classifier for every target lemma. Up to now, it can be said that the neural network method can totally beat the traditional word expert method. Compared with recent neural-based methods, our so-lution is more intuitive and can make better use of gloss knowledge. Besides, our approach demonstrates that when we fine-tune BERT on a downstream task, converting it into a sentence-pair classification task may be a good choice.

Conclusion
In this paper, we seek to better leverage gloss knowledge in a supervised neural WSD system. We propose a new solution to WSD by constructing context-gloss pairs and then converting WSD to a sentence-pair classification task. We finetune the pre-trained BERT model and achieve new state-of-the-art results on WSD task.