Improved Word Sense Disambiguation Using Pre-Trained Contextualized Word Representations

Contextualized word representations are able to give different representations for the same word in different contexts, and they have been shown to be effective in downstream natural language processing tasks, such as question answering, named entity recognition, and sentiment analysis. However, evaluation on word sense disambiguation (WSD) in prior work shows that using contextualized word representations does not outperform the state-of-the-art approach that makes use of non-contextualized word embeddings. In this paper, we explore different strategies of integrating pre-trained contextualized word representations and our best strategy achieves accuracies exceeding the best prior published accuracies by significant margins on multiple benchmark WSD datasets.


Introduction
Word sense disambiguation (WSD) automatically assigns a pre-defined sense to a word in a text. Different senses of a word reflect different meanings a word has in different contexts. Identifying the correct word sense given a context is crucial in natural language processing (NLP). Unfortunately, while it is easy for a human to infer the correct sense of a word given a context, it is a challenge for NLP systems. As such, WSD is an important task and it has been shown that WSD helps downstream NLP tasks, such as machine translation (Chan et al., 2007a) and information retrieval (Zhong and Ng, 2012).
A WSD system assigns a sense to a word by taking into account its context, comprising the other words in the sentence. This can be done through discrete word features, which typically involve surrounding words and collocations trained using a classifier (Lee et al., 2004;Ando, 2006;Chan et al., 2007b;Zhong and Ng, 2010). The classifier can also make use of continuous word representations of the surrounding words (Taghipour and Ng, 2015;Iacobacci et al., 2016). Neural WSD systems (Kågebäck and Salomonsson, 2016;Raganato et al., 2017b) feed the continuous word representations into a neural network that captures the whole sentence and the word representation in the sentence. However, in both approaches, the word representations are independent of the context.
Recently, pre-trained contextualized word representations (Melamud et al., 2016;McCann et al., 2017;Peters et al., 2018;Devlin et al., 2019) have been shown to improve downstream NLP tasks. Pre-trained contextualized word representations are obtained through neural sentence encoders trained on a huge amount of raw texts. When the resulting sentence encoder is fine-tuned on the downstream task, such as question answering, named entity recognition, and sentiment analysis, with much smaller annotated training data, it has been shown that the trained model, with the pre-trained sentence encoder component, achieves new state-of-the-art results on those tasks.
While demonstrating superior performance in downstream NLP tasks, pre-trained contextualized word representations are still reported to give lower accuracy compared to approaches that use non-contextualized word representations (Melamud et al., 2016;Peters et al., 2018) when evaluated on WSD. This seems counter-intuitive, as a neural sentence encoder better captures the surrounding context that serves as an important cue to disambiguate words. In this paper, we explore different strategies of integrating pre-trained contextualized word representations for WSD. Our best strategy outperforms prior methods of incorporating pre-trained contextualized word representations and achieves new state-of-the-art accuracy on multiple benchmark WSD datasets.
The following sections are organized as follows.
Section 2 presents related work. Section 3 describes our pre-trained contextualized word representation. Section 4 proposes different strategies to incorporate the contextualized word representation for WSD. Section 5 describes our experimental setup. Section 6 presents the experimental results. Section 7 discusses the findings from the experiments. Finally, Section 8 presents the conclusion.

Related Work
Continuous word representations in real-valued vectors, or commonly known as word embeddings, have been shown to help improve NLP performance. Initially, exploiting continuous representations was achieved by adding real-valued vectors as classification features (Turian et al., 2010). Taghipour and Ng (2015) fine-tuned non-contextualized word embeddings by a feedforward neural network such that those word embeddings were more suited for WSD. The finetuned embeddings were incorporated into an SVM classifier. Iacobacci et al. (2016) explored different strategies of incorporating word embeddings and found that their best strategy involved exponential decay that decreased the contribution of surrounding word features as their distances to the target word increased. The neural sequence tagging approach has also been explored for WSD. Kågebäck and Salomonsson (2016) proposed bidirectional long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) for WSD. They concatenated the hidden states of the forward and backward LSTMs and fed the concatenation into an affine transformation followed by softmax normalization, similar to the approach to incorporate a bidirectional LSTM adopted in sequence labeling tasks such as part-ofspeech tagging and named entity recognition (Ma and Hovy, 2016). Raganato et al. (2017b) proposed a self-attention layer on top of the concatenated bidirectional LSTM hidden states for WSD and introduced multi-task learning with part-ofspeech tagging and semantic labeling as auxiliary tasks. However, on average across the test sets, their approach did not outperform SVM with word embedding features. Subsequently, Luo et al. (2018) proposed the incorporation of glosses from WordNet in a bidirectional LSTM for WSD, and reported better results than both SVM and prior bidirectional LSTM models. A neural language model (LM) is aimed at predicting a word given its surrounding context. As such, the resulting hidden representation vector captures the context of a word in a sentence. Melamud et al. (2016) designed context2vec, which is a one-layer bidirectional LSTM trained to maximize the similarity between the hidden state representation of the LSTM and the target word embedding. Peters et al. (2018) designed ELMo, which is a two-layer bidirectional LSTM language model trained to predict the next word in the forward LSTM and the previous word in the backward LSTM. In both models, WSD was evaluated by nearest neighbor matching between the test and training instance representations. However, despite training on a huge amount of raw texts, the resulting accuracies were still lower than those achieved by WSD approaches with pre-trained non-contextualized word representations.
End-to-end neural machine translation (NMT) (Sutskever et al., 2014;Bahdanau et al., 2015) learns to generate an output sequence given an input sequence, using an encoder-decoder model. The encoder captures the contextualized representation of the words in the input sentence for the decoder to generate the output sentence. Following this intuition, McCann et al. (2017) trained an encoder-decoder model on parallel texts and obtained pre-trained contextualized word representations from the encoder.

Pre-Trained Contextualized Word Representation
The contextualized word representation that we use is BERT (Devlin et al., 2019), which is a bidirectional transformer encoder model (Vaswani et al., 2017) pre-trained on billions of words of texts. There are two tasks on which the model is trained, i.e., masked word and next sentence prediction. In both tasks, prediction accuracy is determined by the ability of the model to understand the context. A transformer encoder computes the representation of each word through an attention mechanism with respect to the surrounding words. Given a sentence x n 1 of length n, the transformer computes the representation of each word x i through a multi-head attention mechanism, where the query vector is from x i and the key-value vector pairs are from the surrounding words x i (1 ≤ i ≤ n). The word representation produced by the transformer captures the contextual information of a word.
The attention mechanism can be viewed as mapping a query vector q and a set of key-value vector pairs (k, v) to an output vector. The attention function A(·) computes the output vector which is the weighted sum of the value vectors and is defined as: where K and V are matrices, containing the key vectors and the value vectors of the words in the sentence respectively, and α(q, k, ρ) is a scalar attention weight between q and k, re-scaled by a scalar ρ. Two building blocks for the transformer encoder are the multi-head attention mechanism and the position-wise feed-forward neural network (FFNN). The multi-head attention mechanism with H heads leverages the attention function in Equation 1 as follows: where ⊕ denotes concatenation of vectors, The input vector q ∈ R d model is the hidden vector for the ambiguous word, while input matrices K, V ∈ R d model ×n are the concatenation of the hidden vectors of all words in the sentence. For each attention head, the dimension of both the query and key vectors is d k while the dimension of the value vector is d v . The encoder model dimension is d model .
The position-wise FFNN performs a non-linear transformation on the attention output corresponding to each input word position as follows: (5) in which the input vector u ∈ R d model is transformed to the output vector with dimension d model via a series of linear projections with the ReLU activation function.
For the hidden layer l (1 ≤ l ≤ L), the selfattention sub-layer output f l i is computed as follows: where LayerNorm refers to layer normalization (Ba et al., 2016) and the superscript l and subscript h indicate that each encoder layer l has an independent set of multi-head attention weight parameters (see Equations 3 and 4). The input for the first layer is h 0 The second sub-layer consists of the positionwise fully connected FFNN, computed as: where, similar to self-attention, an independent set of weight parameters in Equation 5 is defined in each layer.

Incorporating Pre-Trained Contextualized Word Representation
As BERT is trained on the masked word prediction task, which is to predict a word given the surrounding (left and right) context, the pre-trained model already captures the context. In this section, we describe different techniques of leveraging BERT for WSD, broadly categorized into nearest neighbor matching and linear projection of hidden layers.

Nearest Neighbor Matching
A straightforward way to disambiguate word sense is through 1-nearest neighbor matching. We compute the contextualized representation of each word in the training data and the test data through BERT. Given a hidden representation h L i at the Lth layer for word x i in the test data, nearest neighbor matching finds a vector h * in the L-th layer from the training data such that where the sense assigned to x i is the sense of the word whose contextualized representation is h * . This WSD technique is adopted in earlier work on contextualized word representations (Melamud et al., 2016;Peters et al., 2018).

Linear Projection of Hidden Layers
Apart from nearest neighbor matching, we can perform a linear projection of the hidden vector h i by an affine transformation into an output sense vector, with its dimension equal to the number of senses for word x i . The output of this affine transformation is normalized by softmax such that all its values sum to 1. Therefore, the predicted sense s i of word x i is formulated as where s i is a vector of predicted sense distribution for word x i , while W lexelt(x i ) and b lexelt(x i ) are respectively the projection matrix and bias vector specific to the lexical element (lexelt) of word x i , which consists of its lemma and optionally its partof-speech tag. We choose the sense corresponding to the element of s i with the maximum value.
Training the linear projection model is done by the back-propagation algorithm, which updates the model parameters to minimize a cost function. Our cost function is the negative log-likelihood of the softmax output value that corresponds to the tagged sense in the training data. In addition, we propose two novel ways of incorporating BERT's hidden representation vectors to compute the sense output vector, which are described in the following sub-subsections.

Last Layer Projection
Similar to the nearest neighbor matching model, we can feed the hidden vector of BERT in the last layer, h L i , into an affine transformation followed by softmax. That is, h i in Equation 7 is instan-tiated by h L i . The last layer projection model is illustrated in Figure 1(a).

Weighted Sum of Hidden Layers
BERT consists of multiple layers stacked one after another. Each layer carries a different representation of word context. Taking into account different hidden layers may help the WSD system to learn from different context information encoded in different layers of BERT.
To take into account all layers, we compute the weighted sum of all hidden layers, h l i , where 1 ≤ l ≤ L, corresponding to a word position i, through attention mechanism. That is, h i in Equation 7 is replaced by the following equation: where m ∈ R d model is a projection vector to obtain scalar values with the key vectors. The model with weighted sum of all hidden layers is illustrated in Figure 1(b).

Gated Linear Unit
While the contextualized representations in the BERT hidden layer vectors are features that determine the word sense, some features are more useful than the others. As such, we propose filtering the vector values by a gating vector whose values range from 0 to 1. This mechanism is known as the gated linear unit (GLU) (Dauphin et al., 2017), which is formulated as where W h and W g are separate projection matrices and b h and b g are separate bias vectors. The symbols σ(·) and denote the sigmoid function and element-wise vector multiplication operation respectively. GLU transforms the input vector h by feeding it to two separate affine transformations. The second transformation is used as the sigmoid gate to filter the input vector, which is summed with the vector after the first affine transformation.

Experimental Setup
We conduct experiments on various WSD tasks. The description and the statistics for each task are given in Table 1. For English, a lexical element (lexelt) is defined as a combination of lemma and part-of-speech tag, while for Chinese, it is simply the lemma, following the OntoNotes setup.
We exploit English BERT BASE for the English tasks and Chinese BERT for the Chinese task. We conduct experiments with different strategies of incorporating BERT as described in Section 4, namely 1-nearest neighbor matching (1-nn) and linear projection. In the latter technique, we explore strategies including simple last layer projection, layer weighting (LW), and gated linear unit (GLU).
In the linear projection model, we train the model by the Adam algorithm (Kingma and Ba, 2015) with a learning rate of 10 −3 . The model parameters are updated per mini-batch of 16 sentences. As update progresses, we pick the best model parameters from a series of neural network updates based on accuracy on a held-out development set, disjoint from the training set.
The state-of-the-art supervised WSD approach takes into account features from the neighboring sentences, typically one sentence to the left and one to the right apart from the current sentence that contains the ambiguous words. We also exploit this in our model, as BERT supports inputs with multiple sentences separated by a special [SEP] symbol.
For English all-words WSD, we train our WSD model on SemCor (Miller et al., 1994), and test it on Senseval-2 (SE2), Senseval-3 (SE3), Se-mEval 2013 task 12 (SE13), and SemEval 2015 task 13 (SE15). This common benchmark, which has been annotated with WordNet-3.0 senses (Raganato et al., 2017a), has recently been adopted in English all-words WSD. Following (Raganato et al., 2017b), we choose SemEval 2007 Task 17 (SE07) as our development data to pick the  best model parameters after a number of neural network updates, for models that require backpropagation training. We also evaluate on Senseval-2 and Senseval-3 English lexical sample tasks, which come with pre-defined training and test data. For each word type, we pick 20% of the training instances to be our development set and keep the remaining 80% as the actual training data. Through this development set, we determine the number of epochs to use in training. We then re-train the model with the whole training dataset using the number of epochs identified in the initial training step.
While WSD is predominantly evaluated on English, we are also interested in evaluating our approach on Chinese, to evaluate the effectiveness of our approach in a different language. We use OntoNotes Release 5.0 1 , which contains a number of annotations including word senses for Chinese. We follow the data setup of Pradhan et al. (2013) and conduct an evaluation on four genres, i.e., broadcast conversation (BC), broadcast news (BN), magazine (MZ), and newswire (NW), as well as the concatenation of all genres. While the training and development datasets are divided into genres, we train on the concatenation of all genres and test on each individual genre.   (1sent), we also show BERT representation of one sentence plus one surrounding sentence to the left and one to the right (1sent+1sur). The best result in each dataset is shown in bold. Statistical significance tests by bootstrap resampling ( * : p < 0.05) compare 1nn (1sent+1sur) with each of Simple (1sent+1sur), LW (1sent+1sur), GLU (1sent+1sur), and GLU+LW (1sent+1sur).
For Chinese WSD evaluation, we train IMS (Zhong and Ng, 2010) on the Chinese OntoNotes dataset as our baseline. We also incorporate pretrained non-contextualized Chinese word embeddings as IMS features (Taghipour and Ng, 2015;Iacobacci et al., 2016). The pre-trained word embeddings are obtained by training the word2vec skip-gram model on Chinese Gigaword Fifth Edition 2 , which after automatic word segmentation contains approximately 2 billion words. Following (Taghipour and Ng, 2015), we incorporate the embedding features of words within a window surrounding the target ambiguous word. In our experiments, we take into account 5 words to the left and right.

Results
We present our experimental results and compare them with prior baselines.

English All-Words Tasks
For English all-words WSD, we compare our approach with three categories of prior approaches. Firstly, we compare our approach with the supervised SVM classifier approach, namely IMS (Zhong and Ng, 2010). We compare our approach with both the original IMS without word embedding features and IMS with non-contextualized word embedding features, that is, word2vec with exponential decay (Iacobacci et al., 2016). We also compare with SupWSD (Papandrea et al., 2017), which is an alternative implementation of IMS. Secondly, we compare our approach with the neural WSD approaches that leverage bidirectional LSTM (bi-LSTM). These include the bi-LSTM model with attention trained jointly with lexical semantic labeling task (Raganato et al., 2017b) (BiLSTMatt+LEX) and the bi-LSTM model enhanced with gloss representation from WordNet (GAS). Thirdly, we show comparison with prior contextualized word representations for WSD, pre-trained on a large number of texts, namely context2vec (Melamud et al., 2016) and ELMo (Peters et al., 2018). In these two models, WSD is treated as nearest neighbor matching as described in Section 4.1. Table 2 shows our WSD results in F1 measure. It is shown in the table that with the nearest neighbor matching model, BERT outperforms context2vec and ELMo. This shows the effectiveness of BERT's pre-trained contextualized word representation. When we include surrounding sentences, one to the left and one to the right, we get improved F1 scores consistently.
We also show that linear projection to the sense output vector further improves WSD performance, by which our best results are achieved. While BERT has been shown to outperform other pre-trained contextualized word representations through the nearest neighbor matching experiments, its potential can be maximized through linear projection to the sense output vector. It is worthwhile to note that our more advanced linear projection, by means of layer weighting ( §4.2.2 and gated linear unit ( §4.2.3) gives the best F1 scores on all test sets.
All our BERT WSD systems outperform glossenhanced neural WSD, which has the best overall score among all prior systems.

English Lexical Sample Tasks
For English lexical sample tasks, we compare our approach with the original IMS (Zhong and Ng, 2010) and IMS with non-contextualized word embedding features. The embedding features incorporated into IMS include CW embeddings (Collobert et al., 2011), obtained from a convolutional language model, fine-tuned (adapted) to WSD (Taghipour and Ng, 2015) (+adapted CW), and word2vec skip-gram (Mikolov et al., 2013) with exponential decay (Iacobacci et al., 2016) (+w2v+expdecay). We also compare our approach with the bi-LSTM, on top of which sense classification is defined (Kågebäck and Salomonsson, 2016), and context2vec (Melamud et al., 2016), which is a contextualized pre-trained bi-LSTM model trained on 2B words of text. Finally, we also compare with a prior multi-task and semi-supervised WSD approach learned through alternating structure optimization (ASO) (Ando, 2006), which also utilizes unlabeled data for training.  As shown in Table 3, our BERT-based WSD approach with linear projection model outperforms all prior approaches. context2vec, which is pre-trained on a large amount of texts, performs worse than the prior semi-supervised ASO approach on Senseval-3, while our best result outperforms ASO by a large margin.
Neural bi-LSTM performs worse than IMS with non-contextualized word embedding features. Our neural model with pre-trained contextualized word representations outperforms the best result achieved by IMS on both Senseval-2 and Senseval-3.

Chinese OntoNotes WSD
We compare our approach with IMS without and with word embedding features as the baselines. The results are shown in Table 4.
Across all genres, BERT outperforms the baseline IMS with word embedding (noncontextualized word representation) features (Taghipour and Ng, 2015).
The latter also improves over the original IMS without word embedding features (Zhong and Ng, 2010 Table 4: Chinese OntoNotes WSD results in accuracy (%), averaged over three runs, for each genre. All BERT results in this table are obtained from the representation of one sentence plus one surrounding sentence to the left and to the right (1sent+1sur). We show results of various BERT incorporation strategy, namely nearest neighbor matching (1nn), simple projection, projection with layer weighting (LW) and gated linear unit (GLU). Best accuracy in each genre is shown in bold. Statistical significance tests by bootstrap resampling ( * : p < 0.05) compare 1nn with each of Simple, LW, GLU, and GLU+LW. projection approaches consistently outperform nearest neighbor matching by a significant margin, similar to the results on English WSD tasks.
The best overall result for the Chinese OntoNotes test set is achieved by the models with simple projection and hidden layer weighting.

Discussion
Across all tasks (English all-words, English lexical sample, and Chinese OntoNotes), our experiments demonstrate the effectiveness of BERT over various prior WSD approaches. The best results are consistently obtained by linear projection models, which project the last hidden layer or the weighted sum of all hidden layers to an output sense vector.
We can view the BERT hidden layer outputs as contextual features, which serve as useful cues in determining the word senses. In fact, the attention mechanism in transformer captures the surrounding words. In prior work like IMS (Zhong and Ng, 2010), these contextual cues are captured by the manually-defined surrounding word and collocation features. The features obtained by the hidden vector output are shown to be more effective than the manually-defined features.
We introduced two advanced linear projection techniques, namely layer weighting and gated linear unit (GLU). While Peters et al. (2018) showed that the second biLSTM layer results in better WSD accuracy compared to the first layer (nearer to the individual word representation), we showed that taking into account different layers by means of the attention mechanism is useful for WSD. GLU as an activation function has been shown to be effective for better convergence and to overcome the vanishing gradient problem in the convolutional language model (Dauphin et al., 2017). In addition, the GLU gate vector, with values ranging from 0 to 1, can be seen as a filter for the features from the hidden layer vector.
Compared with two prior contextualized word representations models, context2vec (Melamud et al., 2016) and ELMo (Peters et al., 2018), BERT achieves higher accuracy. This shows the effectiveness of the attention mechanism used in the transformer model to represent the context.
Our BERT WSD models outperform prior neural WSD models by a large margin. These prior neural WSD models perform comparably with IMS with embeddings as classifier features, in addition to the discrete features. While neural WSD approaches (Kågebäck and Salomonsson, 2016;Raganato et al., 2017b;Luo et al., 2018) exploit non-contextualized word embeddings which are trained on large texts, the hidden layers are trained only on a small amount of labeled data.

Conclusion
For the WSD task, we have proposed novel strategies of incorporating BERT, a pre-trained contextualized word representation which effectively captures the context in its hidden vectors. Our experiments show that linear projection of the hidden vectors, coupled with gating to filter the values, gives better results than the prior state of the art. Compared to prior neural and feature-based WSD approaches that make use of non-contextualized word representations, using pre-trained contextualized word representation with our proposed incorporation strategy achieves significantly higher scores.