Leveraging Gloss Knowledge in Neural Word Sense Disambiguation by Hierarchical Co-Attention

The goal of Word Sense Disambiguation (WSD) is to identify the correct meaning of a word in the particular context. Traditional supervised methods only use labeled data (context), while missing rich lexical knowledge such as the gloss which defines the meaning of a word sense. Recent studies have shown that incorporating glosses into neural networks for WSD has made significant improvement. However, the previous models usually build the context representation and gloss representation separately. In this paper, we find that the learning for the context and gloss representation can benefit from each other. Gloss can help to highlight the important words in the context, thus building a better context representation. Context can also help to locate the key words in the gloss of the correct word sense. Therefore, we introduce a co-attention mechanism to generate co-dependent representations for the context and gloss. Furthermore, in order to capture both word-level and sentence-level information, we extend the attention mechanism in a hierarchical fashion. Experimental results show that our model achieves the state-of-the-art results on several standard English all-words WSD test datasets.


Introduction
Word Sense Disambiguation (WSD) is a crucial task and long-standing problem in Natural Language Processing (NLP). Previous researches mainly exploit two kinds of resources. Knowledge-based methods (Lesk, 1986; exploit the lexical knowledge like gloss to infer the correct senses of ambiguous words in the context. However, supervised feature-based methods  and neural-based methods (Kågebäck and Salomonsson, 2016; usually use labeled data to train one or more classifiers.

Context
As they often play football together, they know each other quite well Glosses g1: participate in games or sports g2: perform music on an instrument g3: behave in a certain way Table 1: An example of the context and three glosses of different senses according to the target word "play". It shows that the words "games/sports" in the gloss g 1 can help to highlight the important words "football" in the context and ignore the words "know each other" which are useless for distinguishing the sense of word "play". Meanwhile, the context can potentially help to stress on the words "games/sports" of the gloss g 1 which is actually the correct sense for the target word.
Although both lexical knowledge (especially gloss) and labeled data are of great help for WSD, previous supervised methods rarely take the integration of knowledge into consideration. To the best of our knowledge,  are the first to directly incorporate the gloss knowledge from WordNet into a unified neural network for WSD. This model separately builds the context representation and the gloss representation as distributed vectors and later calculates their similarity in a memory network. However, we find that the learning of the representations of the context and gloss can contribute to each other. We use an example to illustrate our ideas. Table 1 shows that the red words are more important than the blue words when distinguishing the sense of the target word. In other words, we should pay more attention to the words which can "overlap" between the context and the gloss when generating the representations of context and gloss. Therefore, we introduce a co-attention mechanism to model the mutual influence between the representations of context and gloss.
Moreover, we find that both word-level and sentence-level information are crucial to WSD. As shown in Table 1, the local word "football" is crucial for distinguishing the sense of word "play". However, in more complex sentences such as "Investors played it carefully for maximum advantage" 1 , sentence-level information is necessary. Therefore, we extend the co-attention model in a hierarchical fashion to capture both the word-level and sentence-level semantic information.
The main contributions are listed as follows.
• We propose a novel way to integrate gloss knowledge into a neural network for WSD via a co-attention mechanism in order to build better representations of context and gloss. In this way, our model can benefit from both labeled data and lexical knowledge.
• We further extend the attention mechanism into a hierarchical architecture, since both word-level and sentence-level information are crucial to disambiguating the word sense.
• We conduct a series of experiments, which show that our models outperform the state-ofthe-art systems on several standard English all-words WSD test datasets.

Related work
Lexical knowledge is a fundamental component of Word Sense Disambiguation and provides rich resources which are essential to associate senses with words (Navigli, 2009). Unsupervised knowledge-based methods have shown the effectiveness of textual knowledge such as gloss (Lesk, 1986; and the structural knowledge Agirre et al., 2014) of the lexical databases. However, the prime shortcoming of knowledge-based methods is that they have worse performance than supervised methods, but they have wider coverage for the polysemous words, thanks to the use of large-scale knowledge resources (Navigli, 2009). There are many other tasks such as Chinese Word Segmentation (Zhang et al., 2018), Language Modeling (Ahn et al., 2016), and LSTMs (Xu et al., 2016;Yang and Mitchell, 2017) show that integrating knowledge and labeled data into a unified system can achieve better performance than other methods which only learn from large scale labeled data. Therefore, it's a promising and 1 Play in the sentence means behave in a certain way. challenging study to integrate labeled data and lexical knowledge into a unified system.
A few recent studies of WSD have exploited several ways to incorporate lexical resources into supervised systems. In the field of traditional feature-based methods (Chen et al., 2015;Rothe and Schütze, 2015), they usually utilize knowledge (to train word sense embeddings) as features of the classifier like the support vector machine (SVM). In the field of neural-based methods,  regard lexical resource LEX which is extracted from the WordNet as an auxiliary classification task, and propose a multi-task learning framework for WSD and LEX.  integrate the context and glosses of the target word into a unified framework via a memory network. It encodes the context and glosses of the target word separately, and then models the semantic relationship between the context vector and gloss vector in the memory module. What's more,  utilize much more knowledge about gloss via its semantic relations such as hypernymy and hyponymy in WordNet. All studies listed above show that integrating lexical resources especially gloss into supervised systems of WSD can significantly improve the performance. Therefore, we follow this direction and seek a new way of better integrating gloss knowledge.
Instead of building representations for context and gloss separately, we use the inner connection between the gloss and the context to promote the representation of each other. The interaction process can be modeled by a co-attention mechanism which has made great progress in the question answering task (Xiong et al., 2016;Seo et al., 2016;Hao et al., 2017;Lu et al., 2016). We are enlightened by this iterative procedure and introduce it into WSD. We then make some adaptations to the output of the original co-attention model to get the score of each word sense.

The Co-Attention Model for WSD
In this section, we first give an overview of the CAN: co-attention neural network for WSD (Figure 1). And then, we extend it into a hierarchical architecture HCAN (Figure 2).

Overview
The overall architecture of the proposed nonhierarchical co-attention model is shown in Figure  1. It consists of three parts:  • Input Embedding Layer: First of all, we encode the input context and each gloss 2 into distributed representations C and G, which are also called embeddings in the paper. In Figure 1, if C and G are word embeddings, we call the model CAN w in the paper. If C and G are sentence embeddings, we call the model CAN s .
• Co-Attention Layer: Then, each coattention mechanism in this layer generates a context vector and a gloss vector according to the corresponding gloss and context representations. The outputs of the co-attention layer are N pairs of context vector and gloss vector.
• Output Layer: Finally, the output layer takes the N pairs of context vector and gloss vector as inputs and calculates the score of each word sense. Figure 1 shows the non-hierarchical coattention model which generates either word-level representations (CAN w ) or sentence-level representations (CAN s ). Since both the word-level and sentence-level representations can help to disambiguate the word sense, we extend CAN into a hierarchical model, named as HCAN ( Figure 2). The extensions of each layer are listed as follows: 1. The input embedding layer is extended to two sub-layers in the hierarchical architecture which encodes both word-level and sentencelevel representations.
2. The co-attention layer is also extended to two attention layers for capturing two different levels' attention.

Input Embedding Layer
We denote each input sentence (context or gloss) as a sequence of words where T x is the length of the input sentence.

Word Embedding
After looking up a pre-trained word embedding matrix E w ∈ R dw×V , we transfer a one-hot vector x i into a d w -dimensional vector e i . We treat [e 1 , e 2 , . . . , e Tx ] as the word-level representations of the sentence. Specifically, context's word-level representations are denoted as [e c 1 , e c 2 , . . . , e c n ] and i-th gloss's word-level representations are denoted as [e g i 1 , e g i 2 , . . . , e g i m ], where n and m represent the max length of context and gloss.

Sentence Embedding
We utilize a bi-directional long short-term memory network (Bi-LSTM) to generate the hidden states of the input sentence.

Co-Attention Mechanism
The right part of Figure 2 illustrates the coattention mechanism which is the most crucial part of the model. The inputs are context representations C ∈ R d×n and gloss representations G ∈ R d×m , where d is the dimension of the input representation vector. The outputs are the gloss-aware context vector c ∈ R d and the context-aware gloss vector g ∈ R d . Therefore, we can define the coattention mechanism as a function Next, we give the detailed definition of the coattention mechanism function CoAt. We begin to compute a similarity matrix A, in which each element A ij indicates the similarity between i-th context word and j-th gloss word. The similarity matrix A is computed by where U ∈ R d×d is a trainable parameter. Based on the similarity matrix A, we can compute the gloss-to-context attention matrix A c and context-to-gloss attention matrix A g .
Gloss-to-Context Attention. Since each gloss word may focus on different context words, we can generate a context representation which is aware of a particular gloss word. Note that each element of A in the j-th column indicates the similarity between j-th gloss word and each context word. Thus, we can get the attention weight for each context word through a softmax function across the column of A: where A :j denotes j-th column of A and A c :j denotes j-th column of A c ∈ R n×m .
Hence we can get the gloss-aware context rep-resentationsĈ by a product of the initial context representations C and attention weight matrix A c : Note that j-th column inĈ means the context representation according to the j-th gloss word. Therefore, we can get the final context vector c by summing across the column ofĈ: Context-to-Gloss Attention. Conversely, each context word may focus on different gloss words, we can generate a gloss representation which is aware of a particular context word. Since each element of A in the i-th row indicates the similarity between i-th context word and each gloss word, we can get the attention weight of each gloss word through a softmax function across the row of A (or across the column of A ) where B = A , B :j denotes j-th column of B (also j-th row of A) and A g :j denotes j-th column of A g ∈ R m×n . Now we can get the context-aware gloss representations in the same way as Equation 4: Note that j-th column inĜ denotes the gloss representation according to j-th context word. Therefore, we can get the final gloss vector g by summing across the column ofĜ:

Word-Level Co-Attention Layer
Since there are N glosses according to N different word senses, we use N independent co-attention mechanisms in both word-level and sentence-level co-attention layers. And each layer shares a same parameter U in Equation 2. For i-th word-level co-attention mechanism, the inputs are word embeddings of the context and i-th gloss (in Section 3.2.1). Define C w = [e c 1 , e c 2 , . . . , e c n ] and G w i = [e g i 1 , e g i 2 , . . . , e g i m ], thus the outputs of i-th word-level co-attention mechanism are computed as Inspired by the well-known Lesk algorithm (Lesk, 1986) and its variants , the score of the i-th word sense can be computed as the similarity of the context vector c w i and the gloss vector g w i : The word-level context embedding vectorĉ w can be computed as the average of the N glossaware context vectors c w i :

Sentence-Level Co-Attention Layer
Same to word-level co-attention layer, for the ith co-attention mechanism, the inputs of sentencelevel co-attention layer are Bi-LSTM hidden states of context and the i-th gloss (in Section 3.2.2). Define C s = [h c 1 , h c 2 , . . . , h c n ] and G s i = [h g i 1 , h g i 2 , . . . , h g i m ], thus the outputs of i-th sentence-level co-attention mechanism are computed as Like Equation 10, we can also calculate a sentence-level score for the i-th word sense by a dot product of the context vector c s i and the gloss vector g s i : The sentence-level context embedding vectorĉ s is also computed as the average of N gloss-aware context vectors c s i :

Output Layer
The output layer aims to calculate the scores of N senses of the target word x t and finally outputs a sense probability distribution over the N senses. The final score of each sense is a weighted sum of two values: µ and ν. µ is the similarity score of gloss and context, which reveals the influence of knowledge. ν is generated by the context vector through a linear projection layer, which reveals the influence of labeled data. Finally, the probability distributionŷ over all the senses of the target word is computed aŝ where λ xt ∈ [0, 1] is the parameter for word x t . For the non-hierarchical model CAN in Figure 1, the final score µ and ν are generated only by the outputs of the one level co-attention layer. Specifically, for the word-level co-attention model For the sentence-level co-attention model For the hierarchical co-attention model HCAN in Figure 2, the outputs of the word and sentence level layer are merged together to generate the final results. Therefore, the final similarity score between i-th gloss and context is computed as the weighted sum of word-level score β w i and sentence-level score β s i : Meanwhile, the final context embedding vector is also generated by a combination of two levels' context embedding vector:ĉ w andĉ s . In order to transfer from word-level encoding space to sentence-level encoding space, we introduce a non-linear projection layer on top of the wordlevel context vectorĉ w . Therefore, the final context embedding vectorĉ is generated bŷ In total, for the hierarchical co-attention model HCAN: It's noteworthy that in Equation 16, 18 and 22, each ambiguous word x t has its corresponding weight matrix W xt and bias b xt .
During training, all model parameters θ are jointly learned by minimizing a cross-entropy loss betweenŷ and the true label y.
where M is the number of examples in the dataset, N i is the word sense number of i-th example, y ij andŷ ij are the true and predict probability of the i-th example belongs to j-th label.

Datasets
Validation and Evaluation Datasets: We evaluate our model on several English all-words WSD datasets. For a fair comparison, we use the benchmark datasets proposed by Raganato et al. (2017b) which include five standard all-words fine-grained WSD datasets from the Senseval and SemEval competitions: • Senseval-2 (Edmonds and Cotton, 2001, SE2): It consists of 2282 sense annotations, including nouns, verbs, adverbs and adjectives.
Knowledge Base: The original WordNet version of sense inventory for SemCor 3.0, SE2, SE3, SE7, SE13, SE15 are 1.4, 1.7, 1.7.1, 2.1, 3.0 and 3.0, respectively. Raganato et al. (2017b) map all the sense annotations in the training and test datasets to WordNet 3.0 via a semi-automatic method. Therefore, We choose WordNet 3.0 as the sense inventory for extracting the gloss.

Settings
We use the validation set (SE7) to find the optimal hyper parameters of our models: the word embedding size d w , the hidden state size d s of LSTM, the optimizer, etc. However, since there are no adverbs and adjectives in SE7, we randomly sample some adverbs and adjectives from training dataset into SE7 for validation. We use the pre-trained word embeddings 4 . The hidden state size d s is 256. The mini-batch size is set to 32. The optimizer is Adam (Kingma and Ba, 2014) with 0.001 initial learning rate. In order to avoid over-fitting, we use dropout regularization on the outputs of LSTM and set drop rate to 0.5. Orthogonal initialization is used for initialing weights in LSTM and random uniform initialization with range [-0.1, 0.1] is used for others. Training runs for up to 50 epochs with early stopping if the validation loss doesn't improve within the last 5 epochs. Table 3: F1-score (%) for fine-grained English all-words WSD on the test sets. Bold font indicates best systems. The * represents the systems which don't use any lexical knowledge. The five blocks list the baseline, 2 knowledgebased systems, 2 supervised feature-based systems, 7 neural-based systems and our models, respectively. . Table 3 shows the results on four test datasets and different parts of speech. Note that all the systems in Table 3 are trained on SemCor 3.0.

English all-words results
In the first block, we show the MFS baseline, which simply selects the most frequent sense in the training dataset.
In the second block, we show two latest knowledge-based (unsupervised) systems. Lesk ext+emb is a variant of the well-known Lesk algorithm (Lesk, 1986) which computes the overlap of gloss and context as the score of word sense. Babelfy  is a graphbased system performed on BabelNet (Navigli and Ponzetto, 2012). We can find that MFS is a strong baseline for knowledge-based systems.
In the third block, we show two traditional supervised systems which only learn from labeled data based on manual designed features. IMS ) is a flexible framework which trains K SVM classifiers for K polysemous words. Its variant IMS +emb  adds word embedding features into IMS. Both of them train a dedicated classifier for each word individually. In other words, each target word has its own parameters. Therefore, IMS +emb is a hard to beat system for many neural networks which also only uses labeled data but builds a unified system for all the polysemous words.
In the fourth block, we show four latest neural networks. Except for Bi-LSTM (Kågebäck and Salomonsson, 2016), which is a baseline for neural models, the others all utilize not only labeled data but also lexical knowledge. Bi-LSTM +att.+LEX  and its variant Bi-LSTM +att.+LEX+P OS are multi-task learning frameworks for WSD, POS tagging and LEX with context self-attention mechanism. GAS  is a gloss-augmented neural network in an improved memory network paradigm. The best neural network is GAS ext which extends from GAS and uses more gloss knowledge via the semantic relations in WordNet. 5 In the last block, we give the performance of our proposed co-attention models for WSD. We can see that our best model HCAN improves state-ofthe-art result by 0.5% on the concatenation of four datasets. Even though we use less gloss knowledge than the previous best system GAS ext , our co-attention models can still get the best results on three test datasets. For non-hierarchical models, CAN s performs much better than the CAN w , which reveals that global sentence-level information is much more useful than local word-level information. Integration of these two levels' information (HCAN) can further boost the performance. What's more, we find that our best model HCAN performs best on all parts of speech, except for adverbs. However, there are only 346 examples about adverbs which account for 5% of the four test datasets, thus 1% drop on adverbs means only 4 examples are wrongly classified which will make little influence on the overall score.

Ablation Study
In this part, we further discuss the impacts of the components of our hierarchical model HCAN. In order to ablate the co-attention mechanism, we replace the co-attention function CoAt in Equation 1 with a function Avg which simply calculates the average of input representation vectors. Specifically, in function Avg, the outputs c = j C :j and g = j G :j .
We re-train HCAN by ablating certain components: • No Attention: We totally replace the coattention function CoAt with Avg in both word-level and sentence-level co-attention layers. This is the baseline for comparison.
• W/O Word-level Attention: We replace the word-level co-attention function CoAt with Avg. Note that this ablation model is different from CAN s , for that the word-level representation vectorĉ w is used to calculate the final score in this ablation model.
• W/O Sentence-level Attention: We replace the sentence-level co-attention function CoAt with Avg. Note that this ablation model is not same as CAN w , for the sentence-level representation vectorĉ s is also used to calculate the final score in this ablation model.
• W/O Context2Gloss Attention: We remove the attention of generating the context vector, which means all elements in A g are set to 1.
• W/O Gloss2Context Attention: We remove the attention of generating the gloss vector, which means all elements in A c are set to 1. Table 4 indicates the effectiveness of different components in the proposed model HCAN. It shows that without any attention mechanism, the overall score declines 2.3%.
Ablated versions without word, sentence level co-attention decline 0.6% and 2.0%, respectively. It reveals that sentence-level co-attention mechanism seems much more important to HCAN, which is consistent with the scores of CAN s and CAN w . However, we find that the results of ablated versions without word-level and sentence-level co-attention are worse than CAN s and CAN w . We hypothesize that it is because that the context and gloss vector generated from the layer (or level) which doesn't use attention mechanism may bring some noise to the final scores.
Without the context-to-gloss attention, the score declines 1.2% on concatenation of the four test datasets. Conversely, without the gloss-to-context attention, the score declines 0.4%. It is probably due to that the context-to-gloss attention which generates the context-aware gloss vector is more direct to find out the correct word sense.
In conclusion, the results in Table 4 show that all components in the proposed hierarchical coattention model HCAN can contribute to boosting the performance of WSD.

Conclusions
In this paper, we investigate the problem of incorporating gloss knowledge into neural network for Word Sense Disambiguation. We find that the gloss can highlight the important words in the context, and later contribute to the representation of the context. Meanwhile, context can also help to focus on the words in gloss of the right word sense. Therefore, we propose a co-attention mechanism to model the gloss-to-context and contextto-gloss attention. Furthermore, in order to capture not only local word-level features but also global sentence-level features, we extend the coattention model into a hierarchical architecture. The experimental results show that our proposed models achieve the state-of-the-art results on several standard English all-words WSD datasets.