Incorporating Glosses into Neural Word Sense Disambiguation

Word Sense Disambiguation (WSD) aims to identify the correct meaning of polysemous words in the particular context. Lexical resources like WordNet which are proved to be of great help for WSD in the knowledge-based methods. However, previous neural networks for WSD always rely on massive labeled data (context), ignoring lexical resources like glosses (sense definitions). In this paper, we integrate the context and glosses of the target word into a unified framework in order to make full use of both labeled data and lexical knowledge. Therefore, we propose GAS: a gloss-augmented WSD neural network which jointly encodes the context and glosses of the target word. GAS models the semantic relationship between the context and the gloss in an improved memory network framework, which breaks the barriers of the previous supervised methods and knowledge-based methods. We further extend the original gloss of word sense via its semantic relations in WordNet to enrich the gloss information. The experimental results show that our model outperforms the state-of-the-art systems on several English all-words WSD datasets.


Introduction
Word Sense Disambiguation (WSD) is a fundamental task and long-standing challenge in Natural Language Processing (NLP). There are several lines of research on WSD. Knowledge-based methods focus on exploiting lexical resources to infer the senses of word in the context. Supervised methods usually train multiple classifiers with manual designed features. Although supervised methods can achieve the state-of-the-art performance (Raganato et al., 2017b,a), there are still two major challenges.
Firstly, supervised methods  usually train a dedicated classifier for each word individually (often called word expert). So it can not easily scale up to all-words WSD task which requires to disambiguate all the polysemous word in texts 2 . Recent neural-based methods (Kågebäck and Salomonsson, 2016; solve this problem by building a unified model for all the polysemous words, but they still can't beat the best word expert system. Secondly, all the neural-based methods always only consider the local context of the target word, ignoring the lexical resources like Word-Net (Miller, 1995) which are widely used in the knowledge-based methods. The gloss, which extensionally defines a word sense meaning, plays a key role in the well-known Lesk algorithm (Lesk, 1986). Recent studies (Banerjee and Pedersen, 2002; have shown that enriching gloss information through its semantic relations can greatly improve the accuracy of Lesk algorithm. To this end, our goal is to incorporate the gloss information into a unified neural network for all of the polysemous words. We further consider extending the original gloss through its semantic relations in our framework. As shown in Figure 1, the glosses of hypernyms and hyponyms can enrich the original gloss information as well as help to build better a sense representation. Therefore, we integrate not only the original gloss but also  Figure 1: The hypernym (green node) and hyponyms (blue nodes) for the 2nd sense bed 2 of bed, which means a plot of ground in which plants are growing, rather than the bed for sleeping in. The figure shows that bed 2 is a kind of plot 2 , and bed 2 includes f lowerbed 1 , seedbed 1 , etc.
the related glosses of hypernyms and hyponyms into the neural network.
In this paper, we propose a novel model GAS: a gloss-augmented WSD neural network which is a variant of the memory network (Sukhbaatar et al., 2015b;Kumar et al., 2016;Xiong et al., 2016). GAS jointly encodes the context and glosses of the target word and models the semantic relationship between the context and glosses in the memory module. In order to measure the inner relationship between glosses and context more accurately, we employ multiple passes operation within the memory as the re-reading process and adopt two memory updating mechanisms.
The main contributions of this paper are listed as follows: • To the best of our knowledge, our model is the first to incorporate the glosses into an end-to-end neural WSD model. In this way, our model can benefit from not only massive labeled data but also rich lexical knowledge.
• In order to model semantic relationship of context and glosses, we propose a glossaugmented neural network (GAS) in an improved memory network paradigm.
• We further expand the gloss through its semantic relations to enrich the gloss information and better infer the context. We extend the gloss module in GAS to a hierarchical framework in order to mirror the hierarchies of word senses in WordNet.
• The experimental results on several English all-words WSD benchmark datasets show that our model outperforms the state-of-theart systems.

Related Work
Knowledge-based, supervised and neural-based methods have already been applied to WSD task (Navigli, 2009). Knowledge-based WSD methods mainly exploit two kinds of knowledge to disambiguate polysemous words: 1) The gloss, which defines a word sense meaning, is mainly used in Lesk algorithm (Lesk, 1986) and its variants.
2) The structure of the semantic network, whose nodes are synsets 3 and edges are semantic relations, is mainly used in graph-based algorithms (Agirre et al., 2014;. Supervised methods  usually involve each target word as a separate classification problem (often called word expert) and train classifiers based on manual designed features.
Although word expert supervised WSD methods perform best in terms of accuray, they are less flexible than knowledge-based methods in the allwords WSD task . To deal with this problem, recent neural-based methods aim to build a unified classifier which shares parameters among all the polysemous words. Kågebäck and Salomonsson (2016) leverages the bidirectional long short-term memory network which shares model parameters among all the polysemous words.  transfers the WSD problem into a neural sequence labeling task. However, none of the neural-based methods can totally beat the best word expert supervised methods on English all-words WSD datasets.
What's more, all of the previous supervised methods and neural-based methods rarely take the lexical resources like WordNet (Fellbaum, 1998) into consideration. Recent studies on sense embeddings have proved that lexical resources are helpful. Chen et al. (2015) trains word sense embeddings through learning sentence level embeddings from glosses using a convolutional neural networks. Rothe and Schütze (2015) extends word embeddings to sense embeddings by using the constraints and semantic relations in WordNet. They achieve an improvement of more than 1% in WSD performance when using sense embeddings as WSD features for SVM classifier. This work shows that integrating structural information of lexical resources can help to word expert supervised methods. However, sense embeddings can only indirectly help to WSD (as SVM classifier features).  shows that the coarse-grained semantic labels in WordNet can help to WSD in a multi-task learning framework. As far as we know, there is no study directly integrates glosses or semantic relations of the Word-Net into an end-to-end model.
In this paper, we focus on how to integrate glosses into a unified neural WSD system. Memory network (Sukhbaatar et al., 2015b;Kumar et al., 2016;Xiong et al., 2016) is initially proposed to solve question answering problems. Recent researches show that memory network obtains the state-of-the-art results in many NLP tasks such as sentiment classification (Li et al., 2017) and analysis (Gui et al., 2017), poetry generation , spoken language understanding (Chen et al., 2016), etc. Inspired by the success of memory network used in many NLP tasks, we introduce it into WSD. We make some adaptations to the initial memory network in order to incorporate glosses and capture the inner relationship between the context and glosses.

Incorporating Glosses into Neural Word Sense Disambiguation
In this section, we first give an overview of the proposed model GAS: a gloss-augmented WSD neural network which integrates the context and the glosses of the target word into a unified framework. After that, each individual module is described in detail.

Architecture of GAS
The overall architecture of the proposed model is shown in Figure 2. It consists of four modules: • Context Module: The context module encodes the local context (a sequence of surrounding words) of the target word into a distributed vector representation.
• Gloss Module: Like the context module, the gloss module encodes all the glosses of the target word into a separate vector representations of the same size. In other words, we can get |s t | word sense representations according to |s t | 4 senses of the target word, where |s t | is the sense number of the target word w t .
• Memory Module: The memory module is employed to model the semantic relationship between the context embedding and gloss embedding produced by context module and gloss module respectively.
• Scoring Module: In order to benefit from both labeled contexts and gloss knowledge, the scoring module takes the context embedding from context module and the last step result from the memory module as input. Finally it generates a probability distribution over all the possible senses of the target word.
Detailed architecture of the proposed model is shown in Figure 3. The next four sections will show detailed configurations in each module.

Context Module
Context module encodes the context of the target word into a vector representation, which is also called context embedding in this paper.
We leverage the bidirectional long short-term memory network (Bi-LSTM) for taking both the preceding and following words of the target word into consideration.
The input of this mod- where T x is the length of the context. After applying a lookup operation over the pre-trained word embedding matrix M ∈ R D×V , we transfer a one hot vector x i into a D-dimensional vector. Then, the forward LSTM reads the segment (x 1 , . . . , x t−1 ) on the left of the target word x t and calculates a sequence of forward hidden states

Original Gloss
Extended Glosses g i Figure 3: Detailed architecture of our proposed model, which consists of a context module, a gloss module, a memory module and a scoring module. The context module encodes the adjacent words surrounding the target word into a vector c. The gloss module encodes the original gloss or extended glosses into a vector g i . In the memory module, we calculate the inner relationship (as attention) between context c and each gloss g i and then update the memory as m i at pass i. In the scoring module, we make final predictions based on the last pass attention of memory module and the context vector c. Note that GAS only uses the original gloss, while GAS ext uses the entended glosses through hypernymy and hyponymy relations. In other words, the relation fusion layer (grey dotted box) only belongs to GAS ext .
segment (x Tx , . . . , x t+1 ) on the right of the target word x t and calculates a sequence of backward hidden states ( where : is the concatenation operator.

Gloss Module
The gloss module encodes each gloss of the target word into a fixed size vector like the context vector c, which is also called gloss embedding. We further enrich the gloss information by taking semantic relations and their associated glosses into consideration. This module contains a gloss reader layer and a relation fusion layer. Gloss reader layer generates a vector representations for a gloss. Relation fusion layer aims at modeling the semantic relations of each gloss in the expanded glosses list which consists of related glosses of the original gloss. Our model GAS with extended glosses is denoted as GAS ext . GAS only encodes the original gloss, while GAS ext encodes the expanded glosses from hypernymy and hyponymy relations (details in Figure 3).

Gloss Reader Layer
Gloss reader layer contains two parts: gloss expansion and gloss encoder. Gloss expansion is to enrich the original gloss information through its hypernymy and hyponymy relations in WordNet. Gloss encoder is to encode each gloss into a vector representation.
Gloss Expansion: We only expand the glosses of nouns and verbs via their corresponding hypernyms and hyponyms. There are two reasons: One is that most of polysemous words (about 80%) are nouns and verbs; the other is that the most frequent relations among word senses for nouns and verbs are the hypernymy and hyponymy relations 5 .
The original gloss is denoted as g 0 . Breadthfirst search method with a limited depth K is employed to extract the related glosses. The glosses of hypernyms within K depth are denoted as [g −1 , g −2 , . . . , g −L 1 ].
The glosses of hyponyms within K depth are denoted as [g +1 , g +2 , . . . , g +L 2 ] 6 . Note that g +1 and g −1 are the glosses of the nearest word sense.
Gloss Encoder: We denote the j-th 7 gloss in the expanded glosses list for i th sense of the target word as a sequence of G words. Like the context encoder, the gloss encoder also leverages Bi-LSTM units to process the words sequence of the gloss. The gloss representation g i j is computed as the concatenation of the last hidden states of the forward and backward LSTM.

Relation Fusion Layer
Relation fusion layer models the hypernymy and hyponymy relations of the target word sense.
A forward LSTM is employed to encode the hypernyms' glosses of i th sense . In order to highlight the original gloss g i 0 , the enhanced i th sense representation is concatenated as the final state of the forward and backward LSTM.

Memory Module
The memory module has two inputs: the context vector c from the context module and the gloss vectors {g 1 , g 2 , . . . , g |st| } from the gloss module, where |s t | is the number of word senses. We model the inner relationship between the context and glosses by attention calculation. Since onepass attention calculation may not fully reflect the relationship between the context and glosses (details in Section 4.4.2), the memory module adopts a repeated deliberation process. The process repeats reading gloss vectors in the following passes, in order to highlight the correct word sense for the following scoring module by a more accurate attention calculation. After each pass, we update 6 Since one synset has one or more direct hypernyms and hyponyms, L1 >= K and L2 >= K. 7 Since GAS don't have gloss expansion, j is always 0 and gi = g i 0 . See more in Figure 3. the memory to refine the states of the current pass. Therefore, memory module contains two phases: attention calculation and memory update. Attention Calculation: For each pass k, the attention e k i of gloss g i is generally computed as where m k−1 is the memory vector in the (k − 1)th pass while c is the context vector. The scoring function f calculates the semantic relationship of the gloss and context, taking the vector set (g i , m k−1 , c) as input. In the first pass, the attention reflects the similarity of context and each gloss. In the next pass, the attention reflects the similarity of adapted memory and each gloss. A dot product is applied to calculate the similarity of each gloss vector and context (or memory) vector.
We treat c as m 0 . So, the attention α k i of gloss g i at pass k is computed as a dot product of g i and m k−1 : Memory Update: After calculating the attention, we store the memory state in u k which is a weighted sum of gloss vectors and is computed as where n is the hidden size of LSTM in the context module and gloss module. And then, we update the memory vector m k from last pass memory m k−1 , context vector c, and memory state u k . We propose two memory update methods: • Linear: we update the memory vector m k by a linear transformation from m k−1 where H ∈ R 2n×2n .
• Concatenation: we get a new memory for kth pass by taking both the gloss embedding and context embedding into consideration where : is the concatenation operator, W ∈ R n×6n and b ∈ R 2n .

Scoring Module
The scoring module calculates the scores for all the related senses {s 1 t , s 2 t , . . . , s p t } corresponding to the target word x t and finally outputs a sense probability distribution over all senses.
The overall score for each word sense is determined by gloss attention α T M i from the last pass in the memory module, where T M is the number of passes in the memory module. The e T M ( α T M without Softmax) is regarded as the gloss score.
Meanwhile, a fully-connected layer is employed to calculate the context score.
where W xt ∈ R |st|×2n , b xt ∈ R |st| , |s t | is the number of senses for the target word x t and n is the number of hidden units in the LSTM. It's noteworthy that in Equation 11, each ambiguous word x t has its corresponding weight matrix W xt and bias b xt in the scoring module.
In order to balance the importance of background knowledge and labeled data, we introduce a parameter λ ∈ R N 8 in the scoring module which is jointly learned during the training process. The probability distributionŷ over all the word senses of the target word is calculated as: where λ xt is the parameter for word x t , and λ xt ∈ [0, 1].
During training, all model parameters are jointly learned by minimizing a standard crossentropy loss betweenŷ and the true label y.
Following by , we choose SE7, the smallest test set as the development (validation) set, which consists of 455 labeled instances. The last four test sets consist of 6798 labeled instances with four types of target words, namely nouns, verbs, adverbs and adjectives. We extract word sense glosses from WordNet3.0 because Raganato et al. (2017b) maps all the sense annotations 9 from its original version to 3.0.
Training Dataset: We choose SemCor 3.0 as the training set, which was also used by , Raganato et al. (2017b), , , etc. It consists of 226,036 sense annotations from 352 documents, which is the largest manually annotated corpus for WSD. Note that all the systems listed in Table 1 are trained on SemCor 3.0.

Implementation Details
We use the validation set (SE7) to find the optimal settings of our framework: the hidden state size n, the number of passes |T M |, the optimizer, etc. We use pre-trained word embeddings with 300 dimensions 10 , and keep them fixed during the training process. We employ 256 hidden units in both the gloss module and the context module, which means n=256. Orthogonal initialization is used for weights in LSTM and random uniform initialization with range [-0.1, 0.1] is used for others. We assign gloss expansion depth K the value of 4. We also experiment with the number of passes |T M | from 1 to 5 in our framework, finding |T M | = 3 performs best. We use Adam optimizer (Kingma and Ba, 2014) in the training process with 0.001 initial learning rate. In order to avoid overfitting, we use dropout regularization and set drop rate to 0.5. Training runs for up to 100 epochs with early stopping if the validation loss doesn't improve within the last 10 epochs.

Systems to be Compared
In this section, we describe several knowledgebased methods, supervised methods and neuralbased methods which perform well on the English all-words WSD datasets for comparison. Table 1: F1-score (%) for fine-grained English all-words WSD on the test sets. Bold font indicates best systems. The * represents the neural network models using external knowledge. The fives blocks list the MFS baseline, two knowledge-based systems, two supervised systems (feature-based), three neuralbased systems and our models, respectively.

Knowledge-based Systems
• Lesk ext+emb : Basile et al. (2014) is a variant of Lesk algorithm (Lesk, 1986) by using a word similarity function defined on a distributional semantic space to calculate the gloss-context overlap. This work shows that glosses are important to WSD and enriching gloss information via its semantic relations can help to WSD.
• Babelfy:  exploits the semantic network structure from BabelNet and builds a unified graph-based architecture for WSD and Entity Linking.

Supervised Systems
The supervised systems mentioned in this paper refers to traditional feature-based systems which train a dedicated classifier for every word individually (word expert).
• IMS: Zhi and Ng (2010) selects a linear Support Vector Machine (SVM) as its classifier and makes use of a set of features surrounding the target word within a limited window, such as POS tags, local words and local collocations.
• IMS +emb :  selects IMS as the underlying framework and makes use of word embeddings as features which makes it hard to beat in most of WSD datasets.

Neural-based Systems
Neural-based systems aim to build an end-to-end unified neural network for all the polysemous words in texts.
• Bi-LSTM: Kågebäck and Salomonsson (2016) leverages a bidirectional LSTM network which shares model parameters among all words. Note that this model is equivalent to our model if we remove the gloss module and memory module of GAS.
• Bi-LSTM +att.+LEX and its variant Bi-LSTM +att.+LEX+P OS :  transfers WSD into a sequence learning task and propose a multi-task learning framework for WSD, POS tagging and coarse-grained semantic labels (LEX). These two models have used the external knowledge, for the LEX is based on lexicographer files in WordNet.
Moreover, we introduce MFS baseline, which simply selects the most frequent sense in the training data set.

English all-words results
In this section, we show the performance of our proposed model in the English all-words task. Table 1 shows the F1-score results on the four test sets mentioned in Section 4.1. The systems in the first four blocks are implemented by Raganato et al. (2017a,b) except for the single Bi-LSTM model. The last block lists the performance of our proposed model GAS and its variant GAS ext which extends the gloss module in GAS. GAS and GAS ext achieves the state-of-theart performance on the concatenation of all test datasets. Although there is no one system al-Context: He plays a pianist in the film Glosses Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 g 1 : participate in games or sport g 2 : perform music on a instrument g 3 : act a role or part  Table 3: F1-score (%) of different passes from 1 to 5 on the test data sets. It shows that appropriate number of passes can boost the performance as well as avoid over-fitting of the model. . ways performs best on all the test sets 11 , we can find that GAS ext with concatenation memory updating strategy achieves the best results 70.6 on the concatenation of the four test datasets. Compared with other three neural-based methods in the fourth block, we can find that our best model outperforms the previous best neural network models  on every individual test set. The IMS +emb , which trains a dedicated classifier for each word individually (word expert) with massive manual designed features including word embeddings, is hard to beat for neural networks models. However, our best model can also beat IMS +emb on the SE3, SE13 and SE15 test sets.
Incorporating glosses into neural WSD can greatly improve the performance and extending the original gloss can further boost the results. Compared with the Bi-LSTM baseline which only uses labeled data, our proposed model greatly improves the WSD task by 2.2% F1-score with the help of gloss knowledge. Furthermore, compared with the GAS which only uses original gloss as the background knowledge, GAS ext can further improve the performance with the help of the extended glosses through the semantic relations. This proves that incorporating extended glosses through its hypernyms and hyponyms into the neural network models can boost the performance for 11 Because the source of the four datasets are extremely different which belongs to different domains.

Multiple Passes Analysis
To better illustrate the influence of multiple passes, we give an example in Table 2. Consider the situation that we meet an unknown word x 12 , we look up from the dictionary and find three word senses and their glosses corresponding to x.
We try to figure out the correct meaning of x according to its context and glosses of different word senses by the proposed memory module. In the first pass, the first sense is excluded, for there are no relevance between the context and g 1 . But the g 2 and g 3 may need repeated deliberation, for word pianist is similar to the word music and role in the two glosses. By re-reading the context and gloss information of the target word in the following passes, the correct word sense g 3 attracts much more attention than the other two senses. Such rereading process can be realized by multi-pass operation in the memory module. Furthermore, Table 3 shows the effectiveness of multi-pass operation in the memory module. It shows that multiple passes operation performs better than one pass, though the improvement is not significant. The reason of this phenomenon is that for most target words, one main word sense accounts for the majority of their appearances. Therefore, in most circumstances, one-pass inference can lead to the correct word senses. Case studies in Table 2 show that the proposed multipass inference can help to recognize the infrequent senses like the third sense for word play. In Table 3, with the increasing number of passes, the F1-score increases. However, when the number of passes is larger than 3, the F1-score stops increasing or even decreases due to over-fitting. It shows that appropriate number of passes can boost the performance as well as avoid over-fitting of the model.

Conclusions and Future Work
In this paper, we seek to address the problem of integrating the glosses knowledge of the ambiguous word into a neural network for WSD. We further extend the gloss information through its semantic relations in WordNet to better infer the context. In this way, we not only make use of labeled context data but also exploit the background knowledge to disambiguate the word sense. Results on four English all-words WSD data sets show that our best model outperforms the existing methods.
There is still one challenge left for the future. We just extract the gloss, missing the structural properties or graph information of lexical resources. In the next step, we will consider integrating the rich structural information into the neural network for Word Sense Disambiguation.