Zero-shot Word Sense Disambiguation using Sense Definition Embeddings

Word Sense Disambiguation (WSD) is a long-standing but open problem in Natural Language Processing (NLP). WSD corpora are typically small in size, owing to an expensive annotation process. Current supervised WSD methods treat senses as discrete labels and also resort to predicting the Most-Frequent-Sense (MFS) for words unseen during training. This leads to poor performance on rare and unseen senses. To overcome this challenge, we propose Extended WSD Incorporating Sense Embeddings (EWISE), a supervised model to perform WSD by predicting over a continuous sense embedding space as opposed to a discrete label space. This allows EWISE to generalize over both seen and unseen senses, thus achieving generalized zero-shot learning. To obtain target sense embeddings, EWISE utilizes sense definitions. EWISE learns a novel sentence encoder for sense definitions by using WordNet relations and also ConvE, a recently proposed knowledge graph embedding method. We also compare EWISE against other sentence encoders pretrained on large corpora to generate definition embeddings. EWISE achieves new state-of-the-art WSD performance.


Introduction
Word Sense Disambiguation (WSD) is an important task in Natural Language Processing (NLP) (Navigli, 2009).The task is to associate a word in text to its correct sense, where the set of possible senses for the word is assumed to be known a priori.Consider the noun "tie" and the following examples of its usage (Miller, 1995).
• "he wore a vest and tie" • "their record was 3 wins, 6 losses and a tie" It is clear that the implied sense of the word "tie" is very different in the two cases.The word is associated with "neckwear consisting of a long narrow piece of material" in the first example, and with "the finish of a contest in which the winner is undecided" in the second.The goal of WSD is to predict the right sense, given a word and its context.
WSD has been shown to be useful for popular NLP tasks such as machine translation (Neale et al., 2016;Pu et al., 2018), information extraction (Zhong and Ng, 2012;Delli Bovi et al., 2015) and question answering (Ramakrishnan et al., 2003).The task of WSD can also be viewed as an intrinsic evaluation benchmark for the semantics learned by sentence comprehension models.WSD remains an open problem despite a long history of research.In this work, we study the all-words WSD task, where the goal is to disambiguate all ambiguous words in a corpus.
Supervised (Zhong and Ng, 2010;Iacobacci et al., 2016;Melamud et al., 2016) and semisupervised approaches (Taghipour and Ng, 2015;Yuan et al., 2016) to WSD treat the target senses as discrete labels.Treating senses as discrete labels limits the generalization capability of these models for senses which occur infrequently in the training data.Further, for disambiguation of words not seen during training, these methods fall back on using a Most-Frequent-Sense (MFS) strategy, obtained from an external resource such as WordNet (Miller, 1995).To address these concerns, unsupervised knowledge-based (KB) approaches have been introduced, which rely solely on lexical resources (e.g., WordNet).KB methods include approaches based on context-definition overlap (Lesk, 1986;Basile et al., 2014), or on the structural properties of the lexical resource (Moro et al., 2014;Weissenborn et al., 2015;Chaplot et al., 2015;Chaplot and Salakhutdinov, 2018 Tripodi and Pelillo, 2017).While knowledge-based approaches offer a way to disambiguate rare and unseen words into potentially rare senses, supervised methods consistently outperform these methods in the general setting where inference is to be carried over both frequently occurring and rare words.Recently, Raganato et al. (2017b) posed WSD as a neural sequence labeling task, further improving the stateof-the-art.Yet, owing to an expensive annotation process (Lopez de Lacalle and Agirre, 2015), there is a scarcity of sense-annotated data thereby limiting the generalization ability of supervised methods.While there has been recent interest in incorporating definitions (glosses) to overcome the supervision bottleneck for WSD (Luo et al., 2018b,a), these methods are still limited due to their treatment of senses as discrete labels.
Our hypothesis is that supervised methods can leverage lexical resources to improve on WSD for both observed and unobserved words and senses.We propose Extended WSD Incorporating Sense Embeddings (EWISE).Instead of learning a model to choose between discrete labels, EWISE learns a continuous space of sense embeddings as target.This enables generalized zero-shot learning, i.e., the ability to recognize instances of seen as well as unseen senses.EWISE utilizes sense definitions and additional information from lexical resources.We believe that natural language information manually encoded into definitions contains a rich source of information for representation learning of senses.
To obtain definition embeddings, we propose a novel learning framework which leverages recently successful Knowledge Graph (KG) embedding methods (Bordes et al., 2013;Dettmers et al., 2018).We also compare against sentence encoders pretrained on large corpora.
In summary, we make the following contributions in this work.
• We propose EWISE, a principled framework to learn from a combination of senseannotated data, dictionary definitions and lexical knowledge bases.
• We propose the use of sense embeddings instead of discrete labels as the targets for supervised WSD, enabling generalized zeroshot learning.
• Through extensive evaluation, we demonstrate the effectiveness of EWISE over stateof-the-art baselines.

Related Work
Classical approaches to supervised WSD relied on extracting potentially relevant features and learning classifiers independently for each word (Zhong and Ng, 2010).Extensions to use distributional word representations have been proposed (Iacobacci et al., 2016).Semi-supervised approaches learn context representations from unlabeled data, followed by a nearest neighbour classification (Melamud et al., 2016) or label propagation (Yuan et al., 2016).Recently, Raganato et al. (2017b) introduced neural sequence models for joint disambiguation of words in a sentence.All of these methods rely on sense-annotated data and, optionally, additional unlabeled corpora.
Lexical resources provide an important source of knowledge about words and their meanings.Recent work has shown that neural networks can extract semantic information from dictionary definitions (Bahdanau et al., 2017;Bosc and Vincent, 2018).In this work, we use dictionary definitions to get representations of word meanings.
Dictionary definitions have been used for WSD, motivated by the classical method of Lesk (Lesk, 1986).The original as well as subsequent modifications of the algorithm (Banerjee and Pedersen, 2003), including using word embeddings (Basile et al., 2014), operate on the hypothesis that the definition of the correct sense has a high overlap with the context in which a word is used.These methods tend to rely on heuristics based on insights about natural language text and their definitions.More recently, gloss (definition)-augmented neural approaches have been proposed which integrate a module to score definition-context similarity (Luo et al., 2018b,a), and achieve state-ofthe-art results.We differ from these works in that we use the embeddings of definitions as the target space of a neural model, while learning in a supervised setup.Also, we don't rely on any overlap heuristics, and use a single definition for a given sense as provided by WordNet.
One approach for obtaining continuous representations for definitions is to use Universal Sentence Representations, which have been explored to allow transfer learning from large unlabeled as well as labeled data (Conneau et al., 2017;Cer et al., 2018).There has also been interest in learning deep contextualized word representations (Peters et al., 2018;Devlin et al., 2019).In this work, we evaluate definition embeddings obtained using these methods.
Structural Knowledge available in lexical resources such as WordNet has motivated several unsupervised knowledge-based approaches for WSD.Graph based techniques have been used to match words to the most relevant sense (Navigli and Lapata, 2010;Sinha and Mihalcea, 2007;Agirre et al., 2014;Moro et al., 2014;Chaplot and Salakhutdinov, 2018).
Our work differs from these methods in that we use structural knowledge to learn better representations of definitions, which are then used as targets for the WSD model.To learn a meaningful encoder for definitions we rely on knowledge graph embedding methods, where we represent an entity by the encoding of its definition.TransE (Bordes et al., 2013) models relations between entities as translations operating on the embeddings of the corresponding entities.ConvE (Dettmers et al., 2018), a more recent method, utilizes a multi-layer convolutional network, allowing it to learn more expressive features.
Predicting in an embedding space is key to our methods, allowing generalized zero shot learning capability, as well as incorporating definitions and structural knowledge.The idea has been explored in the context of zero-shot learning (Xian et al., 2018).Tying the input and output embeddings of language models (Press and Wolf, 2017) resembles our approach.

Background
In this work, we propose to use the training signal present in WordNet relations to learn encoders for definitions (Section 4.3.2).To learn from WordNet relations, we employ recently popular Knowledge Graph (KG) Embedding learning methods.In Section 3.1, we briefly introduce the framework for KG Embedding learning, and present the specific formulations for TransE and ConvE.

Knowledge Graph Embeddings
Knowledge Graphs, a set of relations defined over a set of entities, provide an important field of research for representation learning.Methods for learning representations for both entities and relations have been explored (Wang et al., 2017) with an aim to represent graphical knowledge.Of particular significance is the task of link prediction, i.e., predicting missing links (edges) in the graph.
A Knowledge Graph is typically comprised of a set K of N triples (h, l, t), where head h and tail t are entities, and l denotes a relation.
TransE defines a scoring function for a triple (h, l, t), as the dissimilarity between the head em-bedding, translated by the relation embedding, and the tail embedding: where, e h , e t and e l are parameters to be learnt.A margin based criterion, with margin γ, can then be formulated as: (2) where K is a set of corrupted triples (Bordes et al., 2013), and [x] + refers to the positive part of x.
ConvE formulates the scoring function ψ l (e h , e t ) for a triple (h, l, t) as: where e h and e t are entity parameters, e l is a relation parameter, x denotes a 2D reshaping of x, w denotes the filters for 2D convolution, vec(x) denotes the vectorization of x, W represents a linear transformation, and f denotes a rectified linear unit.
For a given head entity h, the score ψ l (e h , e t ) is computed with each entity in the graph as a tail.Probability estimates for the validity of a triple are obtained by applying a logistic sigmoid function to the scores: p = σ(ψ l (e h , e t )). (4) The model is then trained using a binary cross entropy loss: (5) where t i is 1 when (h, l, t) ∈ K and 0, otherwise.

EWISE
EWISE is a general WSD framework for learning from sense-annotated data, dictionary definitions and lexical knowledge bases (Figure 1).
EWISE addresses a key issue with existing supervised WSD systems.Existing systems use discrete sense labels as targets for WSD.This limits the generalization capability to only the set of annotated words in the corpus, with reliable learning only for the word-senses which occur with high relative frequency.In this work, we propose using continuous space embeddings of senses as targets for WSD, to overcome the aforementioned supervision bottleneck.
To ensure generalized zero-shot learning capability, it is important that the target sense embeddings be obtained independent of the WSD task learning.We use definitions of senses available in WordNet to obtain sense embeddings.Using Dictionary Definitions to obtain the representation for a sense enables us to benefit from the semantic overlap between definitions of different senses, while also providing a natural way to handle unseen senses.
In Section 4.1, we state the task of WSD formally.We then describe the components of EWISE in detail.Here, we briefly discuss the components: • Attentive Context Encoder: EWISE uses a Bi-directional LSTM (BiLSTM) encoder to convert the sequence of tokens in the input sentence into context-aware embeddings.
Self-attention is used to enhance the context for disambiguating the current word, followed by a projection layer to produce sense embeddings for each input token.The architecture is detailed in Section 4.2.
• Definition Encoder: In EWISE, definition embeddings are learnt independent of the WSD task.In Section 4.3.1,we detail the usage of pretrained sentence encoders as baseline models for encoding definitions.In Section 4.3.2,we detail our proposed method to learn an encoder for definitions using structural knowledge in WordNet.

The WSD Task
WSD is a classification problem for a word w (e.g., bank) in a context c, with class labels being the word senses (e.g., financial institution).
We consider the all-words WSD task, where all content words -nouns, verbs, adjectives, adverbsneed to be disambiguated (Raganato et al., 2017a).The set of all possible senses for a word is given by a predefined sense inventory, such as WordNet.In this work, we use sense candidates as provided in the evaluation framework of (Raganato et al., 2017a) which has been created using WordNet.
More precisely, given a variable-length sequence of words x =< x 1 . . .x T >, we need to predict a sequence of word senses y =< y 1 . . .y T >.Output word sense y i comes from a predefined sense inventory S.During inference, the set of candidate senses S w for input word w is assumed to be known a priori.

Attentive Context Encoder
In this section, we detail how EWISE encodes the context of a word to be disambiguated using BiL-STMs (Hochreiter and Schmidhuber, 1997).BiL-STMs have been shown to be successful for generating effective context dependent representations for words.Following Raganato et al. (2017b), we use a BiLSTM with a self-attention layer to obtain sense-aware context specific representations of words.The sense embedding for a word is obtained through a projection of the context embedding.We then train the model with independently trained sense embeddings (Section 4.3) as target embeddings.
Our model architecture is shown in Figure 1.The model processes a sequence of tokens x i , i ∈ [T ] in a given sentence input by first representing each token with a real-valued vector representation, e i , via an embedding matrix W e ∈ R |V | * d , where V is the vocabulary size and d is the size of the embeddings.The vector representations are then input to a 2 layer bidirectional LSTM encoder.Each word is represented by concatenating the forward h i f and backward h i b hidden state vectors of the second LSTM layer.
Following Vaswani et al. (2017), we use a scaled dot-product attention mechanism to get context information at each timestep t.Attention queries, keys and values are obtained using projection matrices W q , W k and W v respectively, while the size of the projected key (d k ) is used to scale the dotproduct between queries and values.
A projection layer (fully connected linear layer) maps this context-aware word representation r i to v i in the space of sense embeddings.
During training, we multiply this with the sense embeddings of all senses in the inventory, to obtain a score for each output sense.A bias term is added to this score, where the bias is obtained as the dot product between the sense embedding and a learned parameter b.A softmax layer then generates probability estimates for each output sense.
The cross entropy loss for annotated word x i is given by: where z i is the one-hot representation of the target sense y i in the sense inventory S. The network parameters are learnt by minimizing the average cross entropy loss over all annotated words in a batch.
During inference, for each word x i , we select the candidate sense with the highest score.ŷi = argmax j (dot(v i , ρ j ) + dot(b, ρ j ));

Definition Encoder
In this section, we detail how target sense embeddings are obtained in EWISE.

Pretrained Sentence Encoders
We use pretrained sentence representation models, InferSent (Conneau et al., 2017) and USE (Cer et al., 2018) to encode definitions, producing sense embeddings of sizes 4096 and 512, respectively.We also experiment with deep context encoders, ELMO (Peters et al., 2018) and BERT (Devlin et al., 2019) to obtain embeddings for definitions.In each case, we encode a definition using the available pretrained models, producing a context embedding for each word in the definition.A fixed length representation is then obtained by averaging over the context embeddings of the words in the definition, from the final layer.This produces sense embeddings of sizes 1024 with both ELMO and BERT.

Knowledge Graph Embedding
WordNet contains a knowledge graph, where the entities of the graph are senses (synsets), and re- The training set K is comprised of triples (h, l, t), where head h and tail t are senses, and l is a relation.Also, g x denotes the definition of entity x, as provided by WordNet.The dataset contains 18 WordNet relations (Bordes et al., 2013).
The goal is to learn a sentence encoder for definitions and we select the BiLSTM-Max encoder architecture due to its recent success in sentence representation (Conneau et al., 2017).The words in the definition are encoded by a 2-layer BiL-STM to obtain context-aware embeddings for each word.A fixed length representation is then obtained by Max Pooling, i.e., selecting the maximum over each dimension.We denote this definition encoder by q(.).

TransE
We modify the dissimilarity measure in TransE (Equation 1) to represent both head (h) and tail (t) entities by an encoding of their definitions.
The parameters of the BiLSTM model q and the relation embeddings e l are then learnt by minimizing the loss function in Equation 2.
ConvE We modify the scoring function of ConvE (Equation 3), to represent a head entity by the encoding of its definition.
Note that we represent only the head entity with an encoding of its definition while the tail entity t is still represented by parameter e t .This helps restrict the size of the computation graph.
The parameters of the model q, e l and e t are then learnt by minimizing the binary cross-entropy loss function in Equation 5. captured in Appendix A.

Evaluation
In this section, we aim to answer the following questions: • Q1: How does EWISE compare to stateof-the-art methods on standardized test sets?(Section 6.1) • Q2: What is the effect of ablating key components from EWISE? (Section 6.2) • Q3: Does EWISE generalize to rare and unseen words (Section 6.3.1) and senses (Section 6.3.2)?

Overall Results
In this section, we report the performance of EWISE on the fine-grained all-words WSD task, using the standardized benchmarks and evaluation methodology introduced in Raganato et al. (2017a).In Table 1, we report the F1 scores for EWISE, and compare against the best reported supervised and knowledge-based methods.WordNet S1 is a strong baseline obtained by using the most frequent sense of a word as listed in WordNet.MFS is a most-frequent-sense baseline obtained through the sense frequencies in the training corpus.
Context2Vec (Melamud et al., 2016), an unsupervised model for learning generic context embeddings, enables a strong baseline for supervised WSD while using a simplistic approach (nearestneighbour algorithm).
Back-off : Traditional supervised approaches can't handle unseen words.WordNet S1 is used as a back-off strategy for words unseen during training.EWISE is capable of generalizing to unseen words and senses and doesn't use any back-off.We provide an ablation study of EWISE on the ALL dataset in Table 2. To investigate the effect of using definition embeddings in EWISE, we trained a BiLSTM model without any externally obtained sense embeddings.This model can make predictions only on words seen during training, and is evaluated with or without a back-off strategy (WordNet S1) for unseen words (row 2 and 3).The results demonstrate that incorporating sense embeddings is key to EWISE's performance.Further, the generalization capability of EWISE is illustrated by the improvement in F1 in the absence of a back-off strategy (10.0 points).Next, we investigate the impact of the choice of sense embeddings used as the target for EWISE (Table 3), on the ALL dataset.We compare definition embeddings learnt using structural knowledge (TransE, ConvE; See Section 4.3.2) against definition embeddings obtained from pre-trained sentence and context encoders (USE, InferSent, ELMO, BERT; See Section 4.3.1).We also compared with off-the-shelf sense embeddings (De-Conf) (Pilehvar and Collier, 2016), where definitions are not used.The results justify the choice of learning definition embeddings to represent senses.

Detailed Results
We provide detailed results for EWISE on the ALL dataset, compared against BiLSTM-A (BiL-STM+attention) baseline which is trained to predict in the discrete label space (Raganato et al., 2017b).We also compare against WordNet S1 and knowledge-based methods, Lesk ext +emb and Babelfy, available in the evaluation framework of Raganato et al. (2017a).

WSD on Rare Words
In this section, we investigate a key claim of EWISE -the ability to disambiguate unseen and rare words.We evaluate WSD models based on different frequencies of annotated words in the training set in Figure 2. EWISE outperforms the supervised as well as knowledge-based baselines for rare as well as frequent words.The bar plot To investigate the ability to generalize to rare senses, we partition the ALL test set into two parts -the set of instances labeled with the most frequent sense of the corresponding word (MFS), and the set of remaining instances (LFS: Least Frequent Senses).Postma et al. (2016) note that existing methods learn well on the MFS set, while doing poorly (∼ 20%) on the LFS set.
In Table 4, we evaluate the performance of EWISE and baseline models on MFS and LFS sets.We note that EWISE provides significant gains over a neural baseline (BiLSTM-A), as well as knowledge based methods on the LFS set, while maintaining high accuracy on the MFS set.The gain obtained on the LFS set is consistent with our hypothesis that predicting over sense embeddings enables generalization to rare senses.In this section, we investigate if EWISE can learn efficiently from less training data, given its increased supervision bandwidth (sense embeddings instead of sense labels).In Table 5, we report the performance of EWISE on the ALL dataset with varying sizes of the training data.We note that with only 50% of training data, EWISE already competes with several supervised approaches (Table 1), while with just 20% of training data, EWISE is able to outperform the strong WordNet S1 baseline.For reference, we also present the performance of EWISE when we use back-off (WordNet S1) for words unseen during training.

Conclusion and Future Work
We have introduced EWISE, a general framework for learning WSD from a combination of senseannotated data, dictionary definitions and Lexical Knowledge Bases.EWISE uses sense embeddings as targets instead of discrete sense labels.This helps the model gain zero-shot learning capabilities, demonstrated through ablation and detailed analysis.EWISE improves state-of-the-art results on standardized benchmarks for WSD.We are releasing EWISE code to promote reproducible research.
This paper should serve as a starting point to better investigate WSD on out-of-vocabulary words.Our modular architecture opens up various avenues for improvements in few-shot learning for WSD, viz., context encoder, definition encoder, and leveraging structural knowledge.Another potential future work would be to explore other ways of providing rich supervision from textual descriptions as targets.
; Overview of WSD in EWISE: A sequence of input tokens is encoded into context-aware embeddings using a BiLSTM and a self-attention layer (⊕ indicates concatenation).The context-aware embeddings are then projected on to the space of sense embeddings.The score for each sense in the sense inventory is obtained using a dot product (indicated by ) of the sense embedding with the projected word embedding.Please see Section 4.2 for details on the context encoding and training of the context encoder.The sense embedding for each sense in the inventory is generated using a BiLSTM-Max definition encoder.The encoder is learnt using the training signal present in WordNet Graph.An example signal with hypernym relation is depicted.Please see Section 4.3 for details on learning sense embeddings.

Table 1 :
Raganato et al. (2017a) for fine-grained all-words WSD on Senseval and SemEval datasets in the framework ofRaganato et al. (2017a).The F1 scores on different POS tags (Nouns, Verbs, Adjectives, and Adverbs) are also reported.WordNet S1 and MFS provide most-frequent-sense baselines.* represents models which access definitions, while ˆindicates models which don't access any external knowledge.EWISE (ConvE) is the proposed approach, where the ConvE method was used to generate the definition embeddings.Both the non-neural and neural supervised baselines presented here rely on a back-off mechanism, using WordNet S1 for words unseen during training.For each dataset, the highest score among existing systems with a statistically significant difference (un- paired t-test, p < 0.05) from EWISE is underlined.EWISE, which is capable of generalizing to unseen words and senses, doesn't use any back-off.EWISE consistently outperforms all supervised and knowledge-based systems, except for adverbs.Please see Section 6.1 for details.While the overall performance of EWISE is comparable to the neural baselines in terms of statistical significance, the value of EWISE lies in its ability to handle unseen and rare words and senses (See Section 6.3).Further, among the models compared, EWISE is the only system which is statistically significant (unpaired t-test, p < 0.01) with respect to the WordNet S1 baseline across all test datasets.lationsare defined over these senses.Example relations include hypernym and part of.With each entity (sense), there is an associated text definition.We propose to use WordNet relations as the training signal for learning definition encoders.

Table 3 :
Comparison of F1 scores with different sense embeddings as targets for EWISE.While pre-trained embedding methods (USE, InferSent, ELMO, BERT) and DeConf provide impressive results, the KG embedding methods (TransE and ConvE) perform competitively or better by learning to encode definitions using WordNet alone.Please see Section 6.2 for details.

Table 4 :
Comparison of F1 scores on different sense frequencies.EWISE outperforms baselines on infrequent senses, without sacrificing the performance on the most frequent sense examples.Please see Section 6.3.2 for details.

Table 5 :
Performance of EWISE with varying sizes of training data.With only 20% of training data, EWISE is able to outperform the most-frequent-sense baseline of WordNet S1.Please see Section 6.4 for details.