A Synset Relation-enhanced Framework with a Try-again Mechanism for Word Sense Disambiguation

Contextual embeddings are proved to be overwhelmingly effective to the task of Word Sense Disambiguation (WSD) compared with other sense representation techniques. However, these embeddings fail to embed sense knowledge in semantic networks. In this paper, we propose a Synset Relation-Enhanced Framework (SREF) that leverages sense relations for both sense embedding enhancement and a try-again mechanism that implements WSD again, after obtaining basic sense embeddings from augmented WordNet glosses. Experiments on all-words and lexical sample datasets show that the proposed system achieves new state-of-the-art results, defeating previous knowledge-based systems by at least 5.5 F1 measure. When the system utilizes sense embeddings learned from SemCor, it outperforms all previous supervised systems with only 20% SemCor data.


Introduction
Word Sense Disambiguation (WSD) is an ongoing research area in Natural Language Processing community. It is aimed at determining the correct meaning (sense) of a word in its context given a list of potential or competing senses in a sense inventory. According to Navigli (2009), most of WSD solutions can be categorized into supervised and knowledge-based approaches. For supervised systems, they rely on sense-annotated data to train either word experts (Zhong and Ng, 2010) or a neural language model (Raganato et al., 2017a) for disambiguation and thus perform better than their * corresponding author knowledge-based counterparts (Banerjee and Pedersen, 2002;Basile et al., 2014;Agirre et al., 2014), which merely utilize sense knowledge in a sense inventory. However, knowledge-based approaches can better scale to a multilingual scenario or a specific domain where sense annotation is limited.
Contextual representations learned from neural language models (Peters et al., 2018) are proved to be beneficial to the task of WSD. Many recent systems (Loureiro and Jorge, 2019; Vial et al., 2019;Scarlini et al., 2020) utilize language models, especially BERT (Devlin et al., 2019), as a feature extraction tool to obtain contextual sense representations and outperform previous approaches by large margins. There are also systems (Luo et al., 2018;Kumar et al., 2019) that incorporate sense definitions into language models and achieve state-of-the-art performance. However, most of the systems are implemented in a supervised manner using a widely exploited sense-annotated corpus, SemCor (Miller et al., 1994), and merging knowledge from the sense inventory as a supplement. There is much space to explore regarding how to better exploit knowledge in a sense inventory such as different WordNet relations and super-sense that categorizes WordNet senses into 45 clusters.
In this paper, we present SREF, a knowledgeenhanced WSD approach that effectively exploits the sense definitions and relations in an inventory. First, we design a gloss augmentation for those synsets that have a short definition in WordNet so that each synset can learn a reliable sense embedding with features from BERT. Then, based on these embeddings, we explore the contribution of different synset relations in WordNet (Miller, 1995) to learn relation-enhanced sense embeddings. After the first WSD is conducted with a nearest neighbor approach against an ambiguous word's context embedding and the relation-enhanced embeddings of the word's potential senses, we implement a try-again mechanism to the top 2 competing senses using synset relations and super-sense category. When applying the proposed strategy to tackle WSD, our system achieves state-of-the-art performance among knowledge-based systems. When we concatenate our sense embeddings with those learned from SemCor, new state-of-the-art performance in supervised category is achieved. We thus summarize our contributions as follows: (1) We propose a fine-grained utilization of short WordNet sense glosses to retrieve web mentions to supplement sense embedding learning, and a method to create sense embeddings in a bag-of-sense manner by utilizing WordNet sense relations.
(2) We design a try-again mechanism that employs both synset relations and supersense connections. To the best of our knowledge, this is the first attempt on employing WordNet relations to implement WSD again with sense relation knowledge.
(3) State-of-the-art performance is achieved in both all-words and lexical sample WSD datasets, surpassing previous systems by 5.5 F1 measure in knowledge-based all-words WSD. The supervised version of our system achieves state-of-the-art performance with only 20% SemCor data. The source code is available at: github.com/lwmlyy/SREF.

Related Work
In order to tackle WSD, approaches in two streams have been well developed over the last few decades, namely supervised and knowledgebased approaches. Their major difference is whether a sense-annotated corpus is employed.

Supervised Systems
Supervised systems originally regard WSD as a sense classification problem, building one classifier for each target word. IMS (Zhong and Ng, 2010), among others (Tsatsaronis et al., 2007, Iacobacci et al., 2016, Papandrea et al., 2017, is the most widespread system that leverages SVM to classify senses. In recent years, a more efficient supervised scheme has been proposed. Rather than training a few classifiers, it constructs a single neural architecture (Raganato et al., 2017a) with an annotated corpus and disambiguates words based on the output of the last layer. These methods have not outperformed traditional counterparts until sense definitions were incorporated (Luo et al., 2018). It has also become a trend that newly proposed systems (Kumar et al., 2019;Huang et al., 2019;Loureiro and Jorge, 2019;Vial et al., 2019;Scarlini et al., 2020) tend to exploit WordNet sense knowledge one way or another.
Despite the employment of sense knowledge, many systems still require a Most Frequent Sense (MFS) fallback since SemCor only covers a small proportion of WordNet lemmas. To address this issue, LMMS (Loureiro and Jorge, 2019) takes into account the synset and hypernymy relation in WordNet to extend sense embeddings to full coverage, utilizing BERT to contextualize the annotated senses in SemCor as a starting point. This approach achieves an unprecedented improvement in WSD tasks, although the synset relations are not adequately explored.
The recent development in contextual embeddings has injected much power into supervised WSD systems. Many of them rely on WordNet gloss to embed contextual information regarding a particular sense. However, a simple fact seems to be overlooked that many synset glosses are excessively short to deliver sufficient information. We thus propose a gloss augmentation method to relieve this issue. This is different from the previous gloss expanding methods (Ponzetto and Navigli, 2010;Miller et al., 2012), which expand glosses with either separate words or Wikipedia documents, rather than selected short sentences.

Knowledge-based Systems
Knowledge-based systems typically design some algorithms with which to operate on the semantic networks for disambiguation. One major branch is to consider the similarity between potential senses and the ambiguous word, including Lesk (Lesk, 1986) and other following researches (Banerjee and Pedersen, 2002;Basile et al., 2014). Another branch is to run graph algorithms (Agirre et al., 2014, Moro et al., 2014 on the semantic network and disambiguate based on sense connections in the network. There are also studies (McCarthy et al., 2007;Bhingardive et al., 2015) that focus on exploring how to learn or manipulate MFS given the fact that MFS is a highly competitive strategy.
Seeking language transferability, many knowledge-based methods (Basile et al., 2014;Camacho-Collados et al., 2016) rely on multilingual resources such as Wikipedia and BabelNet. A recent work (Scarlini, et al., 2020) follows the same idea by using BERT to learn contextual sense representations from retrieved mentions in both resources. Its supervised version is capable of beating many latest systems in noun disambiguation. However, both knowledge resources are constructed from a perspective of noun or entity relation, limiting the system's capability of disambiguating words in other partof-speech (POS). In this paper, we augment the synset gloss (regardless of synset POS) of short length with retrieved mentions from the web so that the contextual representations can be more comprehensive for senses in all POS.
Although many previous similarity-based methods have explored the value of synset relations in WordNet, most of them utilize related synsets in a bag-of-word manner. For example, in enhanced Lesk (Basile et al., 2014), gloss words of related synsets are first merged into the gloss word set of a potential sense. Using the word set, a sense embedding is learned by summing all its word embeddings. This approach naturally neglects the word order in a sense gloss and weakens the difference between senses. In our approach, the sense embedding learning process is implemented in a bag-of-sense perspective so the weaknesses are relieved. Also, we propose a novel relation exploitation scheme to disambiguate again with not only the potential sense itself but also its related senses in WordNet. This is distinct from the methods in previous researches where relations are exploited to compress or cluster senses into coarse-grained senses (Miller and Iryna, 2015;Vial et al., 2019).

Preliminaries
In this section, we introduce WordNet and BERT, the contextual representation learning model.

WordNet
WordNet is a commonly used sense inventory for English WSD and it covers 117,659 synsets and 206,978 senses in its 3.0 version. A synset contains a set of senses that share the same meaning. For each synset, a definition (gloss) is provided to show what it means, or in some cases, to explain the synset less ambiguously. For example, intend.v.01 (intend as its lemma), mean.v.04 and think.v.07 convey an identical meaning of have in mind as a purpose while think.v.05 is defined as imagine or visualize. Also, many synsets are contextualized with one or more example sentences, e.g. I mean no harm for mean.v.04.
The synsets are organized into four groups according to their POS, namely noun (N), verb (V), adjective (A) and adverb (R). Synsets in each POS are connected by different relations separately in most cases. There are over 15 relations for synsets but many of them are defined for synsets in a particular POS. For instance, hypernymy and hyponymy relations are only available for nouns and verbs while entailment relation is valid for verbs alone. There is also a cross-POS relation in WordNet, defined as 'derivationally related form'. As an example, intend.v.01 and intention.n.03 are derivationally related.
WordNet defines a coarse-grained sense category named super-sense, which arranges senses into 45 clusters including noun.person, noun.artifact and others, 26 of which are for nouns, 15 for verbs, 3 for adjectives and 1 for adverbs. Senses in the same category have a weak connection to each other.
Despite the notable contribution of synset gloss to many WSD systems, synset relations are more valuable since they provide possibilities that machines could recognize synset connections. Here, we utilize WordNet relations for sense embedding enhancement (section 4.2) and a tryagain mechanism (section 4.3).

BERT Utilization
BERT, a transformer-based language model, has attracted much attention from researchers of many NLP applications. In our research, we utilize BERT as a feature extraction model to learn a sense embedding for each WordNet sense using its gloss.
However, directly using synset gloss to learn a N V A R gloss length 11.5 6.2 7.2 5.0 ambiguity 1.4 2.6 1.6 1.3 sense embedding is problematic since many synset glosses contain insufficient context for representation learning. Among others, the gloss for think.v.05 is imagine or visualize, which is too short to carry adequate information. Table 1 presents the average synset gloss length and ambiguity of lemmas in four POS. It shows a relatively short gloss length for verb, adjective, and adverb synsets. To address the above issue, we propose a gloss augmentation method (section 4.1) to bring in more context information regarding those poorly contextualized synsets.
In our final proposal, for each sense, we use BERT to learn its basic sense embedding from the concatenation of its gloss (and lemmas), example sentences and retrieved sentences from the web. In detail, we use BERTLARGE_CASED as our feature extraction model and sum the output of the last 4 layers (a typical setting in previous researches such as LMMS, Loureiro and Jorge, 2019) at all output positions. Figure 1 demonstrates the overall concept of the framework without the try-again mechanism using an example. It relies on a K-NN algorithm to predict the correct sense of each word under disambiguation. The algorithm is implemented against a context representation ( , lighter grey circle) directly from BERT at the position of the word under disambiguation and a knowledge-enhanced representation ( , smaller blue circle) from BERT and WordNet knowledge. The big blue circle briefly illustrates how related senses are merged into one specific sense (section 4.2). In this big circle, the grey circles are basic sense embeddings ( , grey circle) learned from the synset's augmented gloss (section 4.1) via BERT.

WordNet Gloss Augmentation
In order to relieve the under-contextualization issue of many synsets, we propose a gloss augmentation approach to draw in more contextual information. Precisely, we simply use the short-length glosses as queries (words or phrases) to retrieve sequences from the web and combine the sequences with the original gloss and example sentences to learn a contextual representation from BERT. The whole process is built upon two hypotheses as follows.
(1) The words in the linguistic explanation of a synset tend to be less ambiguous and are often skewed to MFS/WordNet 1 st sense. This is supported by the fact that more than 75% of the WordNet gloss words are labeled as MFS in the Princeton WordNet Gloss Corpus (Mihalcea and Moldovan, 2001).
(2) Word phrases in a synset gloss are even less ambiguous. Also, we calculate the proportion of polysemous phrase lemma in all phrase lemmas in WordNet. It shows a small proportion of those ambiguous phrase lemmas, 13.9% (4,922 out of 46,470).
Inspired by the above two hypotheses, we design a gloss augmentation method to retrieve sequences that contain gloss mentions. This is only operated on those synsets whose gloss has less than 6 words, which are easier to apply rules on. We detail the procedures as follows: (1) For synsets whose gloss length is smaller than 6 words, cut each gloss or compose gloss words into one or more phrases under heuristic rules (split the gloss sentence with ';' into spans; segment each span based on the location of 'or'), see Table 2   (2) Filter out those sentences where query's POS is not the same as the synset's if the query is a word; extract the sub-sentence which includes the query but filters out the words before the query to reduce noise; Filter out those sentences that occur in more than one retrieved sentence sets of competing synsets (e.g. think.v.01, think.v.02) of a lemma to avoid overlap.
After the sequences (cf. Figure 2) are obtained, we combine them with each corresponding synset's gloss to learn a basic contextual representation.

Sense Embedding Enhancement
In this section, we introduce how to exploit WordNet relations for learning relation-enhanced sense embeddings. After each basic sense embedding is learned from its augmented gloss via BERT, it is further enhanced with a weighted sum of all its directly connected senses' basic sense embeddings. Here, we use all the relations except verb_group because this relation connects competing senses in many cases, weakening the difference between each other. The right proportion of Figure 3 reveals the process of sense embedding enhancement for medicine.n.02.
The relations are categorized into two classes named hyper_hypo (hypernymy and hyponymy) and other_relations. This is because the former class covers most of the connections in WordNet. We experiment on how the utilization of these two classes of relations benefit the task of WSD later.
Formula (1) details the sense embedding enhancement. Given all basic sense embeddings ( ), we enhance the embedding of sense with the basic sense embedding ( ) of all its directly connected senses ( , including sense ) obtained with different WordNet relations.
is the shortest path distance between sense and sense . (1) Given the above enhanced sense embeddings, we calculate the similarity (dot product) between an ambiguous word's context embedding and the potential senses' enhanced embeddings after normalization. The disambiguation at the first attempt (1 st WSD) is completed by selecting the potential sense with the highest similarity. The lemma and POS are utilized when retrieving the potential senses from WordNet.

Try-again Mechanism
In this section, we introduce the try-again mechanism against the first and second most  similar potential senses for every ambiguous word. This is based on the observation from the experimental result of the 1 st WSD. It shows that after ranking potential senses according to the calculated similarity, 71.8% of the correct senses are ranked 1 st , which represents the F1 score of the 1 st WSD. Furthermore, 16% of the correct senses are ranked 2 nd , which means our system's top 2 performance is 87.8%. This becomes a trigger to our experiment on whether synsets from different relations or the super-sense connection can benefit a 2 nd WSD merely against the top 2 potential senses.
Algorithm 1 illustrates the detailed try-again mechanism, where both the 1 st and 2 nd WSD similarities are employed to select the final predicted sense. Precisely, for ambiguous word ( as its contextual embedding), is the enhanced sense embedding for one of its potential sense . is all the directly connected senses from different WordNet relations except verb_group. In particular, if the top 2 potential senses belong to different super-sense categories, also contains all the senses that belong to the same super-sense as the potential sense. For instance, medicine.n.01 belongs to noun.cognition while medicine.n.02 is in noun.artifact category. In other words, the final WSD approach utilizes both the sense embedding of the potential sense itself and those of its related senses from WordNet relations and the super-sense category.

SREF
We have implemented both knowledge-based and supervised version of our system. SREF kb : the augmented gloss is utilized to learn a basic sense embedding from BERT by summing its last 4 layers at all output positions. Then synset relations are used to enhance each basic sense embedding. Finally, a nearest neighbor method is implemented against every ambiguous word's context embedding to its potential senses' enhanced embeddings before the try-again mechanism.
SREF sup : Semcor is exploited to learn a supervised sense embedding for each labeled sense. The exact approach is proposed in LMMS (Loureiro and Jorge, 2019) but the learned sense embeddings are not extended with WordNet relations because we already have a knowledgeenhanced sense embedding learned from WordNet, detailed in section 4.2. Then we concatenate the SREF kb sense embedding with the corresponding one learned from SemCor if the sense is labeled in SemCor, otherwise itself. Each context embedding is concatenated with itself for vector dimension matching because the vector dimension of each sense embedding has doubled.

Systems for Comparison
We compare our experimental results with the state-of-the-art in both knowledge-based and supervised categories.  (Scarlini et al., 2020). We also include two systems that are available after the submission of this paper, namely BEM (Blevins and Zettlemoyer, 2020)

Results
In this section, an ablation study is first implemented to illustrate how the proposed factors contribute to the final WSD performance and a test set example is given regarding the tryagain mechanism. Then, we compare our systems' performance on all-words and lexical sample datasets with state-of-the-art systems. Also, we demonstrate how the number of labeled sentences in SemCor affects the performance of SREF sup and LMMS. Finally, we experiment on how the knowledge-enhanced sense embeddings can benefit several similarity-calculating and ranking tasks with simple attempts. Table 3 shows the ablation analysis of SREF kb on the combined dataset and its POS portions, demonstrating the contribution of each proposed factor. In detail, gloss augmentation manages to boost the system's performance by 1 F1, equal to the contribution of other relations which is manually defined in WordNet. This has revealed the potential of such a fine-grained WordNet gloss utilization, and the employment of more valuable resources such as Wikipedia rather than web mentions for further investigation. Another noteworthy observation is that the sense embedding enhancement damages adverb disambiguation performance. Figure 4 provides an example about how the try-again mechanism in SREF kb selects the correct sense of bell. Here, the word is first falsely predicted to be bell.n.03 which means the sound of a bell rather than bell.n.01 that means a hollow device made of metal. The try-again mechanism manages to detect a more similar sense to the word's context, fire_bell.n.01, which is a hyponym of bell.n.01. In this case, the hyponymy relation helps the system to correctly disambiguate bell. There are also other cases where the super-sense relation contributes. Table 4 illustrates how different systems perform on standard WSD datasets separately (SE2, SE3, SE07, SE13, and SE15) and on their combined dataset (ALL). It has also shown those systems' performance on 'ALL' from POS perspectives.

All-words WSD
For dataset-level performance, our relationenhanced system, SREF kb , achieves new state-ofthe-art performance among the systems in the same category, surpassing the previous best system by 5.5 F1.
When our relation-enhanced sense embeddings are combined with the supervised sense embeddings learned from SemCor, our system (SREF sup ) also obtains new state-of-the-art performance among supervised systems, beating GlossBERT by 1 F1. GlossBERT utilizes SE07 as a developing set and tunes parameters on it. It is the first supervised system that performs over 70 F1 on SE07. SREF sup , in contrast, requires no parameter tuning and reaches 72.1 F1 on SE07. It is also worth noting that SREF sup outperforms LMMS, a similar system, by almost 2.5 F1, revealing the tremendous benefits of explicit exploitation of WordNet sense relations.
Our systems also obtain state-of-the-art results in terms of POS disambiguation in both categories, achieving advantageous performance on more ambiguous word types (cf . Table 1) including verb, adjective and noun.

Lexical Sample WSD
We also conduct experiments on the English lexical sample tasks. For a fair comparison, we use the associated training dataset instead of SemCor to learn the supervised sense embeddings.
As is shown in Table 5, SREF sup obtains new state-of-the-art performance on lexical sample tasks, although the margin between previous best performance is relatively small. NN-CWEs and GLU are systems that employ BERT as a featureextraction tool for their supervised learning framework but neglect WordNet sense knowledge. Therefore, although the systems can perform well on senses that are given sufficient labeled data for training, they do not have a good generalization ability to disambiguate rare or unseen senses. This is typically illustrated in their SE07 performance.

Performance on Rare Senses
Except for the above regular experiments, we also set up an experiment regarding how our system performs on those synsets that are ranked first (MFS) in WordNet and the others (LFS, least frequent sense) in the 'ALL' dataset. We compare our results with those provided by EWISE, which is a zero-shot WSD system that makes use of sense gloss and relations in WordNet. EWISE has an overwhelming advantage of disambiguating unseen or rare senses and thus achieve much better results on LFS disambiguation. However, our systems (SREF kb , SREF sup ) have better performance on LFS, although the margin between LMMS is not significant. Table 6 demonstrates the performance on MFS and LFS for different systems. Although EWISE surpasses BiLSTM (Raganato et al., 2017a) on LFS disambiguation by a large margin, our supervised system still beats EWISE's performance by over 20 F1 while maintains a competitive performance on MFS disambiguation. This has shown our system's generalization ability of disambiguating rare sense. Figure 5 demonstrates how the number of utilized Semcor sentences influences the performance of SREF sup , LMMS and the sense embeddings learned from BERT and SemCor. For stable performance, we fix the sentence order in SemCor and incrementally extract a proportion of sentences to perform the experiments with a 10% step size. It is shown that even with 10% labeled data, SREF sup can outperform LMMS with full labeled data by 0.5 F1. Furthermore, SREF sup   obtains a new state-of-the-art result with only 20% labeled data.

Sense Embedding Application
In order to reveal the potential of SREF sup sense embeddings to other tasks, we experiment with three similarity-based tasks including SemEval-2017-Semantic Textual Similarity (Cer et al., 2017, SE17-STS-en-en), SemEval-2017 Task 3-SubtaskA and SubtaskB (Nakovet al., 2017, SE17-Task3-SubtaskA and SubtaskB). The similarity calculation is achieved by using merely BERT embeddings or concatenating them with the sum of SREF sup sense embeddings after disambiguating the text. The whole process is conducted in an unsupervised approach. Table 7 shows that the utilization of sense embeddings is beneficial to these tasks. Nonetheless, a more plausible approach might be to utilize sense embeddings in a supervised framework, requiring further explorations.

Error Analysis
To implement the error analysis from a general perspective, we calculate the average ambiguity level (total number of potential senses divided by total number of ambiguous words) of those correctly and falsely disambiguated words by our system, 5.1 and 8.4 respectively. In a detail perspective, among the falsely disambiguated words, many competing senses are highly ambiguous and similar, and even their supersenses are hard to distinguish. For example, in 'The medicine can only be obtained with a prescription' from SE15, the correct and predicted sense for prescription are so similar that algorithms that cannot spot the gloss focus (instruction or drug) would fail, requiring the sense embedding to carry separate information regarding what the object is and what features it has.
Correct -written instructions from a physician or dentist to a druggist concerning the form and dosage of a drug to be issued to a given patient.
Predicted -a drug that is available only with written instructions from a dentist to a pharmacist.

Conclusion
We have introduced SREF, a synset relationenhanced framework with a try-again mechanism that takes into account WordNet relations and augments WordNet glosses with mentions from the web under simple hypotheses and rules. Empirical experiments have proved the effectiveness of SREF from both knowledgebased and supervised perspectives, obtaining major and minor improvements over previous state-of-the-art performance, respectively.
For future work, we intend to scale SREFkb to a multilingual version and explore the possibilities of using the multilingual WordNet so that abundant knowledge regarding English can be transferred to other languages. It is also worth investigating regarding how to better incorporate sense embedding into other downstream tasks.