Improving Word Sense Disambiguation with Translations

It has been conjectured that multilingual information can help monolingual word sense disambiguation (WSD). However, existing WSD systems rarely consider multilingual information, and no effective method has been proposed for improving WSD by generating translations. In this paper, we present a novel approach that improves the performance of a base WSD system using machine translation. Since our approach is language independent, we perform WSD experiments on several languages. The results demonstrate that our methods can consistently improve the performance of WSD systems, and obtain state-of-the-art results in both English and multilingual WSD. To facilitate the use of lexical translation information, we also propose B AB A LIGN , an precise bitext alignment algorithm which is guided by multilingual lexical correspondences from BabelNet.


Introduction
Word sense disambiguation (WSD) is one of the core tasks in natural language processing. Given a predefined sense inventory, a WSD system aims to identify the correct sense of a content word in context. Although WSD is a monolingual task, it has been conjectured that multilingual information could help (Resnik and Yarowsky, 1999;Carpuat, 2009). Attempts have been made to leverage parallel corpora for sense tagging (Diab and Resnik, 2002), but no effective method for improving WSD with translations has been proposed to date.
Much of the history of WSD has been determined by the availability of manually created lexical resources in English, including SemCor (Miller et al., 1994) and WordNet (Miller, 1995). The situation changed with the introduction of Babel-Net (Navigli and Ponzetto, 2012a), a massive multilingual semantic network, created by automatically integrating WordNet, Wikipedia, and other resources. In particular, BabelNet synsets contain translations in multiple languages for each individual word sense. Methods have been proposed to use multilingual information in BabelNet for WSD (Navigli and Ponzetto, 2012b;Apidianaki and Gong, 2015), but they do not directly exploit the mapping between senses and translations in multiple languages.
While there have been many attempts to apply WSD to machine translation (MT) (Liu et al., 2018;Pu et al., 2018), our goal instead is to harness advances in MT to improve WSD. Rather than develop a new WSD system, we propose a general method that can make existing and future systems more accurate by leveraging translations. We evaluate our methods with several supervised and knowledge-based WSD systems.
Our approach is based on the assumption of absolute synonymy between the senses of mutual translations in context . The principal method SOFTCONSTRAINT refines sense predictions of a given base WSD system using sense-translation mappings from BabelNet. The approach is able to take advantage of translations in multiple languages, whether produced manually or by MT models. It is also able to leverage sense frequency information, which can be obtained in either a supervised or unsupervised manner. Another method that we test is t emb which integrates translations as contextual word embeddings into a WSD system to bias its sense predictions. To obtain word-level translations from the translated contexts, we introduce BABALIGN, a precise alignment algorithm guided by BabelNet synsets. In Figure 1, we show the entire architecture of our model based on aforementioned components.
Our experimental results demonstrate that translations can significantly improve existing WSD systems. We perform several experiments on English and multilingual WSD with both manual and machine translations. In the English WSD experiments with manual translations and word-level alignments, we determine the potential of our methods in an ideal situation. In the experiments with machine translations, we validate that the methods are effective and robust by showing improvements over existing WSD systems. Finally, in the multilingual WSD experiments, we demonstrate the language independence of our methods.
The main contributions of this work are the following. (1) We propose the first effective method to improve WSD with automatically generated translations.
(2) Our language-independent knowledgebased method achieves state-of-the-art results in both English all-words and multilingual WSD.
(3) We introduce a bitext alignment algorithm that leverages information from BabelNet.

Related Work
The integration of multilingual information to improve English WSD has been considered in prior work. Through analyzing a multilingual dictionary, Resnik and Yarowsky (1999) observe that highly distinct senses can translate differently. Diab and Resnik (2002) propose a WSD system based on translation information extracted from a bitext, but it fails to outperform systems that rely on monolingual information only.
Word sense induction (WSI) and cross-lingual WSD (CLWSD) are related tasks. WSI aims for automatically inducing word senses from corpora by clustering similar instances of words. Several prior works perform WSI based on bitexts to create bilingual sense inventory on word samples, where translations are treated as sense tags (Specia et al., 2007;Apidianaki, 2009). CLWSD is a task to predict a set of translations for a given ambiguous word in context. Attempts have been made to integrate translations as bag-of-words feature vectors to enhance CLWSD (Lefever et al., 2011). Since the goals of WSI and CLWSD differ from standard WSD with predefined senses, our approach is not directly comparable. Navigli and Ponzetto (2012b) incorporate translations in BabelNet synsets as a feature in a graphbased WSD system. However, rather than apply translations of the focus word token as constraints, they simply consider all possible translations of the focus word type to enhance its sense distinctions. Apidianaki and Gong (2015) directly apply sense-translation mappings in BabelNet as a hard constraint on sense predictions using translations from sense-annotated bitexts. Unlike our work, their approach is based on the BabelNet First Sense (BFS) baseline, rather than on an actual WSD system. Their results on English WSD fail to show improvement over the baseline, which may be due to the use of only a single target language, as well as word alignment errors.

Methods
We first formulate our WSD task. The input is a sentence, in which one word, e, is designated as the focus word. The set of possible senses of the focus word S(e) comes from the sense inventory. We assume that a base WSD system assigns a probability or score to each sense, with the output being the sense with the maximum score. The objective is to determine which sense s ∈ S(e) is the sense of e in this sentence.
We propose two methods, HARDCONSTRAINT and SOFTCONSTRAINT, which can be used to augment any base WSD system that meets the above specifications. Both methods leverage translations in order to constrain sense predictions made by a base WSD system. In addition, we introduce a method of leveraging contextual word embeddings to enhance the integration of translations in combination with those constraints. Finally, since our methods crucially depend upon identifying the translation of the focus word in the translated sentence, we also introduce a new knowledge-based word alignment algorithm.

HARDCONSTRAINT
Our first method extends the idea of Apidianaki and Gong (2015) to constrain S(e) based on sensetranslation mappings in BabelNet. However, instead of relying on a single translation, we incorporate multiple languages by taking the intersection of the individual sets of senses; that is, we rule out senses if their corresponding BabelNet synsets do not contain translations from all target languages. This baseline method is simple but inflexible: the correct sense can be accidentally ruled out if the provided translation of the focus word is not found in the corresponding BabelNet synset.
Our implementation of HARDCONSTRAINT considers the intersection of the sets of synsets that contain translations from each language. Ideally, the intersection contains exactly one sense, which we take as the final prediction. (Such a case is illustrated in Figure 2.) Otherwise, if the intersection contains multiple senses, we choose the one with the highest score from the base WSD system. If the intersection happens to be empty, we back-off to the prediction of the base WSD system.

SOFTCONSTRAINT
HARDCONSTRAINT is effective at ruling out sense candidates, but is also sensitive to MT errors and BabelNet deficiencies. BabelNet contains translations for only 79% of the nominal senses in Word-Net, and its multilingual lexicalizations have an average precision of only 72% (Navigli and Ponzetto, 2012a).
Our principal method, SOFTCONSTRAINT, is more robust in handling noisy MT translations and BabelNet gaps. It integrates information from three sources: the base WSD system, translations, and sense frequencies ( Figure 2). From each of these sources, we derive a probability distribution over S(e). We employ the product of experts (PoE) approach (Hinton, 2002) to combine the probabilities as follows: The resulting scorep is an unnormalized measure of probability with tunable weights α, β, and γ. We tune those weights through grid-search. The sense that maximizes this measure is taken as the prediction. Below, we provide the details on each of the three distributions.
Probability p wsd is obtained by simply normalizing the numerical scores from the base WSD system.
Probability p trans is calculated on the basis of the set of translations for each focus word e in Babel-Net. Given a source focus word e and a word f in another language, we obtain its sense coverage c(e, f ) representing the number of possible senses of e that are mapped to f , i.e., the number of Ba-belNet synsets containing both e and f . Based on the sense coverage, the word pair e and f is assigned a weight w(e, f ) that reflects its discrimination power: Now, we consider f to be a translation t L (e) for e in a target language L ∈ L, where L stands for the set of target languages. The score of a candidate sense s ∈ S(e) is then the sum of weights of the translations that are found in the corresponding BabelNet synset BN(s): where 1 BN(s) (t L (e)) is an indicator function that becomes 1 if t L (e) ∈ BN(s) and 0 otherwise. As with p wsd , we normalize the scores into a proper probability distribution p trans over the set of senses.
To avoid zero values, we perform smoothing by adding a small positive value (a tunable parameter). Probability p freq represents the sense frequency information for a given lemma and part-of-speech (POS). This information is also used by most WSD systems. For English, we obtain sense frequencies from WordNet, which derives such information from SemCor, a sense-annotated corpus. To handle senses with zero frequency in SemCor, we also apply additive smoothing. To obtain p freq for languages other than English, which lack large, highquality sense annotated corpora, we use CluBERT , the state-of-the-art system for unsupervised sense distribution learning, which applies a clustering algorithm to contextual embeddings from BERT (Devlin et al., 2019). Like our methods, CluBERT is language independent, has no additional training data requirements, and has been successfully integrated into WSD systems to improve their performance. Figure 2 illustrates how SOFTCONSTRAINT combines the three probability distributions to correct an incorrect sense prediction produced by a base system.

Contextual Word Embeddings
Recent work has demonstrated the utility of contextual word embeddings for NLP tasks (Peters et al., 2018;Devlin et al., 2019). Accordingly, WSD systems such as SENSEMBERT  take a contextual embedding of the focus word as input, in order to leverage its dense encoding of relevant local information, which may be used to determine the correct sense.
In this section, we propose a method of adding translation information to the input of a WSD system by modifying the contextual embedding of the focus word to reflect its translation. We refer to this method as t emb. Note that this method can be combined with either the HARDCONSTRAINT or SOFT-CONSTRAINT methods. Unlike the constraintbased methods, which use translations of the focus word to post-process the output of a WSD system, t emb provides the translation information in the form of an embedding directly as input to the WSD system. Thus, translation information is used as an additional feature to improve sense predictions of the base WSD system.
As before, our approach is to translate the context of the focus word, and use word alignment to identify the translation of the focus word. We compute a contextual embedding of this translation, just as we did for the focus word itself, and then concatenate the two embeddings. This produces a new embedding that can be provided to a base WSD system in place of the focus word embedding alone. However, since not all WSD systems use contextual embeddings, this method is less general, and we only apply it to some of our models and evaluation experiments.

Translation Alignment
The effectiveness of our approach for improving WSD depends on the correct identification of the word-level translations in each language. Even when the sentential context of the focus word is correctly rendered in another language, both HARD-CONSTRAINT and SOFTCONSTRAINT rely on the proper alignment between the source focus word and its translation, which may be composed of multiple word tokens. Although attention weights in some NMT systems may be used to derive word alignment, such an approach is not necessarily more accurate than off-the-shelf alignment tools (Li et al., 2019). Therefore, our approach is to instead identify the word-level translations by performing a bitext-based alignment between the source focus words and their translations.
During development, we found that the accuracy of alignment tools such as FASTALIGN (Dyer et al., 2013) is limited by the size of the aligned bitext, as well as the lack of access to the translation information which is present in BabelNet. To mitigate these issues, we introduce a knowledge-based word alignment algorithm BABALIGN 1 that leverages translation information in BabelNet by post-processing the output of an off-the-shelf word aligner. BA-BALIGN is shown to be more effective than existing word aligners in downstream tasks such as crosslingual lexical entailment . We first append our translated WSD data to a large lemmatized bitext. We further augment the bitext with the BabelNet translations for all WSD focus words. We then run the base aligner in both translation directions, and take the intersection of the two sets of alignment links. In its final stage, BABALIGN leverages the BabelNet translation pairs again, to post-process the generated alignment.
Algorithm 1 summarizes BABALIGN. The algorithm takes as input a source-language sentence and a target-language sentence, as well as the set of translations for each content word in the source sentence. As BABALIGN is an alignment postprocessing algorithm, its input is the alignment of the two sentences from a base aligner.
If a source word w s is aligned to a word w t which is one of its translations, the alignment is considered correct. Since a possible translation may be composed of multiple words (e.g., French translation salle d'audience for courtroom), we attempt to expand a partial alignment by considering the adjacent word tokens. This is achieved by invoking compound search, which takes the aligned token pair (w s , w t ) and returns the longest sequence of target tokens c such that bn(w s ) contains c, c contains w t , and c does not contain any target tokens (except w t ) that are aligned by the base aligner. If no such compound is found, compound search simply returns w t , so no change in the alignment will be made.
On the other hand, if the source word w s is aligned to a target word which is not among its translations, we invoke bnlex search, which returns the longest sequence of target tokens l such that bn(w s ) contains l, and l does not contain any tokens that are already aligned. Intuitively, this is an attempt to "repair" an incorrect alignment by searching for an unaligned target word which is known to be a translation of w s . If such an l can be found (i.e. l = N one), the alignment is modified so that w s is aligned to l.

Word Alignment Evaluation
To show the effectiveness of BABALIGN, which combines an existing word aligner with translations from BabelNet, we evaluate the alignment performance using parallel datasets with gold alignment. We employ FASTALIGN as the base aligner. As Algorithm 1 BABALIGN Input: list of all source tokens, σ s = (w s1 , . . . , w sl ) list of all target tokens, σ t = (w t1 , . . . , w tm ) BabelNet translations, bn(w s ) = {l 1 , . . . , l n } 1: A ← BaseAligner(σ s , σ t ) 2: for each aligned word pair (w s , w t ) ∈ A do 3: if w t ∈ bn(w s ) then 4: c ← compound search(w s , w t )

5:
Modify A such that w s aligns to c. if l = N one then 9: Modify A such that w s aligns to l.
10: return A the evaluation datasets, we use SemCor 3.0 2 and its translations, Multi SemCor (MSC) (Bentivogli and Pianta, 2005) and Japanese SemCor (JSC) (Bond et al., 2012), to evaluate English-Italian and English-Japanese alignment respectively. Both MSC and JSC contain manually annotated gold alignment for a subset of the sense-annotated content words in SemCor. We extract all English, Italian, and Japanese sentence triples where an English token has gold alignments in both the Italian and Japanese sides. We get 639 sentence triples with 2602 aligned tokens. We only evaluate the alignment performance for those 2602 sense-annotated tokens, and do not consider the alignment for other tokens, because our purpose here is to obtain proper translations for test words in the WSD setting. For SemCor, we continue to use the included tokenization, lemma, and POS information. For MSC and JSC, we do not use the tokenization, lemma, and POS information provided in the data to emulate the setting where we generate translations for monolingual WSD datasets. Instead, for MSC, JSC, and the additional bitexts, we employ morphological taggers to perform pre-processing: TreeTagger (Schmid, 1994) for Italian and MeCab (Kudo, 2005) for Japanese. The additional bitexts that we append to the data are from OpenSubtitles2018 (Lison and Tiedemann, 2016): English-Italian (37.8M sentences) and English-Japanese (2.2M sentences). We evaluate alignment performance in terms of  whether the lemma of the aligned translation corresponds to the lemma of the manually aligned translation in MSC or JSC. Table 1 compares the alignment approaches. As expected, the concatenation of a large bitext to the test data (+OpenSub) dramatically reduces the number of errors. The addition of translation pairs from BabelNet (+pairs) yields further gains. BA-BALIGN itself improves the quality of the alignment on English-Japanese by nearly 10 points. The improvement on English-Italian is smaller, as the alignment between similar languages is easier, and the additional bitext is much larger. Japanese is particularly challenging, not only because it is typologically different, but also due to the frequency of multi-character compounds. The back-off strategy used by BABALIGN effectively leverages possible translations in BabelNet to recover tokenized compounds and missing alignment links. This mitigates the effect of alignment errors on our WSD results, which we describe in the next section.

WSD Evaluation
In this section, we first describe the WSD systems that we use in our experiments. We then show how our methods can improve existing WSD systems in the oracle setting for English all-words WSD. Finally, we report the results of the experiments on multilingual WSD and English all-words WSD with automatic translations.

WSD Systems
There are two main approaches to WSD: supervised and knowledge-based. Supervised systems are trained on sense-annotated corpora and generally outperform knowledge-based systems. On the other hand, knowledge-based systems usually apply graph-based algorithms to a semantic network and thus do not require any sense-annotated corpora. Since it is expensive to obtain manually senseannotated corpora and such corpora exist mainly in English, it is often impractical to apply supervised systems to multilingual settings. Therefore, for multilingual WSD, knowledge-based approaches are typically employed.
Many effective WSD systems have been proposed; we include here only the systems that we use in our experiments. IMS (Zhong and Ng, 2010) is a canonical supervised WSD system, which uses support vector machines with various lexical features. LMMS (Loureiro and Jorge, 2019) leverages contextual word embeddings and surpasses the long-standing 70% F-score ceiling for supervised WSD. It learns supervised sense embeddings by applying BERT to SemCor, with additional semantic knowledge from WordNet. Among the knowledge-based systems, Babelfy (Moro et al., 2014) applies random walks with restarts to Ba-belNet to perform WSD and entity linking. Even though Babelfy is based on BabelNet, it does not make direct use of the translation information in BabelNet. Similarly, UKB (Agirre et al., 2014(Agirre et al., , 2018, which is based on personalized PageRank on WordNet, achieves state-of-the-art performance on English all-words WSD. Finally, utilizing contextual embeddings, SENSEMBERT  learns knowledge-based multilingual sense embeddings obtained by combining representations learned using BERT with knowledge obtained from BabelNet. This yields state-of-the-art results on English nouns WSD and multilingual WSD. We test these systems both without modification, and with the addition of our knowledge-based methods, to measure how much improvement can be obtained by leveraging translations.

Oracle WSD Experiments
Our first set of experiments aims at estimating the upper limits of our approach in an oracle setting of annotated and aligned bitexts with high-quality human translations.
Experimental Setup Our sense-annotated bitexts are MSC and JSC (Section 4), which contain manual translations of texts from SemCor. As in Section 4, we use 639 sentences with 2602 senseannotated instances from MSC and JSC. We randomly sample 10% of the instances as the development set. We tune all parameters on the development set, and use the same hyperparameters throughout the experiment.
We employ two knowledge-based WSD systems: Babelfy and UKB. Both systems have variants that take advantage of sense frequency information in  WordNet. Babelfy backs off to the WordNet first sense (WN1st) using a fixed confidence threshold, which we set to 0.8 following Moro et al. (2014). UKB uses complete sense frequency distributions, which are referred to as the dictionary weight (dict weight). We use the same parameter settings as Agirre et al. (2018). For fair comparison, when applying SOFTCONSTRAINT to a system variant without sense frequency information, we set γ to 0 to turn off the p freq component.

Results
The results in Table 2 demonstrate the efficacy of leveraging translations for WSD. The systems without sense frequency information are boosted by 15-18%, while the systems with full features get up to 9% absolute improvement. Also, SOFTCONSTRAINT consistently outperforms HARDCONSTRAINT. The modest improvement of 1% on Babelfy is due to the base system falling back on the WN1st sense in about 80% of test instances, precluding the use of translations. Also, we observe that our approach is effective in combining translations from multiple languages. For instance, the F-score of 73.3% for plain UKB with SOFTCONSTRAINT (shown in Table 2) drops to 72.1% with only Japanese translations, to 64.2% with only Italian translations, and to 58.0% with no translations. These results also indicate that translations from a more distant language, i.e., Japanese, work better at discriminating senses.

Multilingual WSD Experiments
Since our methods are language-independent, we test them on standard multilingual WSD datasets.

Experimental Setup
We perform our multilingual WSD evaluation on benchmark parallel datasets in English, Spanish, Italian, French, and German from SemEval-2013 task 12 (Navigli et al., 2013) and SemEval-2015 task 13 (Moro and Navigli, 2015). 3 The datasets contain manual reference translations, but are not word-aligned. We perform experiments in two settings, with either machine or human translations. To obtain automatic translations, we translate the test sets into English using Google Translate 4 because the pre-trained NMT models for test languages are not always available. For manual translations, we use the provided parallel datasets in all languages. For each individual language, we use BABALIGN to obtain translations of the focus word in other languages. We randomly sample 10% of test instances in each dataset to obtain development sets for parameter tuning.
We use two multilingual base WSD systems: IMS and SENSEMBERT. We train IMS on OneSeC (Scarlini et al., 2019), an automatically sense-annotated set of corpora in multiple languages. 5 For SENSEMBERT embeddings, when we integrate the translation embedding (t emb), we concatenate the focus word embedding and its corresponding t emb, as described in Section 3.3. To compute these contextual word embeddings for English translations, we use the 768-dimensional multilingual BERT cased pre-trained model (mBERT). Since both OneSeC and SENSEMBERT are limited to nouns, we follow Scarlini et al. (2019 in performing the evaluation on nominal instances only.
Since languages other than English lack large sense-annotated corpora, we employ two evaluation settings. In the default setting, sense frequency information is not used, with the parameter γ set to 0 in SOFTCONSTRAINT. In the other setting, we approximate sense distributions with CluBERT .

Results
In Tables 3 and 4   STRAINT performs well in this set of experiments, as nouns are very well covered by BabelNet. 7 SOFTCONSTRAINT achieves an average improvement of several F1 points on both systems, even without sense frequency information. The best results are obtained using sense frequency estimates from CluBERT, especially when they can be combined with mBERT-based contextual translation embeddings (t emb), neither of which requires manually sense-annotated corpora. We interpret these results as the new state of the art in multilingual WSD based on the consistent improvement over SENSEMBERT.
To evaluate the potential of using translations from a replicable NMT model, we obtain English translations for test words in the SemEval-2013 German dataset with a pre-trained transformer model  available in the fairseq toolkit . In this setting, we use only English translations for both constraints and t emb. The results on both WSD systems with the pre-trained model are almost the same as with GT, and slightly better than with English-only manual 7 Over 99% of the words in BabelNet are nouns (Navigli and Ponzetto, 2012a). On average, 92% of the SemEval translations are in the BabelNet synsets of the correct senses.
translations. According to our preliminary analysis, machine translations may sometimes work better because they tend to be more literal, and easier to align with the source focus words. This suggests that our methods can effectively leverage translations from different kinds of sources.

English WSD Experiments with NMT
In the final set of experiments, we evaluate our methods on standard monolingual benchmark datasets using NMT translations from multiple languages.
Experimental Setup We evaluate on five English all-words datasets: Senseval2, Senseval3, SemEval-2007, SemEval-2013, and SemEval-2015 from the unified framework made available by Raganato et al. (2017). Since these datasets are not accompanied by translations, we automatically obtain the translations from NMT models. We tune parameters on Senseval2, and apply the same parameter settings in all datasets.
We test our methods with four base WSD systems: Babelfy and UKB (knowledge-based), and IMS and LMMS (supervised), trained on SemCor 3.0 provided in Raganato et al. (2017). Our replication experiments match the reported results for these systems (±0.2% on average). For translations, we employ pre-trained transformer models from the fairseq toolkit: English-French and English-German models from Ott et al. (2018), and an English-Russian model from . We choose French, German, and Russian as target languages due to the availability of pre-trained models. Note that unlike multilingual WSD experiments (Section 5.3), we do not use Google Translate in the following experiments. We compare plain Babelfy and UKB to SOFTCON-STRAINT without p freq . For other systems, we derive p freq from sense frequency information from WordNet 3.0. 8 Table 5 shows the results on the standard English all-words WSD datasets. While HARD-CONSTRAINT is not sufficiently robust to improve complex WSD systems with automatically generated translations, SOFTCONSTRAINT shows statistically significant improvements over the original performance for all base systems.   For example, UKB with dict weight correctly predicts the sense of "earth" in "the world's most influential countries." However, English world and its three translations, monde, Welt, and mir, are only found in the BabelNet synset glossed as "populace", while the Russian translation mir happens to be missing from the BabelNet synset glossed as "earth", perhaps because there is no Russian link to the English Wikipedia page for World. Hence, while HARDCONSTRAINT miscorrects the UKB prediction to the sense of "populace", SOFTCON-STRAINT keeps it unchanged by leveraging sense frequencies and the base system scores.

Results
In Table 6, we show additional results on UKB with dict weight when using only a single language to derive translations. All three languages show similar improvements, and we can obtain better improvements by combining multiple languages.
In summary, these results again demonstrate that our method can effectively integrate information from the WSD system itself, translations, and sense frequency even with noisy translations generated by NMT models. While translations are shown to help even strong supervised WSD systems, the improvements are particularly impressive on knowledgebased systems. The SOFTCONSTRAINT result on UKB with dict weight sets a new state of the art for knowledge-based systems.

Conclusion
We proposed a novel approach that improves WSD by leveraging translations from multiple languages, which incorporates a knowledge-based bitext alignment. We tested our methods with several base WSD systems. We demonstrated experimentally that SOFTCONSTRAINT can consistently improve WSD performance even when no manual translations are available, leading to state-of-the-art results on multilingual and English all-words WSD. We make the source code available at https:// github.com/YixingLuan/translations4wsd.