The Word Sense Disambiguation Test Suite at WMT18

We present a task to measure an MT system’s capability to translate ambiguous words with their correct sense according to the given context. The task is based on the German–English Word Sense Disambiguation (WSD) test set ContraWSD (Rios Gonzales et al., 2017), but it has been filtered to reduce noise, and the evaluation has been adapted to assess MT output directly rather than scoring existing translations. We evaluate all German–English submissions to the WMT’18 shared translation task, plus a number of submissions from previous years, and find that performance on the task has markedly improved compared to the 2016 WMT submissions (81%→93% accuracy on the WSD task). We also find that the unsupervised submissions to the task have a low WSD capability, and predominantly translate ambiguous source words with the same sense.


Introduction
Ambiguous words are often difficult to translate automatically, since the MT system has to decide which sense is correct in the given context.Errors in lexical choice can result in bad or even incomprehensible translations.However, documentlevel metrics, such as BLEU (Papineni et al., 2002) are not fine-grained enough to assess this type of error.
Early evaluations have shown that neural machine translation (NMT) produces translations that are substantially more fluent, i.e. more grammatical and natural, than the previously dominant phrase-based/syntax-based statistical models, but results are more mixed when comparing ade-quacy, the semantic faithfulness of the translation to the original (Bojar et al., 2016;Bentivogli et al., 2016;Castilho et al., 2017;Klubička et al., 2017).
For example, in the fine-grained human evaluation by Klubička et al. (2017), mistranslations were the most frequent error category for the NMT system they evaluated, whereas fluency errors dominated in phrase-based machine translation. 1Our aim is to quantify one aspect of adequacy, word sense disambiguation (WSD), in a reproducible and semi-automatic way, to track progress over time and compare different types of systems in this respect.
We present a German→English test set to semiautomatically assess an MT systems performance on word sense disambiguation.The test set is based on ContraWSD (Rios Gonzales et al., 2017), but has been further filtered to reduce noise, and we use a different evaluation protocol.Instead of scoring a set of translations and measuring whether the reference translation is scored highest, we base the evaluation on the 1-best translation output to make the evaluation applicable to black-box systems.We report results on all German→English submissions to the WMT 2018 shared translation task (Bojar et al., 2018), plus a number of baseline systems from previous years.

Test Suite
Rather than measuring word sense disambiguation against a manually defined sense inventory such as those in Wordnet (Miller, 1995), we perform a task-based evaluation, focusing on homonyms whose different senses have distinct translations. 2he collection of test cases consists of 3249 German-English sentence pairs where the German source contains one of 20 ambiguous words that have more than one possible translation in English. 3We have associated the 20 ambiguous words with a total of 45 word senses, and extracted up to 100 examples for each sense.
The set of ambiguous words and sentence pairs are based on the test set described in (Rios Gonzales et al., 2017). 4The original test set was designed to use scoring for the evaluation, however, in the present task we let the systems translate the source sentences, and evaluate the translation output.This change in evaluation protocol required further filtering of the original test set, specifically, the removal of German words with an English translation that covers multiple senses.For instance, the original test set contains Stelle with two English senses: job and place.Both meanings can be translated as position, in which case we would not be able to assess the translation as correct or wrong, therefore Stelle was removed from our set of ambiguous words.
Since for most ambiguous words, one or more of their meanings are relatively rare, a large amount of parallel text is necessary to extract a sufficiently balanced number of examples. 5The correct translation is automatically determined for each pair through the reference translation.Table 1 lists all the ambiguous German words in the test set with with their translations in English.We base our statistics on the number of ambiguous source words, which is slightly higher (3363) than the number of sentences (3249).Sentence pairs 3 The test set and evaluation scripts are available from https://github.com/a-rios/ContraWSD/tree/master/testsuite_wmt18 4 The identification of ambiguous words and senses was performed with the help of lexical translation probabilities. 5Sentence pairs have been extracted from the following corpora: • WMT test and development sets 2006-2016 (de-en) and 2006-2013 (de-fr) • Crédit Suisse News Corpus https://pub.cl.uzh.ch/projects/b4c/de/ • Corpora from OPUS (Tiedemann, 2012): -Global Voices (http://opus.lingfil.uu.se/GlobalVoices.php)-Books (http://opus.lingfil.uu.se/Books.php)-EU Bookshop Corpus (http://opus.lingfil.uu.se/EUbookshop.php) • MultiUN (Ziemski et al., 2016) where the reference translation contains more than one possible sense as a translation have been removed.For instance, if a given reference contains the word investment as a translation for Anlage, but also attachment as a translation of another source word, this sentence pair cannot be part of the test set, since word alignment would be required to assess it correctly.
The evaluation is semi-automatic: We automatically check for each sentence in the MT output if one of the correct translations of the ambiguous word is present, and if the output contains one of the other possible translations of the word, i.e. if it has been translated with one of its other senses.Note that we check for more variation in the automatic matching than shown in Table 1, e.g. for Absatz -sales, we also consider verbal forms such as sold, sells, selling etc. as correct, using manually created lists of valid translations. 6here are four possible outcomes of this automatic evaluation: 1. we find only instances of the correct translations → counts as correct7 2. we find only instances of the other translations → counts as wrong 3. we find both the correct and one of the other translations → manual inspection 4. we find none of the known translations → manual inspection

Manual Evaluation Protocol
The large majority of translation outputs could be categorized as correct or wrong automatically.For the remaining approximately 5%, we manually assigned a label.Overall, around 25% of these were labelled as correct.
Case 3 typically indicates that the same ambiguous source word occurs multiple times in the input, and a manual annotator provided the number

Evaluation
We present results for all submissions to the WMT'18 shared translation task for German→English.
In addition, we include several baseline systems in our evaluation to track performance over time.We report results for Edinburgh's WMT'16 and WMT'17 submitted neural systems for German→English (Sennrich et al., 2016(Sennrich et al., , 2017)), which were ranked first in 2016, and tied first in 2017. 9We also include Edinburgh's WMT'16 syntax-based system (Williams et al., 2016), ranked tied second in 2016, to compare the now dominant neural systems to a more traditional SMT system.
We report the WSD accuracy for each system, in two variants: automatic and full.For automatic accuracy only case 1 is considered correct, and cases 2-4 are considered wrong.Full accuracy considers some cases 3 and 4 (where both a correct and an incorrect translation, or none of the listed translations, are found) correct, if they were found to be correct upon manual inspection.We also report BLEU scores on newstest2018, and on the WSD test suite, for comparison.

Results
Results on the WSD test suite are shown in Table 3. Table 4 shows an error analysis with two categories, distinguishing between predicting the wrong sense, and leaving the ambiguous source word untranslated.Globally, we observe a strong correlation between WSD accuracy and BLEU on the WSD test suite (Kendall's τ = 0.91), and a smaller (but still strong) correlation between WSD accuracy and BLEU on newstest2018 (τ = 0.72).
However, there are some notable differences between BLEU and WSD accuracy.Especially some unconstrained, anonymous systems (online-A/B/G/Y) perform better on the WSD test suite than newstest2018 relative to other systems, which is likely due to differences in domain focus and training data: most constrained systems built for the shared task use monolingual news data for domain adaptation, whereas the online systems likely do not.At the same time, the online systems may be using extra training resources, and we cannot rule out that they train on corpora from which the WSD test suite is extracted.
The unsupervised systems RWTH-UNSUPER and LMU-unsup, as well as the anonymous rulebased system online-F clearly fall behind.In many cases, these systems stick to one translation of a given ambiguous word.This becomes obvious when looking at the number of cases where the translation contains one of the other meanings of the translated words.The less common a given sense, the more likely it is translated with one of its other meanings -this is true for all systems, but more pronounced in the unsupervised models.Not only do they translate words with a wrong meaning more often, they seem to have learned some spurious correlations.For instance, the German word Preis (price/prize) was translated in almost all cases as call by LMU-unsup.Generally, the unsupervised systems tend to translate words in a deterministic fashion, i.e. they use mostly the same translation for an ambiguous source word, regardless of context.
We observe that there is little difference in WSD accuracy between the syntax-based and neural uedin systems from 2016, even though the neural system achieves a substantially higher BLEU score.This is consistent with human comparisons of statistical and neural systems at the time, which found large improvements in fluency, but only small differences in adequacy, or specifically the number of mistranslations (Bojar et al., 2016;Castilho et al., 2017;Klubička et al., 2017).Interestingly, we observe major improvements in lexical choice since the 2016 systems, with a jump of 5 percentage points in 2017, and another 8 percentage points by the best system in 2018.
While these experiments were not under controlled data conditions10 , we believe that this im- provement is only partially explainable by the increase in the amount of training data.We highlight a number of systems to illustrate this point.
Paracrawl is a noisy resource, and most submission systems report using a filtered version of it.Ubiqus-NMT does not use Paracrawl at all, and is thus comparable to uedin-nmt-2017 in terms of training data, but outperforms it in WSD accuracy.This is even more impressive considering that Ubiqus-NMT is based on a single model, outperforming the reranked ensembles of uedin-nmt-2017.
A second interesting comparison is that between different architectures.LMU-nmt is based on a shallow RNN encoder-decoder, similar to uedin-nmt-2016, and exhibits a similarly low WSD accuracy.Most submissions are based on deep Transformer or RNN architectures, and show a higher WSD accuracy.Neural network depth was also one of the main differences between uedin-nmt-2016 and uedin-nmt-2017, and our results indicate that this is an important factor for lexical choice.Experiments by Tang et al. (2018), conducted in parallel to this work, on WMT17 training data also show that neural architectures through the inclusion of Paracrawl (+700%).
play an important role in the performance on WSD, with a substantial lead for the Transformer over the tested RNN and CNN architectures.
The error analysis in Table 4 exposes other differences between systems.The rule-based system online-F is least prone to leaving the ambiguous source words untranslated (0.7%), while this is a more serious problems in the unsupervised systems (up to 6.9%) and some neural systems (up to 4.7%).It has been argued that SMT, which uses a coverage mechanism during decoding, is less prone to undertranslation than NMT (Tu et al., 2016).On the WSD test set, we find that uedin-nmt-2016 leaves more of the ambiguous words untranslated (2.4%) than the contemporaneous uedin-syntax-2016 (1.3%), but most NMT systems submitted to this year's shared translation task improve upon this number.While this is a very narrow evaluation of the undertranslation problem (only on one data set, and looking at specific source words), we consider it encouraging that we could measure some progress.

Conclusions
We present a targeted evaluation of 16 systems regarding their performance in lexical choice.A comparison against a baseline consisting of the top ranked systems from WMT 2016 and 2017 for German-English shows that translation models in general have improved substantially.Furthermore, we observe that unsupervised systems are at a clear disadvantage when it comes to word sense disambiguation: they are less flexible and tend to stick to one translation of a given ambiguous word, regardless of context.
The current study is focused on a small set of 20 ambiguous nouns and 45 word senses, and a large-scale test set is created by extracting 3249 sentence pairs containing one of these word senses from various parallel corpora.This focus on ambiguous source words without lexical overlap between word senses in the target language allowed us to define an evaluation protocol that is mostly automatic: manual inspection was only necessary for about ≈ 5% of sentences, and had little effect on the ranking.However, this narrow focus also comes with limitations, and it would be interesting to evaluate word sense disambiguation on a larger set of words, and including other parts-ofspeech such as verbs and adverbs, which constituted a substantial proportion of lexical choice er-rors in previous analyses of MT systems (Williams et al., 2015).

Table 1 :
List of ambiguous German words, and the English translations of their different senses, included in the test suite.sourceImAllgemeinen lässt sich deshalb mit Recht behaupten, dass -mit der richtigen Beratung und Sorgfalt -Hedge-Fund-Anlagen nicht zwangsläufig risikoreicher sind als traditionelle Anlagen.reference It is therefore fair to say that properly advised hedge fund investments are, generally speaking, not necessarily riskier than traditional investments.MT translation In general, therefore, it is fair to say that, with the right advice and care, hedge fund assets are not necessarily more risky than traditional plants.

Table 2 :
Example sentence pair for ambiguous word Anlagen with translation from uedin-nmt-2017.The first translation assets is correct, the second (plants) wrong. of correct translations.See Table2with an example from one of the baseline systems, where the ambiguous word Anlage occurs twice, both times in the financial sense.The MT system translates the first form correctly, but the second with one of its other meanings, plant.Case 4 can indicate that the ambiguous source word was translated into a variant not covered by our automatic patterns, or left untranslated. 8Manual assessment by the main author is used to distinguish between the two.

Table 3 :
Results on WSD test suite.WSD accuracy before and after manual inspection, and BLEU on newstest2018, and on references from WSD test suite.