XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization

The ability to correctly model distinct meanings of a word is crucial for the effectiveness of semantic representation techniques. However, most existing evaluation benchmarks for assessing this criterion are tied to sense inventories (usually WordNet), restricting their usage to a small subset of knowledge-based representation techniques. The Word-in-Context dataset (WiC) addresses the dependence on sense inventories by reformulating the standard disambiguation task as a binary classification problem; but, it is limited to the English language. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages from varied language families and with different degrees of resource availability, opening room for evaluation scenarios such as zero-shot cross-lingual transfer. We perform a series of experiments to determine the reliability of the datasets and to set performance baselines for several recent contextualized multilingual models. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance in the task of distinguishing different meanings of a word, even for distant languages. XL-WiC is available at https://pilehvar.github.io/xlwic/.


Introduction
One of the desirable properties of contextualized models, such as BERT (Devlin et al., 2019) and its derivatives, lies in their ability to associate dynamic representations to words, i.e., embeddings that can change depending on the context. This provides the basis for the model to distinguish different meanings (senses) of words without the need to resort to an explicit sense disambiguation step. The conventional evaluation framework for this property has Authors marked with a star ( ) contributed equally.
been Word Sense Disambiguation (Navigli, 2009, WSD). However, evaluation benchmarks for WSD are usually tied to external sense inventories (often WordNet (Fellbaum, 1998)), making it extremely difficult to evaluate systems that do not explicitly model sense distinctions in the inventory, effectively restricting the benchmark to inventory-based sense representation techniques and WSD systems. This prevents a direct evaluation of lexical semantic capacity for a wide range of inventory-free models, such as the dominating language model-based contextualized representations.
Pilehvar and Camacho-Collados (2019) addressed this dependence on sense inventories by reformulating the WSD task as a simple binary classification problem: given a target word w in two different contexts, c 1 and c 2 , the task is to identify if the same meaning (sense) of w was intended in both c 1 and c 2 , or not. The task was framed as a dataset, called Word-in-Context (WiC), which is also a part of the widely-used SuperGLUE benchmark (Wang et al., 2019). Despite allowing a significantly wider range of models for direct WSD evaluations, WiC is limited to the English language only, preventing the evaluation of models in other languages and in cross-lingual settings.
In this paper, we present a new evaluation benchmark, called XL-WiC , that extends the WiC dataset to 12 new languages from different families and with different degrees of resource availability: Bulgarian (BG), Chinese (ZH), Croatian (HR), Danish (DA), Dutch (NL), Estonian (ET), Farsi (FA), French (FR), German (DE), Italian (IT), Japanese (JA) and Korean (KO). With over 80K instances, our benchmark can serve as a reliable evaluation framework for contextualized models in a wide range of heterogeneous languages. XL-WiC can also serve as a suitable testbed for cross-lingual experimentation in settings such as zero-shot or few-shot transfer across languages. As an additional contribution, we tested several pretrained multilingual models on XL-WiC, showing that they are generally effective in transferring sense distinction knowledge from English to other languages in the zero-shot setting. However, with more training data at hand for target languages, monolingual approaches gain ground, outperforming their multilingual counterparts by a large margin.

Related Work
XL-WiC is a benchmark for inventory-independent evaluation of WSD models (Section 2.1), while the multilingual nature of the dataset makes it an interesting resource for experimenting with crosslingual transfer (Section 2.2).

Word Sense Disambiguation
The ability to identify the intended sense of a polysemous word in a given context is one of the fundamental problems in lexical semantics. It is usually addressed with two different kinds of approaches relying on either sense-annotated corpora (Bevilacqua and Navigli, 2020;Scarlini et al., 2020;Blevins and Zettlemoyer, 2020) or knowledge bases (Moro et al., 2014;Agirre et al., 2014;Scozzafava et al., 2020). Both are usually evaluated on dedicated benchmarks, including at least five WSD tasks in Senseval and SemEval series, from 2001 (Edmonds and Cotton, 2001) to 2015 (Moro and Navigli, 2015a) that are included in the Raganato et al. (2017)'s test suite. All these tasks are framed as classification problems, where disambiguation of a word is defined as selecting one of the predefined senses of the word listed by a sense inventory. This brings about different limitations such as restricting senses only to those defined by the inventory, or forcing the WSD system to explicitly model sense distinctions at the granularity level defined by the inventory.
Stanford Contextual Word Similarity (Huang et al., 2012) is one of the first datasets that focuses on ambiguity but outside the boundaries of sense inventories, and as a similarity measurement between two words in their contexts. Pilehvar and Camacho-Collados (2019) highlighted some of the limitations of the dataset that prevent a reliable evaluation, and proposed the Word-in-Context (WiC) dataset. WiC is the closest dataset to ours, which provides around 10K instances (1400 instances for 1184 unique target nouns and verbs in the test set), but for the English language only.

Cross-lingual NLP
A prerequisite for research on a language is the availability of relevant evaluation benchmarks. Given its importance, construction of multilingual datasets has always been considered as a key contribution in NLP research and numerous benchmarks exist for a wide range of tasks, such as semantic parsing (Hershcovich et al., 2019), word similarity (Camacho-Collados et al., 2017;Barzegar et al., 2018), sentence similarity (Cer et al., 2017), or WSD (Navigli et al., 2013;Moro and Navigli, 2015b). A more recent example is XTREME (Hu et al., 2020), a benchmark that covers around 40 languages in nine syntactic and semantic tasks.
On the other hand, pre-trained language models have recently proven very effective in transferring knowledge in cross-lingual NLP tasks (Devlin et al., 2019;Conneau et al., 2020). This has further magnified the requirement for rigorous multilingual benchmarks that can be used as basis for this direction of research (Artetxe et al., 2020b).

XL-WiC: The Benchmark
In this section, we describe the procedure we followed to construct the XL-WiC benchmark. Our framework is based on the original WiC dataset, which we extend to multiple languages.

English WiC
Each instance of the original WiC dataset (Pilehvar and Camacho-Collados, 2019) is composed of a target word (e.g., justify) and two sentences where the target word occurs (e.g., "Justify the margins" and "The end justifies the means"). The task is a binary classification: to decide whether the same sense of the target word (justify) was intended in the two contexts or not. The dataset was built using example sentences from resources such as Wiktionary, WordNet (Miller, 1995) and VerbNet (Schuler et al., 2009).

XL-WiC
We followed Pilehvar and Camacho-Collados (2019) and constructed XL-WiC based on example usages of words in sense inventories. Example usages are curated in a way to be self contained and clearly distinguishable across different senses of a word; hence, they provide a reliable basis for the binary classification task. Specifically, for a word Lang.  w and for all its senses {s w 1 , ..., s w n }, we extract from the inventory all the example usages. We then pair those examples that correspond to the same sense s w i to form a positive instance (True label) while examples from different senses (i.e., s w i and s w j where i = j) are paired as a negative instance (False label).
We leveraged two main sense inventories for this extension: Multilingual WordNet (Section 3.2.1) and Wiktionary (Section 3.2.2).

Multilingual WordNet
WordNet (Miller, 1995) is the de facto sense inventory for English WSD. The resource was originally built as an English lexical database in 1995, but since then there have been many efforts to extend it to other languages (Bond and Paik, 2012). We took advantage of these extensions to construct XL-WiC. In particular, we processed the Word-Net versions of Bulgarian (Simov and Osenova, 2010), Chinese (Huang et al., 2010), Croatian (Raffaelli et al., 2008), Danish (Pedersen et al., 2009), Dutch (Postma et al., 2016), Estonian (Vider and Orav, 2002), Japanese (Isahara et al., 2008), Korean (Yoon et al., 2009) and Farsi (Shamsfard et al., 2010). 1 Farsi: Semi-automatic extraction. FarsNet v3.0 (Shamsfard et al., 2010) comprises 30K synsets with over 100K word entries. Many of these synsets are mapped to the English database; however, each synset provides just one example usage for a target word. This prevents us from applying the automatic extraction of positive examples. Therefore, we utilized a semi-automatic procedure for the construction of the Farsi set. To this end, for each word, we extracted all example usages from FarsNet, and asked an annotator to group them into positive and negative pairs. The emphasis was to make a challenging dataset with sense distinctions that are easily interpretable by humans. This can also be viewed as a case study to understand the real gap between human and machine performance in settings where manual curation of instances is feasible.
Filtering. WordNet is often considered to be a fine-grained resource, especially for verbs (Duffield et al., 2007). In some cases, the exact meaning of a word can be hard to assess, even for humans. For example, WordNet lists 29 distinct meanings for the noun line, two of which correspond to the horizontally and the vertically organized line formations. Therefore, to cope with this issue, we followed Pilehvar and Camacho-Collados (2019) and filtered out all pairs whose target senses were connected by an edge (including sister-sense relations) in WordNet's semantic network or if they belonged to the same supersense, i.e., one of the 44 lexicographer files 2 in WordNet which cluster concepts into semantic categories, e.g., Animal, Cognition, Food, etc. For example, the Japanese instance "成長 中の企業は大な指導者いなければならない" ("Growing companies must have bold leaders"), "彼は安定した大きな企業に投資するだけだ" ("He just invested in big stable companies") for the target word "企業" ("company") is discarded as its corresponding synsets, i.e., "An organization created for business ventures" and "An institution created to conduct business", are grouped under the same supersense in WordNet, i.e., Group.
Finally, all datasets are split into development 3 and test. At the end of this step, we ensure that both test and development sets have the same number of positive and negative instances. An excerpt  of examples included in some of our datasets are shown in Table 1.

Wiktionary
Wiktionary is one of the richest free collaborative lexical databases, available for dozens of languages.
In this online resource, each word is provided with definitions for its various potential meanings, some of which are paired with example usages. However, each language has a specific format, and therefore the compilation of these examples requires a careful language-specific parsing. We extracted examples for three European languages for which we did not have WordNet-based data, namely French, German, and Italian. 4 Once these examples were compiled, the process to build the final dataset was analogous to that for the WordNet-based datasets (see Section 3.2.1), except for the filtering step, which was not feasible as Wiktionary entries are not connected through paradigmatic relations as in WordNet.
For the case of Wiktionary, the number of examples was considerably higher; therefore, we also compiled language-specific training sets, which enabled a comparison between cross-lingual and monolingual models (see Section 5.2). All Wiktionary datasets are split into balanced training, development and test splits, in each of which there are equal number of positive and negative instances. Table 2 shows the statistics of all datasets, including the total number of instances, unique words, and the context length average. 5 Wiktionary-based datasets are substantially larger than the WordNetbased ones, and also provide training sets. The Chinese datasets feature longer contexts on average and contain the largest number of development and testing instances among WordNet-based datasets. Korean, on the other hand, is the one with the shortest contexts, which is expected given its agglutinative nature. As for the training corpora, German and French datasets contain almost ten times the number of instances in the English training set. This allows us to perform a large-scale comparison between cross-lingual and monolingual settings (see Section 5.2) as well as a few-shot analysis (Section 6.2).

Validation and human performance
To verify the reliability of the datasets, we carried out manual evaluation for those languages for which we had access to annotators. To this end, we presented a set of 100 randomly sampled instances from each dataset to the corresponding annotator in the target language. 6 Annotators were all native speakers of the target language with high-level education. They were provided with a minimal guideline: a brief explanation of their task and a few tagged examples. We did not provide any lexical resource (or any other detailed instructions) to the annotators with the emphasis to make a challenging dataset with sense distinctions that are easily interpretable to the layman. Given an instance, i.e., a pair of sentences containing the same target word, their task consisted of tagging it with a  True or False label, depending on the the intended meanings of the word in the two contexts. Table 3 reports human performance for eight datasets in XL-WiC. All accuracy figures are around 80%, i.e., in the same ballpark as the original WiC English dataset, which attests the reliability of underlying resources and the construction procedure. The only exception is for Farsi, for which the checker annotators agree with the gold labels in 97% of the instances (by average). This corroborates our emphasis on the annotation procedure for this manually-created dataset to have sense distinctions that are easily interpretable by humans. As for Wiktionary, the human agreements are lower than those for the WordNet counterparts. 7 This was partly expected given that the semantic network-based filtering step (see Section 3.2.1) was not feasible for the case of Wiktionary datasets due to the nature of the underlying resource.

Experimental Setup
For our experiments, we implemented a simple, yet effective, baseline based on a Transformer-based text encoder (Vaswani et al., 2017) and a logistic regression classifier, following Wang et al. (2019). The model takes as input the two contexts and first tokenizes them, splitting the input words into subtokens. The encoded representations of the target words are concatenated and fed to the logistic classifier. For those cases where the target word was split by the tokenizer into multiple sub-tokens, we followed Devlin et al. (2019) and considered the representation of its first sub-token.
As regards the text encoder, we carried out the experiments with three different multilingual models, i.e., the multilingual version of BERT (Devlin et al., 2019) (mBERT) and the base and large versions of XLM-RoBERTa (Conneau et al., 2020) (XLMR-base and XLMR-large, respec-7 Even if not part of XL-WiC, we also compiled and validated a small Italian WordNet dataset to compare it with its Wiktionary counterpart. Table 8 in Appendix includes an additional table comparing the nature of these two datasets (including zero-shot cross-lingual transfer results). tively). In the monolingual setting, we used the following language-specific models: BERT-de 8 , CamemBERT-large (Martin et al., 2020) 9 , BERTit 10 , and ParsBERT 11 (Farahani et al., 2020), respectively, for German, French, Italian, and Farsi. As for all the other languages covered by the Word-Net datasets, i.e., Bulgarian, Chinese, Croatian, Danish, Dutch, Estonian, Japanese and Korean, we used the pre-trained models made available by TurkuNLP. 12 We refer to each language-specific model as L-BERT.
In all experiments we trained the baselines to minimize the binary cross-entropy loss between their prediction and the gold label with the Adam (Kingma and Ba, 2015) optimizer. Training is carried out for 10 epochs with the learning rate fixed to 1e −5 and weight decay set to 0. As for tuning, results are reported for the best training checkpoint (among the 10 epochs) according to the performance on the development set.

Evaluation settings
We evaluated the baselines with different configuration setups, depending on the data used for training and tuning.
Cross-Lingual Zero-shot. This setting aims at assessing the capabilities of multilingual models in transferring knowledge captured in the English language to other languages. As training set, we used the English training set of WiC. As for tuning, depending on the setting, we either used the English development set of WiC or language-specific development sets of XL-WiC (Section 3.2.1). We report results on all WordNet and Wiktionary datasets of XL-WiC, i.e., Bulgarian, Chinese, Croatian, Danish, Dutch, Estonian, Farsi, French, German, Italian, Japanese and Korean.
Multilingual Fine-Tuning. In this setting, models are first trained on the WiC's English training set, and then further fine-tuned on the development sets of the target languages in XL-WiC. Depending on the training set used, we report results for two configurations: (i) EN+Target Language, combining with WiC's training data and the language-specific WordNet development sets for  Monolingual. In this setting, we trained each model on the corresponding training set of the target language only. For the case of WordNet datasets (where no training sets are available), we used the development sets for training. In this case we split each development set into two subsets with 9:1 ratio (for training and development). As for the Wiktionary datasets, we used the corresponding training and development sets for each language (Section 3.2.2).
Translation. In this last setting we make use of existing neural machine translation (NMT) models to translate either the training or the test set, essentially reducing the cross-lingual problem to a monolingual one. In particular, we used the general-domain translation models from the Opus-MT project 13 (Tiedemann and Thottingal, 2020) available for the following language pairs: English-Bulgarian, English-Croatian, English-Danish, English-Dutch, English-Estonian, and English-Japanese. The models are trained on all OPUS parallel corpora collection (Tiedemann, 2012), using the state-of-the-art 6-layer Transformer-based architecture (Vaswani et al.,13 github.com/Helsinki-NLP/Opus-MT 2017). 14 In this configuration, as the original target word may be lost during automatic translation, we view the task as context (sentence) similarity as proposed by Pilehvar and Camacho-Collados (2019). 15 Therefore, for each model, the context vector is given by the start sentence symbol. We note that while training custom optimized NMT models for each target language may result in better overall performance, this is beyond the scope of this work.
Evaluation Metrics. Since all datasets are balanced, we only report accuracy, i.e., the ratio of correctly predicted instances (true positives or true negatives) to the total number of instances.

Results
In this section, we report the results for the configurations discussed in the previous section on the XL-WiC benchmark. We organize the experiments into two parts, based on the test dataset: WordNet (Section 5.1) and Wiktionary (Section 5.2).

WordNet datasets
Using English data only. Table 4 shows results on the XL-WiC WordNet test sets, when only 14 More details about the NMT models and their translation quality are given in the Appendix (Table 13). 15 The Appendix (Table 12) includes another translation baseline that uses dictionary alignments to identify the target word as comparison, but performed worse overall.  WiC's English data was used for training and tuning purposes. Across the board, XLMR-large consistently achieves the best results, while mBERT and XLMR-base attain scores in the same ballpark. Indeed, the massive pretraining and the number of parameters of XLMR-large play key roles behind this lead in performance. As regards the translationbased settings (lower two blocks), the performance generally falls slightly behind the zero-shot crosslingual counterpart. This shows that the usage of good quality English data and multilingual models provide a stronger training signal than noisy automatically-translated data. This somehow contrasts with the observations made on other crosslingual tasks in XTREME (Hu et al., 2020), especially in question answering datasets (Artetxe et al., 2020a;Lewis et al., 2020;Clark et al., 2020), where translating data was generally better. This difference could perhaps be reduced with larger monolingual models or accurate alignment, but this would further increase the complexity, and extracting these alignments from NMT models is not trivial (Koehn and Knowles, 2017;Ghader and Monz, 2017;Li et al., 2019).

Model
Utilizing language-specific data. Table 5 shows results for settings where target language-specific data was used for training or tuning. Comparing the results in the top block (where target language data was used for tuning) with the middle two blocks (where target language data was instead used for training) reveals that it is more effective to leverage the target language data for training, rather than using it for tuning only. Overall, it is clear that adding multilingual data during training drastically improves the results in all languages. In this case, training (fine-tuning) is performed on a larger dataset which, despite having examples from different languages, provides a stronger signal to the models, enabling them to better generalize across languages. On the contrary, when only using target language data for training and tuning (last block in the table), results drop for most languages. This highlights the fact that having additional training data is beneficial, reinforcing the utility of multilingual models and cross-lingual transfer.

Wiktionary Datasets
In Table 6 we show results for the Wiktionary datasets. Differently from the results reported for the WordNet datasets (Table 4), models are less effective in the zero-shot setting, performing from 10 to almost 20 points lower than their counterparts trained on data in the target language. This can be attributed to the size of the available training data. Indeed, while in the WordNet datasets we only have  Table 6: Results on the Wiktionary test sets in different training settings: zero-shot (Z-Shot) and monolingual training (Mono). L-BERT stands for language-specific models, i.e., BERT-de, CamemBERT-large and BERTit for German, French and Italian, respectively. a very small amount of data at our disposal for training (see statistics in Table 2), Wiktionary training sets are much larger, hence providing enough data to the models to better generalize. Once again, XLMR-large proves to be the best model in the zero-shot setting and a competitive alternative to the language-specific models (L-BERT row) in the monolingual setting, performing 1.1 points higher in German and 2 and 0.3 lower in French and Italian, respectively.

Analysis
In this section, we delve into the performance of the models on XL-WiC and analyze relevant aspects about their behaviour.

Seen and Unseen Words
For this analysis, we aim at measuring the difference in performance when a given target word was seen (as a target word) at training time or not. To this end, we evaluate our baselines when trained on the German, French, and Italian Wiktionary training sets and tested on two different subsets of the larger language-specific Wiktionary test sets: In-Vocabulary (IV), containing only the examples whose target word was seen at training time; and Out-of-Vocabulary (OOV), containing only the examples whose target word was not seen during training. We report the results in Table 7. In general, multilingual models are less reliable when classifying unseen instances, lagging between 1 and 12 points behind in performance, depending on the language and on the model considered. This can be attributed to the fact that their vocabulary is shared among several languages, and therefore may have less knowledge stored about  Table 7: Results on the in-vocabulary (IV) and outof-vocabulary (OOV) Wiktionary test sets. L-BERT stands for each language-specific model, i.e., BERTbase-de, camemBERT and BERT-base-xxl-it for German, French and Italian, respectively. particular words that do not occur often. The performance drop of language-specific models (L-BERT) is less pronounced, with the French architecture (CamemBERT-large) attaining even higher performance (0.4 points more) on the OOV set.

Few-shot Monolingual
As an additional experiment, we investigate the impact of training size on performance. To this end, we leveraged the Wiktionary datasets for German, French and Italian, which allow us to use varyingsized training sets, and created 7 training sets with 10, 25, 50, 100, 250, 500, 1000 instances. 16 The results of this experiment are displayed in Figure 1. When providing only 10 examples, most of the models perform similarly or even worse than random, i.e., 50% accuracy. In this setting, language-specific models (L-BERT) attain better results than their multilingual counterparts, showing better generalization capabilities when fewer examples are provided. This also goes in line with what we found in the previous experiment on seen and unseen words. With less than 5% of the training data (1000 instances in French and German and 50 instances in Italian), all models attain roughly 85% of their performance with full training data, comparable to results reported for the zero-shot setting (Table 6).

Conclusions
In this paper we have introduced XL-WiC, a large benchmark for evaluating context-sensitive models. XL-WiC comprises datasets for a heterogeneous set of 13 languages, including the original English data in WiC (Pilehvar and Camacho-Collados, 2019), providing an evaluation framework not only for contextualized models in those languages, but also for experimentation in a cross-lingual transfer setting. Our evaluations show that, even though current language models are effective performers in the zero-shot cross-lingual setting (where no instances in the target language are provided), there is still room for improvement, especially for far languages such as Japanese or Korean.
As for future work, we plan to investigate using languages other than English for training (e.g., our larger French and German training sets) in our cross-lingual transfer experiments, since English may not always be the optimal source language (Anastasopoulos and Neubig, 2020). Finally, while in our comparative analysis we have focused on a quantitative evaluation for all languages, an additional error analysis per language would be beneficial in revealing the weaknesses and limitations of cross-lingual models. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A massively multilingual multitask benchmark for evaluating cross-lingual generalization. Proceedings of the 37th International Conference on Machine Learning (ICML).

C Additional experimental results
WordNet datasets. Table 10 includes details on the variability of the results, in particular the average results from three runs, including the standard deviation, for the zero-shot cross-lingual settingthis is the setting producing a higher variability in the results.
Wiktionary Datasets. In Table 11 we show the development and test results in the monolingual settings of the multilingual language models trained, tuned and tested on the XL-WiC language-specific datasets from Wiktionary.
Translation setting + Dictionary alignment.
We include a setting where, after translating the English training set to each target language, we also retrieve the corresponding translation of the English target word through a multilingual dictionary. We use BabelNet (Navigli and Ponzetto, 2012) as multilingual dictionary for all languages, discarding the sentences where the translated target word could not be found. Table 12 shows the results.

D Translation models
Translation models are trained using the Marian-NMT framework (Junczys-Dowmunt et al., 2018) on a filtered version of all OPUS parallel corpora collection using a language identifier (CLD2). As hyper-parameters, each model is based on the base version of the Transformer architecture (Vaswani et al., 2017). All models and training details are available at https://github.com/Helsinki-NLP/ Opus-MT. To give an idea of the translation quality, Table 13 reports the BLEU scores (Papineni et al., 2002) for each model. We report the performance, as described within the Opus-MT project, on the latest available test sets from the series of WMT news translation shared tasks, or on 5K sentences