Wikipedia Entities as Rendezvous across Languages: Grounding Multilingual Language Models by Predicting Wikipedia Hyperlinks

Masked language models have quickly become the de facto standard when processing text. Recently, several approaches have been proposed to further enrich word representations with external knowledge sources such as knowledge graphs. However, these models are devised and evaluated in a monolingual setting only. In this work, we propose a language-independent entity prediction task as an intermediate training procedure to ground word representations on entity semantics and bridge the gap across different languages by means of a shared vocabulary of entities. We show that our approach effectively injects new lexical-semantic knowledge into neural models, improving their performance on different semantic tasks in the zero-shot crosslingual setting. As an additional advantage, our intermediate training does not require any supplementary input, allowing our models to be applied to new datasets right away. In our experiments, we use Wikipedia articles in up to 100 languages and already observe consistent gains compared to strong baselines when predicting entities using only the English Wikipedia. Further adding extra languages lead to improvements in most tasks up to a certain point, but overall we found it non-trivial to scale improvements in model transferability by training on ever increasing amounts of Wikipedia languages.


Introduction
Pretrained Multilingual Masked Language Models (MMLMs) such as mBERT (Devlin et al., 2019), XLM-R (Conneau et al., 2020) and their variants have achieved state-of-the-art results across diverse natural language understanding tasks. Typically, a MMLM model is pretrained on very large amounts of raw text in different languages using the masked language modelling (MLM) objective and is further finetuned on (usually limited amounts of) task data.
In the zero-shot crosslingual setting, which is our focus in this paper, a MMLM is finetuned on the target task using data in a single language (e.g., English) and is evaluated on the same task but in different languages (e.g., non-English languages).
We introduce the multilingual Wikipedia hyperlink prediction objective to contextualise words in a text with entities and concepts from an external knowledge source by using Wikipedia articles in up to 100 languages. Hyperlink prediction is a knowledge-rich task designed to (1) inject semantic knowledge from Wikipedia entities and concepts into the MMLM token representations, and (2) with a similar motivation as the translated language modelling loss of Conneau and Lample (2019), i.e., to inject explicit language-independent knowledge into a model trained via self-supervised learning, but in our case without parallel data. We devise a training procedure where we mask out hyperlinks in Wikipedia articles and train the MMLM to predict the hyperlink identifier similarly to standard MLM but using a "hyperlink vocabulary" of 250k concepts shared across languages.
We use the state-of-the-art MMLM XLM-Rlarge (Conneau et al., 2020) and show that by adding an add-on training step using Wikipedia hyperlink prediction we consistently improve several zero-shot crosslingual natural language understanding tasks across a diverse array of languages: crosslingual Word Sense Disambiguation in 18 languages including English (XL-WSD;Pasini et al., 2021); the crosslingual Word-in-Context task (XL-WiC;Raganato et al., 2020) in 12 non-English languages; and in 7 tasks from the XTREME benchmark (Hu et al., 2020) in up to 40 languages. Recently, Zhang et al. (2019, ERNIE) and Peters et al. (2019, KnowBERT) devised different methods to incorporate entities from external knowledge graphs into masked language model (LM) training. Since then, several works followed (Wang et al., 2021;Sun et al., 2020;Xiong et al., 2020 , 2020) showing increasingly better performance than masked LMs that rely on information from raw text only. Nevertheless, all these methods were proposed for a single language 1 and cannot be easily applied to transfer learning in a zero-shot crosslingual setting.

Approach
Notation Let x 1:m = MMLM(x 1:m ) be contextualised word representations for some input text x 1:m with m words, and computed with a pretrained MMLM. Let x n:k (n ≥ 1, k ≤ m) be a subsequence of contextualised word representations of a single hyperlink x n:k consisting of k − n words. In our working example we use a single hyperlink x n:k for simplicity, but in practice there may be multiple hyperlinks in the input x 1:m . Wikipedia Hyperlink Prediction Our main goal is to use the rich semantic knowledge contained in the multilingual Wikipedias' structure to improve language model pretraining. Our approach can be seen as intermediate-task training (Phang et al., 2018(Phang et al., , 2020 where we use Wikipedias' hyperlinks as labelled data to further finetune a pretrained MMLM model before training it one last time in the actual target task of interest. Motivated by recent studies on pretrained language encoders demonstrating that semantic features are highlighted in higher layers (Raganato and Tiede mann, 2018;Jawahar et al., 2019;Cui et al., 2020;Rogers et al., 2021), we further train only the last two layers of the MMLM. Moreover, similarly to the MLM procedure, we replace the hyperlink tokens x n:k by the [MASK] token or by a random token 80% and 10% of the time, respectively (Devlin et al., 2019).
Since the number of Wikipedia articles is very large, we only consider the most frequent 250k referenced articles h t as possible hyperlinks in our model and we use the adaptive softmax activation function to speed-up training (Grave et al., 2017). Our objective allows us to consider textentity alignments during training only. At prediction time, instead, we simply feed the model with raw text with no need of precomputed alignments. This makes our model easy to use and to adapt to many different scenarios. For more details on the model architectures and objective, see Appendix B.

Experimental Setup
We use XLM-R-large (Conneau et al., 2020) as our MMLM, which is pretrained on a large volume of raw multilingual corpora using MLM training.

Models
We propose three different model architectures which differ in how the input to the hyperlink classification head is computed. In Token we use the vector representation of each token in the hyperlink text x i , i ∈ [n, k] as input to the prediction head. In Concat CLS we use the concatenation [x i ; x CLS ] of the representation of each word in the hyperlink x i , i ∈ [n, k] with the [CLS] token representation as input to the prediction head. Finally, in Replace CLS the input to the prediction head is the representation of each word in the hyperlink x i , i ∈ [n, k] with probability p r or the [CLS] token representation x CLS with probability 1 − p r . More details on the architectures in Appendix B.1.

Methodology
We follow a sequential, three steps approach to training and evaluating our models. We first finetune the pretrained MMLM on the Wikipedia hyperlink prediction task, then finetune again this time on the target-task training data in English, and finally evaluate the model on non-English target-task evaluation data in a zero-shot crosslingual setting (see Figure 1). We use Wikipedia articles in different sets of languages (Section 3.3) and experiment with many diverse target tasks (Section 3.4).

Wikipedia Languages
We experiment using only English (Wiki EN), 15 different languages (Wiki 15), or 100 Wikipedia languages (Wiki 100). By doing that, i) we include a monolingual albeit resource-rich baseline (Wiki EN), ii) we investigate the impact of including a varied mixture of languages from different families (Wiki 15), and iii) we also experiment if going massively multilingual has a noticeable impact on crosslingual transferability (Wiki 100).

Target Tasks
Word Sense Disambiguation We follow the zero-shot crosslingual setting of Pasini et al. (2021, XL-WSD), which includes 17 languages plus English, i.e., we train on the English SemCor (Miller et al., 1993) dataset merged with the Princeton WordNet Gloss corpus 3 and test on all available languages (Miller et al., 1993;Raganato et al., 2017;Edmonds and Cotton, 2001;Snyder and Palmer, Word-in-Context We use the crosslingual Wordin-Context dataset (XL-WiC;Raganato et al., 2020) with data in 12 diverse languages. The task is to predict whether an ambiguous word that appears in two different sentences share the same meaning. We finetune the model on the English WiC (Pilehvar and Camacho-Collados, 2019) dataset and evaluate on the 12 XL-WiC languages. , XLM-R is tested considering the output of its 14-th layer, which, however, is not tuned during our intermediate task.
We therefore do not report results on these tasks. 4 Task Architectures Across all the tasks, we finetune transformer-based models by adding a classification head for each task. 5

Results and Discussion
Results on XL-WSD and XL-WiC tasks (Tables 1 and 2) suggest that our models have a better grasp of word-level semantics than XLM-R, which does not have explicit semantic signals during its pretraining. This is consistent across languages and hyperlink prediction architectures, also when compared to the baseline XLM-R additionally finetuned using MLM training on in-domain Wikipedia data. Our best models outperform the baselines in both tasks by several points. Interestingly, training on  15 languages tends to slightly outperform training on all 100 languages on XL-WSD, but on XL-WiC results with our best models trained on 100 languages outperforms all other configurations most of the time by a reasonable margin. These results corroborate our hunch that the intermediate task injects semantic knowledge within the neural model.
In Table 3, we confirm that our models preserve the sentence-level comprehension capabilities of the underlying XLM-R architecture and that it performs either comparably or favourably to the baselines in the XTREME benchmark, across target tasks and languages.
Training on the English Wikipedia only can be surprisingly effective at times (Tables 2 and 3), and training on 100 languages shows more consistent improvements only on XL-WiC but fails to lead to similar improvements on other tasks. We note that performance on XL-WSD is similar when using 15 or 100 languages, while our evaluation using XTREME shows that performance is slightly worse when using 100 languages compared to using 15 languages only. We conjecture this could be due to the fact we finetune only the last two layers of XLM-R (see Appendix B), so the model retains most of the multilingual knowledge it learned dur-  We also hypothesise that the English Wikipedia size (in number of words) and quality (in coverage of our hyperlink vocabulary) may also be a reason why training solely on English already brings large gains in transfer to other tasks. For comparison, the English Wikipedia is the one with the most data, i.e., about 73M hyperlinks, where the second highest resource language is German with only about 28M hyperlinks (see Table 4 in Appendix B). Regarding the coverage of our hyperlink vocabulary with 250k entries, the English Wikipedia covers over 249k hyperlink types at least 10 times, whereas the second highest coverage is for the French Wikipedia, which covers over 142k hyperlink types at least 10 times. We plan on investigating the effect of the size and coverage of hyperlinks further in future work.
Limitations Finally, we highlight that: (1) We report results using single model runs, therefore we have no estimates of the variance of these models; (2) We lack a more thorough hyperparameter search to further consolidate our results. In both cases, the reason we made such choices is because of the high cost of training large models such as XLM-R large.

Conclusions and Future work
We presented a multilingual Wikipedia hyperlink prediction intermediate task to improve the pretraining of contextualised word embedding models. We trained three model variants on different sets of languages, finding that injecting multilingual semantic knowledge consistently improves performance on several zero-shot crosslingual tasks. As future work, we plan to devise a solution to allow crosslingual transferability to scale more efficiently with the number of languages. Finally, we will investigate the impact on resource-poor vs resource-rich languages, and the effect of the size and coverage of hyperlinks in model transferability.
Language sets used for training We finetune MMLM models on the Wikipedia hyperlink prediction task using articles in different sets of languages to investigate the impact of multilingualism. Wiki EN includes only articles in English (en); Wiki 15 includes articles in bg, da, de, en, es, et, eu, fa, fr, hr, it, ja, ko, nl, zh; finally, Wiki 100 includes articles in all 100 languages listed above.
Rationale Wiki EN is a monolingual albeit resource-rich baseline. In Wiki 15, we explore the impact of including languages with different amounts of data and from a mixture of different language families. In Wiki 100, we wish to see if going massively multilingual has a noticeable impact on our models' crosslingual transferability.
Hyperlink extraction We use BabelNet (Navigli and Ponzetto, 2010) -a large multilingual knowledge base comprising WordNet, Wikipedia, and many other resources -to map Wikipedia articles in different languages about the same subject onto unique identifiers. For instance, all "computer science" articles (e.g., Ciencias de la computación in Spanish, Computer science in English, Informatik in German, etc.) are mapped to the same identifier h t , in this case bn:00021494n. 7 After each article is mapped to a single identifier, we create prediction targets for every hyperlink by using the identifier of its referenced article. For example, in Figure 3 the text "algorithmic processes" (x n:k ) refers to the article "Algorithm", 8 which is mapped to the ID bn:00002705n 9 (h t ).
In Table 4 we show detailed per-language statistics for the Wikipedia data used in our experiments, including the size of the datasets and the number of hyperlinks appearing in the articles (this count already includes only the hyperlinks in our hyperlinks vocabulary of 250k types).

B Hyperparameters, Training Procedure and Model Architectures
We use XLM-R-large (Conneau et al., 2020), which has an encoder with 24 layers and a hidden state size 1024. We finetune XLM-R-large using AdamW (Kingma and Ba, 2015;Loshchilov and Hutter, 2018) with learning rate 0.00005, no weight decay, and batch size 16. We train on minibatches with maximum sequence length 256, gradient norm set to 1.0, and for 300k model updates. When finetuning XLM-R on Wikipedia hyperlink prediction, we only update the last two layers of the model.
Training data sampling We sample batches of training data from each of the languages available, i.e., depending on the experiment these can be English only, 15 languages, or 100 languages. We sample with probability r l = min(e l ,K) (min(e l ,K) , where e l is the number of examples per language l and the constant K = 2 17 leads to sampling more often from resource-poor languages (Raffel et al., 2020).
Adaptive softmax We collect hyperlink targets h t from across Wikipedia articles in all the 100 languages available, sort these hyperlinks from most to least frequent, and keep only the top 250k hyperlink targets h t . Since hyperlink frequencies follow a natural Zipfian distribution, we use the adaptive softmax activation (Grave et al., 2017) to predict hyperlinks. We bin hyperlink mentions from most to least frequent, i.e. the most frequent h t is ranked 1st and the least frequent h t is ranked 250k-th. We use five bins, which include hyperlinks with ranks in the following intervals: [1, 10k], (10k, 40k], (40k, 50k], (50k, 70k], (70k, 250k]. The adaptive softmax activation is efficient to compute because: (1) we use one matrix multiplication for each bin, drastically reducing the number of parameters; and (2) the latter bins are only computed in case there is at least one entry in the minibatch with a target in that bin. The five-weight matrices that parameterise each bin in our adaptive softmax layer have sizes: hdim × 10, 000, hdim×30, 000, hdim×10, 000, hdim×20, 000, hdim × 180, 000, respectively. Since bins are constructed so that the least frequent hyperlinks are added to the latter bins, we rarely need to compute them. This is especially important in case of the last bin, which is the most costly to compute (and is rarely used).

B.1 Model Architectures
We refer the reader for the mathematical notation in Section 2 Approach. The Wikipedia hyperlink prediction head for a single hyperlink using each of our models is shown below. Token is computed in Equation 1.
(1) where AdaptiveSoftmax k computes the probability of the hyperlink target h t = k, x n:k is a hyperlink consisting of words {x n , · · · , x k }, and W t and b t are trained parameters.
Concat CLS is computed in Equation 2.
where AdaptiveSoftmax k computes the probability of the hyperlink target h t = k, x n:k is a hyperlink consisting of words {x n , · · · , x k }, and W c and b c are trained parameters.

Replace CLS is computed in Equation 3
.
where sample(a, b) samples a or b with probability 0.9 and 0.1, respectively; AdaptiveSoftmax k computes the probability of the hyperlink target h t = k, x n:k is a hyperlink consisting of words {x n , · · · , x k }, and W r and b r are trained parameters.

B.1.1 XL-WSD
We freeze the pretrained MMLM model weights and simply add a trained classification head on top of the pretrained MMLM. We compute representations for each subword as the sum of the last 4 layers of the model, and for each word as the average of its subword representations (Bevilacqua and Navigli, 2020).

B.1.2 XL-WiC
We follow Raganato et al. (2020) and add a binary classification head on top of the pretrained MMLM model, which takes as input the concatenation of the target words' embedding in the two contexts. We use the output of the 24-th layer as the target words' representation.

B.1.3 XTREME
We use the Jiant library (Pruksachatkun et al., 2020) to carry out the evaluation on XTREME. We use the output of the 24-th layer as the input token representations so as to better measure the impact of our intermediate training on the XTREME tasks.

B.2 XTREME Sentence Retrieval Tasks
BUCC (Zweigenbaum et al., 2018), and Tatoeba (Artetxe and Schwenk, 2019) are two unsupervised tasks requiring, given a sentence in a language L to retrieve its closest sentence in another language L . XTREME baselines use the average of the 14-th layer outputs to represent the sentence. 10 Since our intermediate training procedure only tunes the last two layers, the output of the 14-th layer would be the exact same of the plain XLM-R baseline. For this reason, we did not report the results in both tasks.