X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models

Language models (LMs) have proven surprisingly successful at capturing factual knowledge by completing cloze-style fill-in-the-blank questions such as"Punta Cana is located in _."However, while knowledge is both written and queried in many languages, studies on LMs' factual representation ability have almost invariably been performed on English. To assess factual knowledge retrieval in LMs in different languages, we create a multilingual benchmark of cloze-style probes for 23 typologically diverse languages. To properly handle language variations, we expand probing methods from single- to multi-word entities, and develop several decoding algorithms to generate multi-token predictions. Extensive experimental results provide insights about how well (or poorly) current state-of-the-art LMs perform at this task in languages with more or fewer available resources. We further propose a code-switching-based method to improve the ability of multilingual LMs to access knowledge, and verify its effectiveness on several benchmark languages. Benchmark data and code have been released at https://x-factr.github.io.


Introduction
Language models (LMs; (Church, 1988;Kneser and Ney, 1995;Bengio et al., 2003)) learn to model the probability distribution of text, and in doing so capture information about various aspects of the syntax or semantics of the language at hand. Recent works have presented intriguing results demonstrating that modern large-scale LMs also capture a significant amount of factual knowledge (Petroni et al., 2019;Jiang et al., 2020;Poerner et al., 2019). This knowledge is generally probed by having the LM fill in the blanks of cloze-style prompts such as * : Work done at Carnegie Mellon University. The first two authors contributed equally. en fr nl ru es jp vi zh hu ko tr he Figure 1: X-FACTR contains 23 languages, for which the data availability varies dramatically. Prompts get instantiated to produce grammatical sentences with different numbers of mask tokens and are used to obtain predictions for [Y]. In this Spanish example, the verb gerund "fundar" to found is rendered as "fundada" to agree in gender and number with the subject "Bloomberg L.P.". The final prediction is in bold.
"Obama is a _ by profession.", where these prompts are invariably written in English. However, it goes without saying that there are many languages of the world other than English, and it is quite conceivable that (1) users may want to query this factual knowledge in other languages, and (2) some facts will be written in non-English languages and thus multilingually trained LMs (hereinafter, M-LMs) may be more equipped to recall these facts in the languages of the original data. In this paper, we study the intersection of multilinguality and the factual knowledge included in LMs.
We create a new multilingual benchmark for probing factual knowledge in LMs -the Crosslingual FACTual Retrieval benchmark (X-FACTR). X-FACTR shares a similar formulation as the LAMA benchmark of Petroni et al. (2019), which assesses whether LMs have memorized a fact (i.e., a subject-relation-object triple) by having LMs predict the blank (i.e. object) in a cloze-style prompt for each relation after filling in the subject. We manually create such prompts for 23 languages spanning different language families and different levels of data availability ( § 3.1). Because many languages that we handle are morphologically rich, we design a morphology-sensitive annotation schema (see example in Fig. 1) that can properly instantiate prompts using entity metadata (e.g. gender) and a morphological inflection model ( § 3.3).
In addition, while previous works (Petroni et al., 2019;Jiang et al., 2020;Poerner et al., 2019) have limited examination to single-token entities (e.g. "France"), we expand our setting to include multi-token entities (e.g. "United States"), which comprise more than 75% of facts included in our underlying database (Wikidata; § 3.2). We propose several decoding algorithms for prediction of these multi-token entities using masked LMs ( § 4). We discuss the related work in depth in § 7.
We perform experiments on X-FACTR ( § 5), comparing and contrasting across languages and LMs to answer the following research questions: (1) How and why does performance vary across different languages and models? (2) Can multilingual pre-training increase the amount of factual knowledge in LMs over monolingual pre-training? (3) How much does knowledge captured in different languages overlap? We find that the factual knowledge retrieval of M-LMs in high-resource languages is easier than in low-resource languages, but the overall performance is relatively low, indicating that this is a challenging task. We analyze the types of failure cases, shedding light on future directions to improve factual knowledge in M-LMs. In addition, multilingual pre-training does not necessarily lead to a higher recall of facts compared to language-specific monolingual pre-training. The knowledge memorized by M-LMs in fact is largely distinct across languages, with almost 50% of facts being recalled in only one language. Inspired by the above observations, we propose a code-switching-based objective function to improve the ability of M-LMs to access knowledge using queries from a variety of languages. We replace entities in a sentence from the original language with counterparts in another lan-guage, and further fine-tune the LM on these codeswitched data ( § 6). We perform experiments on three languages (French, Russian, and Greek, codeswitched with English). Results demonstrate that this code-switching-based learning can successfully improve the knowledge retrieval ability with low-resource language prompts.

Retrieving Facts from LMs
In this paper we follow the protocol of Petroni et al. (2019)'s English-language LAMA benchmark, which targets factual knowledge expressed in the form of subject-relation-object triples from Wikidata 1 curated in the T-REx dataset (ElSahar et al., 2018). The cloze-style prompts used therein are manually created and consist of a sequence of tokens, where [X] and [Y] are placeholders for subjects and objects (e.g. "[X] is a [Y] by profession."). To assess the existence of a certain fact, [X] is replaced with the actual subject (e.g. "Obama is a mask by profession.") and the model predicts the object in the blankŷ i = argmax y i p(y i |s i:i ), where s i:i is the sentence with the i-th token masked out. Finally, the predicted fact is compared to the ground truth. In the next section, we extend this setting to more languages and predict multiple tokens instead of a single one.

Multilingual Multi-token Factual
Retrieval Benchmark

Facts
While Petroni et al. (2019) and follow-up works focus on entities that can be represented by a single token, since many popular entities consist of multiple tokens (e.g. "United States"), we argue that it is crucial to include multi-token entities in the benchmark to make the evaluation unbiased. Similar to Petroni et al. (2019), we use the T-REx dataset to collect facts for our benchmark. Since T-REx aligns facts from Wikidata with sentences in abstract sections from DBpedia, we can estimate the commonality of each fact based on its frequency of being grounded to a sentence in these abstracts.
For each of the 46 relations in T-REx, we sample 1000 subject-object pairs with probability proportional to their frequency. Frequency-proportional sampling makes the distribution of the facts in our benchmark close to real usage and covers facts of different popularity. To keep the benchmark unbiased, we did not constrain the facts with any language-related criteria (e.g., require the entities to have translations in all languages we considered). As a result, some entities (either subjects or objects) might not have translations in all languages. The number of facts in different languages in our multilingual multi-token X-FACTR benchmark is shown in Tab. 1. Because many modern pre-trained M-LMs almost invariably use some variety of subword tokenization, the number of tokens an entity contains will depend on the tokenization method used in the LM. We report the statistics based on the WordPiece tokenization used in multilingual BERT (Devlin et al., 2019). The tokenization scheme statistics for the other M-LMs are similar.

Prompts
Some languages we include in the benchmark require additional handling of the prompts to account for their grammar or morphology. For example, (some) named entities inflect for case in languages like Greek, Russian, Hebrew, or Marathi. In some languages syntactic subjects and objects need to be in particular cases. Similarly, languages often require that the verb or other parts of the sentence agree with the subject or the object on some morphological features like person, gender, or number.
Our prompts provide the necessary information in order to generate grammatical sentences, given the gender and number of the entities. For example, the Russian prompt for " [X] was born in [Y]" is: The prompt denotes that the subject ([X]) needs to be in the nominative (Nom) case and the object ([Y]) needs to be inflected in the essive case (Ess). The prompt also accounts for the variation of the gender of [X] providing options (separated by |) for the subject being masculine, feminine, or neuter (MASC, FEM, NEUT respectively). Everything within square brackets gets concretely instantiated given the subject and object. Grammatical gender is assigned through a combination of Wikidata information and languagespecific heuristics, constructed based on feedback from native speakers of each language. When the entity corresponds to a person, we retrieve their "sex_or_gender" properties from Wikidata. In addition, for languages like Greek or French, the gender of an entity can be inferred with fairly high certainty given the form of the word (e.g. looking at the ending). Last, some categories of entities (such as cities, countries, organizations, etc, which can be obtained using the "instance_of" Wikidata property) often get assigned a general grammatical case based on the category.
Once all the morphological features have been specified as detailed above, we use the unimorph_inflect package (Anastasopoulos and Neubig, 2019) to generate the appropriately inflected surface form of the bracketed words. 3 We note that the target entity ([Y]) might also need to be inflected, as in the above Russian example, in which case we require the model's predictions to match the inflected target forms.
To verify the quality of the prompts we performed user studies with native speakers, finding that 88% on average were judged as natural and grammatically correct. Details are shown in Appendix B, but it is worth noting that the majority of errors are due to prompts being awkward or incorrect for some senses captured by the relation, and not due to our gender heuristics or automatic inflection. This issue is also present in the LAMA English prompts (Jiang et al., 2020).

Evaluation
As noted in Petroni et al. (2019), because some subject-relation pairs might have multiple correct objects (e.g., America maintains diplomatic relations with multiple countries), we collect all valid objects and judge a prediction as correct if it can match any object (e.g., both France and Canada are correct). Since an entity might have multiple aliases (e.g., "America" and "the US"), we collect all aliases for each entity from Wikidata, and the prediction is marked as correct if it can match any one of them after lowercasing.

Multi-token Decoding
As Tab. 1 shows, many facts involve multi-token entities and thus a LM would need to predict these entities in multiple steps. Generating multiple predictions is straightforward for traditional left-toright LMs (Sundermeyer et al., 2015;Radford et al., 2019), where we can autoregressively decode the next token conditioned on previous tokens. However, many pre-trained LMs such as BERT (Devlin et al., 2019) are masked LMs that predict individual words given left and right contexts, and decoding from such masked LMs remains an open problem (Lawrence et al., 2019;Salazar et al., 2020;Ghazvininejad et al., 2019;Wang and Cho, 2019;Cho, 2019). We systematically examined different multi-token decoding algorithms from three orthogonal perspectives: (1) how the initial predictions are produced, (2) how to refine the predictions, and (3) other commonly used components in neural text generation systems. We assume that the following conditional probability distribution is defined by the masked LM for a sentence with n tokens: where the subscript of mask indicates its position, and the surrounding token x · can either be an actual word x · or mask . We aim to handle sentences containing multiple mask tokens conditioning on the surrounding actual words: Barack Obama is a by profession United 1 of 1 president 1 (a) Independent: Barack Obama is a by profession United 1 State 2 President 3 Barack Obama is a by profession minister 2 of 3 cabinet 1 Figure 2: Illustration of three initial prediction and refinement methods. Green boxes are mask tokens to be filled, and subscripts indicate the prediction order.

Initial Prediction and Refinement
Given a sentence with multiple mask tokens, e.g., Eq. 2, we can either generate outputs in parallel independently or one at a time conditioned on the previously generated tokens. These methods are similar to the prediction problems that BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019b) perform in their pre-training stages respectively. We define c ∈ R n as the probability of each prediction, with details varying by prediction methods.
After all mask tokens are replaced with the initial predictions, i.e.,ŝ i:j = x 1 , ...,ŷ i , ...,ŷ j , ..., x n , we can further refine the predictions by iteratively modifying one token at a time until convergence or until the maximum number of iterations is reached. Here we outline the algorithms with high-level descriptions, and provide concrete details in Appendix C. Independent. For independent initial prediction ( Fig. 2a), the mask tokens are all predicted in parallel (at once). We also consider two autoregressive methods for initial prediction or refinement. Order-based. Mask tokens are predicted from left to right, in each step conditioning also on the previously generated tokens (Fig. 2b). In the refinement stage, we modify predictions also from left to right, and convergence is reached when there are no changes in a left-to-right scan. Confidence-based. In each step, we choose the prediction with the highest probability, so the order of predictions can be arbitrary (Fig. 2c). In the refinement stage, we choose from all predicted tokens the one with the lowest confidence (i.e., the lowest probability) and re-predict it similarly to Ghazvininejad et al. (2019). Convergence is reached when the re-predicted token is the same as the original token.

Final Prediction
Because we do not know the number of tokens of the ground truth in advance, we enumerate from 1 to M mask tokens and choose the final prediction based on the confidence. Given the prompt in Eq. 2, the simplest way to compute the confidence is pseudo log likelihood, which is the sum be easily adapted to non-consecutive cases. of log probabilities of each predicted token conditioned on the other tokens (Salazar et al., 2020): v(j − i + 1) = j k=i log c k , where c k is the confidence (probability) of the k-th predicted token, and v(m) is the overall prediction confidence with m initial mask tokens. Among M predictions, we choose the one with the highest confidence.

Additional Components
We also investigate additional components commonly used in neural generation systems. Specifically, we consider length normalization in computing the final confidence (i.e., divide v(m) by the number of mask tokens m) because a simple sum might favor short predictions. In addition, the confidence value c in previous methods contains probabilities when the predictions are first generated, which will become stale once the surrounding tokens change (Ghazvininejad et al., 2019). We consider re-computing confidence c whenever a change happens. Last, we attempted beam search to keep track of the most plausible B predictions at each step. Details of these components can be found in Appendix C, along with a general schema of the overall decoding algorithm in Alg. 1.
We set the maximal number of mask tokens to M = 5 for English, French, Dutch, and Spanish. In these languages more than 90% of the entities are split into ≤5 tokens. For all other languages we use M = 10. This is expected because the vocabulary of M-LMs based on WordPiece tokenization is dominated by frequent words and low-resourcelanguage words tend to split into more pieces (Ács, 2019). We set the maximal number of iterations to T = 2M , so that we can approximately refine all the predicted tokens once for a sentence with 5 Yoruba is not in the training data of XLM and XLM-R.
M mask tokens (the initial prediction takes exactly M iterations). In our main results, we report results with two decoding algorithms: the simplest independent generation method and the confidencebased method for both initial and refinement predictions. The latter performs better than order-based methods, as we will show in Tab. 3. To save computation time, we only use confidence re-computation for M = 5. We discuss computation complexity in Appendix C.
Evaluation Metrics. We follow Petroni et al. (2019), computing the accuracy of predicted objects for each relation and macro-average them as final scores. For fine-grained analysis of different decoding methods, pre-trained LMs, and languages, we report results on all facts as well as on subsets consisting only of single-token objects (single) and multi-token objects (denoted as multi).

Experimental Results
We run both the independent and confidence-based decoding methods with 3 M-LMs, and when available 8 monolingual LMs, across 23 languages, 6 with results shown in Fig. 3. Overall, even in the most favorable settings, the performance of state-of-that-art M-LMs at retrieving factual knowledge in the X-FACTR benchmark is relatively low, achieving less than 15% on high-resource languages (e.g., English and Spanish) and less than 5% for some low-resource languages (e.g., Marathi and Yoruba). This may initially come as a surprise, given the favorable performance reported in previous papers (Petroni et al., 2019;Jiang et al., 2020), which achieved accuracies over 30% on English. We justify this discrepancy in our following analysis. We note that, although we provide baseline results in almost all languages, we perform our extensive analysis on a representative subset, consisting of 13 languages.
Performance on Different Languages. Performance on high-resource languages is usually better than performance on middle-or low-resource languages regardless of the (M-)LMs. This is probably due to high-resource languages having more data in the pre-training stage. It is also possible that even if the fact of low-resource languages is written in the available data for these languages, it is not appropriately memorized due to lack of model capacity or forgetting (Kirkpatrick et al., 2017 Figure 3: Accuracy on different languages using different LMs (%). Independent prediction (solid bars) outperforms confidence-based prediction (no-fill bars) on high-resource languages but not on low-resource languages. Different models are color-coded, with missing/unsupported models marked with ×. Languages are ranked by the total number of facts in our benchmark. Details in Appendix Tab. 10.
is worth noting that the best results are in Indo-European languages which not only have the most data, but also share the same (Latin) script which could further facilitate cross-lingual learning.
Performance of Different LMs. Comparing the performance of different M-LMs, we found that M-BERT outperforms XLM and XLM-R on highresource languages, while on low-resource languages performance is similar. This is contradictory to the conclusion on other cross-lingual tasks, such as natural language inference and syntactic prediction, as reported in Hu et al. (2020). Our conjecture is that because factual knowledge probing requires retrieving the identity and relations of individual entities, it is more fine-grained than more coarse-grained understanding of syntactic and semantic classes that are required to solve other tasks. We posit that pre-training methods that show superior performance on inference and syntactic prediction tasks (i.e., XLM-R) might achieve good syntactic/semantic abstraction at the cost of making less concrete lexical distinctions.
Comparing M-BERT with language-specific LMs, we find M-BERT outperforms the monolingual BERT on Dutch, Spanish, and Greek, while underperforming on English, Russian, Chinese, and Turkish. Since most of the LMs follow the architecture and pre-training settings of BERT (Devlin et al., 2019) or RoBERTa , we hypothesize that training corpus is the major contributor to the final performance, and summarize those corpora in Tab. 8 in the Appendix. Another potential explanation is that model capacity limita-en fr nl es ru zh he tr ko vi el mr yo Single-token vs Multi-token. Since we choose among M candidate predictions with different numbers of mask tokens based on confidence, it is possible that the prediction with the correct number of mask tokens has lower confidence than the other predictions. To investigate the errors introduced by this step, we conduct an ablation experiment that assumes we know the ground-truth number of mask tokens. As shown in Fig. 4, performance improves significantly by 75% on average across all languages using the oracle mask number, indicating that pre-trained LMs have difficulties in choosing the correct number of mask tokens. The performance on single-token facts (i.e., the setting of previous works that only predicts a single token) is even higher, demonstrating the difficulty of multi-token prediction. 7  The most prominent error type, about one-fourth of mistakes for all LMs, was repeating subjects, whereby the prediction repeats either the full or partial subject. Predicting the wrong entities is also fairly common, especially in Spanish (29%). Interestingly, we find that wrong predictions are often a language-specific "common" entity such as 'Αθήνα' (Athens, the capital of Greece) in Greek location prompts, while the Spanish model insisted most musicians play 'flauta' (flute). Another error type, particularly common in Greek (27%), is producing non-informative output, where the predictions are function words that could never be an entity. Type errors when the semantic type of the prediction is different than expected (e.g. predicting dates instead of locations) are fairly common (English: 8%, Spanish 6%), as are related concepts predictions (English: 7%), where the model predicts relevant, possibly factually correct entities (e.g. predicting a country or a state instead of a city). Worryingly, in a fair amount of cases (English: 5%, Spanish: 8%, Greek: 11%) the models output non-existent words (unk). Errors of the last 4 types could potentially be avoided by limiting the allowed outputs of the model to specific entity classes; we leave this for future work. Last, we identified around 3% of false negatives, where the prediction is actually correct but is not part of our aliases list and less than 1% of inflection errors where the prediction is the correct entity but improperly inflected.

Performance of Different Decoding Methods.
Overall, the confidence-based decoding method improves the accuracy in middle-and low-resource languages, while it hurts the performance on high-  resource languages. To better understand the effect of different components on the final performance, we conduct a comprehensive comparison on English and Chinese. We compare the three initial prediction methods and the three refinement options (including not performing refinement), for a total of nine decoding methods ( § 4.1). We further apply additional improvements ( § 4.3) on the confidence-based decoding method.
By comparing the performance in Tab. 3, we first see advanced decoding methods improve performance on multi-token objects, but hurt performance on single-token ones. The best-performing decoding method on English improves the multitoken accuracy from 5.57% to 11.06%, indicating that advanced decoding methods have a better chance to elicit multi-token facts from M-BERT. Some examples are shown in Tab. 7 in the Appendix. The lower performance on single-token objects is probably caused by the fact that advanced decoding methods discover multi-token predictions that have higher confidence than single-token ones ( § 4.2). For example, the single-token prediction for "Enrique Iglesias used to communicate in _." is "Spanish", while the best decoding method outputs "his own words" with higher confidence. Second, initial prediction methods have a greater effect on the final performance than refinement methods. We  hypothesize that this is because the greedy decoding process heavily depends on previous predictions, and refinement cannot recover from unsatisfactory initial predictions. Third, length normalization was not found useful in either case. There are also observations not consistent across the two languages. First, since Chinese has a larger portion of multi-token objects than English (as shown in Tab. 1), the overall performance on Chinese increases while it decreases on English, which is consistent with the observation in Fig. 3. Second, confidence re-computation and beam search are not as effective on Chinese, which we conjecture is because that the distribution over English sentences exhibits more multimodality than the distribution over Chinese sentences due to more training data.

Improving Multilingual LM Retrieval
As the performance of M-LMs is relatively low, especially on low-resource languages, an obvious endeavor is to refine the model to improve fact retrieval performance in various languages. We analyze how similarly M-BERT performs on queries in different languages. We collect correctly predicted facts across all languages, and count in how many languages each fact was retrieved correctly. As shown in the bottom-left histogram of Fig. 5, half of the correctly predicted facts were correct in a single language, indicating little overlap across languages (Lin et al., 2018). Only 3% of facts were correct in more than 5 languages, and objects in those facts are usually sub-strings of subjects, making them easy to retrieve regardless of the language. This observation is also confirmed by the overlap between pairs of languages in the top-right chart of Fig. 5; even the most similar languages (i.e., English and Dutch) only have 34% of correct predictions in common.
We find that facts retrievable only in a single language tend to be knowledge that is mainly mentioned in a certain language. For example, M-BERT mistakenly predicts "QQ" in the English sentence "Tencent QQ is developed by _.", while the prediction "腾讯" (Tencent) in the corresponding Chinese sentence "腾讯QQ是由_开发的。" is correct. This is probably because Tencent, a Chinese company, is more frequently mentioned in the Chinese training corpus.

Methods
Inspired by these observations, we propose to use code-switching to create data to fine-tune pretrained LMs, replacing entity mentions in one language (e.g., English/Greek) with their counterparts in another language (e.g., Greek/English). Through this bi-directional code-switching, entity mentions serve as pivots, enabling knowledge that was originally learned in one language to be shared with others. Given a pair of languages, we first identify Wikipedia sentences that mention entities from our benchmark using SLING (Ringgaard et al., 2017). The M-LM is then finetuned on these sentences. Following Wu et al. (2020), with 30% of probability we switch all the entity mentions (can be one or multiple) from the original language to their counterparts in the other language, ending up with sentences like "Οµπάµα later reflected on his years ...", where we substituted "Obama" with a Greek mention of the entity, and vice-versa for Greek-to-English. 70% of the sentences remain the same. If there are multiple mention texts for an entity, we sample proportionally to their frequencies, which we found in our preliminary experiments performed better than using a fixed translation. We fine-tune M-BERT using the masked LM objective on this data, with 15% of non-mention words and 50% of mention words masked out. 8

Experimental Results
We choose three languages with different data availability, namely French, Russian, and Greek, and pair them with English, producing 560k, 396k, and 129k code-switched sentences respectively. We compare M-BERT after code-switched fine-tuning (denoted as cs) with both the original M-BERT and with fine-tuning only on raw text (raw). We vary the evaluation settings to illustrate the effect of code-switching: on top of matching predictions to  ground truth aliases in the prompt language (singleeval), we evaluate with targets in both languages (double-eval; English and prompt). As shown in Tab. 4, continued fine-tuning on raw text outperforms the original M-BERT, likely due to our fine-tuning on a subset of sentences with mentions of entities from our benchmark. Results on code-switched text are slightly worse when only matching entities in the original target language, but significantly better if we allow matching in both the original language and English. This indicates that code-switched fine-tuning allows M-BERT to retrieve facts, albeit in English rather than in the prompt language. Encouragingly, the increase is larger for low-resource (Greek) and typologically distant-to-English (Russian) languages. For example, the prediction for the Greek prompt "η Θεωρία κατηγοριών είναι µέρος των ." ("Category theory is part of _.") is "mathematics" (in English!), while the prediction without code-switching is the non-informative "οποίων" ("which"). Considering that we have more raw than code-switched sentences in the dataset, this seems to indicate that English entities are easier to predict than their promptlanguage counterparts, which might be because facts expressed in English are better learned in the pre-trained model due to training data abundance.

Related Work
Factual Knowledge Retrieval from LMs Several works have focused on probing factual knowledge solely from pre-trained LMs without access to external knowledge. They do so by either using prompts and letting the LM fill in the blanks, which assumes that the LM is a static knowledge source (Petroni et al., 2019;Jiang et al., 2020;Poerner et al., 2019;Bouraoui et al., 2020), or fine-tuning the LM on a set of question-answer pairs to directly generate answers, which dynamically adapts the LM to this particular task (Roberts et al., 2020). Impressive results demonstrated by these works indicate that large-scale LMs contain a significant amount of knowledge, in some cases even outperforming competitive question answering systems relying on external resources (Roberts et al., 2020). Petroni et al. (2020) further shows that LMs can generate even more factual knowledge when augmented with retrieved sentences. Our work builds on these works by expanding to multilingual and multi-token evaluation, and also demonstrates the significant challenges posed by this setting.
Multilingual Benchmarks Many multilingual benchmarks have been created to evaluate the performance of multilingual systems on different natural language processing tasks, including question answering (Artetxe et al., 2020;Clark et al., 2020), natural language understanding (Conneau et al., 2018;Yang et al., 2019a;Zweigenbaum et al., 2018;Artetxe and Schwenk, 2019), syntactic prediction (Nivre et al., 2018;Pan et al., 2017), and comprehensive benchmarks covering multiple tasks (Hu et al., 2020;Liang et al., 2020). We focus on multilingual factual knowledge retrieval from LMs, which to our knowledge has not been covered by any previous work.

Conclusion
We examine the intersection of multilinguality and the factual knowledge included in LMs by creating a multilingual and multi-token benchmark X-FACTR, and performing experiments comparing and contrasting across languages and LMs. The results demonstrate the difficulty of this task, and that knowledge contained in LMs varies across languages. Future directions include other pre-training or fine-tuning methods to improve retrieval performance and methods that encourage the LM to predict entities of the right types.

A Benchmark Details
Tab. 5 shows the detailed number of facts in each language in our X-FACTR benchmark. Fig. 6 demonstrates the ratio of facts with respect to the number of tokens of the object in different languages, where high-resource languages (e.g., English, French, Dutch, and Spanish) have more portion of single-token facts than low-resource languages.

B Benchmark Prompt Quality
The prompts generated in different languages may not be perfectly natural. This could be due to awkwardness of attempting to express relational phrases that were originally devised for English in languages where the semantic distinctions of the underlying words may differ, or due to our errors in our automated approach to grammatical attribute inference and subsequent inflection. To this end, we evaluated our prompts on a sample of languages, providing native speakers with 10 sentences per prompt with the missing slots filled by our inflection models. Our approach produces sentences that are annotated as correct 97.9% of the cases in Spanish, 90.5% in Yoruba, 86.7% in Greek, 82.3% in Marathi, and 81.9% in Russian.
We present an analysis of the annotations on the erroneous prompts in Table 6. The error types differ drastically across languages. Russian and Marathi have comparatively large percentages of inflection-related errors, but for different reasons: the prediction of non-human entity grammatical gender in Russian is difficult and this results in mistakes in the inflection. In Marathi, this issue is also exacerbated by the inflection model, which is of slightly lower quality due to the scarcity of training data availability.
Despite these two outliers, we consider the rest of our prompts to be of high quality. Even if small inflection or grammatical gender assignment mistakes occur (e.g. in Greek) this should not render the prompt unintelligible to native speakers -the burden is on the model to be robust to such slight variations, just as humans are. We point out that the prompts can be awkward or incorrect for some senses captured by the relation, an issue unrelated to our gender heuristics or automatic inflection. This issue, though, is also present in the LAMA English prompts (Petroni et al., 2019;Jiang et al., 2020) and is the result of the original Wikidata annotation.

C Multi-Token Decoding
We outline here the exact concrete formulation of our multi-token decoding algorithms. Given a sentence with multiple mask tokens, e.g., Eq. 2, we can either generate outputs in parallel independently or one at a time conditioned on the previously generated tokens. These methods are similar to the prediction problems that BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019b) perform in their pre-training stages respectively. We define c ∈ R n as the confidence of each prediction, with details varying by prediction method.

C.1 Initial Prediction and Refinement
Independent For independent initial prediction, the mask tokens are all predicted in parallel: ∀k ∈ {i, ..., j}.
We also consider two autoregressive methods for initial prediction or refinement.
Order-based Mask tokens are predicted from left to right, conditioned on previously generated tokens in each step: In the refinement stage, we modify the predicted tokens from left to right by replacing the token with a mask and re-predicting it: where s\i means that the i-th token in s is replaced with mask . Convergence is reached when there are no changes in a left-to-right scan.
Confidence-based Among all the predictions for masked positions, we choose the one with the highest confidence (i.e., the highest probability), so the actual order of predictions can be arbitrary, as shown in Fig. 2: In the refinement stage, we choose from all predicted tokens the one with the lowest confidence (i.e., the lowest probability) and re-predict it (Ghazvininejad et al., 2019): Convergence is reached when the re-predicted token is the same as the original token.

C.2 Additional Decoding Components
Length Normalization Since the sum used in § 4.2 might favor short predictions, we consider normalizing it by the number of the mask tokens: Confidence Re-computation Note that the confidence of each predicted token c in previous equations is the probability when the token is predicted. However, the probability will become stale once the surrounding tokens change because of the bidirectional conditional distributions, and this is also noted in (Ghazvininejad et al., 2019). To make the confidence up-to-date, given the prompt in Eq. 2, when a new token is predicted (in the initial stage) or a token is modified (in the refinement stage), we re-compute c i to c j . This makes the time complexity quadratic to the number of mask tokens, because every time we make a modification, we have to recompute the confidence values of all predictions. As a result, the final confidence becomes: c k = p(ŷ k |ŝ i:j \ k), whereŝ i:j = x 1 , ...,ŷ i , ...,ŷ j , ..., x n contains the final predictions.
Beam Search All of the previous methods use the most plausible prediction at each masked position. We also consider performing beam search that keeps track of the most plausible B predictions. Our beam search algorithm is very similar to the case of conventional left-to-right decoding, except that the decoding order might be arbitrary if we use confidence-based initial or refinement prediction methods. As a result, extending different samples in the beam might lead to the same results so we need an additional deduplication step. The time complexity with all the above components is O(M 2 BT ), where M is the maximal number of mask tokens, and T is the maximal number of iteration. Alg. 1 outlines the overall multi-token decoding algorithm. The confidence-based decoding method takes 20 minutes to 2 hours on a Nvidia Geforce RTX 2080 Ti GPU depending on the number of facts of each language.

D Details of Pre-trained LMs
LMs examined in this paper share similar architecture and pre-training setting as BERT (Devlin et al., 2019) or RoBERTa , but are trained on different corpora. We provide the shortcut name of each LM in the Hugging-Face's Transformer library (https://huggingface. co/transformers/pretrained_models.html) and their training corpora in Tab. 8, from which you can find more information.

E Detailed Experimental Results
Detailed performance across LMs and languages and error cases in Spanish and Greek are shown in Tab. 10 and Tab. 9 respectively.    Table 10: Accuracy on different languages using different LMs (%). We use M = 5 mask tokens for en, fr, nl es, vi (on the left) and M = 10 mask tokens for the other languages on the right. Best results for each language-part combination are in bold. "-" denotes missing/unsupported models.