A Call for More Rigor in Unsupervised Cross-lingual Learning

We review motivations, definition, approaches, and methodology for unsupervised cross-lingual learning and call for a more rigorous position in each of them. An existing rationale for such research is based on the lack of parallel data for many of the world’s languages. However, we argue that a scenario without any parallel data and abundant monolingual data is unrealistic in practice. We also discuss different training signals that have been used in previous work, which depart from the pure unsupervised setting. We then describe common methodological issues in tuning and evaluation of unsupervised cross-lingual models and present best practices. Finally, we provide a unified outlook for different types of research in this area (i.e., cross-lingual word embeddings, deep multilingual pretraining, and unsupervised machine translation) and argue for comparable evaluation of these models.


Introduction
The study of the connection among human languages has contributed to major discoveries including the evolution of languages, the reconstruction of proto-languages, and an understanding of language universals (Eco and Fentress, 1995). In natural language processing, the main promise of multilingual learning is to bridge the digital language divide, to enable access to information and technology for the world's 6,900 languages . For the purpose of this paper, we define "multilingual learning" as learning a common model for two or more languages from raw text, without any downstream task labels. Common use cases include translation as well as pretraining multilingual representations. We will use the term interchangeably with "cross-lingual learning". * Equal contribution.
Recent work in this direction has increasingly focused on purely unsupervised cross-lingual learning (UCL)-i.e., cross-lingual learning without any parallel signal across the languages. We provide an overview in §2. Such work has been motivated by the apparent dearth of parallel data for most of the world's languages. In particular, previous work has noted that "data encoding cross-lingual equivalence is often expensive to obtain" (Zhang et al., 2017a) whereas "monolingual data is much easier to find" . Overall, it has been argued that unsupervised cross-lingual learning "opens up opportunities for the processing of extremely low-resource languages and domains that lack parallel data completely" (Zhang et al., 2017a).
We challenge this narrative and argue that the scenario of no parallel data and sufficient monolingual data is unrealistic and not reflected in the real world ( §3.1). Nevertheless, UCL is an important research direction and we advocate for its study based on an inherent scientific interest (to better understand and make progress on general language understanding), usefulness as a lab setting, and simplicity ( §3.2).
Unsupervised cross-lingual learning permits no supervisory signal by definition. However, previous work implicitly includes monolingual and cross-lingual signals that constitute a departure from the pure setting. We review existing training signals as well as other signals that may be of interest for future study ( §4). We then discuss methodological issues in UCL (e.g., validation, hyperparameter tuning) and propose best evaluation practices ( §5). Finally, we provide a unified outlook of established research areas (cross-lingual word embeddings, deep multilingual models and unsupervised machine translation) in UCL ( §6), and conclude with a summary of our recommendations ( §7).

Cross-lingual word embeddings
Cross-lingual word embedding methods traditionally relied on parallel corpora (Gouws et al., 2015;Luong et al., 2015). Nonetheless, the amount of supervision required was greatly reduced via crosslingual word embedding mappings, which work by separately learning monolingual word embeddings in each language and mapping them into a shared space through a linear transformation. Early work required a bilingual dictionary to learn such a transformation (Mikolov et al., 2013a;Faruqui and Dyer, 2014). This requirement was later reduced with self-learning (Artetxe et al., 2017), and ultimately removed via unsupervised initialization heuristics (Artetxe et al., 2018a;Hoshen and Wolf, 2018) and adversarial learning (Zhang et al., 2017a;. Finally, several recent methods have formulated cross-lingual embedding alignment as an optimal transport problem (Zhang et al., 2017b;Alvarez-Melis and Jaakkola, 2018).

Deep multilingual pretraining
Following the success in learning shallow word embeddings (Mikolov et al., 2013b;Pennington et al., 2014), there has been an increasing interest in learning contextual word representations (Dai and Le, 2015;Peters et al., 2018;Howard and Ruder, 2018). Recent research has been dominated by BERT (Devlin et al., 2019), which uses a bidirectional transformer encoder trained on masked language modeling and next sentence prediction, which led to impressive gains on various downstream tasks.
While the above approaches are limited to a single language, a multilingual extension of BERT (mBERT) has been shown to also be effective at learning cross-lingual representations in an unsupervised way. 1 The main idea is to combine monolingual corpora in different languages, upsampling those with less data, and training a regular BERT model on the combined data. Conneau and Lample (2019) follow a similar approach but perform a more thorough evaluation and report substantially stronger results, 2 which was further scaled up by . Several recent studies (Wu and Dredze, 2019;Pires et al., 2019;Artetxe et al., 2020b; analyze mBERT to get a better understanding of its capabilities.

Unsupervised machine translation
Early attempts to build machine translation systems using monolingual data alone go back to statistical decipherment (Ravi and Knight, 2011;Knight, 2012, 2013). However, this approach was only shown to work in limited settings, and the first convincing results on standard benchmarks were achieved by Artetxe et al. (2018c) and  on unsupervised Neural Machine Translation (NMT). Both approaches rely on cross-lingual word embeddings to initialize a shared encoder, and train it in conjunction with the decoder using a combination of denoising autoencoding, backtranslation, and optionally adversarial learning.
Subsequent work adapted these principles to unsupervised phrase-based Statistical Machine Translation (SMT), obtaining large improvements over the original NMT-based systems (Lample et al., 2018b;Artetxe et al., 2018b). This alternative approach uses cross-lingual n-gram embeddings to build an initial phrase table, which is combined with an n-gram language model and a distortion model, and further refined through iterative backtranslation. There have been several follow-up attempts to combine NMT and SMT based approaches (Marie and Fujita, 2018;Ren et al., 2019;Artetxe et al., 2019b). More recently, Conneau and Lample (2019), Song et al. (2019) and Liu et al. (2020) obtain strong results using deep multilingual pretraining rather than cross-lingual word embeddings to initialize unsupervised NMT systems.

Motivating fully unsupervised learning
In this section, we challenge the narrative of motivating UCL based on a lack of parallel resources. We argue that the strict unsupervised scenario cannot be motivated from an immediate practical perspective, and elucidate what we believe should be the true goals of this research direction.
3.1 How practical is the strict unsupervised scenario?
Monolingual resources subsume parallel resources. For instance, each side of a parallel corpus effectively serves as a monolingual corpus. From this argument, it follows that monolingual data is cheaper to obtain than parallel data, so unsupervised crosslingual learning should in principle be more generally applicable than supervised learning. However, we argue that the common claim that the requirement for parallel data "may not be met for many language pairs in the real world" (Xu et al., 2018) is largely inaccurate. For instance, the JW300 parallel corpus covers 343 languages with around 100,000 parallel sentences per language pair on average (Agić and Vulić, 2019), and the multilingual Bible corpus collected by Mayer and Cysouw (2014) covers 837 language varieties (each with a unique ISO 639-3 code). Moreover, the PanLex project aims to collect multilingual lexica for all human languages in the world, and already covers 6,854 language varieties with at least 20 lexemes, 2,364 with at least 200 lexemes, and 369 with at least 2,000 lexemes (Kamholz et al., 2014). While 20 or 200 lexemes might seem insufficient, weakly supervised cross-lingual word embedding methods already proved effective with as little as 25 word pairs (Artetxe et al., 2017). More recent methods have focused on completely removing this weak supervision Artetxe et al., 2018a), which can hardly be justified from a practical perspective given the existence of such resources and additional training signals stemming from a (partially) shared script ( §4.2). Finally, given the availability of sufficient monolingual data, noisy parallel data can often be obtained by mining bitext (Schwenk et al., 2019a,b).
In addition, large monolingual data is difficult to obtain for low-resource languages. For instance, recent work on cross-lingual word embeddings has mostly used Wikipedia as its source for monolingual corpora (Gouws et al., 2015;Vulić and Korhonen, 2016;. However, as of November 2019, Wikipedia exists in only 307 languages 3 of which nearly half have less than 10,000 articles. While one could hope to overcome this by taking the entire web as a corpus, as facilitated by Common Crawl 4 and similar initiatives, this is not 3 https://en.wikipedia.org/wiki/List_ of_Wikipedias 4 https://commoncrawl.org/ always feasible for low-resource languages. First, the presence of less resourced languages on the web is very limited, with only a few hundred languages recognized as being used in websites. 5 This situation is further complicated by the limited coverage of existing tools such as language detectors (Buck et al., 2014;Grave et al., 2018), which only cover a few hundred languages. Alternatively, speech could also serve as a source of monolingual data (e.g., by recording public radio stations). However, this is an unexplored direction within UCL, and collecting, processing and effectively capitalizing on speech data is far from trivial, particularly for low-resource languages. All in all, we conclude that the alleged scenario involving no parallel data and sufficient monolingual data is not met in the real world in the terms explored by recent UCL research. Needless to say, effectively exploiting unlabeled data is important in any low-resource setting. However, refusing to use an informative training signal-which parallel data is-when it does indeed exist, cannot be justified from a practical perspective if one's goal is to build the strongest possible model. For this reason, we believe that semi-supervised learning is a more suitable paradigm for truly low-resource languages, and UCL should not be motivated from an immediate practical perspective.

A scientific motivation
Despite not being an entirely realistic setup, we believe that UCL is an important research direction for the reasons we discuss below.
Inherent scientific interest. The extent to which two languages can be aligned based on independent samples-without any cross-lingual signal-is an open and scientifically relevant problem per se. In fact, it is not entirely obvious that UCL should be possible at all, as humans would certainly struggle to align two unknown languages without any grounding. Exploring the limits of UCL could help to understand the limits of the principles that the corresponding methods are based on, such as the distributional hypothesis. Moreover, this research line could bring new insights into the properties and inner workings of both language acquisition and the underlying computational models that ultimately make UCL possible. Finally, such methods may be useful in areas where supervision is impos-sible to obtain, such as when dealing with unknown or even non-human languages.
Useful as a lab setting. The strict unsupervised scenario, although not practical, allows us to isolate and better study the use of monolingual corpora for cross-lingual learning. We believe lessons learned in this setting can be useful in the more practical semi-supervised scenario. In a similar vein, monolingual language models, although hardly useful on their own, have contributed to large improvements in other tasks. From a research methodology perspective, unsupervised systems also set a competitive baseline, which any semi-supervised method should improve upon.
Simplicity as a value. As we discussed previously, refusing to use an informative training signal when it does exist can hardly be beneficial, so we should not expect UCL to perform better than semisupervised learning. However, simplicity is a value in its own right. Unsupervised approaches could be preferable to their semi-supervised counterparts if the performance gap between them is small enough. For instance, unsupervised cross-lingual embedding methods have been reported to be competitive with their semi-supervised counterparts in certain settings , while being easier to use in the sense that they do not require a bilingual dictionary.

What does unsupervised mean?
In its most general sense, unsupervised crosslingual learning can be seen as referring to any method relying exclusively on monolingual text data in two or more languages. However, there are different training signals-stemming from common assumptions and varying amounts of linguistic knowledge-that one can potentially exploit under such a regime. This has led to an inconsistent use of this term in the literature. In this section, we categorize different training signals available both from a monolingual and a cross-lingual perspective and discuss additional scenarios enabled by multiple languages.

Monolingual training signals
From a computational perspective, text is modeled as a sequence of discrete symbols. In UCL, the training data consists of a set of such sequences in each of the languages. In principle, without any knowledge about the languages, one would have no prior information of the nature of such sequences or the possible relations between them. In practice, however, sets of sequences are assumed to be independent, and existing work differs whether they assume document-level sequences (Conneau and Lample, 2019) or sentence-level sequences (Artetxe et al., 2018c;. Nature of atomic symbols. A more important consideration is the nature of the atomic symbols in such sequences. To the best of our knowledge, previous work assumes some form of word segmentation or tokenization (e.g., splitting by whitespaces or punctuation marks). Early work on cross-lingual word embeddings considered such tokens as atomic units. However, more recent work (Hoshen and Wolf, 2018; has primarily used fastText embeddings (Bojanowski et al., 2017) which incorporate subword information into the embedding learning, although the vocabulary is still defined at the token level. In addition, there have also been approaches that incorporate character-level information into the alignment learning itself (Heyman et al., 2017;Riley and Gildea, 2018). In contrast, most work on contextual word embeddings and unsupervised machine translation operates with a subword vocabulary (Devlin et al., 2019;Conneau and Lample, 2019).
While the above distinction might seem irrelevant from a practical perspective, we think that it is important from a more fundamental point of view (e.g. in relation to the distributional hypothesis as discussed in §3.2). Moreover, some of the underlying assumptions might not generalize to different writing systems (e.g. logographic instead of alphabetic). For instance, subword tokenization has been shown to perform poorly on reduplicated words (Vania and Lopez, 2017). In relation to that, one could also consider the text in each language as a stream of discrete character-like symbols without any notion of tokenization. Such a tabula rasa approach is potentially applicable to any arbitrary language, even when its writing system is not known, but has so far only been explored for a limited number of languages in a monolingual setting (Hahn and Baroni, 2019).
Linguistic information. Finally, one can exploit additional linguistic knowledge through linguistic analysis such as lemmatization, part-of-speech tagging, or syntactic parsing. For instance, before the advent of unsupervised NMT, statistical deci-pherment was already shown to benefit from incorporating syntactic dependency relations (Dou and Knight, 2013). For other tasks such as unsupervised POS tagging (Snyder et al., 2008), monolingual tag dictionaries have been used. While such approaches could still be considered unsupervised from a cross-lingual perspective, we argue that the interest of this research direction is greatly limited by two factors: (i) from a theoretical perspective, it assumes some fundamental knowledge that is not directly inferred from the raw monolingual corpora; and (ii) from a more practical perspective, it is not reasonable to assume that such resources are available in the less resourced settings where this research direction has more potential for impact.

Cross-lingual training signals
Pure UCL should not use any cross-lingual signal by definition. When we view text as a sequence of discrete atomic symbols (either characters or tokens), a strict interpretation of this principle would consider the set of atomic symbols in different languages to be disjoint, without prior knowledge of the relationship between them.
Needless to say, any form of learning requires making assumptions, as one needs some criterion to prefer one mapping over another. In the case of UCL, such assumptions stem from the structural similarity across languages (e.g. semantically equivalent words in different languages are assumed to occur in similar contexts). In practice, these assumptions weaken as the distribution of the datasets diverges, and some UCL models have been reported to break under a domain shift Marchisio et al., 2020). Similarly, approaches that leverage linguistic features such as syntactic dependencies may assume that these are similar across languages.
In addition, one can also assume that the sets of symbols that are used to represent different languages have some commonalities. This departs from the strict definition of UCL above, establishing some prior connections between the sets of symbols in different languages. Such an assumption is reasonable from a practical perspective, as there are a few scripts (e.g. Latin, Arabic or Cyrillic) that cover a large fraction of languages. Moreover, even when two languages use different writing systems or scripts, there are often certain elements that are still shared (e.g. Arabic numerals, named entities written in a foreign script, URLs, certain punctua-tion marks, etc.). In relation to that, several models have relied on identically spelled words (Artetxe et al., 2017;Smith et al., 2017; or string-level similarity across languages (Riley and Gildea, 2018;Artetxe et al., 2019b) as training signals. Other methods use a joint subword vocabulary for all languages, indirectly exploiting the commonalities in their writing system (Lample et al., 2018b;Conneau and Lample, 2019).
However, past work greatly differs on the nature and relevance that is attributed to such a training signal. The reliance on identically spelled words has been considered as a weak form of supervision in the cross-lingual word embedding literature , and significant effort has been put into developing strictly unsupervised methods that do not rely on such signal . In contrast, the unsupervised machine translation literature has not payed much attention to this factor, and has often relied on identical words (Artetxe et al., 2018c), string-level similarity (Artetxe et al., 2019b), or a joint subword vocabulary (Lample et al., 2018b;Conneau and Lample, 2019) under the unsupervised umbrella. The same is true for unsupervised deep multilingual pretraining, where a shared subword vocabulary has been a common component (Pires et al., 2019;Conneau and Lample, 2019), although recent work shows that it is not important to share vocabulary across languages (Artetxe et al., 2020b;. Our position is that making assumptions on linguistics universals is acceptable and ultimately necessary for UCL. However, we believe that any connection stemming from a (partly) shared writing system belongs to a different category, and should be considered a separate cross-lingual signal. Our rationale is that a given writing system pertains to a specific form to encode a language, but cannot be considered to be part of the language itself. 6

Multilinguality
While most work in unsupervised cross-lingual learning considers two languages at a time, there have recently been some attempts to extend these methods to multiple languages (Duong et al., 2017;Chen and Cardie, 2018;Heyman et al., 2019), and most work on unsupervised cross-lingual pretraining is multilingual (Pires et al., 2019;Conneau Monolingual signal Cross-lingual signal Sequence of symbols Shared writing system Sets of sentences/documents Identical words Tokens/subwords String similarity Linguistic analysis Table 1: Different types of monolingual and crosslingual signals that have been used for unsupervised cross-lingual learning, ordered roughly from least to most linguistic knowledge (top to bottom). and Lample, 2019). When considering parallel data across a subset of the language pairs, multilinguality gives rise to additional scenarios. For instance, the scenario where two languages have no parallel data between each other but are well connected through a third (pivot) language has been explored by several authors in the context of machine translation (Cheng et al., 2016;Chen et al., 2017). However, given that the languages in question are still indirectly connected through parallel data, this scenario does not fall within the unsupervised category, and is instead commonly known as zero-resource machine translation.
An alternative scenario explored in the contemporaneous work of Liu et al. (2020) is where a set of languages are connected through parallel data, and there is a separate language with monolingual data only. We argue that, when it comes to the isolated language, such a scenario should still be considered as UCL, as it does not rely on any parallel data for that particular language nor does it assume any previous knowledge of it. This scenario is easy to justify from a practical perspective given the abundance of parallel data for high-resource languages, and can also be interesting from a more theoretical point of view. This way, rather than considering two unknown languages, this alternative scenario would assume some knowledge of how one particular language is connected to other languages, and attempt to align it to a separate unknown language.

Discussion
As discussed throughout the section, there are different training signals that we can exploit depending on the available resources of the languages involved and the assumptions made regarding their writing system, which are summarized in Table 1. Many of these signals are not specific to work on UCL but have been observed in the past in allegedly language-independent NLP approaches, as discussed by Bender (2011). Others, such as a re-liance on subwords or shared symbols are more recent phenomena.
While we do not aim to open a terminological debate on what UCL encompasses, we advocate for future work being more aware and explicit about the monolingual and cross-lingual signals they employ, what assumptions they make (e.g. regarding the writing system), and the extent to which these generalize to other languages.
In particular, we argue that it is critical to consider the assumptions made by different methods when comparing their results. Otherwise the blind chase for state-of-the-art performance may benefit models making stronger assumptions and exploiting all available training signals, which could ultimately conflict with the eminently scientific motivation of this research area (see §3.2).

Methodological issues
In this section, we describe methodological issues that are commonly encountered when training and evaluating unsupervised cross-lingual models and propose measures to ameliorate them.

Validation and hyperparameter tuning
In conventional supervised or semi-supervised settings, we use a separate validation set for development and hyperparameter tuning. However, this becomes tricky in unsupervised cross-lingual learning, where we ideally should not use any parallel data other than for testing purposes.
Previous work has not paid much attention to this aspect, and different methods are evaluated with different validation schemes. For instance, Artetxe et al. (2018b,c) use a separate language pair with a parallel validation set to make all development and hyperparameter decisions. They test their final system on other language pairs without any parallel data. This approach has the advantage of being strictly unsupervised with respect to the test language pairs, but the optimal hyperparameter choice might not necessarily transfer well across languages. In contrast,  and  propose an unsupervised validation criterion that is defined over monolingual data and shown to correlate well with test performance. This enables systematic tuning on the language pair of interest, but still requires parallel data to guide the development of the unsupervised validation criterion itself. A parallel validation set has also been used for systematic tuning in the context of unsupervised machine translation (Marie and Fujita, 2018;Stojanovski et al., 2019). While this is motivated as a way to abstract away the issue of unsupervised tuning-which the authors consider to be an open problem-we argue that any systematic use of parallel data should not be considered UCL. Finally, previous work often does not report the validation scheme used. In particular, unsupervised crosslingual word embedding methods have almost exclusively been evaluated on bilingual lexicons that do not have a validation set, and presumably use the test set to guide development to some extent.
Our position is that a completely blind development model without any parallel data is unrealistic. Some cross-lingual signals to guide development are always needed. However, this factor should be carefully controlled and reported with the necessary rigor as a part of the experimental design. We advocate for using one language pair for development and evaluating on others when possible. If parallel data in the target language pair is used, the test set should be kept blind to avoid overfitting, and a separate validation should be used. In any case, we argue that the use of parallel data in the target language pair should be minimized if not completely avoided, and it should under no circumstances be used for extensive tuning. Instead, we recommend to use unsupervised validation criteria for systematic tuning in the target language.

Evaluation practices
We argue that there are also several issues with common evaluation practices in UCL.
Evaluation on favorable conditions. Most work on UCL has focused on relatively close languages with large amounts of high-quality parallel corpora from similar domains. Only recently have approaches considered more diverse languages as well as language pairs that do not involve English , and some existing methods have been shown to completely break in less favorable conditions Marchisio et al., 2020). In addition, most approaches have focused on learning from similar domains, often involving Wikipedia and news corpora, which are unlikely to be available for lowresource languages. We believe that future work should pay more attention to the effect of the typology and linguistic distance of the languages involved, as well as the size, noise and domain similarity of the training data used.
Over-reliance on translation tasks. Most work on UCL focuses on translation tasks, either at the word level (where the problem is known as bilingual lexicon induction) or at the sentence level (where the problem is known as unsupervised machine translation). While translation can be seen as the ultimate application of cross-lingual learning and has a strong practical interest on its own, it only evaluates a particular facet of a model's cross-lingual generalization ability. In relation to that,  showed that bilingual lexicon induction performance does not always correlate well with downstream tasks. In particular, they observe that some mapping methods that are specifically designed for bilingual lexicon induction perform poorly on other tasks, showing the risk of relying excessively on translation benchmarks for evaluating cross-lingual models.
Moreover, existing translation benchmarks have been shown to have several issues on their own. In particular, bilingual lexicon induction datasets have been reported to misrepresent morphological variations, overly focus on named entities and frequent words, and have pervasive gaps in the gold-standard targets (Czarnowska et al., 2019;Kementchedjhieva et al., 2019). More generally, most of these datasets are limited to relatively close languages and comparable corpora.
Lack of an established cross-lingual benchmark. At the same time, there is no de facto standard benchmark to evaluate cross-lingual models beyond translation. Existing approaches have been evaluated in a wide variety of tasks including dependency parsing (Schuster et al., 2019), named entity recognition (Rahimi et al., 2019), sentiment analysis (Barnes et al., 2018), natural language inference (Conneau et al., 2018b), and document classification (Schwenk and Li, 2018). XNLI (Conneau et al., 2018b) and MLDoc (Schwenk and Li, 2018) are common choices, but they have their own problems: MultiNLI, the dataset from which XNLI was derived, has been shown to contain superficial cues that can be exploited (Gururangan et al., 2018), while MLDoc can be solved by keyword matching (Artetxe et al., 2020b). There are non-English counterparts for more challenging tasks such as question answering (Cui et al., 2019;Hsu et al., 2019), but these only exist for a handful of languages. More recent datasets such as XQuAD   (Artetxe et al., 2020b), MLQA (Lewis et al., 2019) and TyDi QA (Clark et al., 2020) cover a wider set of languages, but a comprehensive benchmark that evaluates multilingual representations on a diverse set of tasks-in the style of GLUE (Wang et al., 2018)-and languages has been missing until very recently. The contemporaneous XTREME (Hu et al., 2020) and XGLUE (Liang et al., 2020) benchmarks try to close this gap, but they are still restricted to languages where existing labelled data is available. Finally, an additional issue is that a large part of these benchmarks were created through translation, which was recently shown to introduce artifacts (Artetxe et al., 2020a).
We present a summary of the methodological issues discussed in Table 2. 6 Bridging the gap between unsupervised cross-lingual learning flavors The three categories of UCL ( §2) have so far been treated as separate research topics by the community. In particular, cross-lingual word embeddings have a long history , while deep multilingual pretraining has emerged as a separate line of research with its own best practices and evaluation standards. At the same time, unsupervised machine translation has been considered a separate problem in its own right, where cross-lingual word embeddings and deep multilingual pretraining have just served as initialization techniques. While each of these families have their own defining features, we believe that they share a strong connection that should be considered from a more holistic perspective. In particular, both cross-lingual word embeddings and deep mul-tilingual pretraining share the goal of learning (sub)word representations, and essentially differ on whether such representations are static or contextdependent. Similarly, in addition to being a downstream application of the former, unsupervised machine translation can also be useful to develop other multilingual applications or learn better crosslingual representations. This has previously been shown for supervised machine translation (McCann et al., 2017;Siddhant et al., 2019) and recently for bilingual lexicon induction (Artetxe et al., 2019a). In light of these connections, we call for a more holistic view of UCL, both from an experimental and theoretical perspective.
Evaluation. Most work on cross-lingual word embeddings focuses on bilingual lexicon induction. In contrast, deep multilingual pretraining has not been tested on this task, and is instead typically evaluated on zero-shot cross-lingual transfer. We think it is important to evaluate both familiescross-lingual word embeddings and deep multilingual representations-in the same conditions to better understand their strengths and weaknesses. In that regard, Artetxe et al. (2020b) recently showed that deep pretrained models are much stronger in some downstream tasks, while cross-lingual word embeddings are more efficient and sufficient for simpler tasks. However, this could partly be attributed to a particular integration strategy, and we advocate for using a common evaluation framework in future work to allow a direct comparison between the different families.
Theory. From a more theoretical perspective, it is still not well understood in what ways crosslingual word embeddings and deep multilingual pretraining differ. While one could expect the latter to be learning higher-level multilingual abstractions, recent work suggests that deep multilingual models might mostly be learning a lexical-level alignment (Artetxe et al., 2020b). For that reason, we believe that further research is needed to understand the relation between both families of models.

Recommendations
To summarize, we make the following practical recommendations for future cross-lingual research: • Be rigorous when motivating UCL. Do not present it as a practical scenario unless supported by a real use case.
• Be explicit about the monolingual and crosslingual signals used by your approach and the assumptions it makes, and take them into considerations when comparing different models.
• Report the validation scheme used. Minimize the use of parallel data by preferring an unsupervised validation criterion and/or using only one language for development. Always keep the test set blind.
• Pay attention to the conditions in which you evaluate your model. Consider the impact of typology, linguistic distance, and the domain similarity, size and noise of the training data. Be aware of known issues with common benchmarks, and favor evaluation on a diverse set of tasks.
• Keep a holistic view of UCL, including crosslingual word embeddings, deep multilingual pretraining and unsupervised machine translation. To the extent possible, favor a common evaluation framework for these different families.

Conclusions
In this position paper, we review the status quo of unsupervised cross-lingual learning-a relatively recent field. UCL is typically motivated by the lack of cross-lingual signal for many of the world's languages, but available resources indicate that a scenario with no parallel data and sufficient monolingual data is not realistic. Instead, we advocate for the importance of UCL for scientific reasons. We also discuss different monolingual and crosslingual training signals that have been used in the past, and advocate for carefully reporting them to enable a meaningful comparison across different approaches. In addition, we describe methodological issues related to the unsupervised setting and propose measures to ameliorate them. Finally, we discuss connections between cross-lingual word embeddings, deep multilingual pre-training, and unsupervised machine translation, calling for an evaluation on an equal footing.
We hope that this position paper will serve to strengthen research in UCL, providing a more rigorous look at the motivation, definition, and methodology. In light of the unprecedented growth of our field in recent times, we believe that it is essential to establish a rigorous foundation connecting past and present research, and an evaluation protocol that carefully controls for the use of parallel data and assesses models in diverse, challenging settings.