Unsupervised Cross-Lingual Representation Learning

In this tutorial, we provide a comprehensive survey of the exciting recent work on cutting-edge weakly-supervised and unsupervised cross-lingual word representations. After providing a brief history of supervised cross-lingual word representations, we focus on: 1) how to induce weakly-supervised and unsupervised cross-lingual word representations in truly resource-poor settings where bilingual supervision cannot be guaranteed; 2) critical examinations of different training conditions and requirements under which unsupervised algorithms can and cannot work effectively; 3) more robust methods for distant language pairs that can mitigate instability issues and low performance for distant language pairs; 4) how to comprehensively evaluate such representations; and 5) diverse applications that benefit from cross-lingual word representations (e.g., MT, dialogue, cross-lingual sequence labeling and structured prediction applications, cross-lingual IR).


Motivation and Objectives
Cross-lingual word representations offer an elegant and language-pair independent way to represent content across different languages. They enable us to reason about word meaning in multilingual contexts and serve as an integral source of knowledge for multilingual applications such as machine translation (Artetxe et al., 2018d;Qi et al., 2018;Lample et al., 2018b) or multilingual search and question answering (Vulić and Moens, 2015). In addition, they are a key facilitator of cross-lingual transfer and joint multilingual training, offering support to NLP applications in a large spectrum of languages (Søgaard et al., 2015;Ammar et al., 2016a). While NLP is increasingly more embedded into a variety of products related to, e.g., translation, conversational or search tasks, resources such as annotated training data are still lacking or insufficient to induce satisfying models for many resource-poor languages. There are often no trained linguistic annotators for these languages, and markets may be too small or premature to invest in such training. This is a major challenge, but cross-lingual modelling and transfer can help by exploiting observable correlations between major languages and low-resource languages.
Recent work has already verified the usefulness of cross-lingual word representations in a wide variety of downstream tasks, and has provided extensive model classifications in several survey papers (Upadhyay et al., 2016;Ruder et al., 2018b). They cluster supervised cross-lingual word representation models according to the bilingual supervision required to induce such shared cross-lingual semantic spaces, covering models based on word alignments and readily available bilingual dictionaries (Mikolov et al., 2013;Smith et al., 2017), sentence-aligned parallel data (Gouws et al., 2015), document-aligned data (Søgaard et al., 2015;Vulić and Moens, 2016), or even image tags and captions (Rotman et al., 2018). The current trend (or rather 'obsession') in cross-lingual word embedding learning, however, concerns models that require a tiny amount of supervision (i.e., weaklysupervised alignment models that require only dozens of word translation pairs) or no supervision at all (fully unsupervised models). 1 Such resourcelight unsupervised methods are based on the assumption that monolingual word vector spaces are approximately isomorphic (Conneau et al., 2018a). Therefore, they require only monolingual data and hold promise to enable cross-lingual NLP modeling in the absence of any bilingual resources. As a consequence, they offer support to a wider array of language pairs than supervised models, and promise to deliver language technology to truly resource-poor languages and dialects. However, due to the strong assumption on the similarity of space topology, these models often diverge to nonoptimal solutions, and their robustness is one of the crucial research questions at present (Søgaard et al., 2018).
In this tutorial, we provide a comprehensive survey of the exciting recent work on cuttingedge weakly-supervised and unsupervised crosslingual word representations. After providing a brief history of supervised cross-lingual word representations, we focus on: 1) how to induce weakly-supervised and unsupervised cross-lingual word representations in truly resource-poor settings where bilingual supervision cannot be guaranteed; 2) critical examinations of different training conditions and requirements under which unsupervised algorithms can and cannot work effectively; 3) more robust methods for distant language pairs that can mitigate instability issues and low performance for distant language pairs; 4) how to comprehensively evaluate such representations; and 5) diverse applications that benefit from cross-lingual word representations (e.g., MT, dialogue, cross-lingual sequence labeling and structured prediction applications, cross-lingual IR).
We will introduce researchers to state-of-theart methods for constructing resource-light crosslingual word representations and discuss their applicability in a broad range of downstream NLP applications, covering bilingual lexicon induction, machine translation (both neural and phrase-based), dialogue, and information retrieval tasks. We will deliver a detailed survey of the current cuttingedge methods, discuss best training and evaluation practices and use-cases, and provide links to publicly available implementations, datasets, and pretrained models and word embedding collections. 2 2 Tutorial Overview Part I: Introduction We first present an overview of cross-lingual NLP research, situating the current work on unsupervised cross-lingual representation learning, and motivating the need for multilingual training and cross-lingual transfer for resource-poor languages with weak supervision or no bilingual supervision at all. We also present key downstream applications for cross-lingual word representations, such as bilingual lexicon induction and unsupervised MT (Lample et al., 2018b). These tasks will be used throughout the tutorial to analyze the performance of different methods.
Almost all of the work on unsupervised crosslingual representation learning fall into the category of mapping-based approaches (Ruder et al., 2018b). Such approaches to cross-lingual learning learn mapping functions between pretrained monolingual word embedding spaces; this is in contrast with approaches based on joint learning, data augmentation, or grounding. We show that such approaches to cross-lingual learning, while so far unexplored, can also be unsupervised. We will put focus on a standardized two-step mappingbased framework (Artetxe et al., 2018a) that generalizes all mapping-based approaches, and analyze the importance of each component of the framework. The two-step framework decomposes unsupervised cross-lingual representation learning into initial seed induction and iterative supervised bootstrapping.
Part II: Unsupervised and Weakly Supervised Alignment as Initial Seed Induction + Iterative Supervised Alignment We will analyze the impact of seed bilingual lexicon size and quality (e.g., cognates, named entities, or shared numerals) on the quality of weakly supervised cross-lingual word representations. Unsupervised and weakly supervised approaches can be directly compared by compared the quality of the learned dictionary seeds (Parts III and IV) to using cognates, named entities, etc.
Part III: Adversarial Seed Induction The underlying modus operandi of all adversarial methods will be demonstrated on the example of the MUSE architecture (Conneau et al., 2018a); this is by far the most cited adversarial seed induction method. We will then present similar adversarial methods and discuss their modeling choices, implementation tricks, and various trade-offs. We will also present our own direct comparisons of various GAN algorithms (e.g., WGAN, GP-WGAN, and CT-GAN) within the MUSE framework.
Part IV: Non-Adversarial Seed Induction In the next part, we will present several nonadversarial alternatives for unsupervised seed induction based on convex relaxations, point set registration methods, and evolutionary strategies. We will again dissect all components of the unsupervised methods and point to minor, but important implementation tricks and hyper-parameters that often slip under the radar (e.g., vocabulary size, postmapping refinements, preprocessing steps such as mean centering and unit length normalisation, selected semantic similarity measures, hubness reduction mechanisms). We will also introduce the newest research that extends these methods from bilingual settings to multilingual settings (with more than 2 languages represented in the same shared space).
Part V: Stochastic Dictionary Induction improves Iterative Alignment We will then discuss stochastic approaches to improve the iterative refinement of the dictionary. Stochastic dictionary induction was introduced in Artetxe et al. (2018b), and we show that this bootstrapping technique improves performance and robustness, and is the main reason Artetxe et al. (2018b) achieves state-of-the-art performance for many language pairs. This part of our tutorial explores variation of stochastic dictionary induction.
Part VI: Robustness and (In)stability Unsupervised methods rely on the assumption that monolingual word vector spaces are approximately isomorphic and there exists a linear mapping between the two spaces. This assumption is not true for many cases, which leads to degenerate or suboptimal solutions. The efficacy and stability of unsupervised methods relies on multiple factors such as: monolingual representation models, domain (dis)similarity, language pair proximity and other typological properties, chosen hyper-parameters, etc. In this part, we will analyze the current problems with robustness and stability of weaklysupervised and unsupervised alignment methods in relation to all these factors, and introduce latest solutions to alleviate these problems. We will provide advice on how to approach weakly-supervised and unsupervised training based on a series of empirical observations available in recent literature (Søgaard et al., 2018;Hartmann et al., 2018). We will also discuss the (im)possibility of learning nonlinear mappings using either non-linear generators or locally linear maps (Nakashole, 2018).
We will conclude by providing publicly available software packages and implementations, as well as available training datasets and evaluation protocols and systems. We will also list current state-of-the-art results on standard evaluation datasets, and sketch future research paths.

Outline
Part I: Introduction: Motivating and situating cross-lingual word representation learning; presentation of mapping-based approaches (30 minutes) • Current challenges in cross-lingual NLP. NLP for resource-poor languages.
• Bilingual data and cross-lingual supervision. Why do we need weakly supervised and unsupervised cross-lingual representation learning?
• Bilingual supervision and typology of supervised cross-lingual representation models.
• Learning with word-level supervision: mapping-based approaches.
Part II: Unsupervised and Weakly Supervised Alignment as Initial Seed Induction + Iterative Supervised Alignment (30 minutes) • A general framework for mapping-based approaches.
• Importance of seed bilingual lexicons.
• Learning alignment with weak supervision: small seed lexicons, shared words, numerals.
Part III: Adversarial Seed Induction (30 minutes) • Fully unsupervised models using adversarial training; MUSE and related approaches.
Part IV: Non-Adversarial Seed Induction (25 minutes) • Fully unsupervised models using optimal transport, Wasserstein distance, Sinkhorn distance, and other alternatives.
• Importance of minor technical "tricks": premapping and post-mapping steps: length normalisation, mean centering, whitening and dewhitening, making the methods more robust Part V: Stochastic Dictionary Induction improves Iterative Alignment (15 minutes) • An overview of methods to improve iterative refinement of the dictionary.
Part VI: Robustness and (In)stability (35 minutes) • Impact of language similarity and typological properties.
• Impact of chosen monolingual models, domain similarity, and hyper-parameters.
• Convergence criteria, possible and impossible setups for unsupervised methods.
• How to build more robust and more stable unsupervised methods?
• Publicly available software and training data.
• Publicly available evaluation systems.
• Concluding remarks, remaining challenges, future work, a short discussion.

Tutorial Breadth
Based on the representative set of papers listed in the selected bibliography, we anticipate that the 75%-80% of the tutorial will cover other people's work, while the rest concerns the work where at least one of the three presenters has been actively involved in. Note that the three presenters have been the main authors of the recent book on crosslingual word representations which aimed at making a systematic overview of the field.

Prerequisites
• Machine Learning: Basic knowledge of common neural network components like word embeddings, RNNs, CNNs, denoising autoencoders, and encoder-decoder models.
• Computational Linguistics: Familiarity with standard NLP tasks such as machine translation.

Other Important Information
Previous Tutorial Editions The EMNLP 2017 tutorial on cross-lingual word embeddings presented much of the earlier work from 2013-2016 that require large amounts of parallel data (i.e., supervised cross-lingual representations). In contrast, this tutorial focuses on cutting-edge unsupervised and weakly supervised approaches from the period of 2016-2018, which will be highly relevant to the audience, and will provide a complete overview of the current cutting-edge research in the field.

Acknowledgments
The work of IV is supported by the ERC Consolidator Grant LEXICAL (no 648909) awarded to Anna Korhonen at the University of Cambridge.