Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model

A significant roadblock in multilingual neural language modeling is the lack of labeled non-English data. One potential method for overcoming this issue is learning cross-lingual text representations that can be used to transfer the performance from training on English tasks to non-English tasks, despite little to no task-specific non-English data. In this paper, we explore a natural setup for learning cross-lingual sentence representations: the dual-encoder. We provide a comprehensive evaluation of our cross-lingual representations on a number of monolingual, cross-lingual, and zero-shot/few-shot learning tasks, and also give an analysis of different learned cross-lingual embedding spaces.


Introduction
Sentence embeddings are broadly useful for a diverse collection of downstream natural language processing tasks Conneau et al., 2017;Kiros et al., 2015;Logeswaran and Lee, 2018;Subramanian et al., 2018). Sentence embeddings evaluated on downstream tasks in prior work have been trained on monolingual data, preventing them from being used for cross-lingual transfer learning. However, recent work on learning multilingual sentence embeddings has produced representations that capture semantic similarity even when sentences are written in different languages (Eriguchi et al., 2018;Guo et al., 2018;Schwenk and Douze, 2017;Singla et al., 2018). We explore multi-task extensions of multilingual models for cross-lingual transfer learning. * equal contribution We present a novel approach for cross-lingual representation learning that combines methods for multi-task learning of monolingual sentence representations Subramanian et al., 2018) with recent work on dual encoder methods for obtaining multilingual sentence representations for bi-text retrieval (Guo et al., 2018;Yang et al., 2019). By doing so, we learn representations that maintain strong performance on the original monolingual language tasks, while simultaneously obtaining good performance using zeroshot learning on the same task in another language. For a given language pair, we construct a multitask training scheme using native source language tasks, native target language tasks, and a bridging translation task to encourage sentences with identical meanings, but written in different languages, to have similar embeddings.
We evaluate the learned representations on several monolingual and cross-lingual tasks, and provide a graph-based analysis of the learned representations. Multi-task training using additional monolingual tasks is found to improve performance over models that only make use of parallel data on both cross-lingual semantic textual similarity (STS) (Cer et al., 2017) and cross-lingual eigen-similarity (Søgaard et al., 2018). For European languages, the results show that the addition of monolingual data improves the embedding alignment of sentences and their translations. Further, we find that cross-lingual training with additional monolingual data leads to far better crosslingual transfer learning performance.

Multi-Task Dual-Encoder Model
The core of our approach is multi-task training over problems that can be modeled as ranking input-response pairs encoded via dual-encoders Henderson et al., 2017;. Cross-lingual representations are obtained by incorporating a translation bridge task (Gouws et al., 2015;Guo et al., 2018;Yang et al., 2019). For input-response ranking, we take an input sentence s I i and an associated response sentence s R i , and we seek to rank s R i over all other possible response sentences s R j ∈ S R . We model the conditional probability P (s R i | s I i ) as: Where g I and g R are the input and response sentence encoding functions that compose the dual-encoder. The normalization term in eq. 1 is computationally intractable. We follow Henderson et al. (2017) and instead choose to model an approximate conditional probability P (s R i | s I i ): Where K denotes the size of a single batch of training examples, and the s R j corresponds to the response sentences associated with the other input sentences in the same batch as s I i . We realize g I and g R as deep neural networks that are trained to maximize the approximate log-likelihood, P (s R i | s I i ), for each task.
To obtain a single sentence encoding function g for use in downstream tasks, we share the first k layers of the input and response encoders and treat the final output of these shared layers as g. The shared encoders are used with the ranking formulation above to support conversational response ranking (Henderson et al., 2017), a modified version of quick-though (Logeswaran and Lee, 2018), and a supervised NLI task for representation learning similar to InferSent (Conneau et al., 2017). To learn cross-lingual representations, we incorporate translation ranking tasks using parallel corpora for the source-target pairs: English-French (en-fr), English-Spanish (en-es), English-German (en-de), and English-Chinese (en-zh).
The resulting model structure is illustrated in Figure 1. We note that the conversational response ranking task can be seen as a special case of Contrastive Predictive Coding (CPC) (van den Oord et al., 2018) that only makes predictions one step into the future.

Encoder Architecture
Word and Character Embeddings. Our sentence encoder makes use of word and character ngram embeddings. Word embeddings are learned end-to-end. 2 Character n-gram embeddings are learned in a similar manner and are combined at the word-level by summing their representations and then passing the resulting vector to a single feedforward layer with tanh activation. We average the word and character embeddings before providing them as input to g.
Transformer Encoder. The architecture of the shared encoder g consists of three stacked transformer sub-networks, 3 each containing the feedforward and multi-head attention sub-layers described in Vaswani et al. (2017). The transformer output is a variable-length sequence. We average encodings of all sequence positions in the final layer to obtain our sentence embeddings. This embedding is then fed into different sets of feedforward layers that are used for each task. For our transformer layers, we use 8 attentions heads, a hidden size of 512, and a filter size of 2048.

Multi-task Training Setup
We employ four unique task types for each language pair in order to learn a function g that is capable of strong cross-lingual semantic matching and transfer learning performance for a sourcetarget language pair while also maintaining monolingual task transfer performance. Specifically, we employ: (i) conversational response prediction, (ii) quick thought, (iii) a natural language inference, and (iv) translation ranking as the bridge task. For models trained on a single language pair (e.g., en-fr), six total tasks are used in training, as the first two tasks are mirrored across languages. 4 Conversational Response Prediction. We model the conversational response prediction task in the same manner as . We minimize the negative log-likelihood of P (s R i | s I i ), where s I i is a single comment and s R i is its associated response comment. For the response side, we model g R (s R i ) as g(s R i ) followed by two fullyconnected feedforward layers of size 320 and 512 with tanh activation. For the input representation, however, we simply let g I (s I i ) = g(s I i ). 5 Quick Thought. We use a modified version of the Quick Thought task detailed by Logeswaran and Lee (2018). We minimize the sum of the negative log-likelihoods of P (s where s I i is a sentence taken from an article and s P i and s R i are its predecessor and successor sentences, respectively. For this task, we model all three of g P (s P i ), g I (s I i ), and g R (s R i ) by g followed by separate, fully-connected feedforward layers of size 320 and 512 and using tanh activation.
Natural Language Inference (NLI). We also include an English-only natural language inference task (Bowman et al., 2015). For this task, we first encode an input sentence s I i and its corresponding response hypothesis s R i into vectors u 1 and u 2 using g. Following Conneau et al. (2017), the vectors u 1 , u 2 are then used to construct a relation feature vector (u 1 , u 2 , |u 1 − u 2 |, u 1 * u 2 ), where (·) represents concatenation and * represents element-wise multiplication. The relation vector is then fed into a single feedforward layer of size 512 followed by a softmax output layer that is used to perform the 3-way NLI classification.
Translation Ranking. Our translation task setup is identical to the one used by Guo et al. (2018) for bi-text retrieval. We minimize the neg- is a source-target translation pair. Since the translation task is intended to align the sentence representations of the source and target languages, we do not use any kind of task-specific feedforward layers and instead use g as both g I and g R . Following Guo et al. (2018), we append 5 incorrect translations that are semantically similar to the correct translation for each training example as "hard-negatives". Similarity is determined via a version of our model trained only on the translation ranking task. We did not see additional gains from using more than 5 hard-negatives.

Corpora
Training data is composed of Reddit, Wikipedia, Stanford Natural Language Inference (SNLI), and web mined translation pairs. For each of our datasets, we use 90% of the data for training, and the remaining 10% for development/validation.

Model Configuration
In all of our experiments, multi-task training is performed by cycling through the different tasks (translation pairs, Reddit, Wikipedia, NLI) and performing an optimization step for a single task at a time. We train all of our models with a batch size of 100 using stochastic gradient descent with a learning rate of 0.008. All of our models are trained for 30 million steps. All input text is treebank style tokenized prior to being used for training. We build a vocab containing 200 thousand unigram tokens with 10 thousand hash buckets for out-of-vocabulary tokens. The character n-gram vocab contains 200 thousand hash buckets used for 3 and 4 grams. Both the word and character n-gram embedding sizes are 320. All hyperparameters are tuned based on the development portion (random 10% slice) of our training sets. As an additional training heuristic, we multiply the gradient updates to the word and character embeddings by a factor of 100. 6 We found that using this embedding gradient multiplier alleviates vanishing gradients and greatly improves training. We compare the proposed cross-lingual multitask models, subsequently referred to simply as "multi-task", with baseline models that are trained using only the translation ranking task, referred to as "translation-ranking" models.

Model Performance on English Downstream Tasks
We first evaluate all of our cross-lingual models on several downstream English tasks taken from SentEval (Conneau and Kiela, 2018) to verify the impact of cross-lingual training. Evaluations are performed by training single hidden-layer feedforward networks on top of the 512-dimensional em- 6 We tried different orders of magnitude for the multiplier and found 100 to work the best. beddings taken from the frozen models. Results on the tasks are summarized in Table 1. We note that cross-lingual training does not hinder the effectiveness of our encoder on English tasks, as the multi-task models are close to state-of-the-art in each of the downstream tasks. For the Text REtrieval Conference (TREC) eval, we actually find that our multi-task models outperform the previous state-of-the-art by a sizable amount.
We observe the en-zh translation-ranking models perform significantly better on the downstream tasks than the European language pair translationranking models. The en-zh models are possibly less capable of exploiting grammatical and other superficial similarities and are forced to rely on semantic representations. Exploring this further may present a promising direction for future research.

Cross-lingual Retrieval
We evaluate both the multi-task and translationranking models' efficacy in performing crosslingual retrieval by using held-out translation pair data. Following Guo et al. (2018) and Henderson et al. (2017), we use precision at N (P@N) as our evaluation metric. Performance is scored by checking if a source sentence's target translation ranks 7 in the top N scored candidates when considering K other randomly selected target sentences. We set K to 999. Similar to Guo et al. (2018), we observe using a small value of K, such as K = 99 from Henderson et al. (2017), results The translation-ranking model is a strong baseline for identifying correct translations, with 95.4%, 87.5%, 97.5%, and 99.7% P@1 for enfr, en-es, en-de, and en-zh retrieval tasks, respectively. The multi-task model performs almost identical with 95.1%, 88.8%, 97.8%, and 99.7% P@1, which provides empirical justification that it is possible to maintain cross-lingual embedding space alignment despite training on additional monolingual tasks for each individual language. 9 Both model types surprisingly achieve particularly strong ranking performance on en-zh. Similar to the task transfer experiments, this may be due to the en-zh models having an implicit inductive bias to rely more heavily on semantics rather than more superficial aspects of sentence pair similarity.

Multilingual STS
Cross-lingual representations are evaluated on semantic textual similarity (STS) in French, Spanish, German, and Chinese. To evaluate Spanish-Spanish (es-es) STS, we use data from track 3 of the SemEval-2017 STS shared task (Cer et al., 2017), containing 250 Spanish sentence pairs. We evaluate English-Spanish (en-es) STS using STS 2017 track 4(a), 10 which contains 250 English-Spanish sentence pairs. 8 999 is smaller than the 10+ million used by Guo et al. (2018), but it allows for good discrimination between models without requiring a heavier and slower evaluation framework 9 We also experimented with P@3 and P@10, the results are identical.
10 The en-es task is split into track 4(a) and track 4(b). We only use track 4(a) here. Track 4(b) contains sentence pairs from WMT with only one annotator for each pair. Previously reported numbers are particularly low for track 4(b), which may suggest either distributional or annotation differences between this track and other STS datasets.
Beyond English and Spanish, however, there are no standard STS datasets available for the other languages explored in this work. As such, we perform an additional evaluation on a translated version of the STS Benchmark (Cer et al., 2017) for French, Spanish, German, and Chinese. We use Google's translation system to translate the STS Benchmark sentences into each of these languages. We believe the results on the translated STS Benchmark evaluation sets are a reasonable indicator of multilingual semantic similarly performance, particularly since the NMT encoderdecoder architecture for translation differs significantly from our dual-encoder approach.
Following , we first compute the sentence embeddings u, v for an STS sentence pair, and then score the sentence pair similarity based on the angular distance between the two embedding vectors, − arccos  . Table 2 shows Pearson's r on the STS Benchmark for all models. The first column shows the trained model performance on the original English STS Benchmark. Columns 2 to 5 provide the performance on the remaining languages. Multi-task models perform better than the translation ranking models on our multilingual STS Benchmark evaluation sets. Table 3 provides the results from the en-es models on the SemEval-2017 STS *-es tracks. The multitask models achieve 0.827 Pearson's r for the es-es task and 0.769 for the en-es task. As a point of reference, we also list the two best performing STS systems, ECNU (Tian et al., 2017) and BIT , as reported in Cer et al. (2017). Our results are very close to these state-of-the-art feature engineered and mixed systems.

Zero-shot Classification
To evaluate the cross-lingual transfer learning capabilities of our models, we examine performance of the multi-task and translation-ranking encoders on zero-shot and few-shot classification tasks.

Multilingual NLI
We evaluate the zero-shot classification performance of our multi-task models on two multilingual natural language inference (NLI) tasks. However, prior to doing so, we first train a modified version 11 of our multi-task models that also includes training on the English Multi-genre NLI (MultiNLI) dataset of  in addition to SNLI. We train with MultiNLI to be consistent with the baselines from prior work. We make use of the professionally translated French and Spanish SNLI subset created by Agić and Schluter (2018) for an initial cross-lingual zero-shot evaluation of French and Spanish. We refer to these translated subsets as SNLI-X. There are 1,000 examples in the subset for each language. To evaluate, we feed the French and Spanish examples into the pre-trained English NLI subnetwork of our multi-task models.
We additionally make use of the XNLI dataset of , which provides multilingual NLI evaluations for Spanish, French, German, Chinese and more. There are 5,000 examples in each XNLI test set, and zero-shot evaluation is once again done by feeding non-English examples into the pre-trained English NLI sub-network. Table 4 lists the accuracy on the English SNLI test set as well as on SNLI-X and XNLI for all of our multi-task models. The original English SNLI accuracies are around 84% for all of our multi-task models, indicating that English SNLI performance remains stable in the multi-task training setting.
The zero-shot accuracy on SNLI-X is around 74% for both the en-fr and en-es models. The zero-shot accuracy on XNLI is around 65% for en-es, en-fr, and en-de, and around 63% for en-zh, thereby significantly outperforming the pretrained sentence encoding baselines (X-CBOW) described in . The X-CBOW baselines use fixed sentence encoders that are the result of averaging tuned multilingual word embeddings.
Row 4 of Table 4 shows the zero-shot French NLI performance of Eriguchi et al. (2018), which is a state-of-the-art zero-shot NLI classifier based on multilingual NMT embeddings. Our multitask model shows comparable performance to the NMT-based model in both English and French.

Amazon Reviews
Zero-shot Learning. We also conduct a zero-shot evaluation based on the Amazon review data extracted by Prettenhofer and Stein (2010). Following Prettenhofer and Stein (2010), we preprocess the Amazon reviews and convert the data into a binary sentiment classification task by considering reviews with strictly more than three stars as positive and less than three stars as negative. Reviews contain a summary field and a text field, which we concatenate to produce a single input. Since our models are trained with sentence lengths clipped to 64, we only take the first 64 tokens from the concatenated text as the input. There are 6,000 training reviews in English, which we split into 90% for training and 10% for development.
We first encode inputs using the pre-trained multi-task and translation-ranking encoders and feed the encoded vectors into a 2-layer feedforward network culminating in a softmax layer. We use hidden layers of size 512 with tanh activation functions. We use Adam for optimization with an initial learning rate of 0.0005 and a learning rate decay of 0.9 at every epoch during training. We use a batch size of 16 and train for 20 total epochs in all experiments. We freeze the cross-lingual encoder during training. The model architecture and parameters are tuned on the development set.
We first train the classifier on English data, and then evaluate it on the 6,000 French and German Amazon review test examples. The results are summarized in Table 5. On the English test set, accuracy of the en-fr model is 87.4% with the en-de model achieving 87.1%. Both mod-  (Eriguchi et al., 2018) 84.4 73.9 -----XNLI-CBOW zero-shot  ---64.5 60.3 60.7 61.0 58.8 Non zero-shot baselines XNLI-BiLSTM-last  ---71.0 65.2 67.8 66.6 63.7 XNLI-BiLSTM-max  ---73.7 67.7 68.7 67.7 65.8  Table 5: Zero-shot sentiment classification accuracy(%) on non-English Amazon review test data after training on English only Amazon reviews. els achieve zero-shot accuracy on their respective non-English datasets that is above 80%. The translation-ranking models again perform worse on all metrics. Once again we compare the proposed model with Eriguchi et al. (2018), and find that our zero-shot performance has a reasonable gain on the French test set. 12 Few-shot Learning. We further evaluate the proposed multi-task models via few-shot learning, by training on English reviews and only a portion of French and German reviews. Our fewshot models are compared with baselines trained on French and German reviews only. Table 6 provides the classification accuracy of the few-shot models, where the second row indicates the percent of French and German data that is used when training each model. With as little as 20% of the French or German training data, the few-shot models perform nearly as well compare to the baseline models trained on 100% of the French and German data. Adding more French and German training data leads to further improvements in few- 12 Eriguchi et al. (2018) also train a shallow classifier, but use only review text and truncate their inputs to 200 tokens. Our setup is slightly different, as our models can take a maximum of only 64 tokens.
shot model performance, with the few-shot models reaching 85.8% accuracy in French and 84.5% accuracy in German, when using all of the French and German data. The French model notably performs +0.9% better when being trained on a combination of the English and French reviews rather than on the French reviews alone.

Analysis of Cross-lingual Embedding Spaces
Motivated by the recent work of Søgaard et al. (2018) studying the graph structure of multilingual word representations, we perform a similar analysis for our learned cross-lingual sentence representations. We take N samples of size K from the language pair translation data and then encode these samples using the corresponding multi-task and translation-ranking models. We then compute pairwise distance matrices within each sampled set of encodings, and use these distance matrices to construct graph Laplacians. 13 We obtain the similarity Ψ(S, T ) between each model's source and target language embedding by comparing the eigenvalues of the source language graph Laplacians to the eigenvalues of the target language graph Laplacians:  Table 6: Sentiment classification accuracy(%) on target language Amazon review test data after training on English Amazon review data and a portion of French of German data. The second row shows the percent of French (fr) or German (de) data is used for training in each model.
source-target translation pairs. A smaller value of Ψ(S, T ) indicates higher eigen-similarity between the source language and target language embedding subsets. Following Søgaard et al. (2018) we use a sample size of K = 10 translation pairs, but we choose to draw N = 1, 000 samples instead of N = 10, as was done in Søgaard et al. (2018). We found Ψ(S, T ) has very high variance at N = 10. The computed values of Ψ(S, T ) for our multi-task and translation-ranking models are summarized in Table 7. We find that the source and target embedding subsets constructed from the multi-task models exhibit greater average eigen-similarity than those resulting from the translation-ranking models for the European source-target language pairs, and observe the opposite for the English-Chinese models (en-zh). As a curious discrepancy, we believe further experiments looking at eigen-similarity across languages could yield interesting results and language groupings.
Eigen-similarity trends with better performance for the European language pair multi-task models on the cross-lingual transfer tasks. A potential direction for future work could be to introduce regularization penalties based on graph similarity during multi-task training. Interestingly, we also observe that the eigen-similarity gaps between the multi-task and translation-ranking models are not uniform across language pairs. Thus, another direction could be to further study differences in the difficulty of aligning different source-target language embeddings.

Discussion on Input Representations
Our early explorations using a combination of character n-gram embeddings and word embeddings vs. word embeddings alone as the model input representation suggest using wordembeddings only performs just slightly worse (one  to two absolute percentage points) on the dev sets for the training tasks. The notable exception is the word-embedding only English-German models tend to perform much worse on the dev sets for the training tasks involving German. This is likely due to the prevalence of compound words in German and represents an interesting difference for future exploration.
We subsequently explored training versions of our cross-lingual models using a SentencePiece vocabulary (Kudo and Richardson, 2018), a set of largely sub-word tokens (characters and word chunks) that provide good coverage of an input dataset. Multilingual models for a single language pair (e.g., en-de) trained with SentencePiece performed similarly on the training dev sets to the models using character n-grams. However, when more languages are included in a single model (e.g., a single model that covers en, fr, de, es, and zh), SentencePiece tends to perform worse than using a combination of word and character n-gram embeddings. Within a larger joint model, Senten-cePiece is particularly problematic for languages like zh, which end up getting largely tokenized into individual characters.

Conclusion
Cross-lingual multi-task dual-encoder models are found to learn representations that achieve strong within language and cross-lingual transfer learning performance. By training English-French, English-Spanish, English-German, and English-Chinese multi-task models, we achieve near-stateof-the-art or state-of-the-art performance on a variety of English tasks, while also being able to produce similar caliber results in zero-shot crosslingual transfer learning tasks. Further, crosslingual multi-task training is shown to improve performance on some downstream English tasks (TREC). We believe that there are many possibilities for future explorations of cross-lingual model training and that such models will be foundational as language processing systems are tasked with increasing amounts of multilingual data.