Jointly Learning to Embed and Predict with Multiple Languages

We propose a joint formulation for learning task-speciﬁc cross-lingual word embeddings, along with classiﬁers for that task. Unlike prior work, which ﬁrst learns the embeddings from parallel data and then plugs them in a supervised learning problem, our approach is one-shot: a single optimization problem combines a co-regularizer for the multilingual embeddings with a task-speciﬁc loss. We present theoretical results showing the limitation of Euclidean co-regularizers to increase the embedding dimension, a limitation which does not exist for other co-regularizers (such as the ‘ 1 - distance). Despite its simplicity, our method achieves state-of-the-art accuracies on the RCV1/RCV2 dataset when transferring from English to German, with training times below 1 minute. On the TED Corpus, we obtain the highest re-ported scores on 10 out of 11 languages.


Introduction
Distributed representations of text (embeddings) have been the target of much research in natural language processing (Collobert and Weston, 2008;Mikolov et al., 2013;Pennington et al., 2014;Levy et al., 2015). Word embeddings partially capture semantic and syntactic properties of text in the form of dense real vectors, making them apt for a wide variety of tasks, such as language modeling (Bengio et al., 2003), sentence tagging (Turian et al., 2010;Collobert et al., 2011), sentiment analysis (Socher et al., 2011), parsing (Chen and Manning, 2014), and machine translation (Zou et al., 2013).
At the same time, there has been a consistent progress in devising "universal" multilingual models via cross-lingual transfer techniques of various kinds (Hwa et al., 2005;Zeman and Resnik, 2008;McDonald et al., 2011;Ganchev and Das, 2013;Martins, 2015). This line of research seeks ways of using data from resourcerich languages to solve tasks in resource-poor languages. Given the difficulty of handcrafting language-independent features, it is highly appealing to obtain rich, delexicalized, multilingual representations embedded in a shared space.
A string of work started with Klementiev et al. (2012) on learning bilingual embeddings for text classification. Hermann and Blunsom (2014) proposed a noise-contrastive objective to push the embeddings of parallel sentences to be close in space. A bilingual auto-encoder was proposed by Chandar et al. (2014), while Faruqui and Dyer (2014) applied canonical correlation analysis to parallel data to improve monolingual embeddings. Other works optimize a sum of monolingual and cross-lingual terms (Gouws et al., 2015;Soyer et al., 2015), or introduce bilingual variants of skip-gram Coulmance et al., 2015). Recently,  extended the non-compositional paragraph vectors of Le and Mikolov (2014) to a bilingual setting, achieving a new state of the art at the cost of more expensive (and non-deterministic) prediction.
In this paper, we propose an alternative joint formulation that learns embeddings suited to a particular task, together with the corresponding classifier for that task. We do this by minimizing a combination of a supervised loss function and a multilingual regularization term. Our approach leads to a convex optimization problem and makes a bridge between classical co-regularization approaches for semi-supervised learning (Sindhwani et al., 2005;Altun et al., 2005;Ganchev et al., 2008) and modern representation learning. In addition, we show that Euclidean co-regularizers have serious limitations to learn rich embeddings, when the number of task labels is small. We establish this by proving that the resulting embedding matrices have their rank upper bounded by the number of labels. This limitation does not exist for other regularizers (convex or not), such as the 1 -distance and noise-contrastive distances.
Our experiments in the RCV1/RCV2 dataset yield state-of-the-art accuracy (92.7%) with this simple convex formulation, when transferring from English to German, without the need of negative sampling, extra monolingual data, or nonadditive representations. For the reverse direction, our best number (79.3%), while far behind the recent para_doc approach , is on par with current compositional methods.
On the TED corpus, we obtained general purpose multilingual embeddings for 11 target languages, by considering the (auxiliary) task of reconstructing pre-trained English word vectors. The resulting embeddings led to cross-lingual multi-label classifiers that achieved the highest reported scores on 10 out of these 11 languages. 1

Cross-Lingual Text Classification
We consider a cross-lingual classification framework, where a classifier is trained on a dataset from a source language (such as English) and applied to a target language (such as German). Later, we generalize this setting to multiple target languages and to other tasks besides classification.
The following data are assumed available: , consisting of text documents x in the source language categorized with a label y ∈ {1, . . . , L}.
2. An unlabeled parallel corpus D u := {(s (n) , t (n) )} N n=1 , containing sentences s in the source language paired with their translations t in the target language (but no information about their categories).
Let V S and V T be the vocabulary size of the source and target languages, respectively. Throughout, we represent sentences s ∈ R V S and t ∈ R V T as vectors of word counts, and documents x as an average of sentence vectors. We assume that the unlabeled sentences largely outnumber the labeled documents, N M , and that the number of labels L is relatively small. The goal is to use the data above to learn a classifier h : R V T → {1, . . . , L} for the target language.
This problem is usually tackled with a two-stage approach: in the first step, bilingual word embeddings P ∈ R V S ×K and Q ∈ R V T ×K are learned from D u , where each row of these matrices contains a Kth dimensional word representation in a shared vector space. In the second step, a standard classifier is trained on D l , using the source embeddings P ∈ R V S ×K . Since the embeddings are in a shared space, the trained model can be applied directly to classify documents in the target language. We describe next these two steps in more detail. We assume throughout an additive representation for sentences and documents (denoted ADD by Hermann and Blunsom (2014)). These representations can be expressed algebraically as P x, P s, Q t ∈ R K , respectively.
Step 1: Learning the Embeddings. The crosslingual embeddings P and Q are trained so that the representations of paired sentences (s, t) ∈ D u have a small (squared) Euclidean distance Since a direct minimization of Eq. 1 leads to a degenerate solution (P = 0, Q = 0), Hermann and Blunsom (2014) use instead a noise-contrastive large-margin distance obtained via negative sampling, where n is a random (unpaired) target sentence, m is a "margin" parameter, and [x] + := max{0, x}.
Letting J be the number of negative examples in each sample, they arrive at the following objective function to be minimized: d ns (s (n) , t (n) , n (n,j) ).
(3) This minimization can be carried out efficiently with gradient-based methods, such as stochastic gradient descent or AdaGrad (Duchi et al., 2011). Note however that the objective function in Eq. 3 is not convex. Therefore, one may land at different local minima, depending on the initialization.
Step 2: Training the Classifier. Once we have the bilingual embeddings P and Q, we can compute the representation P x ∈ R K of each document x in the labeled dataset D l . Let V ∈ R K×L be a matrix of parameters (weights), with one column v y per label. A linear model is used to make predictions, according to where w y is a column of the matrix W := P V ∈ R V S ×L . In prior work, the perceptron algorithm was used to learn the weights V from the labeled examples in D l (Klementiev et al., 2012;Hermann and Blunsom, 2014). Note that, at test time, it is not necessary to store the full embeddings: if L K, we may simply precompute W := P V (one weight per word and label) if the input is in the source language-or QV , if the input is in the target language-and treat this as a regular bag-ofwords linear model.

Jointly Learning to Embed and Classify
Instead of a two-stage approach, we propose to learn the bilingual embeddings and the classifier jointly on D l ∪ D u , as described next.
Our formulation optimizes a combination of a co-regularization function R, whose goal is to push the embeddings of paired sentences in D u to stay close, and a loss function L, which fits the model to the labeled data in D l .
The simplest choice for R is a simple Euclidean co-regularization function: An alternative is the 1 -distance: One possible advantage of R 1 (P , Q) over R 2 (P , Q) is that the 1 -distance is more robust to outliers, hence it is less sensitive to differences in the parallel sentences. Note that both functions in Eqs. 5-6 are jointly convex on P and Q, unlike the one in Eq. 3. They are also simpler and do not require negative sampling. While these functions have a degenerate behavior in isolation (since they are both minimized by P = 0 and Q = 0), we will see that they become useful when plugged into a joint optimization framework. The next step is to define the loss function L to leverage the labeled data in D l . We consider a loglinear model P (y | x; W ) ∝ exp(w y x), which leads to the following logistic loss function: (7) We impose that W is of the form W = P V for a fixed V ∈ R K×L , whose choice we discuss below.
Putting the pieces together and adding some extra regularization terms, we formulate our joint objective function as follows: where µ, µ S , µ T ≥ 0 are regularization constants. By minimizing a combination of L(P V ) and R(P , Q), we expect to obtain embeddings Q * that lead to an accurate classifier h for the target language. Note that P = 0 and Q = 0 is no longer a solution, due to the presence of the loss term L(P V ) in the objective.
Choice of V . In Eq. 8, we chose to keep V fixed rather than optimize it. The rationale is that there are many more degrees of freedom in the embedding matrices P and Q than in V (concretely, where we are assuming a small number of labels, L V S + V T ). Our assumption is that we have enough degrees of freedom to obtain an accurate model, regardless of the choice of V . These claims will be backed in §4 by a more rigorous theoretical result. Keeping V fixed has another important advantage: it allows to minimize F with respect to P and Q only, which makes it a convex optimization problem if we choose R and L to be both convex-e.g., setting R ∈ {R 2 , R 1 } and L := L LL .
Relation to Multi-View Learning. An interesting particular case of this formulation arises if K = L and V = I L (the identity matrix). In that case, we have W = P and the embedding matrices P and Q are in fact weights for every pair of word and label, as in standard bag-of-word models. In this case, we may interpret the coregularizer R(P , Q) in Eq. 8 as a term that pushes the label scores of paired sentences P s (n) and Q t (n) to be similar, while the source-based loglinear model is fit via L(W ). The same idea underlies various semi-supervised co-regularization methods that seek agreement between multiple views (Sindhwani et al., 2005;Altun et al., 2005;Ganchev et al., 2008). In fact, we may regard the joint optimization in Eq. 8 as a generalization of those methods, making a bridge between those methods and representation learning.
Multilingual Embeddings. It is straightforward to extend the framework herein presented to the case where there are multiple target languages (say R of them), and we want to learn one embedding matrix for each, {Q 1 , . . . , Q R }. The simplest way is to consider a sum of pairwise co-regularizers, If R is additive over the parallel sentences (which is the case for R 2 , R 1 and R ns ), then this procedure is equivalent to concatenating all the parallel sentences (regardless of the target language) and adding a language suffix to the words to distinguish them. This reduces directly to a problem in the same form as Eq. 8.
Pre-Trained Source Embeddings. In practice, it is often the case that pre-trained embeddings for the source language are already available (letP be the available embedding matrix). It would be foolish not to exploit those resources. In this scenario, the goal is to useP and the dataset D u to obtain "good" embeddings for the target languages (possibly tweaking the source embeddings too, P ≈P ). Our joint formulation in Eq. 8 can also be used to address this problem. It suffices to set K = L and V = I L (as in the multi-view learning case discussed above) and to define an auxiliary task that pushes P andP to be similar. The simplest way is to use a reconstruction loss: The resulting optimization problem has resemblances with the retrofitting approach of Faruqui et al. (2015), except that the goal here is to extend the embeddings to other languages, instead of pushing monolingual embeddings to agree with a semantic lexicon. We will present some experiments in §5.2 using this framework.

Limitations of the Euclidean Co-Regularizer
One may wonder how much the embedding dimension K influences the learned classifier. The next proposition shows the (surprising) result that, with the formulation in Eq. 8 with R = R 2 , it makes absolutely no difference to increase K past the number of labels L. Below, T ∈ R V T ×N denotes the matrix with columns t (1) , . . . , t (N ) .
Proposition 1. Let R = R 2 and assume T has full row rank. 2 Then, for any choice of V ∈ R K×L , possibly with K > L, the following holds: 1. There is an alternative, low-dimensional, V ∈ R K ×L with K ≤ L such that the classifier obtained (for both languages) by optimizing Eq. 8 using V is the same as if using V . 3 2. This classifier depends on V only via the L-by-L matrix V V .
3. If P * , Q * are the optimal embeddings obtained with V , then we always have rank(P * ) ≤ L and rank(Q * ) ≤ L regardless of K.
Proof. See App. A.1 in the supplemental material.
Let us reflect for a moment on the practical impact of Prop. 1. This result shows the limitation of the Euclidean co-regularizer R 2 in a very concrete manner: when R = R 2 , we only need to consider representations of dimension K ≤ L.
Note also that a corollary of Prop. 1 arises when V V = I L , i.e., when V is chosen to have orthonormal columns (a sensible choice, since it corresponds to seeking embeddings that leave the label weights "uncorrelated"). Then, the second statement of Prop. 1 tells us that the resulting classifier will be the same as if we had simply set V = I L (the particular case discussed in §3). We will see in §5.1 that, despite this limitation, this classifier is actually a very strong baseline. Of course, if the number of labels L is large enough, this limitation might not be a reason for concern. 4 An instance will be presented in §5.2, where we will see that the Euclidean co-regularizer excels.
Finally, one might wonder whether Prop. 1 applies only to the (Euclidean) 2 norm or if it holds for arbitrary regularizers. In fact, we show in App. A.2 that this limitation applies more generally to Mahalanobis-Frobenius norms, which are essentially Euclidean norms after a linear transformation of the vector space. However, it turns out that for general norms such limitation does not exist, as shown below.
Proposition 2. If R = R 1 in Eq. 8, then the analogous to Proposition 1 does not hold. It also does not hold for the ∞ -norm and the 0 -"norm." Proof. See App. A.3 in the supplemental material.
This result suggests that, for other regularizers R = R 2 , we may eventually obtain better classifiers by increasing K past L. As such, in the next section, we experiment with R ∈ {R 2 , R 1 , R ns }, where R ns is the (non-convex) noise-contrastive regularizer of Eq. 3.

Experiments
We report results on two experiments: one on cross-lingual classification on the Reuters RCV1/RCV2 dataset, and another on multi-label classification with multilingual embeddings on the TED Corpus. 5

Reuters RCV1/RCV2
We evaluate our framework on the cross-lingual document classification task introduced by Klementiev et al. (2012). Following prior work, our dataset D u consists of 500,000 parallel sentences from the Europarl v7 English-German corpus (Koehn, 2005); and our labeled dataset D l consists of English and German documents from the RCV1/RCV2 corpora (Lewis et al., 2004), each categorized with one out of L = 4 labels. We used the same split as Klementiev et al. (2012): 1,000 documents for training, of which 200 are held out as validation data, and 5,000 for testing. 4 For regression tasks (such as the one presented in the last paragraph of 3), instead of the "number of labels," L should be regarded as the number of output variables to regress. 5 Our code is available at https: //github.com/dcferreira/ multilingual-joint-embeddings.
Note that, in this dataset, we are classifying documents based on their bag-of-word representations, and learning word embeddings by bringing the bag-of-word representations of parallel sentences to be close together. In this sense, we are bringing together these multiple levels of representations (document, sentence and word).
We experimented with the joint formulation in Eq. 8, with L := L LL and R ∈ {R 2 , R 1 , R ns }. We optimized with AdaGrad (Duchi et al., 2011) with a stepsize of 1.0, using mini-batches of 100 Reuters RCV1/RCV2 documents and 50,000 Europarl v7 parallel sentences. We found no need to run more than 100 iterations, with most of our runs converging under 50. Our vocabulary has 69,714 and 175,650 words for English and German, respectively, when training on the English portion of the Reuters RCV1/RCV2 corpus, and 61,120 and 183,888 words for English and German, when training in the German portion of the corpus. This difference is due to the inclusion of words in the training data into the vocabulary. We do not remove any words from the vocabulary, for simplicity. We used the validation set to tune the hyperparameters {µ, µ S , µ T } and to choose the iteration number. When using K = L, we chose V = I L ; otherwise, we chose V randomly, sampling its entries from a Gaussian N (0, 0.1). Table 1 shows the results. We include for comparison the most competitive systems published to date. The first thing to note is that our joint system with Euclidean co-regularization performs very well for this task, despite the theoretical limitations shown in §4. Although its embedding size is only K = 4 (one dimension per label), it outperformed all the two-stage systems trained on the same data, in both directions.
For the EN→DE direction, our joint system with 1 co-regularization achieved state-of-the-art results (92.7%), matching two-stage systems that use extra monolingual data, negative sampling, or non-additive document representations. It is conceivable that the better results of R 1 over R 2 come from its higher robustness to differences in the parallel sentences.
(hence many more parameters), was trained on more parallel sentences, and requires more expensive (and non-deterministic) computation at test time to compute a document's embedding. Our method has the advantage of being simple and very fast to train: it took less than 1 minute to train the joint-R 1 system for EN→DE, using a single core on an Intel Xeon @2.5 GHz. This can be compared with Klementiev et al. (2012), who took 10 days on a single core, or Coulmance et al. (2015), who took 10 minutes with 6 cores. 6 Although our theoretical results suggest that increasing K when using the 1 norm may increase the expressiveness of our embeddings, our results do not support this claim (the improvements in DE→EN from K = 4 to K = 40 were tiny). However, it led to a gain of 2.5 points when using negative sampling. For K = 40, this system is much more accurate than Hermann and Blunsom (2014), which confirms that learning the embeddings together with the task is highly beneficial.

TED Corpus
To assess the ability of our framework to handle multiple target languages, we ran a second set of experiments on the TED corpus (Cettolo et al., 2012), using the training and test partitions created by Hermann and Blunsom (2014), downloaded from http://www.clg.ox.ac. uk/tedcorpus. The corpus contains English transcriptions and multilingual, sentence-aligned translations of talks from the TED conference in 12 different languages, with 12,078 parallel documents in the training partition (totalling 1,641,985 parallel sentences). Following their prior work, we used this corpus both as parallel data (D u ) and as the task dataset (D l ). There are L = 15 labels and documents can have multiple labels.
We experimented with two different strategies: • A one-stage system (Joint), which jointly trains the multilingual embeddings and the multi-label classifier (similarly as in §5.1). To cope with multiple target languages, we used a sum of pairwise co-regularizers as described in Eq. 9. For classification, we use multinomial logistic regression, where we select those labels with a posterior probability above 0.18 (tuned on vali-dation data).
• A two-stage approach (Joint w/ Aux), where we first obtain multilingual embeddings by applying our framework with an auxiliary task with pre-trained English embeddings (as described in Eq. 10 and in the last paragraph of §3), and then use the resulting multilingual representations to train the multi-label classifier. We address this multi-label classification problem with independent binary logistic regressors (one per label), trained by running 100 iterations of L-BFGS (Liu and Nocedal, 1989). At test time, we select those labels whose posterior probability are above 0.5.
For the Joint w/ Aux strategy, we used the 300-dimensional GloVe-840B vectors (Pennington et al., 2014), downloaded from http:// nlp.stanford.edu/projects/glove/. Table 2 shows the results for cross-lingual classification, where we use English as source and each of the other 11 languages as target. We compare our two strategies above with the strong Machine Translation (MT) baseline used by Hermann and Blunsom (2014) (which translates the input documents to English with a state-of-theart MT system) and with their two strongest systems, which build document-level representations from embeddings trained bilingually or multilingually (called DOC/ADD single and DOC/ADD joint, respectively). 7 Overall, our Joint system with 2 regularization outperforms both Hermann and Blunsom (2014)'s systems (but not the MT baseline) for 8 out of 11 languages, performing generally better than our 1 -regularized system. However, the clear winner is our 2 -regularized Joint w/ Aux system, which wins over all systems (including the MT baseline) by a substantial margin, for all languages. This shows that pre-trained source embeddings can be extremely helpful in bootstrapping multilingual ones. 8 On the other hand, the performance of the Joint w/ Aux system with 1 regularization is rather disappointing. Note that the limitations of R 2 shown in §4 are not a concern here, since the auxiliary task has L = 300 dimensions (the dimension of the pretrained embeddings). A small sample of the multilingual embeddings produced by the winner system is shown in Table 4.
Finally, we did a last experiment in which we use our multilingual embeddings obtained with Joint w/ Aux to train monolingual systems for each language. This time, we compare with a bag-ofwords naïve Bayes system (reported by Hermann and Blunsom (2014) Table 3. We observe that, with the exception of Turkish, our systems consistently outperform all the competitors. Comparing the bottom two rows of Tables 2 and 3 we also observe that, for the 2 -regularized system, there is not much degradation caused by cross-lingual training versus training on the target language directly (in fact, for Spanish, Polish, and Brazilian Portuguese, the former scores are even higher). This suggests that the multilingual embeddings have high quality.

Conclusions
We proposed a new formulation which jointly minimizes a combination of a supervised loss function with a multilingual co-regularization term using unlabeled parallel data. This allows learning task-specific multilingual embeddings together with a classifier for the task. Our method achieved state-of-the-art accuracy on the Reuters RCV1/RCV2 cross-lingual classification task in the English to German direction, while being extremely simple and computationally efficient. Our results in the Reuters RCV1/RCV2 task, obtained using Europarl v7 as parallel data, show that our method has no trouble handling different levels of representations simutaneously (document, sentence and word). On the TED Corpus, we obtained the highest reported scores for 10 out of 11 languages, using an auxiliary task with pre-trained English embeddings.  Table 2: Cross-lingual experiments on the TED Corpus using English as a source language. Reported are the micro-averaged F 1 scores for a machine translation baseline and the two strongest systems of Hermann and Blunsom (2014), our one-stage joint system (Joint), and our two-stage system that trains the multilingual embeddings jointly with the auxiliary task of fitting pre-trained English embeddings (Joint w/ Aux), with both 1 and 2 regularization. Bold indicates the best result for each target language.