Bilingual Word Embeddings from Parallel and Non-parallel Corpora for Cross-Language Text Classification

In many languages, sparse availability of resources causes numerous challenges for textual analysis tasks. Text classiﬁcation is one of such standard tasks that is hindered due to limited availability of label information in low-resource languages. Transferring knowledge (i.e. label information) from high-resource to low-resource languages might improve text classiﬁcation as compared to the other approaches like machine translation. We introduce BRAVE ( Bilingual paRAgraph VEctors ), a model to learn bilingual distributed representations (i.e. embeddings) of words without word alignments either from sentence-aligned parallel or label-aligned non-parallel document corpora to support cross-language text classiﬁcation. Empirical analysis shows that classiﬁcation models trained with our bilingual embeddings outperforms other state-of-the-art systems on three different cross-language text classiﬁcation tasks.


Introduction
The availability of language-specific annotated resources is crucial for the efficiency of natural language processing tasks. Still, many languages lack rich annotated resources that support various tasks such as part-of-speech tagging, dependency parsing and text classification. While the growth of multilingual information on the web has provided an opportunity to build these missing annotated resources, but still lots of manual effort is required to achieve high quality resources for every language separately.
Another possibility is to utilize the unlabeled data present in those languages or transfer knowl-edge from annotation-rich languages. For the first alternative, recent advancements made in learning monolingual distributed representations of words (Mikolov et al., 2013a;Pennington et al., 2014;Levy and Goldberg, 2014) (i.e. monolingual word embeddings) capturing syntactic and semantic information in an unsupervised manner was useful in numerous NLP tasks (Collobert et al., 2011). However, this may not be sufficient for several other tasks such as cross-language information retrieval (Grefenstette, 2012), cross-language word semantic similarity (Vulić and Moens, 2014), cross-language text classification (CLTC, henceforth) (Klementiev et al., 2012;Xiao and Guo, 2013;Prettenhofer and Stein, 2010;Tang and Wan, 2014) and machine translation (Zhao et al., 2015) due to irregularities across languages. In these kind of scenarios, transfer of knowledge can be useful.
Several approaches Sarath Chandar et al., 2014;Coulmance et al., 2015) tried to induce monolingual distributed representations into a language independent space (i.e. bilingual or multilingual word embeddings) by jointly training on pair of languages. Although the overall goal of these approaches is to capture linguistic regularities in words that share same semantic and syntactic space across languages, they differ in their implementation. One set of methods either perform offline alignment of trained monolingual embeddings or jointly-train both monolingual and cross-lingual objectives, while the other set uses only cross-lingual objective. Jointly-trained or offline alignment methods can be further divided based on the type of par-Cross-Language Setups Objective Method Tasks Parallel Corpus (Klementiev et al., 2012) CLDC Word-Aligned (Zou et al., 2013) MT,NER Word-Aligned Monolingual+ (Mikolov et al., 2013b) MT Word-Aligned Cross-lingual (Faruqui and Dyer, 2014) Word Similarity Word-Aligned (Lu et al., 2015) Word Similarity Word-Aligned (Gouws and Søgaard, 2015) POS,SuS Word-Aligned  CLDC,MT Sentence-Aligned (Coulmance et al., 2015) CLDC,MT Sentence-Aligned Cross-lingual ) CLDC Sentence-Aligned (Sarath Chandar et al., 2014) CLDC Sentence-Aligned  Word Similarity, CLDC Sentence-Aligned  CLDC Sentence-Aligned allel corpus (e.g. word-aligned, sentence-aligned) they use for learning the cross-lingual objective. Table 1 summarizes different setups to learn bilingual or multilingual embeddings for the various tasks. Methods in the Table 1 that use word-aligned parallel corpus as offline alignment (Mikolov et al., 2013b;Faruqui and Dyer, 2014) assume single correspondence between the words across languages and ignore polysemy. While, the jointlytrain methods (Klementiev et al., 2012) that use word-alignment parallel corpus and consider polysemy perform computationally expensive operation of considering all possible interactions between the pairs of words in vocabulary of two different languages. Methods Sarath Chandar et al., 2014) that overcame the complexity issues of word-aligned models by using sentence-aligned parallel corpora limits themselves to only cross-lingual objective, thus making these approaches unable to explore monolingual corpora. Jointly-trained models Coulmance et al., 2015) overcame the issues of both word-aligned and purely cross-lingual objective models by using monolingual and sentencealigned parallel corpora. Nonetheless, these approaches still have certain drawbacks such as usage of only bag-of-words from the parallel sentences ignoring order of words. Thus, they are missing to capture the non-compositional meaning of entire sentence. Also, learned bilingual embeddings were heavily biased towards the sampled sentence-aligned parallel corpora. It is also some-times hard to acquire sentence-level parallel corpora for every language pair. To subdue this concern, few approaches (Rajendran et al., 2015) used pivot languages like English or comparable documentaligned corpora (Vulić and Moens, 2015) to learn bilingual embeddings specific to only one task.
This major downside can be observed in other aforementioned methods also, which are inflexible to handle different types of parallel corpora and have a tight-binding between cross-lingual objectives and the parallel corpora. For example, a method using sentence-level parallel corpora cannot be altered to leverage document-level parallel corpora (if available) that might have better performance for some tasks. Also, none of the approaches do leverage widely available label/classaligned non-parallel documents (e.g. sentiment labels, multi-class datasets) across languages which share special semantics such as sentiment or correlation between concepts as opposed to parallel texts.
In this paper, we introduce BRAVE a jointlytrained flexible model that learns bilingual embeddings based on the availability of the type of corpora (e.g. sentence-aligned parallel or label/classaligned non-parallel document) by just altering the cross-lingual objective. BRAVE leverages paragraph vector embeddings (Le and Mikolov, 2014) of the monolingual corpora to effectively conceal semantics of the text sequences across languages and build a cross-lingual objective. Method closely related to our approach is by  who uses shared context sentence vector across lan-guages to learn multilingual text sequences.
The main contributions of this paper are: • We jointly train monolingual part of parallel corpora with the improved cross-lingual alignment function that extends beyond bag-of-word models.
• Introduced a novel approach to leverage nonparallel data sets such as label or class aligned documents in different languages for learning bilingual cues.
• Experimental evaluation on three different CLTC tasks, namely cross-language document classification, multi-label classification and cross-language sentiment classification using learned bilingual word embeddings.

Related Work
Most of the related work can be associated to the approaches that aim to learn latent topics across languages or distributed representations of the words and larger pieces of text for supporting various cross-lingual tasks.

Cross-Language Latent Topics
Various approaches have been proposed to identify latent topics in monolingual (Blei, 2012;Rus et al., 2013) and multilingual (Mimno et al., 2009;Fukumasu et al., 2012) scenarios for cross-language semantic word similarity and document comparison. Extraction of cross-language latent topics or concepts use context-insensitive (Zhang et al., 2010) and context-sensitive methods (Vulić and Moens, 2014) to build word co-occurrence statistics for document representations.

BRAVE Model
In this section, we present the BRAVE model along with its variations whose aim is to learn bilingual embeddings that can generalize across different languages.

Bilingual Paragraph Vectors (BRAVE)
Most of the NLP tasks require fixed-length vectors. Tasks like CLTC also require fixed-length vectors to incorporate inherent semantics of sentences or documents. Distributed representation of sentences and documents i.e. paragraph vectors (Le and Mikolov, 2014) are designed to out-perform certain text classification tasks by overcoming constraints posed by the bag-of-words models.
Here, we leverage paragraph vectors distributed memory model (PV-DM) as the monolingual objective M(·) and jointly optimize with bilingual regularization function ϕ(·) for learning bilingual embeddings similar to the earlier approaches Coulmance et al., 2015). Equation 1 shows the formulation of the overall objective function that is minimized.
Here, C l represent the corpus of individual languages (i.e. l 1 or l 2 ). Given any sequence of words (w l 1 , w l 2 ...w l T ) in C l , w t is the predicted word in a context h constrained on paragraph p (i.e. sentence or document) and sequence of words.
Formally, the first term (i.e. M(·)) in the Equation 1 maximizes the average log probability based on word vector matrix W l and a unique paragraph vector matrix P l . Equation 2 represents the average log probability.
where each y l i is log-probability of predicted word i and is given by Equation 3.
To optimize for efficiency, hierarchical softmax (Mnih and Hinton, 2009) is used in training with U and b as parameters. Binary Huffmann tree is utilized to represent hierarchial softmax (Mikolov et al., 2013a). Analogous to , we also derive h by concatenating paragraph vector from P l with the average of word vectors in W l . This helps to fine tune both word and paragraph vectors independently. Now, to capture the bilingual cues, the regularization function (ϕ(·)) is learned in two different ways. In the first approach a sentence-aligned parallel corpora is used, while in the second approach a labelaligned document corpora.

BRAVE with Sentence-Aligned Parallel corpora (BRAVE-S)
To compute the bilingual regularization function ϕ(·), we slightly deviate from earlier approaches . Instead of simply performing L 2 -loss between the mean of word vectors in each sentence pair (s l 1 j ,s l 2 j ) of the sentencealigned parallel corpus (PC) at each training step. We use the concept of elastic net regularization (Zou and Hastie, 2005) and employ linear combination of L 2 -loss between sentence paragraph vectors SP l 1 j and SP l 2 j ∈ R d precomputed from the monolingual term M(·) with L 2 -loss between the mean of word vectors observed in sentences. This induces a constraint on the usage of monolingual part of parallel training data to learn M(·). At the same time, it has an advantage of using combination of paragraph and word vectors which combines compositional and non-compositional meanings of sentences.
Also, it eliminates the need for word-alignment and makes an assumption that each word observed in the sentence of language l 1 can potentially find its alignment in the sentence of language l 2 . Theoretically, low value of ϕ(·) ensures that words across languages which are similar are embedded closer to each other. Equation 4 shows the regularization term.
Where W l 1 i and W l 2 k represent word embeddings obtained for the words w i and w k in each sentence (s j ) of length m and n in languages l 1 and l 2 respectively.

BRAVE with Non-Parallel Document Corpora (BRAVE-D)
Sometimes it is hard to acquire sentence-aligned parallel corpora for many languages. Availability of non-parallel corpora such as topic-aligned (e.g. Wikipedia) or label/class-aligned document corpora (e.g. sentiment analysis and multi-class classification data sets) in different languages can be leveraged to learn bilingual embeddings for performing CLTC. Earlier approaches like CL-LSI (Dumais et al., 1997) and CL-KCCA (Vinokourov et al., 2003) were used to learn bilingual document spaces for the tasks comparable to CLTC. Although these approaches provide decent results, they face serious scalability issues and are mostly limited to Wikipedia. Cross-lingual latent topic extraction models (Vulić and Moens, 2014) showed promising results for the tasks like word-level or phrase-level translations, but have certain drawbacks for CLTC tasks.
Here, we propose a two step approach to build bilingual embeddings with label/class-aligned document corpora.
• In the first step, we perform manifold alignment using Procrustes analysis (Wang and Mahadevan, 2008) between sets of documents belonging to same class/label in different languages. This will help to identify the closest alignment of a document in language l 1 with a document in another language l 2 .
• In the second step, we use the pair of partially aligned documents belonging to same class or label in different languages to extract bilingual cues similar to the approach mentioned in § 3.2. Only difference being paragraph vector is learned for the entire document.
Step-1: Let S l 1 and S l 2 be the sets containing languages l 1 and l 2 training documents associated to label or a class. Below, we provide the three step procedure to attain partial alignment between the documents present in these sets.
• Learning low-dimensional embeddings of the sets (S l 1 , S l 2 ) is key for alignment. We use document paragraph vectors (Le and Mikolov, 2014) to learn low-dimensional embeddings of the documents in each language. Let X l 1 and X l 2 be the low-dimensional embeddings of S l 1 and S l 2 respectively.
• To find the optimal values of transformation, Procrustes superimposition is done by translating, rotating and scaling the objects (i.e. rows of X l 2 is transformed to make it similar to the rows of X l 1 ). Transformation is achieved by -Translation: Taking mean of all the members of set to make centroids ( |S l 2 | ) lie at origin. -Scaling and Rotation: The rotation and scaling that maximizes the alignment is given by orthogonal matrix (Q) and scaling factor (k). They are obtained by minimizing orthogonal Procrustes problem (Schönemann, 1966) and is provided by Equation 5.
where X l 2 * a matrix of transformed X l 2 values given by kX l 2 Q and ||.|| F is the Frobenius norm constrained over Q T Q = I.
• If S l 2 * represents the new document set obtained after identifying the close alignment among documents in S l 1 and S l 2 with cosine similarity between X l 1 and X l 2 * , then the partially aligned corpora {S l 1 , S l 2 * } contains one-to-one correspondence between the two languages documents that are used to learn bilingual cues in the second step.
From perturbation theory of spectral spaces (Kostrykin et al., 2003) it can be understood that the difference between low-dimensional embedding subspaces (i.e. X l 1 and X l 2 * ) is always bounded, thus the new alignment obtained between document sets {S l 1 , S l 2 * } is insensitive to perturbations. Which also means that Procrustes analysis has provided best possible document alignments.
Step-2: Now, document pairs (d l 1 j ,d l 2 j ) of the partiallyaligned corpus (PAC) is used to compute bilingual regularization function ϕ(·). At each training step, L 2 -loss of precomputed document paragraph vectors DP l1 j and DP l2 j ∈ R d obtained from the monolingual term M(·) is combined with the L 2 -loss between vector of words weighted by the probability of their occurrence in a particular label/class of entire PAC. Consideration of word probabilities will help to induce label/class specific information. Equation 6 provides the regularization term.
Where w i ,w k are words and their embeddings W l 1 i ,W l 2 k observed in each document (d j ) of length m and n in languages l 1 and l 2 respectively. While, p w i and q w k represents probability of occurrence of words w i and w k in a specific label/class of entire PAC. Figure-1 shows overall goal of both the approaches.

Experiments
In this section, we report results on three different CLTC tasks to comprehend whether our learned bilingual embeddings are semantically useful across languages. First, cross-language document classification (CLDC) task proposed by Klementiev et al. (2012) using the subset of Reuters RCV1/RCV2 corpora (Lewis et al., 2004). Second, a multi-label CLDC task with more languages using TED corpus 1 of  . Subsequently, a crosslanguage sentiment classification (CLSC) proposed by Prettenhofer et al., (2010) on a multi-domain sentiment dataset.

Parallel and Non-Parallel Corpora
For sentence-aligned parallel corpora, Europarl-v7 2 (EP) is used as both monolingual and parallel training data. While for label-aligned non-parallel document corpora, only training and testing collections of the cross-language multi-domain Amazon product reviews(CL-APR) (Prettenhofer and Stein, 2010) corpus with sentiment labels is used.

Implementation
Our implementation launches monolingual paragraph vector (Le and Mikolov, 2014) threads for each language along with bilingual regularization thread. Word and paragraph embeddings matrices are initialized with normal distribution (µ = 0 and σ 2 = 0.1) for each language and all threads access them asynchronously. Following  suggested combination (P=5*W) of paragraph and word embeddings, we chose paragraph embeddings with dimensionality of 200 and 640 when word embeddings are of 40 and 128 dimensions respectively. Asynchronous stochastic gradient descent (ASGD) is used to update parameters (i.e. P l ,W l ,U and b) and train the model.
For each training pair in parallel or non-parallel corpora, initially monolingual threads sample context h with window size of 8 from a random paragraph (i.e. sentence or document) in each language. Then the bilingual regularization thread along with monolingual threads make update to parameters asynchronously. Learning rate is set to 0.001 which decrease with the increase of epochs, while α is chosen to be 0.6 (can be fine tuned based on empirical analysis) to give more weight to paragraph vectors. All models are trained for 50 epochs. 2 http://www.statmt.org/europarl/

Document Representation
Documents are represented with tf-idf weighted sum of embedding vectors of the words that are present in them.

Results
The experimental results for each of the CLTC tasks are presented separately.

Cross-language Document Classification
(CLDC) -RCV1/RCV2 Goal of this task is to classify target language documents with the labeled examples from the source language. To achieve it, we used the subset of Reuters RCV1/RCV2 corpora as the training and evaluation sets and replicated the experimental setting of Klementiev et al. (2012). From the English, German, French and Spanish collection of the dataset, only those documents are selected which was labeled with a single topic (i.e. CCAT, ECAT, GCAT and MCAT). For the classification experiments, 1000 labeled documents from source language are selected to train a multi-class classifier using averaged perceptron (Freund and Schapire, 1999;Collins, 2002) and 5000 documents were used as the testing data.
English-German, English-French and English-Spanish portion of EP corpora (i.e. each with around 1.9M sentence-pairs) is used both as monolingual and parallel training data with BRAVE-S approach to build vocabulary of around 85k English, 144k German, 119k French and 118k Spanish. While training and testing collections belonging to all domains in English-German, English-French languages of CL-APR ((i.e. around 12,000 documentpairs)) was used both as monolingual and partially aligned data with BRAVE-D approach to build vocabulary of around 21k English, 22k German and   40 86.5 75 ----UnsupAlign  40 87.6 77.8 ----Trans-gram (Coulmance et al., 2015) 40  18k French. Further, documents in the training and testing data of RCV1/RCV2 corpora are represented as described in § 4.3 with the vocabulary built. Table 2 shows the comparison of our approaches with the existing systems.

Multi-label CLDC -TED Corpus
To understand the applicability of our approaches to wider range of languages 3 and class labels, we perform experiments with the subset of TED corpus ). Aim of this task is same as § 4.4.1, but experiments were conducted with larger variety of languages and class labels. TED Corpus contains English transcriptions and their sentence-aligned translations for 12 languages from the TED conference. Entire corpus is further classified into 15 topics (i.e. class labels) based on the most frequent keywords appearing in them.
To conduct our experiments, we follow the single mode setting of  (i.e. embeddings are learned only from a single language pair). Entire language pair (i.e. en→L2) training data of the TED corpus is used both as monolingual and parallel training data to learn bilingual word embeddings with dimensionality of 128 using BRAVE-S approach. Bilingual word embeddings of 128 dimensions learned with EP and CL-APR are also 3 Our goal is not to evaluate shared multilingual semantic representation. used for comparison. Documents in the training and testing data of TED corpus are represented as described in § 4.3 using each of these embeddings. A multi-class classifier using averaged perceptron is built using training documents in source language to be applied on target language testing data for predicting the class labels. Table 3 shows the cumulative F1-scores.

Cross-language Sentiment Classification
(CLSC) The objective of the third CLTC task is to identify sentiment polarity (e.g. positive or negative) of the data in target language by exploiting the labeled data in source language. We chose subset of publicly available Amazon product reviews (CL-APR) (Prettenhofer and Stein, 2010) dataset mainly English(E), German(G) and French(F) languages belonging to three different product categories (books(B), dvds(D) and music(M)) to conduct our experiments. For each language-category pair, corpus consists of training, testing sets comprising 1000 positive and 1000 negative reviews each with an additional unlabeled reviews varying from 9,000 to 170,000.
We constructed 12 different CLSC tasks using different languages (i.e. E,G and F) for three categories (i.e. B,D and M). For example, EFM refers English music reviews as source language and French music reviews as target language. Bilingual word embeddings with dimensionality of 128 learned with    (Xiao and Guo, 2002), CL-SLF (Zhou et al., 2015a), CL-DCI 100 (Esuli and Fernandez, 2015) and BSE (Tang and Wan, 2014) BRAVE-S and BRAVE-D are used to represent each review as described in § 4.3. To have fair comparison with earlier approaches, sentiment classification model is then trained with libsvm 4 default parameter settings using source language training reviews 5 to classify target language test reviews. Table 4 shows the accuracy and standard deviation results after we randomly chose subset of target language testing documents and repeated the experiment for

Discussion
First CLTC task (i.e. CLDC) results presented in Table 2 shows that BRAVE-S was able to outperform most of the existing systems. Success of BRAVE-S can be attributed to its ability to incorporate both non-compositional and compositional meaning observed in entire sentence and the individual words respectively. Thus making it different from other models which use only bag-of-words  or bi-grams . Similarly, second CLTC task (i.e. multi-label  CLDC) results presented in Table 3 shows that BRAVE-S learned with the training data of TED corpus outperformed single mode DOC/* embedding models , BRAVE-S learned with EP and BRAVE-D. The BRAVE-S(TED) was able to capture better linguistic regularities across languages that is more specific to the corpus, than the general purpose bilingual embeddings learned with EP. Though in some cases, all our embedding models could not outperform machine translation baseline. This can be due to the asymmetry between languages induced by the language specific words which could not find its equivalents in English. Also, it can be apprehended from the Table 2 and Table 3 that BRAVE-D results are not as expected. Though being a general approach like BRAVE-S which can capture both non-compositional and compositional meaning from larger pieces of texts, minimal overlap of vocabulary learned with BRAVE-D using cross-language sentiment label-aligned corpora with other domains (i.e. Reuters and TED) produce unfavorable results. Thus, we understand that the choice of label/class-aligned corpora is crucial.
Final CLTC task (i.e. CLSC) results presented in Table 4 shows that BRAVE-D outperforms other baseline approaches in most of the cases. As BRAVE-D learns bilingual word embeddings using CL-APR, it was able to inherently encompass sentiment label information effectively like earlier approaches (Tang and Wan, 2014;Zhou et al., 2015b) than the general purpose embeddings learned using BRAVE-S with EP and similar ap-proaches (Meng et al., 2012). Thus making it more suitable for sentiment classification task. Also unlike CL-SSMC (Xiao and Guo, 2002) and CL-SLF (Zhou et al., 2015a), BRAVE-D is not highly parameter dependent where the results of the former approaches show big variance based on the parameter settings. To visualize the difference in embeddings learned with BRAVE-S and BRAVE-D, we selected sentiment words and identified crosslanguage nearest neighbors in Table 5. It can be observed that BRAVE-D was able to identify better sentiment (either positive or negative) word neighbors than BRAVE-S.

Conclusion and Future Work
In this paper, we presented an approach that leverages paragraph vectors to learn bilingual word embeddings with sentence-aligned parallel and labelaligned non-parallel corpora. Empirical analysis exhibited that embeddings learned from both of these types of corpora have shown good impact on CLTC tasks. In future, we aim to extend the approach to learn multilingual semantic spaces with more labels/classes.