Learning Cross-lingual Word Embeddings via Matrix Co-factorization

A joint-space model for cross-lingual distributed representations generalizes language-invariant semantic features. In this paper, we present a matrix co-factorization framework for learning cross-lingual word embeddings. We explicitly deﬁne monolingual training objectives in the form of matrix decomposition, and induce cross-lingual constraints for simultaneously factorizing monolingual matrices. The cross-lingual constraints can be derived from parallel corpora, with or without word alignments. Empirical results on a task of cross-lingual document classiﬁcation show that our method is effective to encode cross-lingual knowledge as constraints for cross-lingual word embeddings.


Introduction
Word embeddings allow one to represent words in a continuous vector space, which characterizes the lexico-semanic relations among words. In many NLP tasks, they prove to be high-quality features, successful applications of which include language modelling (Bengio et al., 2003), sentiment analysis (Socher et al., 2011) and word sense discrimination (Huang et al., 2012).
Like words having synonyms in the same language, there are also word pairs across languages which share resembling semantic properties. Mikolov et al. (2013a) observed a strong similarity of the geometric arrangements of corresponding concepts between the vector spaces of different languages, and suggested that a crosslingual mapping between the two vector spaces is technically plausible. In the meantime, the jointspace models for cross-lingual word embeddings are very desirable, as language-invariant semantic features can be generalized to make it easy to transfer models across languages. This is especially important for those low-resource languages, where it allows one to develop accurate word representations of one language by exploiting the abundant textual resources in another language, e.g., English, which has a high resource density. The joint-space models are not only technically plausible, but also useful for cross-lingual model transfer. Further, studies have shown that using cross-lingual correlation can improve the quality of word representations trained solely with monolingual corpora (Faruqui and Dyer, 2014).
Defining a cross-lingual learning objective is crucial at the core of the joint-space model. Hermann and Blunsom (2014) and Chandar A P et al. (2014) tried to calculate parallel sentence (or document) representations and to minimize the differences between the semantically equivalent pairs. These methods are useful in capturing semantic information carried by high-level units (such as phrases and beyond) and usually do not rely on word alignments. However, they suffer from reduced accuracy for representing rare tokens, whose semantic information may not be well generalized. In these cases, finer-grained information at lexical level, such as aligned word pairs, dictionaries, and word translation probabilities, is considered to be helpful. Kočiskỳ et al. (2014) integrated word aligning process and word embedding in machine translation models. This method makes full use of parallel corpora and produces high-quality word alignments. However, it is unable to exploit the richer monolingual corpora. On the other hand, Zou et al. (2013) and Faruqui and Dyer (2014) learnt word embeddings of different languages in separate spaces with monolingual corpora and projected the embeddings into a joint space, but they can only capture linear transformation.
In this paper, we address the above challenges with a framework of matrix co-factorization. We simultaneously learn word embeddings in multiple languages via matrix factorization, with induced constraints to assure cross-lingual semantic relations. It provides the flexibility of constructing learning objectives from separate monolingual and cross-lingual corpora. Intricate relations across languages, rather than simple linear projections, are automatically captured. Additionally, our method is efficient as it learns from global statistics. The cross-lingual constraints can be derived both with or without word alignments, given that there is a valid measure of cross-lingual cooccurrences or similarities.
We test the performance in a task of crosslingual document classification. Empirical results and a visualization of the joint semantic space demonstrate the validity of our model.

Framework
Without loss of generality, here we only consider bilingual embedding learning of the two languages l 1 and l 2 . Given monolingual corpora D l i and sentence-aligned parallel data D bi , our task is to find word embedding matrices of the size |V l i | × d where each line corresponds to the embedding of a single word. We also define vocabularies of contexts U l i and we learn context embedding matrices C l i of the size |U l i | × d at the same time. 1 These matrices are obtained by simultaneous matrix factorization of the monolingual wordcontext PMI (point-wise mutual information) matrices M l i . During monolingual factorization, we put a cross-lingual constraint (cost) on it, ensuring cross-lingual semantic relations. We formalize the global loss function as where L mono and L cross are the monolingual and cross-lingual objectives respectively. ω i and ω c weigh the contribution of the different parts to the total objective. An overview of our algorithm is illustrated in Figure 1.

Monolingual Objective
Our monolingual objective follows the GloVe model (Pennington et al., 2014), which learns from global word co-occurrence statistics. For a word-context pair (j, k) in language l i , we try to 1 In this paper, we let U l i = V l i . minimize the difference between the dot product of the embeddings w l i j · c l i k and their PMI value , where X l i is the matrix of word-context co-occurrence counts. As Pennington et al. (2014), we add separate terms b l i w j , b l i c k for each word and context to absorb the effect of any possible word-specific biases. We also add an additional matrix bias b l i for the ease of sharing embeddings among matrices. The loss function is written as the sum of the weighted square error, where we choose the same weighting function as the GloVe model to place less confidence on those word-context pairs with rare occurrences, Notice that we only have to optimize those X l i jk = 0, which can be solved efficiently since the matrix of co-occurrence counts is usually sparse.

Cross-lingual Objectives
As the most important part in our model, the crosslingual objective describes the cross-lingual word relations and sets constraints when we factorize monolingual co-occurrence matrices. It can be derived from either cross-lingual co-occurrences or similarities between cross-lingual word pairs.

Cross-lingual Contexts
The monolingual objective stems from the distributional hypothesis (Harris, 1954) and optimizes words in similar contexts into similar embeddings. It is natural to further extend this idea to define cross-lingual contexts, for which we have multiple choices.
For the definition of cross-lingual contexts, we have multiple choices. A straightforward option is to count all the word co-occurrences in aligned sentence pairs, which is equivalent to a uniform word alignment model adopted by Gouws et al. (2015). For the sentence-aligned bilingual corpus D bi = {(S l 1 , S l 2 )}, where each S l i is a monolingual sentence, we count the co-occurrences as where X bi is the matrix of cross-lingual cooccurrence counts, and #(j, S) is a function counting the number of j's in the sequence S. We then use a similar loss function as Equation 2, with the exception that we optimize for the dot products of w l 1 j · w l 2 k . This method works without word alignments and we denote it as CLC-WA (Crosslingual context without word alignments).
We can also leverage word alignments and define CLC+WA (Cross-lingual context with word alignments). The idea is to count those words co-occurring with k as the context of j, where k ∈ V l 2 is the translationally equivalent word of j ∈ V l 1 . An example is shown in Figure 2. CLC+WA is expected to contain more precise information than CLC-WA, and we will compare the two definitions in the following experiments.
Once we have counted the co-occurrences, a naïve solution is to concatenate the bilingual vocabularies and perform matrix factorization as a whole. To induce additional flexibility, such as separate weighting, we divide the matrix into three parts. It is also more reasonable to calculate PMI values without mixing the monolingual and bilingual corpora.

Cross-lingual Similarities
An alternative way to set cross-lingual constraints is to minimize the distances between similar word pairs. Here the semantic similarities can be measured by equivalence in translation, sim(j, k), which is produced by a machine translation system. In this paper, we use the translation probabilities produced by a machine translation system. Minimizing the distances of related words in the two languages weighted by their similarities gives us the cross-lingual objective … we must do all we can, not just to … … wir alles daran setzen müssen, nicht nur … where w l 1 j and w l 2 k are the embeddings of j and k in l 1 and l 2 respectively. In this paper, we choose the distance function to be the Euclidean distance, distance(w l 1 j , w l 2 k ) = ||w l 1 j − w l 2 k || 2 . Notice that similar to the monolingual objective, we may optimize for only those sim(j, k) = 0, which is efficient as the matrix of translation probabilities or dictionary is sparse. We call this method CLSim.

Experiments
To evaluate the quality of the relatedness between words in different languages, we induce the task of cross-lingual document classification for the English-German language pair, where a classifier is trained in one language and later used to classify documents in another. We exactly replicated the experiment settings of Klementiev et al. (2012).

Data and Training
For optimizing the monolingual objectives, We used exactly the same subset of RCV1/RCV2 corpora (Lewis et al., 2004) as by Klementiev et al. (2012), which were sampled to balance the number of tokens between languages. Our preprocessing strategy followed Chandar  word co-occurrences, we use a decreasing weighting function as Pennington et al. (2014), where dword-apart word pairs contribute 1/d to the total count. We used a symmetric window size of 10 words for all our experiments.
The cross-lingual constraints were derived using the English and German sections of the Europarl v7 parallel corpus (Koehn, 2005), which were similarly preprocessed. For CLC+WA and CLSim, we obtained word alignments and translation probabilities with SyMGIZA++ (Junczys-Dowmunt and Szał, 2012). We did not use Europarl for monolingual training.
The documents for classification were randomly selected by Klementiev et al. (2012) from those in RCV1/RCV2 that are assigned to only one single topic among the four: CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social), and MCAT (Markets). 1,000/5,000 documents in each language were used as a train/test set and we kept another 1,000 documents as a development set for hyperparameter tuning. Each document was represented as an idf-weighted average embedding of all its tokens, and a multi-class document classifier was trained for 10 epochs with an averaged perceptron algorithm as by Klementiev et al. (2012). A classifier trained with English documents is used to classify German documents and vice versa.
We trained our models using stochastic gradient descent. We run 50 iterations for all of our experiments and the dimensionality of the embeddings is 40. We set x max to be 100 for cross-lingual cooccurrences and 30 for monolingual ones, while α is fixed to 3/4. Other parameters are chosen according to the performance on the development set.

Results
We present the empirical results on the task of cross-lingual document classification in Table 1, where the performance of our models is compared with some baselines and previous work. The effect of weighting between parts of the total objective and the amount of training data on the quality of the embeddings is demonstrated in Figure 3.
The baseline systems are Majority class where test documents are simply classified as the class with the most training samples, and Machine translation where a phrased-based machine translation system is used to translate test documents into the same language as the training documents.
Our method outperforms the previous work and we observe improvements when we exploit word translation probabilities (CLSim) over the model without word-level information (CLC-WA). The best result is achieved with CLSim. It is interesting to notice that CLC+WA, which makes use of word alignments in defining crosslingual contexts, does not provide better performance than CLC-WA. We guess that sentencelevel co-occurrence is more suitable for capturing sentence-level semantic relations in the task of document classification.  Figure 4 gives a visualization of some selected words using t-SNE (Van der Maaten and Hinton, 2008) where we observe the topical nature of word embeddings. Regardless of their source languages, words sharing a common topic, e.g. economy, are closely aligned with each other, revealing the semantic validity of the joint vector space.

Related Work
Matrix factorization has been successfully applied to learn word representations, which use several low-rank matrices to approximate the original matrix with extracted statistical information, usually word co-occurrence counts or PMI. Singular value decomposition (SVD) (Eckart and Young, 1936), SVD-based latent semantic analysis (LSA) (Landauer et al., 1998), latent semantic indexing (LSI) (Deerwester et al., 1990), and the more recentlyproposed global vectors for word representation (GloVe) (Pennington et al., 2014) find their wide applications in the area of NLP and information retrieval (Berry et al., 1995). Additionally, there is evidence that some neural-network-based models, such as Skip-gram (Mikolov et al., 2013b) which exhibits state-of-the-art performance, are also implicitly factorizing a PMI-based matrix (Levy and Goldberg, 2014). The strategy for matrix factorization in this paper, as Pennington et al. (2014), is in a stochastic fashion, which better handles unobserved data and allows one to weigh samples according to their importance and confidence.
Joint matrix factorization allows one to decompose matrices with some correlational constraints. Collective matrix factorization has been developed to handle pairwise relations (Singh and Gordon, 2008). Chang et al. (2013) generalized LSA to Multi-Relational LSA, which constructs a 3-way tensor to combine the multiple relations between words. While matrix factorization is widely used in recommender systems, matrix co-factorization helps to handle multiple aspects of the data and improves in predicting individual decisions (Hong et al., 2013). Multiple sources of information, such as content and linkage, can also be connected with matrix co-factorization to derive high-quality webpage representations (Zhu et al., 2007). The advantage of this approach is that it automatically finds optimal parameters to optimize both single matrix factorization and relational alignments, which avoids manually defining a projection matrix or transfer function. To the best of our knowledge, we are the first to introduce this technique to learn cross-lingual word embeddings.

Conclusions
In this paper, we introduced a framework of matrix co-factorization to learn cross-lingual word embeddings. It is capable of capturing the lexicosemantic similarities of different languages in a unified vector space, where the embeddings are jointly learnt instead of projected from separate vector spaces. The overall objective is divided into monolingual parts and a cross-lingual one, which enables one to use different weighting and learning strategies, and to develop models either with or without word alignments. Exploiting global context and similarity information instead of local ones, our proposed models are computationally efficient and effective.
With matrix co-factorization, it allows one to integrate external information, such as syntactic contexts and morphology, which is not discussed in this paper. Its application in statistical machine translation and cross-lingual model transfer remains to be explored. Learning multiple embeddings per word and compositional embeddings with matrix factorization are also interesting future directions.