Trans-gram, Fast Cross-lingual Word-embeddings

We introduce Trans-gram, a simple and computationally-efficient method to simultaneously learn and align wordembeddings for a variety of languages, using only monolingual data and a smaller set of sentence-aligned data. We use our new method to compute aligned wordembeddings for twenty-one languages using English as a pivot language. We show that some linguistic features are aligned across languages for which we do not have aligned data, even though those properties do not exist in the pivot language. We also achieve state of the art results on standard cross-lingual text classification and word translation tasks.


Introduction
Word-embeddings are a representation of words with fixed-sized vectors. It is a distributed representation (Hinton, 1984) in the sense that there is not necessarily a one-to-one correspondence between vector dimensions and linguistic properties. The linguistic properties are distributed along the dimensions of the space.
A popular method to compute wordembeddings is the Skip-gram model (Mikolov et al., 2013a). This algorithm learns high-quality word vectors with a computation cost much lower than previous methods. This allows the processing of very important amounts of data. For instance, a 1.6 billion words dataset can be processed in less than one day. Several authors came up with different methods to align word-embeddings across two languages (Klementiev et al., 2012;Mikolov et al., 2013b;Lauly et al., 2014;Gouws et al., 2015). * These authors contributed equally.
In this article, we introduce a new method called Trans-gram, which learns word embeddings aligned across many languages, in a simple and efficient fashion, using only sentence alignments rather than word alignments. We compare our method with previous approaches on a crosslingual document classification task and on a word translation task and obtain state of the art results on these tasks. Additionally, word-embeddings for twenty-one languages are learned simultaneously -to our knowledge -for the first time, in less than two and a half hours. Furthermore, we illustrate some interesting properties that are captured such as cross-lingual analogies, e.g rey es − Mann de + femme fr ≈ regina it which can be used for disambiguation.

Review of Previous Work
A number of methods have been explored to train and align bilingual word-embeddings. These methods pursue two objectives: first, similar representations (i.e. spatially close) must be assigned to similar words (i.e. "semantically close") within each language -this is the mono-lingual objective; second, similar representations must be assigned to similar words across languages -this is the cross-lingual objective.
The simplest approach consists in separating the mono-lingual optimization task from the crosslingual optimization task. This is for example the case in (Mikolov et al., 2013b). The idea is to separately train two sets of word-embeddings for each language and then to do a parametric estimation of the mapping between word-embeddings across languages. This method was further extended by (Faruqui and Dyer, 2014). Even though those algorithms proved to be viable and fast, it is not clear whether or not a simple mapping between whole languages exists. Moreover, they require word alignments which are a rare and expensive resource.
Another approach consists in focusing entirely on the cross-lingual objective. This was explored in (Hermann and Blunsom, 2013;Lauly et al., 2014) where every couple of aligned sentences is transformed into two fixed-size vectors. Then, the model minimizes the Euclidean distance between both vectors. This idea allows processing corpus aligned at sentence-level rather than word-level. However, it does not leverage the abundance of existing mono-lingual corpora .
A popular approach is to jointly optimize the mono-lingual and cross-lingual objectives simultaneously. This is mostly done by minimizing the sum of mono-lingual loss functions for each language and the cross-lingual loss function. (Klementiev et al., 2012) proved this approach to be useful by obtaining state-of-the-art results on several tasks. (Gouws et al., 2015) extends their work with a more computationally-efficient implementation.
3 From Skip-Gram to Trans-Gram

Skip-gram
We briefly introduce the Skip-gram algorithm, as we will need it for further explanations. Skipgram allows to train word embeddings for a language using mono-lingual data. This method uses a dual representation for words. Each word w has two embeddings: a target vector, w (∈ R D ), and a context vector, w (∈ R D ). The algorithm tries to estimate the probability of a word w to appear in the context of a word c. More precisely we are learning the embeddings w, c so that: σ( w · c) = P (w|c) where σ is the sigmoid function.
A simplified version of the loss function minimized by Skip-gram is the following: where C is the set of sentences constituting the training corpus, and s[w − l : w + l] is a word window on the sentence s centered around w. For the sake of simplicity this equation does not include the "negative-sampling" term, see (Mikolov et al., 2013a) for more details. Skip-gram can be seen as a materialization of the distributional hypothesis (Harris, 1968): "Words used in similar contexts have similar meanings". We will now see how to extend this idea to cross-lingual contexts.

Trans-gram
In this section we introduce Trans-gram, a new method to compute aligned word-embeddings for a variety of languages.
Our method will minimize the summation of mono-lingual losses and cross-lingual losses. Like in BilBOWA (Gouws et al., 2015), we use Skipgram as a mono-lingual loss. Assuming we are trying to learn aligned word vectors for languages e (e.g. English) and f (e.g. French), we note J e and J f the two mono-lingual losses.
In BilBOWA, the cross-lingual loss function is a distance between bag-of-words representations of two aligned sentences. But as (Levy and Goldberg, 2014) showed that the Skip-gram loss function extracts interesting linguistic features, we wanted to use a loss function for the cross-lingual objective that will be closer to Skip-gram than Bil-BOWA.
Therefore, we introduce a new task, Transgram, similar to Skip-gram. Each English sentence s e in our aligned corpus A e,f is aligned with a French sentence s f . In Skip-gram, the context picked for a target word w e in a sentence s e is the set of words c e appearing in the window centered around w e : s e [w e − l : w e + l]. In Trans-gram, the context picked for a target word w e in a sentence s e will be all the words c f appearing in s f . The loss can thus be written as: This loss isn't symmetric with respect to the languages. We, therefore, use two cross-lingual objectives: Ω e,f aligning e's target vectors and f 's context vectors and Ω f,e aligning f 's target vectors and e's context vectors. By comparison BilBOWA only aligns e's target vectors and f 's target vectors. The figure 1 illustrates the four objectives.
Notice that we make the assumption that the meaning of a word is uniformly distributed in the whole sentence. This assumption, although a naive one, gave us in practice excellent results. Also our method uses only sentence-aligned corpus and not word-aligned corpus which are rarer.
To add a third language i (e.g. Italian), we just have to add 3 new objectives (J i , Ω e,i and Ω i,e ) to the global loss. If available we could also add Ω f,i or Ω i,f but in our case we only used corpora aligned with English.  Figure 1: The four partial objectives contributing to the alignment of English and French: a Skip-gram objective per language (Je and J f ) over a window surrounding a target word (blue) and two Trans-gram objectives (Ω e,f and Ω f,e ) over the whole sentence aligned with the sentence from which the target word is extracted (red).

Implementation
In our experiments, we used the Europarl (Koehn, 2005) aligned corpora. Europarl-v7 has two peculiarities: firstly, the corpora are aligned at sentence-level; secondly each pair of languages contains English as one of its members: for instance, there is no French/Italian pair. In other words, English is used as a pivot language. No bi-lingual lexicons nor other bi-lingual datasets aligned at the word level were used. Using only the Europarl-v7 texts as both monolingual and bilingual data, it took 10 minutes to align 2 languages, and two and a half hours to align the 21 languages of the corpus, in a 40 dimensional space on a 6 core computer. We also computed 300 dimensions vectors using the Wikipedia extracts provided by (Al-Rfou et al., 2013) as monolingual data for each language. The training time was 21 hours.

Reuters Cross-lingual Document Classification
We used a subset of the English and German sections of the Reuters RCV1/RCV2 corpora (Lewis and Li, 2004) (10000 documents each), as in (Klementiev et al., 2012), and we replicated the experimental setting. In the English dataset, there are four topics: CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social), and MCAT (Markets). We used these topics as our labels and we only selected documents labeled with a single topic. We trained our classifier on the articles of one language, where each document was represented using an IDF weighted sum of the vectors of its words, we then tested it on the articles of the other language. The classifier used was an averaged perceptron, and we used the implementation from (Klementiev et al., 2012) (Gouws et al., 2015) with BilBOWA and (Lauly et al., 2014) with their Bilingual Auto-encoder model. This model learns word embeddings during a translation task that uses an encoder-decoder approach. We also report the scores from Klementiev et al. who introduced the task and the BiCVM model scores from (Hermann and Blunsom, 2013).
The results show an overall significant improvement over the other methods, with the added advantage of being computationally efficient.

P@k Word Translation
Next we evaluated our method on a word translation task, introduced in (Mikolov et al., 2013b) and used in (Gouws et al., 2015). The words were extracted from the publicly available WMT11 2 corpus. The experiments were done for two sets of translation: English to Spanish and Spanish to English. (Mikolov et al., 2013b) extracted the top 6K most frequent words and translated them with Google Translate. They used the top 5K pairs to train a translation matrix, and evaluated their method on the remaining 1K. As our English and   The reported score, the translation precision P @k, is the fraction of test-pairs where the target translation (Google Translate) is one of the k translations proposed by our model. For a given English word, w, our model takes its target vectors w and proposes the k closest Spanish word using the co-similarity of their vectors to w. We compare ourselves to the "translation matrix" method and to the BilBowa aligned vectors. We also report the scores obtained by a trivial algorithm that uses edit-distance to determine the closest translation and by the Bing Translator service.
6 Interesting properties 6.1 Cross-lingual disambiguation We now present the task of cross-lingual disambiguation as an example of possible uses of aligned multilingual vectors. The goal of this task is to find a suitable representation of each sense of a given polysemous word. The idea of our method is to look for a language in which the undesired senses are represented by unambiguous words and then to perform some arithmetic operation.
Let's illustrate the process with a concrete example: consider the French word "train", train fr . The three closest Polish words to train fr translate in English into "now", "a train" and "when". This seems a poor matching. In fact, train fr is polysemous. It can name a line of railroad cars, but it is also used to form progressive tenses. The French "Il est en train de manger" translates into "he is eating", or in Italian "sta mangiando".
As the Italian word "sta" is used to form progressive tenses, it's a good candidate to disambiguate train fr . Let's introduce the vector v = train fr − sta it . Now the three polish words closest to v translate in English into "a train", "a train" and "railroad". Therefore v is a better representation for the railroad sense of train fr .

Transfer of linguistic features
Another interesting property of the vectors generated by Trans-gram is the transfer of linguistic features through a pivot language that does not possess these features.
Let's illustrate this by focusing on Latin languages, which possess some features that English does not, like rich conjugations. For example, in French and Italian the infinitives of "eat" are manger fr and mangiare it , and the first plural persons are mangeons fr and mangiamo it . Actually in our models we observe the following alignments: manger fr ≈ mangiare it and mangeons fr ≈ mangiamo it . It is thus remarkable to see that features not present in English match in languages aligned through English as the only pivot language. We also found similar transfers for the genders of adjectives and are currently studying other similar properties captured by Trans-gram.

Conclusion
In this paper we provided the following contributions: Trans-gram, a new method to compute cross-lingual word-embeddings in a single word space; state of the art results on cross-lingual NLP tasks; a sketch of a cross-lingual calculus to help disambiguate polysemous words; the exhibition of linguistic features transfers through a pivotlanguage not possessing those features.
We are still exploring promising properties of the generated vectors and their applications in other NLP tasks (Sentiment Analysis, NER...).