Learning Translations via Matrix Completion

Bilingual Lexicon Induction is the task of learning word translations without bilingual parallel corpora. We model this task as a matrix completion problem, and present an effective and extendable framework for completing the matrix. This method harnesses diverse bilingual and monolingual signals, each of which may be incomplete or noisy. Our model achieves state-of-the-art performance for both high and low resource languages.


Introduction
Machine translation (MT) models typically require large, sentence-aligned bilingual texts to learn good translation models (Wu et al., 2016;Sennrich et al., 2016a;Koehn et al., 2003).However, for many language pairs, such parallel texts may only be available in limited quantities, which is problematic.Alignments at the word-or subword-levels (Sennrich et al., 2016b) can be inaccurate in the limited parallel texts, which can in turn lead to inaccurate translations.Due to the low quantity and thus coverage of the texts, there may still be "out-of-vocabulary" words encountered at run-time.The Bilingual Lexicon Induction (BLI) task (Rapp, 1995), which learns word translations from monolingual or comparable corpora, is an attempt to alleviate this problem.The goal is to use plentiful, more easily obtainable, monolingual or comparable data to infer word translations and reduce the need for parallel data to learn good translation models.The word translations obtained by BLI can, for example, be used to augment MT systems and improve alignment accuracy, coverage, and translation quality (Gulcehre et al., 2016;Callison-Burch et al., 2006;Daumé and Jagarlamudi, 2011).Previous research has explored different sources for estimating translation equivalence from monolingual corpora (Schafer and Yarowsky, 2002;Klementiev and Roth, 2006;Irvine andCallison-Burch, 2013, 2017).These monolingual signals, when combined in a supervised model, can enhance end-to-end MT for low resource languages (Klementiev et al., 2012a;Irvine and Callison-Burch, 2016).More recently, similarities between words in different languages have been approximated by constructing a shared bilingual word embedding space with different forms of bilingual supervision (Upadhyay et al., 2016).
We present a framework for learning translations by combining diverse signals of translation that are each potentially sparse or noisy.We use matrix factorization (MF), which has been shown to be effective for harnessing incomplete or noisy distant supervision from multiple sources of information (Fan et al., 2014;Rocktäschel et al., 2015).MF is also shown to result in good crosslingual representations for tasks such as alignment (Goutte et al., 2004), QA (Zhou et al., 2013), and cross-lingual word embeddings (Shi et al., 2015).
Specifically, we represent translation as a matrix with source words in the columns and target words in the rows, and model the task of learning translations as a matrix completion problem.Starting from some observed translations (e.g., from existing bilingual dictionaries,) we infer missing translations in the matrix using MF with a Bayesian Personalized Ranking (BPR) objective (Rendle et al., 2009).We select BPR for a number of reasons: (1) BPR has been shown to outperform traditional supervised methods in the presence of positive-only data (Riedel et al., 2013), which is true in our case since we only observe positive translations.(2) BPR is easily extendable to incorporate additional signals for inferring missing values in the matrix (He and McAuley, 2016).Since observed translations may be sparse, i.e. the "cold start" problem in the matrix completion task, incorporating additional signals of translation equivalence estimated on monolingual corpora is useful.
(3) BPR is also shown to be effective for multilingual transfer learning (Verga et al., 2016).For low resource source languages, there may be related, higher resource languages from which we can project available translations (e.g., translations of loan words) to the target language (Figure 1).
We conduct large scale experiments to learn translations from both low and high resource languages to English and achieve state-of-the-art performance on these languages.Our main contributions are as follows: • We introduce a MF framework that learns translations by integrating diverse bilingual and monolingual signals of translation, each potentially noisy/incomplete.tions from monolingual corpora.Signals such as contextual, temporal, topical, and ortographic similarities between words are used to measure their translation equivalence (Schafer and Yarowsky, 2002;Klementiev and Roth, 2006;Irvine andCallison-Burch, 2013, 2017).
With the increasing popularity of word embeddings, many recent works approximate similarities between words in different languages by constructing a shared bilingual embedding space (Klementiev et al., 2012b;Zou et al., 2013;Vulić and Moens, 2013;Mikolov et al., 2013a;Faruqui and Dyer, 2014;AP et al., 2014;Gouws et al., 2014;Luong et al., 2015;Lu et al., 2015;Upadhyay et al., 2016).In the shared space, words from different languages are represented in a languageindependent manner such that similar words, regardless of language, have similar representations.Similarities between words can then be measured in the shared space.One approach to induce this shared space is to learn a mapping function between the languages' monolingual semantic spaces (Mikolov et al., 2013a;Dinu et al., 2014).The mapping relies on seed translations which can be from existing dictionaries or be reliably chosen from pseudo-bilingual corpora of comparable texts e.g., Wikipedia with interlanguage links.Vulić and Moens (2015) show that by learning a linear function with a reliably chosen seed lexicon, they outperform other models with more expensive bilingual signals for training on benchmark data.
Most prior work on BLI however, either makes use of only one monolingual signal or uses unsupervised methods (e.g., rank combination) to aggregate the signals.Irvine and Callison-Burch (2016) show that combining monolingual signals in a supervised logistic regression model produces higher accuracy word translations than unsupervised models.More recently, Vulić et al. (2016) show that their multi-modal model that employs a simple weighted-sum of word embeddings and visual similarities can improve translation accuracy.These works show that there is a need for combining diverse, multi-modal monolingual signals of translations.In this paper, we take this step further by combining the monolingual signals with bilingual signals of translations from existing bilingual dictionaries of related, "third" languages.
Bayesian Personalized Ranking (BPR) Our approach is based on extensions to the probabilis-tic model of MF in collaborative filtering (Koren et al., 2009;Rendle et al., 2009).We represent our translation task as a matrix with source words in the columns and target words in the rows (Figure 1).Based on some observed translations in the matrix found in a seed dictionary, our model learns low-dimensional feature vectors that encode the latent properties of the words in the row and the words in the column.The dot product of these vectors, which indicate how "aligned" the source and the target word properties are, captures how likely they are to be translations.
Since we do not observe false translations in the seed dictionary, the training data in the matrix consists only of positive translations.The absence of values in the matrix does not imply that the corresponding words are not translations.In fact, we seek to predict which of these missing values are true.The BPR approach to MF (Rendle et al., 2009) formulates the task of predicting missing values as a ranking task.With the assumption that observed true translations should be given higher values than unobserved translations, BPR learns to optimize the difference between values assigned to the observed translations and values assigned to the unobserved translations.
However, due to the sparsity of existing bilingual dictionaries (for some language pairs such dictionaries may not exist), the traditional formulation of MF with BPR suffers from the "cold start" issue (Gantner et al., 2010;He and McAuley, 2016;Verga et al., 2016).In our case, these are situations in which some source words have no translations to any word in the target or related languages.For these words, additional information, e.g., monolingual signals of translation equivalence or language-independent representations such as visual representations, must be used.
We use bilingual translations from the source to the target language, English, obtained from Wikipedia page titles with interlanguage links.Since Wikipedia pages in the source language may be linked to pages in languages other than English, we also use high accuracy, crowdsourced translations (Pavlick et al., 2014) from these third languages to English as additional bilingual translations.To alleviate the cold start issue, when a source word has no existing known translation to English or other third languages, our model backsoff to additional signals of translation equivalence estimated based on its word embedding and visual representations.

Method
In this section, we describe our framework for integrating bilingual and monolingual signals for learning translations.First we formulate the task of Bilingual Lexicon Induction, and introduce our model for learning translations given observed translations and additional monolingual/languageindependent signals.Then we derive our learning procedure using the BPR objective function.
Problem Formulation Given a set of source words F , a set of target words E, the pair ⟨e, f ⟩ where e ∈ E and f ∈ F is a candidate translation with an associated score x e,f ∈ [0, 1] indicating the confidence of the translation.The input to our model is a set of observed translations T := {⟨e, f ⟩ | x e,f = 1}.These could come from an incomplete bilingual dictionary.We also add word identities to the matrix i.e., we define T identity := {⟨e, e⟩}, where T identity ⊂ T .The task of Bilingual Lexicon Induction is then to generate missing translations: for a given source word f and a set of target words {e | ⟨e, f ⟩ / ∈ T }, predict the score x e,f of how likely it is for e to be a translation of f .Bilingual Signals for Translation One way to predict x e,f is by using matrix factorization.The problem of predicting x e,f can be seen as a task of estimating a matrix X : E ×F .X is approximated by a matrix product of two low-rank matrices P : where k is the rank of the approximation.Each row p e in P can be seen as a feature vector describing the latent properties of the target word e, and each row q f of Q describes the latent properties of the source word f .Their dot product encodes how aligned the latent properties are and, since these vectors are trained on observed translations, it encodes how likely they are to be translation of each other.Thus, we can write this formulation of predicted scores xe,f with MF as: Auxiliary Signals for Translation Because the observed bilingual translations may be sparse, the where ↵ m are parameters assigned to control the contribution of each auxiliary feature.
Learning Figure 2: The word tidur (id) is a cold word with no associated translation in the matrix.Auxiliary features ✓ f about the words can be used to predict translations for cold words.
Auxiliary Signals for Translation Because the observed bilingual translations may be sparse, the MF approach can suffer from the existence of cold items: words that have too few associated observed translations to estimate their latent dimensions accurately (Figure 2).Additional signals for measuring translation equivalence can alleviate this problem.Hence, in the case of cold words, we use a formulation of xu,i that involves auxiliary features about the words in the predicted xu,i : ✓ i represents an auxiliary information about the cold word i e.g., its word embedding or visual features.✓ u is a feature vector to be trained, whose dot product with ✓ i models the extent to which the word u matches the auxiliary features of word i.In practice, learning ✓ u amounts to learning a classifier, one for each target word u that learns weights ✓ u given the feature vectors ✓ i of its translations.models the target words' overall bias toward a given word i.
Since each word can have multiple additional feature vectors, we can formulate xAUX u,i as a weighted sum of available auxiliary features 1 : where ↵ m are parameters assigned to control the contribution of each auxiliary feature.
In practice, we can combine the MF and auxiliary formulations by defining: 1 we omit bias terms for brevity Learning with Bayesian Personalized Ranking The objective of Bayesian Personalized Ranking (BPR) is to maximize the difference in scores assigned to the observed translations compared to those assigned to the unobserved translations.Given a training set D consisting of triples of the form hu, i, ji, where hu, ii 2 T and hu, ji / 2 T , BPR wants to maximize xu,i,j , defined as: where xu,i and xu,j can be defined either by eq. 1 or eq. 2 (for cold words).Specifically, BPR optimizes (Rendle et al., 2009): where is the logistic sigmoid function, ⇥ is the parameter vector of xu,i,j to be trained, and ⇥ is its hyperparameter vector.BPR can be trained using stochastic gradient ascent where a triple hu, i, ji is sampled from D and parameter updates are performed: ⌘ is the learning rate.Hence, for the MF formulation of xu,i,j , we can sample a triple hu, i, ji from D and update its parameters as: while for the auxiliary formulation of xu,i,j ; we we can sample a triple hu, i, ji from D and update its parameters as: Figure 2: The word tidur (id) is a cold word with no associated translation in the matrix.Auxiliary features θ f about the words can be used to predict translations for cold words.
MF approach can suffer from the existence of cold items: words that have none or too few associated observed translations to estimate their latent dimensions accurately (Figure 2).Additional signals for measuring translation equivalence can alleviate this problem.Hence, in the case of cold words, we use a formulation of xe,f that involves auxiliary features about the words in the predicted xe,f : θ f represents an auxiliary information about the cold word f e.g., its word embedding or visual features.θ e is a feature vector to be trained, whose dot product with θ f models the extent to which the word e matches the auxiliary features of word f .In practice, learning θ e amounts to learning a classifier, one for each target word e that learns weights θ e given the feature vectors θ f of its translations.β models the targets' overall bias toward a given word f .Since each word can have multiple additional feature vectors, we can formulate xAUX e,f as a weighted sum of available auxiliary features 2 : where α m are parameters assigned to control the contribution of each auxiliary feature.
In practice, we can combine the MF and auxiliary formulations by defining: where xe,f and xe,g can be defined either by eq. 1 or eq. 2 (for cold words).Specifically, BPR optimizes (Rendle et al., 2009): where σ is the logistic sigmoid function, Θ is the parameter vector of xe,f,g to be trained, and λ Θ is its hyperparameter vector.BPR can be trained using stochastic gradient ascent where a triple ⟨e, f, g⟩ is sampled from D and parameter updates are performed: η is the learning rate.Hence, for the MF formulation of xe,f,g , we can sample a triple ⟨e, f, g⟩ from D and update its parameters as: while for the auxiliary formulation of xe,f,g , we can sample a triple ⟨e, f, g⟩ from D and update its parameters as: To implement our approach, we extend the implementation of BPR in LIBREC 3 which is a publicly available Java library for recommender systems.We evaluate our model for the task of Bilingual Lexicon Induction (BLI).Given a source word f , the task is to rank all candidate target words e by their predicted translation scores xe,f .We conduct large-scale experiments on 27 low-and highresource source languages and evaluate their translations to English.We use the 100K most frequent words from English Wikipedia as candidate English target words (E).
At test time, for each source language, we evaluate the top-10 accuracy (Acc 10 ): the percent of source language words in the test set for which a correct English translation appears in the top-10 ranked English candidates.

Test sets
We use benchmark test sets for the task of bilingual lexicon induction to evaluate the performance of our model.The VULIC1000 dataset (Vulić and Moens, 2016) comprises 1000 nouns in Spanish, Italian, and Dutch, along with their one-to-one ground-truth word translations in English.
We construct a new test set (CROWDTEST) for a larger set of 27 languages from crowdsourced dictionaries (Pavlick et al., 2014).For each language, we randomly pick up to 1000 words that have only one English word translation in the crowdsourced dictionary to be the test set for that language.On average, there are 967 test source words with a variety of POS per language.Since different language treats grammatical categories such as tense and number differently (for example, unlike English, tenses are not expressed by specific forms of words in Indonesian (id); rather, they are expressed through context), we make our evaluation on all languages in CROWDTEST generic by treating a predicted English translation of a foreign word as correct as long as it has the same lemma as the gold English translation.To facilitate further research, we make CROWDTEST publicly available in our website.

Bilingual Signals for Translation
We use Wikipedia to incorporate information from a third language into the matrix, with ob- served translations to both the source language and the target language, English.We first collect all interlingual links from English Wikipedia pages to pages in other languages.Using these links, we obtain translations of Wikipedia page titles in many languages to English e.g., id.wikipedia.org/wiki/Kulkas→ fridge (en).The observed translations are projected to fill the missing translations in the matrix (Figure 3).We call these bilingual translations WIKI.
From the links that we have collected, we can also infer links from Wikipedia pages in the source language to other pages in non-target languages e.g., id.wikipedia.org/wiki/Kulkas→ it.wikipedia.org/wiki/Frigorifero.The titles of these pages can be translated to English if they exist as entries in the dictionaries.These non-source, non-target language pages can act as yet another third language whose observed translations can be projected to fill the missing translations in the matrix (Figure 3).We call these bilingual translations WIKI+CROWD.

Monolingual Signals for Translation
We define cold source words in our experiments as source words that have no associated WIKI translations and fewer than 2 associated WIKI+CROWD translations.For each cold source word f , we predict the score of its translation to each candidate English word e using the auxiliary formulation of xe,f (Equation 2).There are two auxiliary signals about the words that we use in our experiments: (1) bilingually informed word embeddings and (2) visual representations.
Bilingually Informed Word Embeddings For each language, we learn monolingual embeddings for its words by training a standard monolingual word2vec skipgram model (Mikolov et al., 2013b) on tokenized Wikipedia pages of that language using Gensim ( Řehůřek and Sojka, 2010).We obtain 100-dimensional word embeddings with 15 epochs, 15 negatives, window size of 5, and cutoff value of 5.
Given two monolingual embedding spaces R d F and R d E of the source and target languages F and E, where d f and d e denote the dimensionality of the monolingual embedding spaces, we use the set of crowdsourced translations that are not in the test set as our seed bilingual translations4 and learn a mapping function W ∈ R d E ×d F that maps the target language vectors in the seed translations to their corresponding source language vectors. 5e learn two types of mapping: linear and nonlinear, and compare their performances.The linear mapping (Mikolov et al., 2013a;Dinu et al., 2014) minimizes: F where, following the notation in (Vulić and Korhonen, 2016), X E and X F are matrices obtained by respective concatenation of target language and source language vectors that are in the seed bilingual translations.We solve this optimization problem using stochastic gradient descent (SGD).
Once the map W is learned, all candidate target word vectors x e can be mapped into the source language embedding space R d F by computing x T e W. Instead of the raw monolingual word embeddings x e , we use these bilingually-informed mapped word vectors x T e W as the auxiliary word features WORD-AUX to estimate xAUX e,f .Visual Representations Pilot Study Recent work (Vulić et al., 2016) has shown that combining word embeddings and visual representations of words can help achieve more accurate bilingual translations.Since the visual representation of a Figure 4: Five images for the French word eau and its top 4 translations ranked using visual simularities of images associated with English words (Bergsma and Van Durme, 2011) word seems to be language-independent (e.g. the concept of water has similar images whether expressed in English or French (Figure 4), the visual representations of a word may be useful for inferring its translation and for complementing the information learned in text.
We performed a pilot study to include visual features as auxiliary features in our framework.We use a large multilingual corpus of labeled images (Anonymous, 2017) to obtain the visual representation of the words in our source and target languages.The corpus contains 100 images for up to 10k words in each of 100 foreign languages, plus images of each of their translations into English.For each of the images, a convolutional neural network (CNN) feature vector is also provided following the method of Kiela et al. (2015).For each word, we use 10 images provided by this corpus and use their CNN features as auxiliary visual features VISUAL-AUX to estimate xAUX e,f .

Combining Signals
During training, we trained the parameters of xMF During testing, we use the following back-off scheme to predict translation scores given a source word f and a candidate target word e:

Results
We conduct experiments using the following variants of our model, each of which progressively incorporates more signals to rank candidate English target words.When a variant uses more than one formulation of xe,f , it applies them using the backoff scheme that we have described before.
• BPR W uses only xMF−W We evaluate the performance of BPR WE against a baseline that is the state-of-the-art model of Vulić and Korhonen (2016), on benchmark VULIC1000 (Table 1).The baseline (MNN) learns a linear mapping between monolingual embedding spaces and finds translations in an unsupervised manner: it ranks candidate target words based on their cosine similarities to the source word in the mapped space.As seed translation pairs, MNN uses mutual nearest neighbor pairs (MNN) obtained from pseudo-bilingual corpora constructed from unannotated monolingual data of the source and target languages (Vulić and Moens, 2016).We train MNN and our models using the same 100-dimensional word2vec monolingual word embeddings.
As seen in Table 1, we see the benefit of learning translations in a supervised manner.
BPR+MNN uses the same MNN seed translations as MNN, obtained from unannotated monolingual data of English and the foreign language, to learn the linear mapping between their embedding spaces.However, unlike MNN, BPR+MNN uses the mapped word vectors to predict ranking in a supervised manner with BPR objective.This results in higher accuracies than MNN.Using seed translations from crowdsourced dictionaries to learn the linear mapping (BPR LN) improves accuracies even further compared to using MNN seed translations obtained from unannotated data.Finally, BPR WE that learns translations in a supervised manner and uses third language translations and non-linear mapping (trained with crowdsourced translations not in the test set) performs consistently and very significantly better than the stateof-the-art on all benchmark test sets.This shows that incorporating more and better signals of translation can improve performance significantly.
Evaluating on CROWDTEST, we observe a similar trend over all 27 languages (Figure 5).Particularly, we see that BPR W and BPR W+C suffer from the cold start issue where there are too few or no observed translations in the matrix to make accurate predictions.Incorporating auxiliary information in the form of bilingually-informed word embeddings improves the accuracy of the predictions dramatically.For many languages, learning these bilingually-informed word embeddings with non-linear mapping improves accuracy even more.The top accuracy scores achieved by the model vary across languages and seem to be influenced by the amount of data i.e., Wikipedia tokens and seed lexicons entries available for training.Somali (so) for example, has only 0.9 million tokens available in its Wikipedia for training the word2vec embeddings and only 3 thousand seed translations for learning the mapping between the word embedding spaces.In comparison, Spanish (es) has over 500 million tokens available in its Wikipedia and 11 thousand seed translations.We also believe that our choice of tokenization may not be suitable for some languages -we use a simple regular-expression based tokenizer for many languages that do not have a trained NLTK6 tokenization model.This may influence performance on languages such as Vietnamese (vi) on which we have a low performance despite its large Wikipedia corpus.Some example translations of an Indonesian word produced by different variants of our model are shown in Table 2. Adding third language translation signals on top of the bilingually-informed auxiliary signals improves accuracies even further. 7The accuracies achieved by BPR WE on these languages are significantly better than previously reported accuracies (Irvine and Callison-Burch, 2017) on test sets constructed from the same crowdsourced dictionaries (Pavlick et al., 2014) 8 .
The accuracies across languages appear to improve consistently with the amount of signals being input to the model.In the following experiments, we investigate how sensitive these improvements are with varying training size.
In Figure 6, we show accuracies obtained by 7 Actual improvement per language depends on the coverage of the Wikipedia interlanguage links for that language 8 The comparison however, cannot be made apples-toapples since the way Irvine and Callison-Burch (2017) select test sets from the crowdsourced dictionaries maybe different and they do not release the test sets BPR WE with varying sizes of seed translation lexicons used to train its mapping.The results show that a seed lexicon size of 5K is enough across languages to achieve optimum performance.This finding is consistent with the finding of Vulić and Korhonen (2016) that accuracies peak at about 5K seed translations across all their models and languages.For future work, it will be interesting to investigate further why this is the case: e.g., how optimal seed size is related to the quality of the seed translations and the size of the test set, and how the optimum seed size should be chosen.Lastly, we experiment with incorporating auxiliary visual signals for learning translations on the multilingual image corpus (Anonymous, 2017).The corpus contains 100 images for up to 10K words in each of 100 foreign languages, plus images of each of their translations into English.We train and test our BPR VIS model to learn translations of 5 low-and high-resource languages in this corpus.We use the translations of up to 10K words in each of these languages as test set and use up to  We compare the quality of our translations with the baseline CNN-AVGMAX (Bergsma and Van Durme, 2011), which considers cosine similarities between individual images from the source and target word languages and takes average of their maximum similarities as the final similarity between a source and a target word.For each source word, the candidate target words are ranked according to these final similarities.This baseline has been shown to be effective for inducing translations from images, both in the uni-modal (Bergsma and Van Durme, 2011;Kiela et al., 2015) and multi-modal models (Vulić et al., 2016).
As seen in Table 3, incorporating additional bilingual and textual signals to the visual signals improves translations.Accuracies on these image corpus' test sets are lower overall as they contain a lot of translations from our crowdsourced dictionaries; thus we have much less seeds to train our word embedding mapping.Furthermore, these test sets contain 10 times as many translations as our previous test sets.Using more images instead of just 10 per word may also improve performance.

Conclusion
In this paper, we propose a novel framework for combining diverse, sparse and potentially noisy multi-modal signals for translations.We view the problem of learning translations as a matrix completion task and use an effective and extendable matrix factorization approach with BPR to learn translations.
We show the effectiveness of our approach in large scale experiments.Starting from minimallytrained monolingual word embeddings, we consistently and very significantly outperform stateof-the-art approaches by combining these features with other features in a supervised manner using BPR.Since our framework is modular, each input to our prediction can be improved separately to improve the whole system e.g., by learning better word embeddings or a better mapping function to input into the auxiliary component.Our framework is also easily extendable to incorporate more bilingual and auxiliary signals of translation equivalence.

Figure 1 :
Figure 1: Our framework allows us to use a diverse range of signals to learn translations, including incomplete bilingual dictionaries, information from related languages (like Indonesian loan words from Dutch shown here), word embeddings, and even visual similarity cues.
can have multiple additional feature vectors, we can formulate xAUX u,i as a weighted combination of all the auxiliary features available to us: xe,f = xMF e,f + xAUX e,f 2 We omit bias terms for brevity.However, since bilingual signals that are input to xMF e,f are often precise but sparse, while monolingual signals that are input to xAUX e,f are often noisy and not sparse, in our model we only back-off to the less precise xAUX e,f for cold source words that have none or too few associated translations (more details are given in the experiments, Section 4).For other source words, we use xMF e,f to predict.Learning with Bayesian Personalized Ranking Unlike traditional supervised models that try to maximize the scores assigned to positive instances (in our case, observed translations), the objective of Bayesian Personalized Ranking (BPR) is to maximize the difference in scores assigned to the observed translations compared to those assigned to the unobserved translations.Given a training set D consisting of triples of the form ⟨e, f, g⟩, where ⟨e, f ⟩ ∈ T and ⟨e, g⟩ / ∈ T , BPR wants to maximize xe,f,g , defined as: xe,f,g = xe,f − xe,g

Figure 3 :
Figure3: Wikipedia pages with observed translations to the source (id) and the target (en) languages act as a third language in the matrix.
WIKI translations as the set of observed translations T • xMF−W+C e,f is trained using WIKI+CROWD translations as the set of observed T • xAUX−WE e,f is trained using the set of word identities T identity and WORD-AUX as θ f • xAUX−VIS e,f is trained using the set of word identities T identity and VISUAL-AUX as θ f

Figure 5 :
Figure 5: Acc 10 on CROWDTEST across all 27 languages show that adding more and better signals for translation improves translation accuracies.The top accuracies achieved by our model: BPR WE vary across languages and appear to be influenced by the amount of data (Wikipedia tokens and seed translations) and tokenization available for the language.

Figure 6 :
Figure 6: Acc 10 across different seed lexicon sizes with Bayesian Personalized Ranking The objective of Bayesian Personalized Ranking (BPR) is to Sarath Chandar AP, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C Aligning words using matrix factorisation.In Pro ceedings of the 42nd Annual Meeting on Associa tion for Computational Linguistics.Association fo Computational Linguistics, page 502.Stephan Gouws, Yoshua Bengio, and Greg Corrado 2014.Bilbowa: Fast bilingual distributed represen tations without word alignments.stat 1050:9.Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati Bowen Zhou, and Yoshua Bengio.2016.Pointing the unknown words.In Proceedings of the 54th An nual Meeting of the Association for Computationa Linguistics (ACL).Association for Computationa Linguistics.Ruining He and Julian McAuley.2016.Vbpr: Visua bayesian personalized ranking from implicit feed back.In Thirtieth AAAI Conference on Artificial In telligence.Ann Irvine and Chris Callison-Burch.2013.Su pervised bilingual lexicon induction with multiple monolingual signals.Citeseer.Ann Irvine and Chris Callison-Burch.2016.End to-end statistical machine translation with zero o small parallel texts.Natural Language Engineering 22(04):517-548.

Table 2 :
Top-5 translations of the Indonesian word kesadaran (awareness) using different model variants

Table 3 :
(Anonymous, 2017)e on the multilingual image corpus test set(Anonymous, 2017)CNN features) of the words in this set as auxiliary visual signals to predict their translations.In this experiment, we weigh auxiliary word embedding and visual features equally.To train the mapping of our word embedding features, we use as seeds crowdsourced translations not in test set.