Learning Bilingual Projections of Embeddings for Vocabulary Expansion in Machine Translation

We propose a simple log-bilinear softmax-based model to deal with vocabulary expansion in machine translation. Our model uses word embeddings trained on significantly large unlabelled monolingual corpora and learns over a fairly small, word-to-word bilingual dictionary. Given an out-of-vocabulary source word, the model generates a probabilistic list of possible translations in the target language using the trained bilingual embeddings. We integrate these translation options into a standard phrase-based statistical machine translation system and obtain consistent improvements in translation quality on the English–Spanish language pair. When tested over an out-of-domain testset, we get a significant improvement of 3.9 BLEU points.


Introduction
Data-driven machine translation systems are able to translate words that have been seen in the training parallel corpora, however translating unseen words is still a major challenge for even the best performing systems. The amount of parallel data is finite (and sometimes scarce) and, therefore, word types like named entities, domain specific content words, or infrequent terms are rare. This lack of information can potentially result in incomplete or erroneous translations. This problem has been actively studied in the field of machine translation (MT) (Habash, 2008;Daumé III and Jagarlamudi, 2011;Marton et al., 2009;Rapp, 1999; Dou and Knight, * This work was done while the authors were in TALP Research Center, Universitat Politècnica de Catalunya, Barcelona. 2012; Irvine and Callison-Burch, 2013). Lexiconbased resources have been used for resolving unseen content words by exploiting a combination of monolingual and bilingual resources (Rapp, 1999;Callison-Burch et al., 2006;Zhao et al., 2015). In this context, distributed word representations, or word embeddings (WE), have been recently applied to resolve unseen word related problems (Mikolov et al., 2013b;Zou et al., 2013). In general, word representations capture rich linguistic relationships and several works (Gouws et al., 2015;Wu et al., 2014) try to use them to improve MT systems. However, very few approaches use them directly to resolve the out-of-vocabulary (OOV) problem in MT systems.
Previous research in MT systems suggests that a significant number of named entities (NE) can be handled by using simple pre or post-processing methods, e.g., transliteration techniques (Hermjakob et al., 2008;Al-Onaizan and Knight, 2002). However, a change in domain results in a significant increase in the number of unseen content words for which simple pre or post-processing methods are sub-optimal (Zhang et al., 2012).
Our work is inspired by the recent advances (Zou et al., 2013;Zhang et al., 2014) in applications of word embeddings to the task of vocabulary expansion in the context of statistical machine translation (SMT). Our focus in this paper is to resolve unseen content words by using continuous word embeddings on both the languages and learn a model over a small seed lexicon to map the embedding spaces. To this extent, our work is similar to Ishiwatari et al. (2016) where the authors map distributional representations using a linear regression method similar to Mikolov et al. (2013b) and insert a new feature based on cosine similarity metric into the MT system. On the other hand, there is a rich body of recent literature that focuses on obtaining bilingual word embeddings using either sentence aligned or document aligned corpora (Bhattarai, 2012;Gouws et al., 2015;Kočiský et al., 2014). Our approach is significantly different as we obtain embeddings separately on monolingual corpora and then use supervision in the form of a small sparse bilingual dictionary, in some terms similar to Faruqui and Dyer (2014). We use a simple yet principled method to obtain a probabilistic conditional distribution of words directly and these probabilities allow us to expand the translation model for new words.
The rest of the paper is organised as follows. Section 2 presents the log-bilinear softmax model, and its integration into an SMT system. The experimental work is described in Section 3. Finally, we conclude and sketch some avenues for future work.

Mapping Continuous Word Representations using a Bilinear Model
Definitions. Let E and F be the vocabularies of the two languages, source and target, and let e ∈ E and f ∈ F be words in these languages respectively. Let us assume, we have a source word to target word e → f dictionary. We also assume that we have access to some kind of distributed word embeddings in both languages, φ s for the source and φ t for the target, where φ(.) → R n denotes the n-dimensional distributed representation of the words. The task we are interested in is to learn a model for the conditional probability distribution Pr(f |e). That is, given a word in a source language, say English (e), we want to get a conditional probability distribution of all the words in a foreign language (f ).
Log-Bilinear Softmax Model. We formulate the problem as a bilinear prediction task as proposed by Madhyastha et al. (2014a) and extend it for the bilingual setting. The proposed model makes use of word embeddings on both languages with no additional features. The basic function is formulated as log-bilinear softmax model and takes the following form: Essentially, our problem reduces to: a) first obtaining the corresponding word embeddings of the vocabularies from both the languages using a sig-nificantly large monolingual corpus and b) estimating W given a relatively small dictionary. That is, to learn W we use the source word to target word dictionary as training supervision. The dictionary can be a true bilingual dictionary or the word alignments generated by the SMT system, therefore, no additional resources to the training parallel corpus are needed.
We learn W by minimizing the negative loglikelihood of the dictionary using a regularized (relaxed low-rank regularization based) objective as: L(W ) = − e,f log(Pr(f |e; W )) + λ W p . λ is the constant that controls the capacity of W . To find the optimum, we follow previous (Madhyastha et al., 2014b) work and use an optimization scheme based on Forward-Backward Splitting (FOBOS) (Singer and Duchi, 2009).
We experiment with two regularization schemes, p = 2 or the 2 regularizer and p = * or the * (nuclear norm) regularizer. We find that both norms have approximately similar performance, however the trace norm regularized W has lower capacity and hence, smaller number of parameters. This is also observed by (Bach, 2008;Madhyastha et al., 2014a,b). In general, we can apply the ideas used by Mikolov et al. (2013b) to speed up the training as this model is equivalent to a softmax model. We can obtain models with similar properties if we change the loss from bilinear log softmax to a bilinear margin based loss. We leave this exploration for future work.
A by-product of regularizing with * norm is a lower-dimensional, language aligned, and compressed embeddings for both languages. This is possible because of the induced low-dimensional properties of W . That is, assume W has rank k, where k < n, such that W ≈ U k V k , then the product: gives us φ s (e) U k and V k φ t (f ) compressed embeddings with shared properties. These are similar to the CCA based projections obtained in Faruqui and Dyer (2014).
Integrating the Probabilistic List into the SMT System. We integrate the probabilistic list of translation options into the phrase-based decoder using the standard log-linear approach (Och and Ney, 2002). Consider a word pair (e, f ), where the decoder searches for a foreign word f that maxi- mizes a linear combination of feature functions: here, λ i is the weight associated with feature h i (f, e) and λ oov is the weight associated with the unseen word.  (Vulic and Moens, 2015). We experiment with English-German and English-French language pairs, so that we can induce the dictionaries for the five systems. As seen in Table 1, our full 300-dimensional embeddings perform better than the BiCCA-based model, whereas 100-dimensional compressed embedding perform slightly worse, but still are competitive. Since our model and BiCCA use similar supervision, we obtain similar results and differ in a similar way to those that use stronger supervision like BiCVM and BiSkip based embeddings.
MT Data and System Settings. For estimating the monolingual WE, we use the CBOW algorithm as implemented in the Word2Vec package (Mikolov et al., 2013a) using a 5-token window. We obtain 300 dimension vectors for English and Spanish from a Wikipedia dump of 2015 and the Quest data 2 . The final corpus contains 2.27 bil- 1 We also used the script provided here: https:// github.com/shyamupa/biling-survey 2 http://statmt.org/˜buck/wmt13qe/ wmt13qe_t13_t2_MT_corpus.tgz lion tokens for English and 0.84 for Spanish. We remove any occurrence of sentences from the test set that are contained in our corpus. The coverage in our test sets is of 97% of the words.
To train the log-bilinear softmax based model, we use the dictionary from the Apertium project 3 (Forcada et al., 2011). The dictionary contains 37651 words, 70% of them are used for training and 30% as a development set for model selection. The average precision @1 is 86% for the best model over the development set.
A state-of-the-art phrase-based SMT system is trained on the Europarl corpus (Koehn, 2005) for the English-to-Spanish language pair. We use a 5-gram language model that is estimated on the target side of the corpus using interpolated Kneser-Ney discounting with SRILM (Stolcke, 2002). Additional monolingual data available within Quest corpora is used to build a larger language model with the same characteristics. Word alignment is done with GIZA++ (Och and Ney, 2003) and both phrase extraction and decoding are done with the Moses package (Koehn et al., 2007). At decoding time, Moses allows to include additional translation pairs with their associated probabilities to selected words via xml markup. We take advantage of this feature to add our probabilistic estimations to each OOV. Since, by definition, OOV words do no appear in the parallel training corpus, they are not present in the translation model either and the new translation options only interact with the language model. The optimization of the weights of the model with the additional translation options is trained with MERT (Och, 2003) against the BLEU (Papineni et al., 2002) evaluation metric on the NewsCommentaries 2012 4 (NewsDev) set. We test our systems on the NewsCommentaries 2013 set (New-sTest) for an in-domain evaluation and on a test set  Smith et. al. (2010) for an out-of-domain evaluation (WikiTest).
The domainess of the test set is established with respect to the number of OOVs. Table 2 shows the figures of these sets paying special attention to the OOVs in the basic SMT system. Less than a 3% of the tokens are OOVs for News data (OOV all ), whereas it is more than a 7% for Wikipedia's. In our experiments, we distinguish between OOVs that are named entities and the rest of content words (OOV CW ). Only about 0.5% (NewsTest) and 1.8% (WikiTest) of the tokens fall into this category, but we show that they are relevant for the final performance.
MT Experiments. We consider two baseline systems, the first one does not output any translation for OOVs (noOOV), it just ignores the token; the second one outputs a verbatim copy of the OOV as a translation (verbatimOOV). Table 3 shows the performance of these systems under three widely used evaluation metrics TER (Snover et al., 2006), BLEU and METEOR (MTR) (Banerjee and Lavie, 2005). Including the verbatim copy improves all the lexical evaluation metrics. Specially for NEs and acronyms (the 80% of OOVs in our sets), this is a hard baseline to be compared to as in most cases the same word is the correct translation.
We then enrich the systems with information gathered from the large monolingual corpora in two ways, using a bigger language model (BLM) and using our newly proposed log-bilinear model that uses word embeddings (BWE). BLMs are important to improve the fluency of the translations, however they may not be helpful for resolving OOVs as they can only promote translations available in the translation model. On the other hand, BWEs are important to make available to the decoder new vocabulary on the topic of the otherwise OOVs. Given the large percentage of NEs in the test sets (Table 2), our models add the source word as an additional option to the list of target words to mimic the verbatimOOV system. Table 3 includes seven systems with the addi- tional monolingual information. Three of them add, at decoding time, the top-n translation options given by the BWE for a OOV. BWE system uses the top-50 for all the OOVs, BWE CW50 also uses the top-50 but only for content words other than named entities 5 , and BWE CW10 limits the list to 10 elements. BLM is the same as the baseline system verbatimOOV but with the large language model. BLM+BWE, BLM+BWE 50 and BLM+BWE 10 combine the three BWE systems with the large language model.
In the NewsTest, most of unseen words are named entities and using BWEs to translate them barely improves the translation. The reason is that embeddings of related NEs are usually equivalent. This affects the overall integration of the scores into the decoder and induces ambiguity in the system. However, we observe that the decoder benefits from the information on content words, specially for the out-of-domain WikiTest set. In this case, given the constrained list of alternative translations (BWE CW10 ) one achieves 2.75 BLEU points of improvement.
The addition of the large language model improves the results significantly. When combined with the BWEs we observe that the BWEs clearly help in the translation of WikiTest but do not seem as relevant in the in-domain set. We achieve a statistically significant improvement of 3.9 points of BLEU with the BLM and BWE combo system -BLM+BWE 10 with respect to BLM-in WikiTest (p<0.001); the improvement in the NewsTest is not statistically significant (p-value=0.29). The number of translation options in the list is also In order to estimate the relevance of the bilingual embeddings into the final translation, we have manually evaluated the translation of WikiTest using the BWE CW50 model. For the translation of the OOVs, we obtain an accuracy of a 68%, that is, the BWE gives the correct translation option at least 68% of the times. We note that, even if the correct translation option is in the translation list obtained by the BWE, the decoder may choose not to consider it.
In general, we observe that when our model fails, in most of the cases, the words in the translated language happened to be either a multiword expression or a named entity. In Table 4 we present some of the these examples. The first two examples galaxy and nymphs are nouns where we obtain the first option as the correct translation. The problem is harder for named entities as we observe in the table, the name Stuart in English has William as most probable translation in Spanish, the correct translation Estuardo however appears as the 48th choice. Our model is also unable to generate multiword expressions, as shown in the table for the english word folksong, the correct translation being canción folk. This would need two words in Spanish in order to be translated correctly, however, our model does obtain words: canción and folclore as the most probable translation options.

Conclusions
We have presented a method for resolving OOVs in SMT that performs vocabulary expansion by using a simple log-bilinear softmax based model. The model estimates bilingual word embeddings and, as a by-product, generates low-dimensional compressed embeddings for both languages.The addition of the new translation options to a mere 1.8% of the words has allowed the system to obtain a relative improvement of a 13% in BLEU (3.9 points) for out-of-domain data. For in-domain data, where the number of content words is small, improvements are more moderate.
The analysis of the results shows how the performance is damaged by not considering multiword expressions. The automatic detection of these elements in the monolingual corpus together with the addition of one-to-many dictionary entries for learning the W matrix can alleviate this problem and will be considered in future work.
We also note that this approach can be extended directly within neural machine translation systems, where its effects could be even larger due to the limited vocabulary. While one of the popular approaches to deal with OOVs is to use subword units (Sennrich et al., 2016) in order to resolve of unknown words, dictionary-based approaches, where an unknown word is translated by its corresponding translation in a dictionary or a (SMT) translation table, have also been used (Luong et al., 2015b). Our method can go further in the latter direction by learning correspondences of source and target vocabularies using large monolingual corpora and either a small dictionary or the word alignments.