Handling Homographs in Neural Machine Translation

Homographs, words with different meanings but the same surface form, have long caused difficulty for machine translation systems, as it is difficult to select the correct translation based on the context. However, with the advent of neural machine translation (NMT) systems, which can theoretically take into account global sentential context, one may hypothesize that this problem has been alleviated. In this paper, we first provide empirical evidence that existing NMT systems in fact still have significant problems in properly translating ambiguous words. We then proceed to describe methods, inspired by the word sense disambiguation literature, that model the context of the input word with context-aware word embeddings that help to differentiate the word sense before feeding it into the encoder. Experiments on three language pairs demonstrate that such models improve the performance of NMT systems both in terms of BLEU score and in the accuracy of translating homographs.


Introduction
Neural machine translation (NMT; Sutskever et al. (2014); Bahdanau et al. (2015), §2), a method for MT that performs translation in an end-toend fashion using neural networks, is quickly becoming the de-facto standard in MT applications due to its impressive empirical results. One of the drivers behind these results is the ability of NMT to capture long-distance context using recurrent neural networks in both the encoder, which takes the input and turns it into a continuous-space representation, and the decoder, which tracks the * * Equal contribution.
† † Now at Snap Inc. ‡ ‡ Now at Google 1 Code for our translation models is available at https://goo.gl/oaiqoT Source Charges against four other men were found not proven . target-sentence state, deciding which word to output next. As a result of this ability to capture long-distance dependencies, NMT has achieved great improvements in a number of areas that have bedeviled traditional methods such as phrasebased MT (PBMT; Koehn et al. (2003)), including agreement and long-distance syntactic dependencies (Neubig et al., 2015;Bentivogli et al., 2016).
One other phenomenon that was poorly handled by PBMT was homographs -words that have the same surface form but multiple senses. As a result, PBMT systems required specific separate modules to incorporate long-term context, performing word-sense (Carpuat and Wu, 2007b;Pu et al., 2017) or phrase-sense (Carpuat and Wu, 2007a) disambiguation to improve their handling of these phenomena. Thus, we may wonder: do NMT systems suffer from the same problems when translating homographs? Or are the recurrent nets applied in the encoding step, and the strong language model in the decoding step enough to alleviate all problems of word sense ambiguity?
In §3 we first attempt to answer this question quantitatively by examining the word translation accuracy of a baseline NMT system as a function of the number of senses that each word has. Results demonstrate that standard NMT systems make a significant number of errors on homographs, a few of which are shown in Fig. 1.
With this result in hand, we propose a method for more directly capturing contextual information that may help disambiguate difficult-to-translate homographs. Specifically, we learn from neural models for word sense disambiguation (Kalchbrenner et al., 2014;Iyyer et al., 2015;Kågebäck and Salomonsson, 2016;Yuan et al., 2016;Šuster et al., 2016), examining three methods inspired by this literature ( §4). In order to incorporate this information into NMT, we examine two methods: gating the word-embeddings in the model (similarly to Choi et al. (2017)), and concatenating the context-aware representation to the word embedding ( §5).
To evaluate the effectiveness of our method, we compare our context-aware models with a strong baseline (Luong et al., 2015) on the English-German, English-French, and English-Chinese WMT dataset. We show that our proposed model outperforms the baseline in the overall BLEU score across three different language pairs. Quantitative analysis demonstrates that our model performs better on translating homographs. Lastly, we show sample translations of the baseline system and our proposed model.

Neural Machine Translation
We follow the global-general-attention NMT architecture with input-feeding proposed by Luong et al. (2015), which we will briefly summarize here. The neural network models the conditional distribution over translations Y = (y 1 , y 2 , . . . , y m ) given a sentence in source language X = (x 1 , x 2 , . . . x n ) as P (Y |X). A NMT system consists of an encoder that summarizes the source sentence X as a vector representation h, and a decoder that generates a target word at each time step conditioned on both h and previous words. The conditional distribution is optimized with cross-entropy loss at each decoder output.
The encoder is usually a uni-directional or bidirectional RNN that reads the input sentence word by word. In the more standard bi-directional case, before being read by the RNN unit, each word in X is mapped to an embedding in continu-ous vector space by a function f e .
M e ∈ R |Vs|×d is a matrix that maps a one-hot representation of x t , 1(x t ) to a d-dimensional vector space, and V s is the source vocabulary. We call the word embedding computed this way Lookup embedding. The word embeddings are then read by a bi-directional RNN After being read by both RNNs we can compute the actual hidden state at step t, , and the encoder summarized representation h = h n . The recurrent units −−→ RNN e and ←−− RNN e are usually either LSTMs (Hochreiter and Schmidhuber, 1997) or GRUs (Chung et al., 2014).
The decoder is a uni-directional RNN that decodes the tth target word conditioned on (1) previous decoder hidden state g t−1 , (2) previous word y t−1 , and (3) the weighted sum of encoder hidden states a t . The decoder maintains the tth hidden state g t as follows, Again, −−→ RNN d is either LSTM or GRU, and f d is a mapping function in target language space.
The general attention mechanism for computing the weighted encoder hidden states a t first computes the similarity between g t−1 and h t for t = 1, 2, . . . , n.
The similarities are then normalized through a softmax layer , which results in the weights for encoder hidden states.
We can then compute a t as follows, Finally, we compute the distribution over y t as, p(y t |y <t , X) = softmax(W 2ĝt )

NMT's Problems with Homographs
As described in Eqs. (2) and (3), NMT models encode the words using recurrent encoders, theoretically endowing them with the ability to handle homographs through global sentential context. However, despite the fact that they have this ability, our qualitative observation of NMT results revealed a significant number of ambiguous words being translated incorrectly, casting doubt on whether the standard NMT setup is able to appropriately learn parameters that disambiguate these word choices.
To demonstrate this more concretely, in Fig. 2 we show the translation accuracy of an NMT system with respect to words of varying levels of ambiguity. Specifically, we use the best baseline NMT system to translate three different language pairs from WMT test set (detailed in §6) and plot the F1-score of word translations by the number of senses that they have. The number of senses for a word is acquired from the Cambridge English dictionary, 2 after excluding stop words. 3 We evaluate the translation performance of words in the source side by aligning them to the target side using fast-align (Dyer et al., 2013). The aligner outputs a set of target words to which the source words aligns for both the reference translation and the model translations. F1 score is calculated between the two sets of words.
After acquiring the F1 score for each word, we bucket the F1 scores by the number of senses, and plot the average score of four consecutive buckets as shown in Fig. 2. As we can see from the results, the F1 score for words decreases as the number of senses increases for three different language pairs. This demonstrates that the translation performance of current NMT systems on words with more senses is significantly decreased from that for words with fewer senses. From this result, it is evident that modern NMT architectures are not enough to resolve the problem of homographs on their own. The result corresponds to the findings in prior work (Rios et al., 2017).

Neural Word Sense Disambiguation
Word sense disambiguation (WSD) is the task of resolving the ambiguity of homographs (Ng and Lee, 1996;Mihalcea and Faruque, 2004;Zhong and Ng, 2010;Di Marco and Navigli, 2013;Chen et al., 2014;Camacho-Collados et al., 2015), and we hypothesize that by learning from these models we can improve the ability of the NMT model to choose the correct translation for these ambiguous words. Recent research tackles this problem with neural models and has shown state-of-the art results on WSD datasets (Kågebäck and Salomonsson, 2016;Yuan et al., 2016). In this section, we will summarize three methods for WSD which we will further utilize as three different context networks to improve NMT.
This is a simple way to model sentences, but has the potential to capture the global topic of the sentence in a straightforward and coherent way. However, in this case, the context vector would be the same for every word in the input sequence.
Bi-directional LSTM (BiLSTM) Kågebäck and Salomonsson (2016) leveraged a bidirectional LSTM that learns a context vector for the target word in the input sequence and predicts the word sense with a multi-layer perceptron. Specifically, we can compute the context vector c t for tth word similarly to bi-directional encoder as follows, −−→ RNN c , ←−− RNN c are forward and backward LSTMs repectively, and f c (x t ) = M c 1(x t ) is a function that maps a word to continous embedding space.
Held-out LSTM (HoLSTM) Yuan et al. (2016) trained a LSTM language model, which predicts a held-out word given the surrounding context, with a large amount of unlabeled text as training data. Given the context vector from this language model, they predict the word sense with a WSD classifier. Specifically, we can compute the context vector c t for tth word by first replacing tth word with a special symbol (e.g. <$>). We then feed the replaced sequence to a uni-directional LSTM:c Finally, we can get context vector for the tth word −−→ RNN c and f c are defined in BiLSTM paragraph, and n is the length of the sequence. Despite the fact that the context vector is always the last hidden state of the LSTM no matter which word we are targeting, the input sequence read by the HoL-STM is actually different every time.

Adding Context to NMT
Now that we have several methods to incorporate global context regarding a single word, it is necessary to incorporate this context with NMT. Specifically, we propose two methods to either Gate or Concatenate a context vector c t with the Lookup embedding M e · 1(x t ) to form a context-aware word embedding before feeding it into the encoder as shown in Fig. 3. The detail of these methods is described below.
Gate Inspired by Choi et al. (2017), as our first method for integration of context-aware word embeddings, we use a gating function as follows: The symbol represents element-wise multiplication, and σ is element-wise sigmoid function.
Figure 3: Illustration of our proposed model. The context network is a differentiable network that computes context vector c t for word x t taking the whole sequence as input. ⊗ represents the operation that combines original word embedding x t with corresponding context vector c t to form context-aware word embeddings. Choi et al. (2017) use this method in concert with averaged embeddings from words in source language like the NBOW model above, which naturally uses the same context vectors for all time steps. In this paper, we additionally test this function with context vectors calculated using the BiL-STM and HoLSTM .
Concatenate We also propose another way for incorporating context: by concatenating the context vector with the word embeddings. This is expressed as below: W 3 is used to project the concatenated vector back to the original d-dimensional space. For each method can compute context vector c t with either the NBOW, BiLSTM, or HoLSTM described in §4. We share the parameters in f e with f c (i.e. M e = M c ) since the vocabulary space is the same for context network and encoder. As a result, our context network only slightly increases the number of model parameters. Details about the number of parameters of each model we use in the experiments are shown in Table 1.

Experiments
We evaluate our model on three different language pairs: English-French (WMT'14), and English-German (WMT'15), English-Chinese (WMT'17)  (2015); for French, we used the moses (Koehn et al., 2007) tokenization script with the "-a" flag; for Chinese, we split sequences of Chinese characters, but keep sequences of non-Chinese characters as they are, using the script from IWSLT Evaluation 2015. 5 We compare our context-aware NMT systems with strong baseline models on each dataset. 4 We use the development set as testing data because the official test set hasn't been released.  Table 2: Results on three different language pairs -The best proposed models (BiLSTM+Concat+uni) are significantly better (p-value < 0.001) than baseline models using paired bootstrap resampling (Koehn, 2004).

Training Details
We limit our vocabularies to be the top 50K most frequent words for both source and target language. Words not in these shortlisted vocabularies are converted into an unk token.
When training our NMT systems, following Bahdanau et al. (2015), we filter out sentence pairs whose lengths exceed 50 words and shuffle minibatches as we proceed. We train our model with the following settings using SGD as our optimization method. (1) We start with a learning rate of 1 and we begin to halve the learning rate every epoch once it overfits. 6 (2) We train until the model converges. (i.e. the difference between the perplexity for the current epoch and the previous epoch is less than 0.01) (3) We batched the instances with the same length and our maximum mini-batch size is 256, and (4) the normalized gradient is rescaled whenever its norm exceeds 5. (6) Dropout is applied between vertical RNN stacks with probability 0.3. Additionally, the context network is trained jointly with the encoder-decoder architecture. Our model is built upon OpenNMT (Klein et al., 2017) with the default settings unless otherwise noted.

Experimental Results
In this section, we compare our proposed contextaware NMT models with baseline models on English-German dataset. Our baseline models are encoder-decoder models using global-general attention and input feeding on the decoder side as described in §2, varying the settings on the encoder side. Our proposed model builds upon baseline models by concatenating or gating different types of context vectors. We use LSTM for encoder, decoder, and context network. The decoder is the same across baseline models and proposed models, having 500 hidden units. During testing, we use beam search with a beam size of 5. The dimension for input word embedding d is set to 500 across encoder, decoder, and context network. Settings for three different baselines are listed below.
Baseline 1: An uni-directional LSTM with 500 hidden units and 2 layers of stacking LSTM.
Baseline 2: A bi-directional LSTM with 250 hidden units and 2 layers of stacking LSTM. Each state is summarized by concatenating the hidden states of forward and backward encoder into 500 hidden units.

Baseline 3: A bi-directional LSTM with 250 hidden units and 3 layers of stacking LSTM.
This can be compared with the proposed method, which adds an extra layer of computation before the word embeddings, essentially adding an extra layer.
The context network uses the below settings. 6 We define overfitting to be when perplexity on the dev set of the current epoch is worse than the previous epoch.

NBOW:
Average word embedding of the input sequence.
BiLSTM: A single-layer bi-directional LSTM with 250 hidden units. The context vector is represented by concatenating the hidden states of forward and backward LSTM into a 500 dimensional vector.
The results are shown in Table 1. The first thing we observe is that the best context-aware model (results in bold in the table) achieved improvements of around 0.7 BLEU on both WMT14 and WMT15 over the respective baseline methods with 2 layers. This is in contrast to simply using a 3-layer network, which actually degrades performance, perhaps due to the vanishing gradients problem it increases the difficulty in learning.
Next, comparing different methods for incorporating context, we can see that BiLSTM performs best across all settings. HoLSTM performs slightly better than NBOW, and NBOW obviously suffers from having the same context vector for every word in the input sequence failing to outperform the corresponding baselines. Comparing the two integration methods that incorporate context into word embeddings. Both methods improve over the baseline with BiLSTM as the context network. Concatenating the context vector and the word embedding performed better than gating. Finally, in contrast to the baseline, it is not obvious whether using uni-directional or bi-directional as the encoder is better for our proposed models, particularly when BiLSTM is used for calculating the context network. This is likely due to the fact that bi-directional information is already captured by the context network, and may not be necessary in the encoder itself.
We further compared the two systems on two different languages, French and Chinese. We achieved 0.5-0.8 BLEU improvement, showing our proposed models are stable and consistent across different language pairs. The results are shown in Table 2.
To show that our 3-layer models are properly trained, we ran a 3-layer bidirectional encoder with residual networks on En-Fr and got 27.45 for WMT13 and 30.60 for WMT14, which is similarly lower than the two layer result. It should be noted that previous work such as Britz et al. (2017) Table 3: Translation results for homographs and all words in our NMT vocabulary. We compare scores for baseline and our best proposed model on three different language pairs. Improvements are in italic. We performed bootstrap resampling for 1000 times: our best model improved more on homographs than all words in terms of either f1, precision, or recall with p < 0.05, indicating statistical significance across all measures.
also noted that the gains for encoders beyond two layers is minimal.

Targeted Analysis
In order to examine whether our proposed model can better translate words with multiple senses, we evaluate our context-aware model on a list of homographs extracted from Wikipedia 7 compared to the baseline model on three different language pairs. For the baseline model, we choose the bestperforming model, as described in §6.2.
To do so, we first acquire the translation of homographs in the source language using fast-align (Dyer et al., 2013). We run fast-align on all the parallel corpora including training data and testing data 8 because the unsupervised nature of the algorithm requires it to have a large amount of training data to obtain accurate alignments. The settings follow the default command on fast-align github page including heuristics combining forward and backward alignment. Since there might be multiple aligned words in the target language given a word in source language, we treat a match between the aligned translation of a targeted word of the reference and the translation of a given model as true positives and use F1, precision, and recall as our metrics, and take the micro-average across all the sentence pairs. 9 We calculated the scores for the 50000 words/characters from our source vocabulary using only English words. The results are shown in Table 3. The table shows two interesting results: (1) The score for the homographs is lower than the score obtained from all the words in the vocabu-7 https://en.wikipedia.org/wiki/List_ of_English_homographs 8 Reference translation, and all the system generated translations. 9 The link to the evaluation scripthttps://goo.gl/oHYR8E lary. This shows that words with more meanings are harder to translate with Chinese as the only exception. 10 (2) The improvement of our proposed model over baseline model is larger on the homographs compared to all the words in vocabulary. This shows that although our context-aware model is better overall, the improvements are particularly focused on words with multiple senses, which matches the intuition behind the design of the model.

Qualitative Analysis
We show sample translations on English-Chinese WMT'17 dataset in Table 4 with three kinds of examples. We highlighted the English homograph in bold, correctly translated words in blue, and wrongly translated words in red. (1) Target homographs are translated into the correct sense with the help of context network. For the first sample translation, "meets" is correctly translated to "会 见" by our model, and wrongly translated to "符 合" by baseline model. In fact, "会见" is closer to the definition "come together intentionally" and "符合" is closer to "satisfy" in the English dictionary.
(2) Target homographs are translated into different but similar senses for both models in the forth example. Both models translate the word "believed" to common translations "被认为" or "相信", but these meaning are both close to reference translation "据信". (3) Target homograph is translated into the wrong sense for the baseline model, but is not translated in our model in the fifth example.  : Sample translations -for each example, we show sentence in source language (src), the human translated reference (ref), the translation generated by our best context-aware model (best), and the translation generated by baseline model (base). We also highlight the word with multiple senses in source language in bold, the corresponding correctly translated words in blue and wrongly translated words in red. The definitions of words in blue or red are in parenthesis.

Related Work
Word sense disambiguation (WSD), the task of determining the correct meaning or sense of a word in context is a long standing task in NLP (Yarowsky, 1995;Ng and Lee, 1996;Mihalcea and Faruque, 2004;Navigli, 2009;Zhong and Ng, 2010;Di Marco and Navigli, 2013;Chen et al., 2014;Camacho-Collados et al., 2015). Recent research on tackling WSD and capturing multi-senses includes work leveraging LSTM (Kågebäck and Salomonsson, 2016;Yuan et al., 2016), which we extended as a context network in our paper and predicting senses with word embeddings that capture context. Šuster et al. (2016); Kawakami and Dyer (2016) also showed that bilingual data improves WSD. In contrast to the standard WSD formulation, Vickrey et al. (2005) reformulated the task of WSD for Statistical Machine Translation (SMT) as predicting possible target translations which directly improves the accuracy of machine translation. Following this reformulation, Chan et al. (2007); Carpuat and Wu (2007a,b) integrated WSD systems into phrase-based systems. Xiong and Zhang (2014) breaks the process into two stages. First predicts the sense of the ambiguous source word. The predicted word senses together with other context features are then used to predict possible target translation. Within the framework of Neural MT, there are works that has similar motivation to ours. Choi et al. (2017) leverage the NBOW as context and gate the word-embedding on both encoder and decoder side. However, their work does not distinguish context vectors for words in the same sequence, in contrast to the method in this paper, and our results demonstrate that this is an important feature of methods that handle homographs in NMT. In addition, our quantitative analysis of the problems that homographs pose to NMT and evaluation of how context-aware models fix them was not covered in this previous work. Rios et al. (2017) tackled the problem by adding sense embedding learned with additional corpus and evaluated the performance on the sentence level with contrastive translation.

Conclusion
Theoretically, NMT systems should be able to handle homographs if the encoder captures the clues to translate them correctly. In this paper, we empirically show that this may not be the case; the performance of word level translation degrades as the number of senses for each word increases. We hypothesize that this is due to the fact that each word is mapped to a word vector despite them being in different contexts, and propose to integrate methods from neural WSD systems into an NMT system to alleviate this problem. We concatenated the context vector computed from the context network with the word embedding to form a contextaware word embedding, successfully improving the NMT system. We evaluated our model on three different language pairs and outperformed a strong baseline model according to BLEU score in all of them. We further evaluated our results targeting the translation of homographs, and our model performed better in terms of F1 score.
While the architectures proposed in this work do not solve the problem of homographs, our empirical results in Table 3 demonstrate that they do yield improvements (larger than those on other varieties of words). We hope that this paper will spark discussion on the topic, and future work will propose even more focused architectures.