Encoding Gated Translation Memory into Neural Machine Translation

Translation memories (TM) facilitate human translators to reuse existing repetitive translation fragments. In this paper, we propose a novel method to combine the strengths of both TM and neural machine translation (NMT) for high-quality translation. We treat the target translation of a TM match as an additional reference input and encode it into NMT with an extra encoder. A gating mechanism is further used to balance the impact of the TM match on the NMT decoder. Experiment results on the UN corpus demonstrate that when fuzzy matches are higher than 50%, the quality of NMT translation can be significantly improved by over 10 BLEU points.


Introduction
Neural machine translation, an emerging machine translation (MT) technology, has made remarkable progress in the past few years Sutskever et al., 2014), which strongly encourages many translation agencies to embrace it for product deployment. A natural question during this deployment is how the strengths of both the traditional TM and new NMT technologies can be combined together for professional high-quality translation.
Such attempts to the TM and MT combination have been already conducted in the context of statistical machine translation (SMT). A variety of efforts have been made to incorporate matched translation segments from TM into SMT (Koehn and Senellart, 2010). Partially inspired by these efforts, we aim at combining TM and NMT in this paper.
Different from TM and SMT, both of which use symbolic fragments to construct translations, NMT induces translations from a real-valued continuous space. Furthermore, NMT is trained in an * Corresponding author end-to-end fashion, which makes it not easy to be amenable to external intervention. Therefore, incorporating TM as external knowledge into NMT is challenging.
In this paper, we propose a novel and effective method to address this issue in the combination of TM and NMT. The key idea behind this method is to mimic human translators in translating a source sentence given a similar source sentence with a translation. We treat the matched TM translation as an additional signal and try to encode it with a new encoder to guide the NMT decoder to translate the current sentence. Specifically, we first find the sentence that is most similar to the current source sentence from TM by calculating their semantic similarity based on sentence embeddings. In order to prevent the TM matched translation from dominating the decoding process, we introduce a gate mechanism to balance the TM translation signal and the current source sentence which are encoded separately by two different encoders.
A series of experiments on the Chinese-English UN corpus demonstrate that when fuzzy matches are over 50%, the proposed method can significantly improve NMT with the gated TM signal. We also conduct an in-depth analysis on the TM gate, which shows that the gate can indeed regulate the information flow from TM to the NMT decoder.

Encoding Gated TM into NMT
In this section, we elaborate our proposed method that encodes translation memories into neural machine translation with a gating mechanism. We refer to our method as NMT-GTM, which consists of three essential components: i) coupled encoders that encodes both the source sentence and matched TM translation separately, ii) a TM gating network that controls the encoded signal from the TM matched translation and iii) a TM-guided decoder that incorporates the gated TM signal into the decoding. The diagram of NMT-GTM is shown in Figure 1.
For each source sentence src, we retrieve TM to find the most similar sentence to it. Different from the combination of TM and SMT, we define the best TM match as the sentence with the highest cosine similarity which is calculated based on sentence embeddings (Le and Mikolov, 2014), instead of being selected based on fuzzy match score. This is consistent with NMT that performs in an embedding-defined semantic space. But we display our results in experiments according to fuzzy match scores for easy understanding. We use tm s to denote the most semantically similar sentence to src from TM and tm t its translation.

Coupled Encoders
We use a pair of encoders to separately encode the source sentence src and its matched TM translation tm t. Both encoders are running independently of each other with bidirectional GRU recurrent neural networks 1 (Chung et al., 2014). Accordingly, two separate attention networks are employed to obtain context representations for both src and tm t, which we denote as c src and c tm t respectively. The attention network for the TM matched translation is able to help detect matched translation segments from tm t for the decoder.

TM Gating Network
When we translate a source sentence, in addition to the input of the sentence itself, we also have a TM matched translation (tm t) semantically similar to the sentence as an additional input. We want the additional input to act as a translation example for providing positive guide to target word prediction. In order to balance the information flow from the two inputs (src and tm t) into the decoder, we further introduce a TM gating network to control the respective proportions of tm t and src, partially inspired by Tu et al. (2017) who propose a gating mechanism to combine source and target contexts. We formulate the TM gating network as follows: where s t−1 is the previous hidden state, y t−1 is the previously predicted target word, and f is a logistic sigmoid function.

TM-Guided Decoder
In the TM-guided decoder, we integrate the gated TM information into the decoding process and use the context representations of src and tm t to predict the hidden state of the decoder in each time step. The decoder hidden state s t is computed as follows: where * is an element-wise multiplication.
The conditional probability of the next word y t is calculated as follows: Please notice that we only incorporate the gated TM into the hidden state of the decoder, rather than the prediction of the next word. Our goal is to correctly translate the source sentence with reference to the translation of the TM match tm t. In other words, tm t only plays a supporting role in translation. We don't want too much information from TM to affect the translation of the source sentence. Therefore, we incorporate the gated TM in a way that it can only indirectly influence the target generation via hidden states. In our experiments, we observe that this helps our proposed model to faithfully translate a source sentence, instead of copying all information from the TM matched translation, especially for source sentences with slight differences (e.g., dates or numbers) from TM matches.

Experiments
We conducted a series of experiments on Chinese-English corpus to evaluate the effectiveness of the proposed NMT-GTM and analyzed the TM gate.

Experimental Settings
Our data come from the Chinese-English United Nations Parallel Corpus (Rafalovitch et al., 2009), which consists of official records and other parliamentary documents. Since large-scale public  translation memories are not easily available, we built a translation memory from the UN corpus. Specifically, we divided the Chinese-English UN corpus into two parts U N a and U N b with equal size. For each source sentence s a from U N a , we chose the source sentence s b from U N b that has the highest semantically similarity to s a , computed in the way described in the last section. In doing so, we built a corpus with matched pairs (s a /t a , s b /t b ) where t a/b are translations corresponding to s a/b . Then we computed the fuzzy match score for each pair of source sentences as follows: where Levenshtein(s a , s b ) is the word-based Levenshtein Distance between s a and s b . The fuzzy match score can also be calculated with other methods, e.g., the method introduced in (Bloodgood and Strauss, 2015). We leave FMS estimated with different methods to our future work. We selected all pairs (s a /t a , s b /t b ) with a fuzzy match score F M S >= 0.5. From those pairs with F M S < 0.5, we randomly selected 20% of them. These selected pairs were then divided into 9 groups according to their fuzzy match scores (e.g., F M S ∈ [0.5, 0.6)). We randomly chose approximately the same number of sentences from each group to create a development set and test set. The remaining data were used to create the training data (i.e., {(s a , t b , t a ) selected }) and translation memory (i.e., {(s b , t b ) selected }). Statistics of the training data, development and test set are shown in Table 1. The numbers of sentences of the test set in each fuzzy match score group are presented in Table 2. We used RNNSearch as our NMT baseline. We set the maximum sentence length of training corpus to 50 words both for the Chinese and English sides. The sizes of vocabularies of both sides were   set to 30k. For those words that are not in the vocabulary, we replaced them with a special token UNK. We set the dropout to 0.5. All the other settings were the same as those described by . We used the stochastic gradient descent algorithm with Adam (Kingma and Ba, 2014) to train NMT models. The learning rate was set to 0.0004. The size of mini-batch was set to 80 sentences. The beam size was set to 10 during decoding.
For the proposed NMT-GTM model, we used tuples (src, tm t, tgt) as input. The rest of the parameter settings were consistent with the baseline model. To calculate the cosine similarity, we used the fasttext tool 2 with the dimension of 100 to obtain sentence embeddings.
2 Available at: https://fasttext.cc/ Table 3 shows the results of different NMT systems measured by BLEU (Papineni et al., 2002). From the table, we can find that when fuzzy match scores are over 50%, the extra introduction of TM information can significantly help NMT to better translate. Even when fuzzy match scores are lower than 50%, the translation quality does not drop too much. On the entire test set, the proposed gated combination model of TM and NMT improves the translation quality by 10.32 BLEU points over the baseline.

Experimental Results
In addition, in order to investigate how similar the matched TM translations tm t are to the reference translations ref , we also measured the BLEU scores of the matched TM translations against the reference translations. The results are also shown in Table 3, indicated as TM.

Analysis
We further took a deep look into how the TM gate is varying when we incorporate TM matches with different fuzzy match scores. As a comparison, we used the reference translations as the matched TM translations and incorporated them into NMT-GTM to check the changes of the gate. The BLEU scores measured when we used reference translations as matched TM translations as well as average gate values are shown in Table 4. The results demonstrate that when the matched TM is semantically closer to the current source sentence, the TM gate is larger, indicating that more information from the matched TM translation is used to guide the decoder. Table 5 shows an example from our test set. The highlighted fragments of the source sentence and the matched TM source sentence are not actually the same in terms of their surface forms. However, they are semantically close and can be translated into the same target translation. Our proposed NMT-GMT is able to successfully incorporate the translation of such a fragment into the decoder.

Related Work
Various strategies have been proposed to combine TM and SMT (Koehn and Senellart, 2010;He et al., 2010). Their key ideas are to integrate the translations of the same fragments from TM into SMT, and let SMT only translate those different parts. In order to better model this process, Wang et al. (2013Wang et al. ( , 2014 use different features to allow relevant TM information to guide SMT decoding. the chairman said that the representative of serbia had asked to participate in the discussion of the item in accordance with rule 43 of the rules of procedure . RNNSearch the chairman said that the representative of zimbabwe, in accordance with rule 43, requested a discussion of the item . NMT-GTM the chairman said that the representative of zimbabwe had asked to participate in the discussion of the item in accordance with rule 43 of the rules of procedure . The related work on combining TM and NMT is quite limited. Gu et al. (2017) propose a TM-NMT model that first finds the most similar segments through search engines according to fuzzy match scores and saves them as key-value pairs in memory. In the subsequent decoding, the saved information is used to help decoding. Our work is significantly different from theirs in two aspects. First, we use semantic similarity based on sentence embeddings to detect the best TM matches rather than the fuzzy match score. Second, we encode the entire TM matched translation rather than segments into NMT with coupled encoders and a gating network.
Our work is also related to multi-source NMT (Zoph and Knight, 2016). The difference is that in our case, the multiple source inputs are just semantically similar, rather than identical. This is the reason that we use a gate to combine these inputs.

Conclusion and Future work
In this paper, we have presented a novel gated method to encode translation memory into NMT so as to convey the information of the matched TM translation into the NMT decoder. Extensive experiments verify that our method can indeed effectively improve translation quality, especially when fuzzy match scores are higher than 50%. Further analysis reveals that the proposed TM gate is able to vary according to the similarity between the matched TM translation and the current sentence.