Boosting Neural Machine Translation with Similar Translations

This paper explores data augmentation methods for training Neural Machine Translation to make use of similar translations, in a comparable way a human translator employs fuzzy matches. In particular, we show how we can simply present the neural model with information of both source and target sides of the fuzzy matches, we also extend the similarity to include semantically related translations retrieved using sentence distributed representations. We show that translations based on fuzzy matching provide the model with “copy” information while translations based on embedding similarities tend to extend the translation “context”. Results indicate that the effect from both similar sentences are adding up to further boost accuracy, combine naturally with model fine-tuning and are providing dynamic adaptation for unseen translation pairs. Tests on multiple data sets and domains show consistent accuracy improvements. To foster research around these techniques, we also release an Open-Source toolkit with efficient and flexible fuzzy-match implementation.


Introduction
For decades, the localization industry has been proposing Fuzzy Matching technology in CAT tools allowing the human translator to visualize one or several fuzzy matches from translation memory when translating a sentence leading to higher productivity and consistency (Yamada, 2011). Hence, even though the concept of fuzzy match scores is not standardized and differs between CAT tools (Bloodgood and Strauss, 2014), translators generally accept discounted translation rate for sentences with "high" fuzzy matches 1 .
With improving machine translation technology 1 https://signsandsymptomsoftranslation. com/2015/03/06/fuzzy-matches/. and training of models on translation memories, machine translated output has been progressively introduced as a substitute for fuzzy matches when no sufficiently "good" fuzzy match is found and proved to also increase translator productivity given appropriate post-editing environment (Plitt and Masselot, 2010). These two technologies are entirely different in their finality -indeed, for a given source sentence, fuzzy matching is just a database retrieval and scoring technique always returning a pair of source and target segments, while machine translation is actually building an original translation. However, with Statistical Machine Translation, the two technologies are sharing the same simple idea about managing and retrieving optimal combination of longest translated n-grams and this property led to the development of several techniques like use of fuzzy matches in SMT decoding (Koehn and Senellart, 2010;Wang et al., 2013), adaptive machine translation (Zaretskaya et al., 2015) or "fuzzy match repairing" (Ortega et al., 2016).
With Neural Machine Translation (NMT), the integration of Fuzzy Matching is less obvious since NMT does not keep nor build a database of aligned sequences and does not explicitly use n-gram language models for decoding. The only obvious and important use of translation memory is to use them to train an NMT model from scratch or to adapt a generic translation model to a specific domain (fine-tuning) (Chu and Wang, 2018). While some works propose architecture changes (Zhang et al., 2018) or decoding constraints (Gu et al., 2018); a recent work (Bulté and Tezcan, 2019;Bulté et al., 2018) has proposed a simple and elegant framework where, like for human translation, translation of fuzzy matches are presented simultaneously with source sentence and the network learns to use this additional information. Even though this method has showed huge gains in quality, it also opens many questions.
In this work, we are pushing the concept further a) by proposing and evaluating new integration methods, b) by extending the notion of similarity and showing that fuzzy matches can be extended to embedding-based similarities, c) by analyzing how online fuzzy matching compares and combines with offline fine-tuning. Finally, our results also show that introducing similar sentence translation is helping NMT by providing sequences to copy (copy effect), but also providing additional context for the translation (context effect).

Translation Memories and NMT
A translation memory (TM) is a database that stores translated segments composed of a source and its corresponding translations. It is mostly used to match up previous translations to new content that is similar to content translated in the past.
Assuming that we translated the following English sentence into French: [How long does the flight last?] ↝ [Combien de temps dure le vol?]. Both the English sentence and the corresponding French translation are saved to the TM. This way, if the same sentence appears in a future document (an exact match) the TM will suggest to reuse the translation that has just been saved. In addition to exact matches, TMs are also useful with fuzzy matches. These are useful when a new sentence is similar to a previously translated sentence, but not identical. For example, when translating the input sentence: [How long does a cold last?], the TM may also suggest to reuse the previous translation since only two replacements (a cold by the flight) are needed to achieve a correct translation. TMs are used to reduce translation effort and to increase consistency over time.

Retrieving Similar Translations
More formally, we consider a TM as a set of K sentence pairs {(s k , t k ) ∶ k = 1, . . . , K} where s k and t k are mutual translations. A TM must be conveniently stored so as to allow fast access to the pair (s k , t k ) that shows the highest similarity between s k and any given new sentence. Many methods to compute sentence similarity have been explored, mainly falling into two broad categories: lexical matches (i.e. fuzzy match) and distributional semantics. The former relies on the number of overlaps between the sentences taken into account. The latter counts on the generalisation power of neural networks when building vector representations. Next, we describe the similarity measures employed in this work.
Fuzzy Matching Fuzzy matching is a lexicalised matching method aimed to identify non-exact matches of a given sentence. We define the fuzzy matching score F M (s i , s j ) between two sentences s i and s j as: where ED(s i , s j ) is the Edit Distance between s i and s j , and |s| is the length of s. Many variants have been proposed to compute the edit distance, generally performed on normalized sentences (ignoring for instance case, number, punctuation, space or inline tags differences that are typically handled at a later stage). Also, IDF and stemming techniques are used to give more weight on significant words or less weight on morphological variants (Vanallemeersch and Vandeghinste, 2015;Bloodgood and Strauss, 2014).
Since we did not find an efficient TM fuzzy match library, we implemented an efficient and parameterizable algorithm in C++ based on suffixarray (Manber and Myers, 1993)  Even though the TM entry may be of great help when translating the input sentence, it receives a low score (1 − 5 12 = 0.583) because of the multiple insertion/deletion operations needed. We thus introduce a second lexicalised similarity measure that focuses on finding the longest of n-gram overlap between sentences.
N -gram Matching 3 We define the N -gram matching score N M (s i , s j ) between s i and s j : where S(s) denotes the set of n-grams in sentence s, max(q) returns the longest n-gram in the set q and |r| is the length of the n-gram r. For Ngram matching retrieval we also use our in-house open-sourced toolkit.

Distributed Representations
The current research on sentence similarity measures has made tremendous advances thanks to distributed word representations computed by neural nets. In this work, we use sent2vec 4 (Pagliardini et al., 2018) to generate sentence embeddings. The network implements a simple but efficient unsupervised objective to train distributed representations of sentences. The authors claim that the algorithm performs state-of-the-art sentence representations on multiple benchmark tasks in particular for unsupervised similarity evaluation. We define the similarity score EM (s i , s j ) between sentences s i and s j via cosine similarity of their distributed representations h i and h j : where ||h|| denotes the magnitude of vector h.
To implement fast retrieval between the input vector representation and the corresponding vector of sentences in the TM we use the faiss 5 toolkit (Johnson et al., 2019).

Related Words in TM Matches
Given an input sentence s, retrieving TM matches consists of identifying the TM entry (s k , t k ) for which s k shows the highest matching score. However, with the exception of perfect matches, not all words in s k or s are present in the match. Considering the example in Section 2, the words the flight and a cold are not related to each other, from that follows that the TM target words le vol are irrelevant for the task at hand. In this section we faiss discuss an algorithm capable of identifying the set of target words T ∈ t k that are related to words of the input sentence s. Thus, we define the set T as: where A is the set of word alignments between words in s k and t k and S is the LCS (Longest Common Subsequence) set of words in s k and s. The LCS is computed as a by-product of the edit distance (Paterson and Dančík, 1994). S is found as a sub-product of computing fuzzy or n-gram matches. Word alignments are performed by fast align 6 (Dyer et al., 2013).

Fuzzy Match
? The TM source sentence s k of the fuzzy matching example has a LCS set of 5 words S = {How, long, does, last, ?}. The set of related target words T is also composed of 5 words {Combien, de, temps, dure, ?}, all aligned to at least one word in S and to no other word. The Ngram match example has a LCS set of 4 words S = {How, long, does, a}, while related target words consists of T = {Combien, de, temps, un}. The target word dure is not part of T as it is aligned to work and work ∉ S. Notice that sets S and T consist of collections of indices (word positions in their corresponding sentences) while word strings are used in the previous examples to facilitate reading.

Integrating TM into NMT
We retrieve fuzzy, n-gram and sentence embedding matches as detailed in the previous section. We explore various ways to integrate matches in the NMT workflow. We follow the work by (Bulté and Tezcan, 2019) where the input sentence is augmented with the translation retrieved from the TM showing the highest matching score (FM, NM or EM). One special integration of fuzzy matching, denoted FM T , is rescoring fuzzy matches based on the target edit distance. This special integration, that is only performed on training data, is discussed in the Target Fuzzy matches section. Figure 2 illustrates the main integration techniques considered in this work and detailed below. The input English sentence [How long does the flight last?] is differently augmented. For each alternative we show: the TM (English) sentence producing the match; the augmented input sentence with the corresponding TM (French) translation. Note that LCS words are displayed in boldface.

FM
# We implement the same format as detailed in (Bulté and Tezcan, 2019). The input English sentence is concatenated with the French translation with the (highest-scored) fuzzy match as computed by F M (s i , s j ). The token ∥ is used to mark the boundary between both sentences. 7 FM * We modify the previous format by masking the French words that are not related to the input sentence. Thus, sequences of unrelated tokens are replaced by the ∥ token. The mechanism to identify relevant words is detailed in Section 2.2.
FM + As a variant of FM * , we now mark target words which are not related to the input sentence in an attempt to help the network identify those target words that need to be copied in the hypothesis. However, we use an additional input stream (also called factors) to let the network access to the entire target sentence. Tokens used by this additional stream are: S for source words; R for unrelated target words and T for related target words.
NM + In addition to fuzzy matches, we also consider arbitrary large n-gram matching. Thus, we use the same format as for FM + but considering the highest scored n-gram match as computed by N M (s i , s j ).
EM + Finally, we also retrieve the most similar TM sentences as computed by EM (s i , s j ). In this case, marking the words that are not related to the input sentence is not necessary since similar sentences retrieved following EM score do not necessarily present any lexical overlap. Note from the example in Table 2

S S S S S S S R T T T T R R T
NM + How long does a vaccine work ?
How long does a cold last ? ∥ Combien de temps dure un vaccin ?

S S S S S S S R T T T R T R R
EM + What is the duration of flu symptoms ?
How long does a cold last ? ∥ Quelle est la durée de la grippe ? 3 Experimental Framework

Corpora and Evaluation
We used the following corpora in this work 8 (Tiedemann, 2012): Proceedings of the European Parliament (EPPS); News Commentaries (NEWS); TED talk subtitles (TED); Parallel sentences extracted from Wikipedia (Wiki); Documentation from the European Central Bank (ECB); Documents from the European Medicines Agency (EMEA); Legislative texts of the European Union (JRC); Localisation files (GNOME, KDE4 and Ubuntu) and Manual texts (PHP). Detailed statistics about these are provided in Appendix A. We randomly split the corpora by keeping 500 sentences for validation, 1, 000 sentences for testing and the rest for training. All data is preprocessed using the Open-NMT tokenizer 9 (conservative mode). We train a 32K joint byte-pair encoding (BPE) (Sennrich et al., 2016b) and use a joint vocabulary for both source and target. Our NMT model follows the state-of-the-art Transformer base architecture (Vaswani et al., 2017) implemented in the OpenNMT-tf 10 toolkit (Klein et al., 2017). Further configuration details are given in Appendix B.

TM Retrieval
We perform fuzzy matching, ignoring exact matches, and keep the single best match if F M (s i , s j ) ≥ 0.6 with no approximation. Similarly, the largest N -gram match is used for each test sentence with a threshold N M (s i , s j ) ≥ 5. A similarity threshold EM (s i , s j ) ≥ 0.8 is also employed when retrieving similar sentences using distributed representations. The EM model is trained on the source training data with default fasttext params on 200 dimension, and 20 epochs. The faiss search toolkit is used through python API with exact FlatIP index. Building and retrieval times for each algorithm on a 2M sentences translation memory (Europarl corpus) are provided in Table 1. Note that all retrieval algorithms are significantly faster than NMT Transformer decoding, thus, implying a very limited decoding overhead.

Results
We compare our baseline model, without augmenting input sentences, to different augmentation formats and retrieval methods. Our base model is built using the concatenation of all the original corpora. All other models extend the original corpora with sentences retrieved following various retrieval methods. It is worth to notice that extended bitexts share the target side with the original data.  Best scores are obtained by models using augmented inputs except for corpora not suited for translation memory usage: News, TED for which we observe no gains correlated to low matching rates. For the other corpora, large gains are achieved when evaluating test sentences with matches (up to +19 BLEU on GNOME corpus), while a very limited decrease in performance is observed for sentences that do not contain matches. This slight decrease is likely to come from the fact that we kept the corpus size and number of iterations identical while giving harder training tasks. Results are totally on par with the findings of (Bulté and Tezcan, 2019).
All types of matching indicate their suitability showing accuracy gains. In particular for fuzzy matching, which seems to be the best for our task.   Target fuzzy matches To evaluate if the fuzzy match quality is really the primary criterion for the observed improvements, we consider FM # T where the fuzzy matches are rescored (on the training set only) with the edit distance between the reference translation and the target side of the fuzzy match. By doing so, we reduce the fuzzy match average F M source score by about 2%, but increase target edit distance from 61% to 69%.
The effect can be seen in Table 2  Unseen matches Note that in the previous experiments, matches were built over domain corpora that are already used to train the model. This is a common use case: the same translation memory used to train the system will be used in run time, but now we evaluate the ability of our model in a different context where a test set is to be translated for which we have a new TM that has never been seen when learning the original model. This use case corresponds to typical translation task where new entries will be added continuously to the TM and shall be used instantly for translation of following sentences. Hence, we only use EPPS, News, TED and Wiki data to build two models: the first employs only the original source and target sentences (base) the second learns to use fuzzy matches (FM + ).  As it can be seen, the model using fuzzy matches shows clear accuracy gains. This confirms that gains obtained by FM + are not limited to remember an example previously "seen" during training. The model using fuzzy matches acquired the ability to actually copy or recycle words from the provided fuzzy matches and therefore is suitable for adaptive translation workflows. Note that all scores are lower than those showed in Table 2 as a result of discarding all in-domain data when training the models showing also that online use of translation memory is not a substitute for in-domain model fine-tuning as we will further investigate in Fine Tuning.
Combining matching algorithms Next, we evaluate the ability of our NMT models to combine different matching algorithms. First, we use ⊖(M 1 , M 2 , ...) to denote the augmentation of an input sentence that considers first the match specified by M 1 , if no match applies for the input sentence then it considers using the match specified by M 2 , and so on. Note that at most one match is used. Sentences for which no match is found are kept without augmentation. Similar to Table 2, models are learned using all the available training data. Table 3 (2 nd block) illustrates the results of this experiment. The first 3 lines show BLEU scores of models combining FM + , NM + and EM + . The last row illustrates the results of a model that learns to use two different matching algorithms. We use the best combination of matches obtained so far (FM + and EM + ) and augment input sentences with both matches. Figure 3 illustrates an example of an input sentence augmented with both a fuzzy match and an embedding match (FM + and EM + ). Notice that the model is able to distinguish between both types of augmented sequences by looking at the token used in the additional stream (factor). As it can be seen in Table 3 (2 nd block), the best combination of matches is achieved by ⊕(FM + ,EM + ) further boosting the performance of previous configurations. It is only surpassed by ⊖(FM + ,EM + ) in two test sets by a slight margin.
Fine Tuning Results so far evaluate the ability of NMT models to integrate similar sentences. However, we have run our comparisons over a "generic" model built from a heterogeneous training data set while it is well known that these models do not achieve best performance on homogeneous test sets. Thus, we now assess the capability of our augmentation methods to enhance fine-tuned (Luong and Manning, 2015) models, a well known technique that is commonly used in domain adaptation scenarios obtaining state-of-the-art results. Table 3 illustrates the results of the model configurations previously described after fine-tuning the models towards each test set domain. Thus, building 7 fine-tuned models for each configuration. Note that similar sentences (matches) are retrieved from the same in-domain data sets used for fine tuning. As How long does a cold last ? ∥ Combien de temps dure le vol ? ∥ Combien de temps dure un vaccin ?  shown in Table 3 (3 rd block), models with FM/EM also increase performance of fine-tuned models gaining in average +6 BLEU on fine-tuned model baselines, and +2.5 compared to FM/EM on generic translation. This add-up effect is interesting since both approaches make use of the same data.

S S S S S S S R T T T T R R T E E E E E E E E
Copy Vs. Context We observe that models allowing for augmented input sentences effectively learn to output the target words used as augmented translations. Table 5 illustrates the rates of usage. We compute for each word added in the input sentence as T (part of a lexicalised match), R (not in the match) and E (from an embedding match), how often they appear in the translated sentence. Results show that T words increase their usage rate by more than 10% compared to the corresponding base models. Considering R words, models incorporating fuzzy matches increase their usage rate compared to base models, albeit with lower rates than for T words. Furthermore, the number of R words output by FM + is clearly lower than those output by FM # , demonstrating the effect of marking unrelated matching words. Thus, we can confirm the copy behaviour of the networks with lexicalised matches. Words marked as E (embedding matches) increase their usage rates when compared to base models but are far from the rates of T words. We hypothesize that these sentences are not copied by the translation model, rather they are used to further contextualise translations.

Related Work
Our work stems on the technique proposed by (Bulté and Tezcan, 2019) to train an NMT model to leverage fuzzy matches inserted in the source sentence. We extend the concept by experimenting with more general notions of similar sentences and techniques to inject fuzzy matches. The use of similar sentences to improve translation models has been explored at scale in (Schwenk et al., 2019), where the authors use multilingual sentence embeddings to retrieve pairs of similar sentences and train models uniquely with such sentences. In (Niehues et al., 2016), input sentences are augmented with pre-translations performed by a phrase-based MT system. In our approach, similar sentence translations are provided dynamically to guide translation of a given sentence.
Similar to our work, (Farajian et al., 2017;Li et al., 2018) retrieve similar sentences from the training data to dynamically adapt individual input sentences. To compute similarity, the first work uses n-gram matches, the second includes dense vector representations. In (Xu et al., 2019) the same approach is followed but authors consider for adaptation a bunch of semantically related input sentences to reduce adaptation time.
Our approach combines source and target words within a same sentence -the same type of approach has also been proposed by (Dinu et al., 2019) for introduction of terminology translation.
Last, we can also compare the extra-tokens appended in augmented sentences as "side constraints" activating different translation paths on the same spirit than the work done by (Sennrich et al., 2016a;Kobus et al., 2017) for controlling translation.

Conclusions and Further Work
This paper explores augmentation methods for boosting Neural Machine Translation performance by using similar translations.
Based on "neural fuzzy repair" technique, we introduce tighter integration of fuzzy matches informing neural network of source and target and propose extension to similar translations retrieved from their distributed representations. We show that the different types of similar translations and model fine-tuning provide complementary information to the neural model outperforming consistently and significantly previous work. We perform data augmentation at inference time with negligible speed overhead and release an Open-Source toolkit with an efficient and flexible fuzzy-match implementation.
In our future work, we plan to optimise the thresholds used with the retrieval algorithms in order to more intelligently select those translations providing richest information to the NMT model and generalize the use of edit distance on the target side.
We would also like to explore better techniques to inject information of small-size n-grams with possible convergence with terminology injection techniques, unifying framework where target clues are mixed with source sentence during translation. As regards distributed representations, we plan to study alternative networks to more accurately model the identification and incorporation of additional context.  Table 6: Corpora statistics. Note that K stands for thousands and L mean is the average length in words.

B NMT Network Configuration
We use the next set of hyper-parameters: size of word embedding: 512; size of hidden layers: 512; size of inner feed forward layer: 2, 048; number of heads: 8; number of layers: 6; batch size: 4, 096 tokens. Note that when using factors (FM + , NM + and EM + ) the final word embedding is built after concatenation of the word embedding (508 cells) and the additional factor embedding (4 cells).
We use the lazy Adam optimiser. We set warmup steps to 4, 000 and update learning rate for every 8 iterations. Models are optimised during 300K iterations. Fine-tuning is performed continuing Adam with the same learning rate decay schedule until convergence on the validation set. All models are trained using a single NVIDIA P100 GPU. We limit the target sentence length to 100 tokens.The source sentence is limited to 100, 200 and 300 tokens depending on the number of sentences used to augment the input sentence. We use a joint vocabulary of 32K for both source and target sides. In inference we use a beam size of 5. For evaluation, we report BLEU scores computed by multi-bleu.perl.

C Example of Embedding Matching
The (a) Gas supply to power producers (CCGTs) 0.87 The Commission shall provide the chairman and the secretariat for these working parties.
The Commission shall provide secretariat services for the Forum, the Bureau and the working parties. 0.93 Admission to a course of training as a pharmacist shall be contingent upon possession of a diploma or certificate giving access, in a Member State, to the studies in question, at universities or higher institutes of a level recognised as equivalent.
Admission to basic dental training presupposes possession of a diploma or certificate giving access, for the studies in question, to universities or higher institutes of a level recognised as equivalent, in a Member State.