Selecting Artificially-Generated Sentences for Fine-Tuning Neural Machine Translation

Neural Machine Translation (NMT) models tend to achieve the best performances when larger sets of parallel sentences are provided for training. For this reason, augmenting the training set with artificially-generated sentence pair can boost the performance. Nonetheless, the performance can also be improved with a small number of sentences if they are in the same domain as the test set. Accordingly, we want to explore the use of artificially-generated sentence along with data-selection algorithms to improve NMT models trained solely with authentic data. In this work, we show how artificially-generated sentences can be more beneficial than authentic pairs and what are their advantages when used in combination with data-selection algorithms.


Introduction
The data used for training Machine Translation (MT) models consist mainly of a set of parallel sentences (a set of sentence-pairs in which each sentence is paired with its translation). As Neural Machine Translation (NMT) models typically achieve best performance when using large sets of parallel sentences, they can benefit from the sentences created by Natural Language Generation (NLG) systems. Although artificial data is expected to be of lower quality than authentic sentences, it still can help the model to learn how to better generalize over the training instances and produce better translations.
A popular technique used to create artificial data is the back-translation technique (Sennrich et al., 2016a;Poncelas et al., 2018c). This consists of generating sentences in the source language by translating monolingual sentences in the target language. Then, these sentences in both languages are paired and can be used to augment the original parallel training set used to build better NMT models.
Nonetheless, if synthetic data are not in the same domain as the test set, it can also hurt the performance. For this reason, we explore an alternative approach to better use the artificially-generated training instances to improve NMT models. In particular, we propose that instead of blindly adding back-translated sentences into the training set they can be considered as candidate sentences for a data-selection algorithm to decide which sentencepairs should be used to fine-tune the NMT model. By doing that, instead of increasing the number of training instances in a motivated manner, the generated sentences provide us with more chances of obtaining relevant parallel sentences (and still use smaller sets for fine-tuning).
As we want to build task-specific NMT models, in this work we explore two data-selection algorithms that are classified as Transductive Algorithms (TA): Infrequent N-gram Recovery (INR) and Feature Decay Algorithms (FDA). These methods use the test set S test (the document to be translated) as the seed to retrieve sentences. In transductive learning (Vapnik, 1998) the goal is to identify the best training instances to learn how to classify a given test set. In order to select these sentences, the TAs search for those n-grams in the test set that are also present in the source side of the candidate sentences.
Although augmenting the candidate pool with more sentences should be beneficial, as the TAs select the sentences based on overlapping n-gram the mistakes produced by the model used for backtranslation (which are those commonly addressed in NLG such as the generated word order or word choice) can be a disadvantage.
In this work, we explore whether TAs are more inclined to select authentic or artificial sentences. In addition, we propose three different methods of how they can be combined into a single hybrid set. Finally, we investigate whether the hybrid sets retrieved by TAs can be more useful than the authentic set of sentences to fine-tune NMT models.

Related Work
The work presented in this paper is based on two main concepts: the generation of synthetic sentences, and the selection of sentences from a set S of candidates.

Use of Artificially-Generated Data to Improve MT Models
The proposal of Sennrich et al. (2016a) showed that NMT models can be improved by backtranslating a set of (monolingual) sentences in the target side into the source side using an MT model. Other uses of monolingual target-side sentences include building the parallel set by using a NULL token in the source side (Sennrich et al., 2016a) or creating language models to improve the decoder (Gülçehre et al., 2015). Hoang et al. (2018) improve the model used for back-translation by training this model with increasing amounts of artificial sentences. They iteratively improve the models creating artificial sentences of better quality.
Similarly to this paper, the use of artificiallygenerated sentences to fine-tuned models has also been explored by Chinea-Rios et al. (2017) where they select monolingual authentic sentences in the source-side and translate them into the target language, or the work of Poncelas et al. (2019a) where they use back-translated sentences only to adapt the models.

Adaptation of NMT Models to the Test Set
The improvement of NMT models can be performed by fine-tuning (Luong and Manning, 2015;Freitag and Al-Onaizan, 2016), i.e. train the models for additional epochs using a small set of indomain data. Alternatively, van der Wees et al.
(2017) train models using smaller but more indomain sentences in each epoch of the training process.
The use of the test set to retrieve relevant sentences for fine-tuning the model has been explored by Li et al. (2016), adapting a different model for each sentence in the test set, or Poncelas et al. (2018bPoncelas et al. ( , 2019b where they adapt the model for the complete test set using transductive data-selection algorithms.

Transductive Algorithms
In this paper, the sentences used to fine-tune the model are retrieved using INR and FDA. These methods select sentences by scoring each sentence s from the candidate pool S, and adding that with the highest score to a selected pool L. This process is performed iteratively until the selected pool contains N sentences.
Infrequent n-gram Recovery (INR) (Parcheta et al., 2018;Gascó et al., 2012): This method selects those sentences that contain n-grams from the test set that are infrequent (ignoring frequent words such as stop words or general-domain terms). A candidate sentence s ∈ S is scored according to the number of infrequent n-grams shared with the set of sentences of the test set S test , computed as in (1): (1) where t is the threshold that indicates the number of occurrences of an n-gram to be considered infrequent. If the number of occurrences of ngr in the selected pool (C L (ngr)) is above the threshold t, then the component max(0, t − C S (ngr)) is 0 and so the n-gram does not contribute to scoring the sentence.
Feature Decay Algorithms (Biçici and Yuret, 2011;Biçici, 2013) also retrieve those sentences sharing the highest number of n-grams from the test set. However, in order to increase the variability and avoid selecting the same n-grams, those that have been selected are penalized is proportional to the number of occurrences in L. The score of a sentence is computed as in (2): where length(s) indicates the number of words in the sentence s. According to the equation, the more occurrences of ngr in L, the smaller the contribution is to the scoring of the sentence s.

Models Adapted with Hybrid Data
In order to fine-tune models with hybrid data, we propose three methods of creating these sets: hybr, batch and online. These methods can be classified depending on whether the combination is performed before or after the execution of the TA.
Combine Before Selection. This approach consists of selecting from a hybrid set (hybr). This involves concatenating both the authentic candidate S auth and artificial S synth sentences as a first step and then executing the TAs with the new candidate set S auth+synth .
Combine After Selection. Another approach is to force the presence of both authentic and synthetic sentences by using different proportions of TAselected authentic (L auth ) and synthetic (L synth ) sentence pairs. We concatenate the top-(N * γ) sentences from the selected authentic set and the top-(N * (1 − γ)) from the synthetic set. The value of γ ∈ [0, 1] indicates the proportion of authentic and synthetic sentences. For example, γ = 0.75 indicates that the 75% of sentences in the dataset are authentic and the remaining 25% are artificially generated. The selected synthetic set L synth can be obtained by executing the TAs on artificial candidate sentences S synth (batch). This implies that the sentences will be retrieved by finding overlaps of n-grams between the test set and artificial sentences.
Alternatively, the retrieval may be carried by finding overlaps in the target-side (online) as they are human-produced sentences. However, as the test set is in the source language, we need to first generate an approximated translation of the test with a general-domain MT model (Poncelas et al., 2018a,d). Unlike in batch, the advantage of this approach is that it is not necessary to generate the source side of the whole set of monolingual sentences, but rather only those selected by the TA.

Data and Models Settings
We build German-to-English NMT models using the following datasets: • Training data: German-English parallel sentences provided in WMT 2015 (Bojar et al., 2015) (4.5M sentence pairs).
The NMT models are built using the attentional encoder-decoder framework with OpenNMTpy 2 (Klein et al., 2017). We use the default values in the parameters: 2-layer LSTM (Hochreiter and Schmidhuber, 1997) with 500 hidden units. The size of the vocabulary is 50,000 words for each language.
In order to retrieve sentences, we use the TAs with default configuration (using n-grams of order 3 to find overlaps between the seed and the training data) to extract sets of 100K, 200K, and 500K sentences. We use a threshold of t = 40 for INR although this causes the INR to retrieve less than 500K sentences. Accordingly, the results shown for INR will include only 100K and 200K sentences.

Back-Translation Generation Settings
In order to generate artificial sentences, we use an NMT model (we refer to it as BT model) to backtranslate sentences from the target language into the source language. This model is built by training a model with 1M sentences sampled from the training data and using the same configuration described above (but in the reverse language direction, English-to-German).
As we want to compare authentic and synthetic sentences, we back-translate the target-side of the training data using the BT model. By doing this we ensure both sets are comparable which allows us to perform a fair analysis of whether artificial sentences are more likely to be selected by a TA and which are more useful to fine-tune the models.
Note also that there are 1M sentences that have been generated by translating the same target-side sentences used in training. This could cause the generated sentences to be exactly the same as authentic ones. However, this is not always the case as we report in Section 5.2. First of all, we present in Table 1 the performance of the model trained with all data for 13 epochs (BASE13), as this is when the model converges. We also show the performance of the model when fine-tuning the 12th epoch with the subset of (authentic) data selected by INR (INR column) and FDA (FDA column).
In order to evaluate the performance of the models, we present the following evaluation metrics: BLEU (Papineni et al., 2002), TER (Snover et al., 2006), METEOR (Banerjee and Lavie, 2005), and CHRF (Popovic, 2015). These metrics provide an estimation of the translation quality when the output is compared to a human-translated reference. Note that in general, the higher the score, the better the translation quality is. The only exception is TER which is an error metric and so lower results indicate better quality.
In addition, we indicate in bold those scores that show an improvement over the baseline (in Table 1 we use BASE13 as the baseline) and add an asterisk if the improvements are statistically significant at p=0.01 (using Bootstrap Resampling (Koehn, 2004), computed with multeval (Clark et al., 2011)).
In the table, we can see that using a small subset of data for training the 13th epoch can cause the performance of the model to improve. In the following experiments, we want to compare whether augmenting the candidate set with synthetic data can further boost these improvements. For this reason, we use INR and FDA FDA as baselines.  In the first set of experiments we explore the hybr approach, i.e. the TAs are executed on a mixture of authentic and synthetic data (combined before the execution of the TA). We present the results of the models trained with these sets in  presented in Table 1.
The results in the tables show that increasing the size of the candidate pool is beneficial. We see that most scores are better (marked in bold) than the model fine-tuned with only authentic data. However, the performance is also dependent on the domain. When comparing BIO and NEWS subtables we see that the models adapted for the latter domain tend to achieve better performances as most of the scores are statistically significant improvements.
When analyzing the selected dataset we find that the authentic sentences constitute slightly above half (between 51% and 64% of the sentences). This is an indicator that artificiallygenerated sentences contain n-grams that can be found by TA and are as useful as authentic sentences.
In addition, the amount of duplicated target-side sentences is very low (between 10% and 13%). This indicates that the MT-generated sentences contain n-grams that are different from the authentic counterpart, which increases the variety of the candidates that are useful for the TA to select.
In Table 4 and Table 5 we present the results of the models when fine-tuned with a combination of authentic and synthetic data following the Combine Before Selection approaches in Section 3.1. The tables are structured in two subtables showing the results of batch and online approaches. Each subtable present the results of three values of γ: 0.75, 0.50 and 0.25.
In these tables, we see that the performance of the models following the batch and online approaches is similar. These results are also in accord with those obtained following the hybr approach, as the improvements depend more on the domain (most evaluation scores in the NEWS test set indicate statistically significant improvements whereas for the BIO test set most of them are not) than the TA used, or the value of γ. Although the best scores tend to be when γ = 0.50 this is not always the case, and moreover we can find experiments in which using high amounts of synthetic sentences (i.e. γ = 0.25) achieve better results than using a higher proportion of authentic sentences. For instance, in BIO subtable of Table 4, using 100K sentences with the online γ = 0.25 approach, the improvements are statistically significant for two evaluation metrics whereas in the other experiments in that row they are not.
When analyzing the translations produced by these models we find several examples in which the translations of models fine-tuned with hybrid data are superior to those tuned with authentic sentences. An example of this is the sentence in the NEWS test set "nach Krankenhausangaben wurde ein Polizist verletzt." (in the reference, "according to statements released by the hospital, a police officer was injured.") This sentence is translated by INR and FDA models (those fine-tuned with 100K authentic sentences) as "a policeman was injured after hospital information.". We see that these models translate the word "nach" with its literal meaning ("after") whereas in this context ("nach Krankenhausangaben") it should have been translated as "according  to" as stated in the reference.
In the hybrid models, we see that the same sentence has been translated as "according to hospital information, a policeman was injured." (in this case the models fine-tuned with hybrid data have produced the same translations). The models tuned with hybrid data are capable of producing the n-gram "according to" which is the same as the reference.
In the selected data, the only sentence containing the n-gram "nach Krankenhausangaben" is the authentic sentence presented in the first row of Table 6 (selected by every execution of TA). As we see, this is a noisy sentence as the target-side does not correspond to an accurate translation (observe that in the source sentence we cannot find names such as "La Passione" or "Carlo Mazzacurati" that are present in the English side). Accordingly, using this sentence in the training of the NMT is harmful.

Analysis of Back-translated Sentences
We find many cases where artificially-generated data is more useful for NMT models than authentic translations. In Table 6 we show some examp-les.
In rows 1 and 2 we present sentences in which the artificial sentence (German (synth) column) is a better translation than the authentic counterpart. In addition to the example described previously (the example of the first row), we also see in row 2 that the authentic candidate pair is ("die Veranstalter haben viele Konzerte und Recitale geplant. Es wird für uns eine vorzügliche Gelegenheit sein Ihre Freizeit angenehm zu gestalten und Sie für die ernste Musik zu gewinnen.","every participant will play at least one programme.") whereas the synthetic counterpart is the pair ("jeder Teilnehmer wird mindestens ein Programm spielen.","every participant will play at least one programme."). In this case, it is preferable to use the synthetic sentence for training instead of the authentic as it is a more accurate translation (observe that the authentic German side consists of two sentences and it is longer than the English-side).
We also present a case in which both authentic and artificial sentences are not proper translations of the English sentence, so both sentences would hurt the performance of NMT if used for training. In row 3 there is a noisy sentence that should not  have been included as the target side is not English but French. The TAs search for n-grams in the source side, so as in this case the artificial sentence consists of a sequence of dots (the BT model has not been able to translate the French sentence) this prevents the TA from selecting it, whereas the authentic sentence-pair could be selected as it is a natural German sentence.
Surprisingly, this correction of inaccurate translations can also be seen on the set of sentences that have been used for training the BT model. As this model does not overfit, when it is provided with the same target sentence used for training, it is capable of generating different valid translations. For example, in row 4 of Table 6 we see the pair ("die Preise liegen zwischen 32.000 und 110.000 Won.","the first evening starts with a big parade of all participants through the city towards the beach.") which is one of the sentence pairs used for training the BT model. This is a noisy sentence (see, for instance, that the English-side does not include the numbers). However, the sentence generated by the BT model is "der erste Abend beginnt mit einer großen Parade aller Teilnehmer durch die Stadt zum Strand." which is a more accurate translation of than the sentence used for training the model that generates it.

Conclusion and Future Work
In this work, we have presented how artificially generated sentences can be used to augment a set of candidate sentences so data-selection algorithms have a wider variety of sentences to select from. The TA-selected sets have been evaluated German (auth) German (synth) English 1 nach Krankenhausangaben wurden zwei um die 50 Jahre alte Männer durch das Beben schwer verletzt: einer sei von einem herabfallenden Schornstein getroffen worden, der andere habe durch Glas Schnittwunden erlitten. außerdem seien mehrere Menschen durch herabstürzende Gegenstände in ihren Wohnungen leicht verletzt worden.
Saturday will feature another comedy, "La Passione"by Carlo Mazzacurati of Italy starring the prolific Silvio Orlando, who plays a washed-up filmmaker who is forced to set his last-chance project in Tuscany after a plumbing disaster at his country home damages a 16th-century fresco in a neighbouring chapel.  We have presented three methods of creating such hybrid data: (i) by allowing the TA decide whether to select authentic or synthetic data (hybr); (ii) by performing independent executions of the TA on authentic and synthetic sets (batch); and (iii) using an MT-generated seed to select monolingual sentences so only the extracted subset is back-translated (online).
The experiments showed that artificiallygenerated sentences can be as competitive as authentic data, as models built with different proportions of authentic and synthetic data achieve similar or even better performance than those finetuned with authentic pairs only. On one hand, those sentences whose target-sides could hurt the performance of NMT (such as sentences in a different language to that expected) causes the backtranslated sentence to also contain unnatural ngrams and so TAs would not select them. On the other hand, if the source-side sentence is not an accurate translation of the target side (the problem of comparable corpus), the back-translated counterpart can be a better alternative to use as training data.
In the future, we want to explore other language pairs and other transductive algorithms. Another limitation of this work is that we have augmented the candidate pool with synthetic sentences generated by a single model. We propose to explore whether using several models for generating the synthetic sentences (including different approaches such as combining statistical and neural model (Poncelas et al., 2019c)) to augment the candidate pool can cause the selected data to further improve NMT models.