Pronoun-Targeted Fine-tuning for NMT with Hybrid Losses

Popular Neural Machine Translation model training uses strategies like backtranslation to improve BLEU scores, requiring large amounts of additional data and training. We introduce a class of conditional generative-discriminative hybrid losses that we use to fine-tune a trained machine translation model. Through a combination of targeted fine-tuning objectives and intuitive re-use of the training data the model has failed to adequately learn from, we improve the model performance of both a sentence-level and a contextual model without using any additional data. We target the improvement of pronoun translations through our fine-tuning and evaluate our models on a pronoun benchmark testset. Our sentence-level model shows a 0.5 BLEU improvement on both the WMT14 and the IWSLT13 De-En testsets, while our contextual model achieves the best results, improving from 31.81 to 32 BLEU on WMT14 De-En testset, and from 32.10 to 33.13 on the IWSLT13 De-En testset, with corresponding improvements in pronoun translation. We further show the generalizability of our method by reproducing the improvements on two additional language pairs, Fr-En and Cs-En. Code available at.


Introduction
The advent of neural machine translation (NMT) (Bahdanau et al., 2015;Vaswani et al., 2017) brought about significant improvements that left the previously successful statistical machine translation models far behind. However, the availability of large corpora has been no small part of that success, with recent NMT models using millions of sentences for training. A lack of availability of such large parallel corpora across languages has given rise to methods utilizing large amounts of monolingual data, such as for backtranslation (Sennrich et al., 2016a), language modeling (Ç aglar Gülçehre et al., 2017;Zheng et al., 2020), or for large-scale pre-training (Lewis et al., 2020).
Backtranslation (Sennrich et al., 2016a;Edunov et al., 2018) is a commonly used strategy to improve MT models in the absence of adequate parallel data for training. A target-to-source model is first trained using the available parallel data, which is then used to translate a large target-monolingual corpus into the source to create pseudo-parallel data for training a source-to-target MT model. This has been shown to result in improvements in the BLEU score, and has become a popular method for improving NMT models, with many recent works proposing strategies to further improve it (Hoang et al., 2018;Yang et al., 2019;. However, recent studies have suggested that there is a limit beyond which the addition of synthetic data hurts the performance of the model (Fadaee and Monz, 2018;Poncelas et al., 2018). Also, recent work (Edunov et al., 2020;Nguyen et al., 2020) point out that back-translation suffers from the translationese effect, where back-translation only improves the performance when the source sentences are translationese but does not offer any improvement when the sentences are natural text.
Automatic post-editing (APE) is another common strategy that is used to improve translations. APE models are commonly monolingual, and typically take the output from some MT model as input, which they then modify. In the absence of adequate human post-edited data to train data-hungry neural models, Voita et al. (2019) and Freitag et al. (2019) both use round-trip translation data to train their post-editing models. In round-trip translations, target monolingual data is translated using a target-to-source model to the source text, and then back to the target using another source-to-target model. This round-trip translated text is considered an approximation of poor quality MT output, which can be used in combination with the original target reference text to train the post-editing model. Voita et al. (2019) train a model to make corrections in context, using groups of 4 sentences as input, and show improvements in BLEU as well as translations of discourse phenomena.
NMT models typically fail on rare words that may not be adequately seen during training, such as named entities, or on words whose interpretation depends on the context such as discourse phenomena (Koehn and Knowles, 2017;Sennrich, 2018). For the latter, NMT models tend to prefer a more typical alternative to a relatively rare but correct one (e.g., French "Il" is often wrongly translated to the more common "it" than "he" ). However, these seemingly trivial errors can erode translation to the extent that they can be easily distinguishable from human-translated texts (Läubli et al., 2018).
There could be several reasons for why NMT models make such mistakes; our hypothesis is that since almost all NMT models are trained with a conditional language model objective, it is clear that this objective alone is proving inadequate to capture all of the information available in the text. We therefore propose a class of conditional generativediscriminative hybrid losses that explicitly teach models what to generate and what not to generate. Using these specialized losses, we aim to improve the learning power of the MT model.
Specifically, in this work, we target the improvement of pronoun translation by focusing our finetuning efforts through our proposed objectives and also through the fine-tuning data. We aim to leverage the training data we already have by extracting a subset of targeted fine-tuning data from the training corpus that the model has failed to learn correctly from. We use the newly proposed training objectives in combination with the targeted data to help the model fully reach its learning potential on the training corpus. We attempt to improve both general translation quality and the pronoun translation without compromising on either, and to do this without any elaborate model architecture.
Our main contributions are as follows: • A class of Conditional Generative-Discriminative Hybrid losses that improve the learning potential of the model ( §2).
• Effective fine-tuning strategy that uses the training data itself to improve MT ( §3).
• Demonstration of generalizability through additional fine-tuning experiments on Fr-En and Cs-En ( §5.4).

Targeted Finetuning Objectives
Before introducing our proposed Conditional Generative-Discriminative hybrid losses for finetuning NMT models on a targeted dataset, we first describe the Conditional Language Modeling (CLM) objective used to train NMT models.

Conditional Language Modeling
NMT models are generally trained with the CLM generative loss that relies on an auto-regressive factorization to perform density estimation and generation of target texts. For a source-target sentence pair (x, y), a CLM predicts a conditional probability distribution P θ (y 1:n |x), where n is the number of tokens in the target text. The auto-regressive factorization for a CLM is given by where c is a context vector that summarizes the relevant input (e.g., attended vector over source text and the current decoder state). The CLM training objective for NMT can be written as: Generating from CLM trained NMT models requires iteratively sampling from P θ (y t |y <t , c), and then feeding y t back into the model as input.

Generative-Discriminative Hybrid Loss
While CLM has been the de-facto loss to train NMT models, models trained with CLM make mistakes that can erode translation quality, making them easily distinguishable from human translation. For example, state-of-the-art NMT models are not very good at handling rare words like named entities. They have also been criticized for not being sensitive to discourse-level aspects such as pronouns, lexical consistency, and discourse connectives (Sennrich, 2018;Jwalapuram et al., 2020).
We introduce a generative-discriminative hybrid method for fine-tuning NMT models, with the motivation of generating tokens that are more strongly in one class vs. another. We consider that the reference tokens come from a positive class, whereas the model generated tokens come from a negative class. We propose two variants of our hybrid training -(i) log-likelihood and (ii) max-margin.
Log-likelihood training. Let z ∈ {0, 1} represent the class for a training instance (x, y). We can consider a generative classifier as follows.
Assuming an equal prior class probability, i.e., P (z = 1) = P (z = 0) and by replacing P θ (y|x, z = k) with Equation (1), we can write: Since our objective is to maximize the probability of the reference tokens, we minimize the following negative log-likelihood loss: If y + is the reference (positive) translation and y − is the model (negative) output, it is easy to show that the above loss is equivalent to where τ is the temperature parameter of the softmax, 2 andŷ + t andŷ − t are the final-layer logits (presoftmax activations) corresponding to the reference token y + t and model generated token y − t , respectively. The logit for the model generated token is computed by just taking the max over all the logits. We use τ = 0.5 for our experiments.
Max-margin training. Following Collobert et al. (2011), we also propose a pairwise ranking loss that maximizes the distance between positive and negative samples. Formally, where µ is the margin; we use µ = 0.3. Note that the additional losses can be applied to all the tokens in the sequence, or restricted to some 2 For the sake of simplicity, we omit τ in Eq. 3 -4 tokens. We demonstrate this in our experiments by applying the loss on all tokens and selectively applying the loss only on pronouns. Both of the discriminative losses essentially promote the probability of the positive (i.e., correct) sample. However, the intuition behind using the additional loss over the standard loss is that the fine-tuning here focuses on improving the positive sample over the negative sample that the model has learnt to produce, rather than over the entire probability distribution over the full vocabulary.
We average these losses at both the sentence and the batch-level to add it to the existing CLM loss. The overall loss for training is where λ is a weighting hyperparameter, and the discriminative loss L d is either L mm (Eq. 7) or L nll (Eq. 6). In our training, the discriminative loss L d is aimed at correcting the mistakes, whereas the generative loss L g is needed to preserve the translation adequacy and fluency. In our experiments, we simply set λ = 0.5.

Pronoun-Targeted Fine-tuning Data
We create a subset of the training corpus in order to find training data that has not been fully learnt from; particularly, we focus our fine-tuning experiments on pronoun translation. Pronouns are an important discourse phenomenon that provide references to entities that have previously occurred in a text. Mistranslations can lead to loss of grammaticality or inference of the wrong antecedent, resulting in a misunderstanding of the text (Guillou, 2012). Consider a parallel corpus D = (S, R), where S is the source and R is the target/reference text. Assuming that the baseline NMT models ( §3.2) are trained until convergence using this data, for our targeted fine-tuning of pronoun translations, we derive a subset of the training corpus D as follows: 4. For each sentence with a mistranslated pronoun, extract the source sentences from S.

5.
The corresponding source and reference sentences form the pronoun-targeted fine-tuning subset, referred to as D prn = (S , T ).

Baseline MT Models
Typically, MT models are trained at the sentence level, taking one sentence as input and producing one sentence as output. Most MT systems at the sentence-level do not have access to adequate context that may be required for the translation of pronouns (Sennrich, 2018). Since it is our aim to improve pronoun translation, we train both a sentence-level model and a simple concatenationbased contextual model as our baselines: SEN2SEN: A standard 6-layer base Transformer model (Vaswani et al., 2017) trained to translate each sentence independently.
CONCAT: A standard 6-layer base Transformer trained to translate a sentence given one previous sentence as context (Tiedemann and Scherrer, 2017). The input to the model is the previous sentence and the current sentence combined with a special separator character. Jwalapuram et al. (2020) show that this simple context model performs comparably or better than other elaborate contextual models like Voita et al. (2018), Zhang et al. (2018), and Miculicich et al. (2018).
Both the baseline models are trained for 100,000 steps. Other parameter details are in the Appendix.

Experiments
We conduct our fine-tuning experiments on the German-English (De-En) translation task. We describe our baseline training and fine-tuning corpus ( §4.1), our experiments and results on fine-tuning using only the targeted subset data ( §4.4), and finetuning using both the targeted subset data and the hybrid training losses ( §4.5).

MT Training Data
Baseline training corpus. We use a De-En training dataset consisting of about 2.5 million sentence pairs, taken from the News Commentary, IWSLT (Cettolo et al., 2012) and Europarl (Tiedemann, 2012)   Byte-Pair Encoding (BPE) (Sennrich et al., 2016b) with 40,000 operations, which results in a shared vocabulary of 40,224 tokens. We will refer to our baseline dataset as D.
Pronoun targeted fine-tuning data. As described in §3.1, we derive the pronoun-targeted fine-tuning subset D prn from the baseline training corpus D based on the translation errors of the baseline models. This results in a pronoun-targeted subset of 294,535 pairs for the SEN2SEN model and 285,783 pairs for the CONCAT model.
Random subset. We randomly extract a subset of 300,000 sentence pairs from D, which approximately matches the size of the pronoun-targeted subset. We will refer to this dataset as D rand .

Pronoun Translation Evaluation
Testset. We run the models on the pronoun challenge testset provided by Jwalapuram et al. (2019), which is extracted from WMT testsets based on submission errors. For De-En, the testset has 2245 sentences, taken from WMT17-WMT19.
Evaluation. We report the macro-averaged F1 scores of the pronoun translation based on a simplified version of AutoPRF (Hardmeier and Federico, 2010). For each sentence in the testset, the counts of the pronouns in the system translation are clipped based on the pronouns in the reference translation; these counts are then used to compute the precision, recall and F1 scores.

Baseline Results
We first report the BLEU scores on the WMT14 De-En testset, and the BLEU, precision, recall, and F1 scores on the pronoun testset from Jwalapuram et al. (2019)  for the CONCAT model. To confirm that the context provides useful information rather than acting simply as a regularizer, we also run an experiment with the CONCAT model using a random sentence as context instead of the previous sentence. This model achieves a BLEU of 31.65 and a pronoun F1 of 69.65 -both lower than the baseline, confirming that the extended context from the previous sentence does provide helpful information.

Fine-tuning on Pronoun-Targeted Data
For the first set of fine-tuning experiments, we only fine-tune on the pronoun targeted subset D prn for the SEN2SEN model. This helps us assess the training schedule so that we can achieve a balance between preserving the information from the full data and gaining targeted information from the subset.
Setup. Given a trained baseline model, we train additional epochs on the targeted subset D prn . Apart from training only on the subset data, we also try training on a shuffled dataset consisting of the training + targeted subset data (which essentially doubles the error-prone subset compared to the baseline training data), alternating the training between the subset and the full data (D + D prn ), and the subset and full data upsampled by 2 (i.e., 2D + D prn ).
To ensure that the results we see are from the fine-tuning and not simply from increased training, we train the original baseline model on the full data D for additional epochs, equivalent to the number of fine-tuning epochs.
Results. We see from the results in Table 2 that although the pronoun F1 improves, the BLEU scores drop when the model is fine-tuned only with the subset data D prn . Shuffling a mix of the full training data with the subset data leads to a smaller drop in BLEU and a gain in pronoun F1. However, alternating the training on the full corpus and the subset (D + D prn ) stabilizes the BLEU score, and upsampling the primary dataset (2D + D prn ) results in a smaller drop in BLEU, while gaining more significantly in pronoun F1 over the baseline. A similar trend is also observed for the CONCAT model. Further upsampling does not lead to a significant improvement in results, so all subsequent experiments upsample the primary dataset by 2.
Increased training of the baseline also results in a drop in BLEU scores. However, the pronoun F1 is also lower, which is not the case for the fine-tuning results, indicating that fine-tuning rather than in-  creased training is the source of the improvements.

Effect of Additional Losses
We conduct experiments using both targeted data and proposed hybrid losses.
Setup. In accordance with our settings to alternate training between the upsampled full dataset and the subset data (2D + D prn ), we also alternate the additional loss such that it is only applied to the targeted subset. That is, in every alternate epoch, the model is trained on the upsampled full dataset (2D) with the standard CLM translation loss L g (Eq. 2), and then trained on the targeted subset D prn with the proposed hybrid loss L gd (Eq. 8).
Each fine-tuning model is trained for 9 additional epochs, such that the first and the last epoch use the targeted subset data and loss. This is effectively about 4 cycles of fine-tuning on (2D+D prn ); further training does not lead to improved loss.
Apart from applying the additional loss on all tokens in the subset data, we also experiment with applying the additional loss only on the pronoun tokens, i.e., the loss is only applied to those tokens which have a pronoun as the target translation.
To further assess the contribution of the targeted subset data, we conduct experiments by replacing it with a random subset D rand . We also conduct fine-tuning experiments by applying the additional loss on the full training dataset D while training the baseline model for additional epochs.
Max-margin loss results. Results for finetuning with the max-margin loss are shown in Table 7a. We see that there is an improvement in BLEU from 31.64 to 32.14. From the difference in improvement of the results from fine-tuning over (b) Fine-tuning results using log-likehihood loss Table 3: Targeted fine-tuning loss: fine-tuning results on the WMT14 De-En testset with F1 scores on the pronoun testset. Fine-tuning results on 2D + D prn refer to alternated training with pronoun-targeted fine-tuning data and the upsampled full training data. Fine-tuning on 2D + D rand is the same setting with the targeted data replaced with a random subset. Fine-tuning on D refers to additional training with the hybrid losses applied on the full dataset. * indicates statistically significant difference from the baseline (p ≤ 0.05 for F1; >80% confidence for BLEU).
D rand and D, it is apparent that this increase is a consequence of both the targeted data and the targeted loss. There is also a corresponding increase in pronoun F1 from 69.55 to 69.77. More importantly, we see that the CONCAT model drops slightly in BLEU to 31.75 with respect to the baseline, but the pronoun translation F1 improves from 72.03 to 72.88. When the loss is applied only on pronouns, the SEN2SEN model has a smaller BLEU increase to 31.81, but a larger pronoun F1 increase to 70.37. The CONCAT model benefits the most from having both pronountargeted fine-tuning data and loss; both the BLEU score and the pronoun F1 improve.
Log-likelihood loss results. Results for finetuning with the log-likelihood loss are shown in Table 7b. The overall increase in BLEU with the loglikelihood loss is lower for SEN2SEN compared to the max-margin loss, but the improvements in pronoun F1 are higher. With respect to the results on fine-tuning over D rand and D, improvement in BLEU score here does not result in a corresponding improvement in pronoun translation, further confirming the contribution of the targeted data. Once again, the CONCAT model outperforms the rest by gaining in both BLEU and pronoun F1.
Both losses perform comparably -while the SEN2SEN model achieves a higher increase in BLEU with the max-margin loss, gains in pronoun translation are higher with the log-likelihood loss. For the CONCAT model, both losses provide similar BLEU improvements, but the max-margin loss leads to higher gains in pronoun F1.

Qualitative Analysis of Results
We performed a qualitative analysis to see the effect of our fine-tuning. Some examples of improvements in translation resulting from our fine-tuning are shown in Table 4 (see Appendix for more).
The results of the targeted fine-tuning show that both the targeted data and the additional loss play a role in improving the translations. Another important conclusion that can be drawn is that there is no correlation between the BLEU score and the pronoun translation quality; in this case we have shown that it is possible to target the improvement of pronoun translations.
However, for the SEN2SEN model in particular, we see that there are improvements in BLEU that do not correspondingly improve pronoun translations, which can be surprising given that the finetuning data is targeted towards pronouns. It can be surmised from the improvements in the CONCAT model that the SEN2SEN model fails to improve in pronoun translation because it simply lacks the additional information that the context provides, which can be important for translating discourse phenomena like pronouns (Sennrich, 2018). See Table 4 for examples from the pronoun testset.
Another anomaly is that in some cases, the pronoun translation results are better when the loss is applied to all tokens rather than only to pronouns. A similar phenomenon may be the cause here -improved translation of the rest of the sentence may result in better contextual information, that in turn leads to better pronoun translations. This under- Source der handel am nasdaq options market wurde am freitagnachmittag deutscher zeit unterbrochen . Reference trading at the nasdaq options market was interrupted on friday afternoon , german time . Baseline trade at nasdaq options market was cut off on the german friday afternoon . Our best model trade in nasdaq options market was suspended on friday afternoon in germany .

Context
... die die amerikanische flamme in die umnachtete welt bringe : lady liberty geht voran . Source sie soll die fackel der freiheit von den vereinigten staaten in den rest der welt tragen . Context ... taking the american flame out to the benighted world : lady liberty is stepping forward . Reference she is meant to be carrying the torch of liberty from the united states to the rest of the world . Baseline it is meant to carry the torch of freedom from the united states to the rest of the world . Our best model she is supposed to carry the torch of freedom from the united states to the rest of the world .

Context
versteinerte reste der haut bedecken noch immer die holprigen panzerplatten , die den schädel des tieres tragen . Source sein rechter vorderfuß liegt an seiner seite , seine fünf finger sind nach oben gespreizt . Context fossilized remnants of skin still cover the bumpy armor plates dotting the animal's skull . Reference its right forefoot lies by its side , its five digits splayed upward . Baseline his right -hand front foot is on his side , his five fingers are spiked up . Our best model its right front foot is on its side , its five fingers are split upwards .  scores the importance of using context rather than trying to improve pronoun translations in isolation. The general improvements in BLEU result from the fact that the targeted data is a subset that the model has failed to learn adequately from. Thus, our method of obtaining targeted data seemingly results in a subset that is generally poorly translated by the original baseline model, so training on this data results in an improved BLEU score. This also explains the disparity in results with the fine-tuning on the random (D rand ) and the full (D) datasets.

Comparison with Related Work
Backtranslation. We train a target-source En-De model with the same training data (D, consisting of 2.5M pairs of parallel data) and settings as the baseline SEN2SEN model. This achieves a BLEU score of 27.4 on the WMT14 En-De testset. We use this model to translate about 76M sentences of NewsCrawl, a monolingual English corpus, to German. Using this pseudo-parallel corpus in addition to the original training corpus (≈ 78M pairs), we train a SEN2SEN source-target De-En backtranslation model. This model is trained for 500K steps. The results are shown in Table 5. Although backtranslation achieves highest BLEU score at 32.57, our fine-tuned CONCAT model achieves the highest F1 for pronoun translation at 72.39, without having been trained on any extra monolingual data. This is further proof that it may be insufficient to simply improve the BLEU scores at a sentence-level. Performing fine-tuning on a CON-CAT backtranslation model may be interesting to consider; we leave this for future work. 5 Automatic post-editing. We train a contextual, monolingual automatic post-editing model proposed by Voita et al. (2019) for English. To capture MT errors, the model is trained with round-triptranslated texts as inputs with reference texts as the intended outputs. We use default settings and similar data sizes as proposed in their paper. We use 2.5M sentences from parallel data D and monolingual English sentences from NewsCrawl to make up ≈ 30M sentences. Using the En-De model described above and our baseline De-En model, we translate this data to German and then back to English to obtain round-trip translations. We use this data to train their model 6 for around 750K steps as recommended by the authors.
We use the outputs of our baseline SEN2SEN De-En model on the WMT14 testset and the pronoun challenge testset as input to the model. 7 The results are shown in Table 5. We see that automatic postediting does not lead to an improvement in BLEU 8 or pronoun translation in this case.
Our analysis of round-trip-translations suggests that this is possibly because they do not contain enough errors. Experiments conducted on the WMT14 En-De testset show that if it is translated using our En-De model (BLEU:27.40) to German and then translated using our De-En model (BLEU:31.64) back to English, the resulting text has a BLEU of 44.44, which is significantly higher. It is a well-known phenomenon that MT models perform substantially better on translationese (Graham et al., 2019), which refers to text that is unnatural by virtue of being translated. This means that it is not very likely to resemble typical MT output or capture the same errors (Poncelas et al., 2018); twice-translated texts therefore contain considerably fewer errors that can be learnt from.

Results on the IWSLT13 Testset
We evaluate our fine-tuned models on the IWSLT13 De-En testset (Table 6). We also evaluate the pronoun translation for this testset. The backtranslation model fails to generalize, and performs worse than the baseline. It can be seen that our fine-tuned models improve over the baseline performance on this testset as well; the best SEN2SEN model improves from 31.64 to 32.16, while the best CONCAT model improves from 32.10 to 33.13, with corresponding improvements in pronoun F1. CONCAT continues to be the best performing model, showing significant improvements for both fine-tuning losses.

Generalizability to Other Languages
Finally, we test the generalizability of our finetuning method by running experiments for French-6 Taken from https://github.com/lena-voita/goodtranslation-wrong-in-context. 7 For the pronoun testset, we were only able to provide groups of 3 sentences as input instead of 4 which the original model uses, since the testset only provides two previous sentences as context. We add dummy text as the first sentence to make it a 4-sentence group input. 8 Note that we calculate the BLEU scores for each sentence separately as is standard, unlike in groups of 4 as the original paper. This is to more accurately compare against the results from the rest of our experiments.  Table 6: BLEU score and Pronoun translation F1 results of the baselines and the fine-tuned models on the IWSLT13 De-En testset.
English and Czech-English. We use the same training dataset sources as for German-English (i.e., News Commentary, IWSLT (Cettolo et al., 2012) and Europarl (Tiedemann, 2012)). This results in 2.53M sentences of training data and 500K sentences of fine-tuning data for Fr-En, and 992K sentences of training data and 100K sentences of fine-tuning data for Cs-En. We report the baseline BLEU results on the WMT14 testsets and the pronoun translation results on the corresponding testsets from Jwalapuram et al. (2019) containing 1478 (Fr-En) and 1686 (Cs-En) sentences. We see from Table 7 that our fine-tuning approach shows similar trends in improving BLEU and pronoun translation results for both Fr-En and Cs-En.

Discussion
Our objective is to propose a novel fine-tuning method that leverages "unlearned" data using additional loss. To this end, we proposed two different losses. We do not mean to advocate for any particular loss; in our experiments we happened to get comparable results, which may not conclusively point to one loss as being better. A different loss may perform better in other tasks. Although we focused on pronoun translations, our fine-tuning method is generic and can be used to correct other kinds of errors in machine translations, like named entities or other rare words. Our proposed losses can be adapted to other directed generation tasks; e.g., to improve coherence/factual correctness in abstractive summarization, or for controlled text generation. Our finetuning approach also opens up new ways to address training issues that originate from datasets; e.g., it could potentially be used to correct biases (such as gender) or used to improve system robustness.

Related Work
Our idea of conditional generative-discriminative training is related to the idea of discriminative training of generative models. Previously, this idea was proposed for Markov models. Collins (2002) trained a Hidden Markov Model (HMM) discriminatively for sequence tagging with structured perceptron algorithm. Yakhnenko et al. (2005) used a similar idea for sequence classification. In deep learning, the well-known generative adversarial networks (GANs) (Goodfellow et al., 2014) are an example where a generator is trained with the help of a discriminator. To the best of our knowledge, ours is the first work to explore this idea with conditional language models for guiding the model on what to generate and what not to generate. A few fine-tuning methods are related to our work. Abdulmumin et al. (2019) pre-train an MT model on synthetic backtranslated data and finetune it on authentic parallel data, and show that it can improve 0.7 BLEU over backtranslation on English-Vietnamese. Fadaee and Monz (2018) use various sampling strategies to improve the results of backtranslation by targeting difficult-to-predict words based on prediction loss. Our strategy is similar in that we also try to target words that the model has trouble with, but we do not use additional data.
A number of methods have been proposed for adapting a trained MT model to another domain by fine-tuning. A common strategy is to simply perform additional training on the new domain dataset (Luong and Manning, 2015) or use a mix of in-domain and out-domain data for fine-tuning without loss of generalization (Chu et al., 2017) or upweight out-of-domain data (Wang et al., 2017).
There has been some work on targeted improve-ment of translations, specifically for named-entities. Ugawa et al. (2018) adapt MT network architecture to encode named entity features and tags while Li et al. (2018) perform domain adaptation in addition to feature encoding. With respect to discourse phenomena, Stojanovski and Fraser (2019) propose a curriculum learning based approach, where a context-aware model is trained on randomly sampled oracle data containing gold-standard pronouns.
In our work, we focus on the baseline model's failings and try to increase its learning capacity by proposing additional losses.
Most recent work on improving pronoun translations has involved building more complex architectures that incorporate contextual information (Voita et al., 2018;Wong et al., 2020). In contrast, we present a more generalized approach.

Conclusions and Future Work
We have proposed a class of conditional generativediscriminative losses to increase the learning potential of NMT models, showing that it is possible to leverage "unlearned" training data to further improve an MT model, by strategically filtering the data and applying additional targeted losses.
We demonstrated the effectiveness of our methods on different languages and testsets, also reporting improved pronoun translations. Although we focus on pronoun translations, our fine-tuning method is generic and can be used to correct other kinds of errors in machine translations, like named entities or other rare words. In future work, we will explore other such applications of our proposed methods.

A.1 Training Parameters
The training parameters used for both the SEN2SEN and the CONCAT models are given in Table 8. All models were trained in fairseq and all results reported are based on averaging the last 10 checkpoints.

A.2 Examples from Fine-tuned Models
Some examples of improved translations from our fine-tuned models are given in Table 9.

WMT14 Testset
Source 14 stunden kämpften dieärzte um dasüberleben des opfers , jedoch vergeblich . Reference for 14 hours, doctors battled to save the life of the victim , ultimately in vain . Baseline 14 hours of doctors fought for the victim's survival , but in vain . Our best model the doctors fought 14 hours for the survival of the victim , but in vain .

Source
der handel am nasdaq options market wurde am freitagnachmittag deutscher zeit unterbrochen . Reference trading at the nasdaq options market was interrupted on friday afternoon , german time . Baseline trade at nasdaq options market was cut off on the german friday afternoon . Our best model trade in nasdaq options market was suspended on friday afternoon in germany .

Source
einem autofahrer wurde eine strafe in höhe von 1.000 £ auferlegt , weil er mit bis zu 210 km / h und einem heißgetränk zwischen seinen beinen gefahren war . Reference a motorist has been fined £ 1,000 for driving at up to 130mph ( 210km / h ) with a hot drink balanced between his legs . Baseline a driver was fined £ 1,000 for driving up to £ 210 per hour and a hot drink between his legs . Our best model a driver was fined £ 1,000 for driving up to 210 kilometers an hour and a hot drink between his legs .

Source
des grues sont arrivées sur place peu après 10 heures , et la circulation sur la nationale aété détournée dans la foulée . Reference cranes arrived on the site just after 10am , and traffic on the main road was diverted afterwards . Baseline cranes arrived soon after 10 hours , and circulation on the national front was hijacked in the process . Our best model cranes arrived shortly after 10 hours , and traffic on the national side was diverted along the way .

Source
le diagnostic de rage aété confirmé par l'institut pasteur . Reference the diagnosis of rabies was confirmed by the pasteur institute . Baseline the rabies diagnosis was confirmed by the institut pasteur. Our best model the rabies diagnosis was confirmed by the pasteur institute .

Context
... die die amerikanische flamme in die umnachtete welt bringe : lady liberty geht voran . Source sie soll die fackel der freiheit von den vereinigten staaten in den rest der welt tragen . Context ... taking the american flame out to the benighted world : lady liberty is stepping forward . Reference she is meant to be carrying the torch of liberty from the united states to the rest of the world . Baseline it is meant to carry the torch of freedom from the united states to the rest of the world . Our best model she is supposed to carry the torch of freedom from the united states to the rest of the world .

Context
der getestete 1,6 l diesel mit 88 kw / 120 ps beschleunigt den hr -v ... Source er dürfte seine arbeit allerdings etwas leiser verrichten . Context the 1.6 l diesel engine we tested , with 88 kw / 120 horsepower accelerates the hr -v powerfully ... Reference however , it could certainly do its work a bit more quietly . Baseline however , he is likely to do his job rather more quietly . Our best model but it is likely to do its job a little more quietly .

Context
versteinerte reste der haut bedecken noch immer die holprigen panzerplatten , die den schädel des tieres tragen . Source sein rechter vorderfuß liegt an seiner seite , seine fünf finger sind nach oben gespreizt . Context fossilized remnants of skin still cover the bumpy armor plates dotting the animal's skull . Reference its right forefoot lies by its side , its five digits splayed upward . Baseline his right -hand front foot is on his side , his five fingers are spiked up . Our best model its right front foot is on its side , its five fingers are split upwards .

Context
Il est mort dimanche matin. Source elle avait promisà son mari , la semaine avant son décès , de le faire sortir de l'hôpital Context He died on Sunday morning. Reference a week before his death , she had promised her husband she would get him out of hospital Baseline she promised her husband , the week before she died , to take her out of the hospital . Our best model she promised her husband , the week before his death , to take him out of the hospital

Context
Elle aété détenue dans une cellule du commissariat local avant l'audience devant le tribunal. Source elleétait en vacances dans la région de krabi , au sud de la thaïlande .

Context
She was held in local police cells before the court hearing. Reference she was holidaying at the resort area of krabi in southern thailand . Baseline it is on holiday in the region of krabi , southern thailand . Our best model she was on holiday in the krabi region of southern thailand . Table 9: Examples showing the improvements in translations from our best models, across the WMT14 and the pronoun testsets. The previous sentence context information for the pronoun testset is also shown.