Leveraging Discourse Rewards for Document-Level Neural Machine Translation

Document-level machine translation focuses on the translation of entire documents from a source to a target language. It is widely regarded as a challenging task since the translation of the individual sentences in the document needs to retain aspects of the discourse at document level. However, document-level translation models are usually not trained to explicitly ensure discourse quality. Therefore, in this paper we propose a training approach that explicitly optimizes two established discourse metrics, lexical cohesion (LC) and coherence (COH), by using a reinforcement learning objective. Experiments over four different language pairs and three translation domains have shown that our training approach has been able to achieve more cohesive and coherent document translations than other competitive approaches, yet without compromising the faithfulness to the reference translation. In the case of the Zh-En language pair, our method has achieved an improvement of 2.46 percentage points (pp) in LC and 1.17 pp in COH over the runner-up, while at the same time improving 0.63 pp in BLEU score and 0.47 pp in F_BERT.


Introduction
The recent advances in neural machine translation (NMT) (Sutskever et al., 2014;Bahdanau et al., 2015;Luong et al., 2015;Vaswani et al., 2017) have provided the research community and the commercial landscape with effective translation models that can at times achieve near-human performance. However, this usually holds at phrase or sentence level. When using these models in larger units of text, such as paragraphs or documents, the quality of the translation may drop considerably in terms of discourse attributes such as lexical and stylistic consistency. In fact, document-level translation is still a very open and challenging problem. The sentences that make up a document are not unrelated pieces of text that can be predicted independently; rather, a set of sequences linked together by complex underlying linguistics aspects, also known as the discourse (Maruf et al., 2019b;Jurafsky and Martin, 2019). The discourse of a document includes several properties such as grammatical cohesion (Halliday and Hasan, 2014), lexical cohesion (Halliday and Hasan, 2014), document coherence (Hobbs, 1979) and the use of discourse connectives (Kalajahi et al., 2012). Ensuring that the translation retain such linguistic properties is expected to significantly improve its overall readability and flow.
However, due to the limitations of current decoder technology, NMT models are still bound to translate at sentence level. In order to capture the discourse properties of the source document in the translation, researchers have attempted to incorporate more contextual information from surrounding sentences. Most document-level NMT approaches augment the model with multiple encoders, extra attention layers and memory caches to encode the surrounding sentences, and leave the model to implicitly learn the discourse attributes by simply minimizing a conventional NLL objective. The hope is that the model will spontaneously identify and retain the discourse patterns within the source document. Conversely, very little work has attempted to model the discourse attributes explicitly. Even the evaluation metrics typically used in translation such as BLEU (Papineni et al., 2002) are not designed to assess the discourse quality of the translated documents.
For these reasons, in this paper we propose training an NMT model by directly targeting two specific discourse metrics: lexical cohesion (LC) and coherence (COH). LC is a measure of the frequency of arXiv:2010.03732v2 [cs.CL] 19 Oct 2020 semantically-similar words co-occurring in a document (or block of sentences) (Halliday and Hasan, 2014). For example, car, vehicle, engine or wheels are all semantically-related terms. There is significant empirical evidence that ensuring lexical cohesion in a text eases its understanding (Halliday and Hasan, 2014). At its turn, COH measures how well adjacent sentences in a text are linked to each other. In the following example from Hobbs (1979): "John took a train from Paris to Istanbul. He likes spinach." the two sentences make little 'sense' one after another. An incoherent text, even if grammatically and syntactically perfect, is anecdotally very difficult to understand and therefore coherence should be actively pursued. Relevant to translation, Vasconcellos (1989) has found that a high percentage of the human post-editing changes over machine-generated translations involves the improvement of cohesion and coherence.
Several LC and COH metrics that well correlate with the human judgement have been proposed in the literature. However, like BLEU and most other evaluation metrics, they are discrete, non-differentiable functions of the model's parameters. Hereafter, we propose to overcome this limitation by using the well-established policy gradient approach from reinforcement learning (Sutton et al., 1999;Sutton and Barto, 2018) which allows using any evaluation metric as a reward without having to differentiate it. By combining different types of rewards, the model can be trained to simultaneously achieve more lexicallycohesive and more coherent document translations, while at the same time retaining faithfulness to the reference translation.
2 Related Work 2.1 Document-level NMT Many document-level NMT models have proposed taking the context into account by concatenating surrounding sentences or extra features to the current input sentence, with otherwise no modifications to the model. For example, Rios et al. (2017) have trained an NMT model that learns to disambiguate words given the context semantic landscape by simply extracting lexical chains from the source document, and using them as additional features. Other researchers have proposed concatenating previous source and target sentences to the current source sentence, so that the decoder can observe a proper amount of context (Agrawal et al., 2018;Tiedemann and Scherrer, 2017;Scherrer et al., 2019). Their work has shown that concatenating even just one or two previous sentences can result in a noticeable improvement. Macé and Servan (2019) have added an embedding of the entire document to the input, and shown promising results in English-French. Conversely, other document-level NMT approaches have proposed modifications to the standard encoder-decoder architecture to more effectively account for the context from surrounding sentences.  have introduced a dedicated attention mechanism for the previous source sentences. Multi-encoder approaches with hierarchical attention networks have been proposed to separately encode each of the context sentences before they are merged back into a single context vector in the decoder (Miculicich et al., 2018;Maruf et al., 2019a;Wang et al., 2017). These models have shown significant improvements over sentence-level NMT baselines on many different language pairs. Kuang et al. (2018) and Tu et al. (2018) have proposed using an external cache to store, respectively, a set of topical words or a set of previous hidden vectors. This information has proved to benefit the decoding step at limited additional computational cost. In turn, Maruf and Haffari (2018) have presented a model that incorporates two memory networks, one for the source and one for the target, to capture document-level interdependencies. For the inference stage, they have proposed an iterative decoding algorithm that incrementally refines the predicted translation.
However, all the aforementioned models assume that the model can implicitly learn the occurring discourse patterns. Moreover, the training objective is the standard negative log-likelihood (NLL) loss, which simply maximizes the probability of the reference target words in the sentence. Only one work these authors are aware of (Xiong et al., 2019) has attempted to train the model by explicitly learning discourse attributes. Inspired by recent work in text generation (Bosselut et al., 2018), Xiong et al. (2019) have proposed automatically learning neural rewards that can encourage translation coherence at document level. However, it is not clear whether the learned rewards would be in good correspondence with human judgment. For this reason, in our work we prefer to rely on established discourse metrics as rewards.

Discourse evaluation metrics
As a matter of fact, several metrics have been proposed in the literature to measure discourse properties. For LC, Wong and Kit (2012) have proposed a metric that looks for repetitions of words and their related terms (e.g. hyponyms, hypernyms) by using WordNet (Miller, 1998). Gong et al. (2015) have proposed a similar metric that uses lexical chains. For COH, mainly two types of metrics have been proposed: entity-based and topic-based. The former follow the Centering Theory (Grosz et al., 1995) which states that documents with a high frequency of the same salient entities are more coherent. An entity-based coherence metric was proposed by Barzilay and Lapata (2008). At their turn, topic-based metrics assume that a document is coherent when adjacent sentences are similar in topic and vocabulary. Accordingly, Hearst (1997) has proposed the Texttiling algorithm which computes the cosine distance between the bag-of-word (BoW) vectors of adjacent sentences. Foltz et al. (1998) have proposed to replace the BoW vectors with topic vectors. Li et al. (2017) have learned topic embeddings with a self-supervised neural network. There is also a third group of COH metrics that are based solely in syntactic regularities (Smith et al., 2016) that have also shown to be effective at modelling textual coherence. Other metrics have been proposed to measure different discourse properties such as grammatical cohesion (Hardmeier and Federico, 2010;Miculicich and Popescu-Belis, 2017) and discourse connectives (Hajlaoui and Popescu-Belis, 2013).

Reinforcement learning in NMT
Researchers in NMT and other natural language generation tasks have used reinforcement learning (Sutton and Barto, 2018) techniques to train the models to maximize discrete sentence-level and documentlevel metrics as an alternative or a complement to the NLL. For example, Ranzato et al. (2016) have proposed training NMT systems targeting the BLEU score, showing consistent improvements with respect to strong baselines. In addition to training the model directly with the evaluation function, they claim that this approach mollifies the exposure bias problem . Expected risk minimization has been proposed as an alternative reinforcement learning-style training to maximize the sentence-level (Edunov et al., 2018;Shen et al., 2016) and the document-level (Saunders et al., 2020) BLEU scores. Paulus et al. (2018) have proposed a similar approach for summarization using ROUGE as the training loss (Lin and Hovy, 2000). Tebbifakhr et al. (2019) have used a similar objective function to improve the sentiment classification of translated sentences. Finally, Edunov et al. (2018) have presented a comprehensive comparison of reinforcement learning and structured prediction losses for NMT model training.

Baseline Models
This section describes the baseline NMT models used in the experiments. In detail, subsection 3.1 recaps the standard sentence-level translation model while subsection 3.2 describes the recent, strong hierarchical baseline that we have augmented with discourse rewards.

Sentence-level NMT
Our first baseline is a standard sentence-level NMT model. Given the source document D with k sentences, the model translates each sentence . . , k, in the document into a sentence in the target language, y i = {y 1 i , . . . , y m i i }: Figure 1: Risk training. Given the source document, the policy (NMT model) predicts l candidate translations. Then, a reward function is computed for each such translation. For supervised rewards, (e.g., BLEU) the reference translation is required, but not for LC and COH. Finally, the Risk loss is computed using the rewards and the probabilities of the candidate translations, differentiated, and backpropagated for parameter update.
Thus, the model translates every sentence in the document independently. Our sentence model uses a standard transformer-based encoder-decoder architecture (Vaswani et al., 2017) where the model is trained to maximize the probability of the words in the training reference sentences using an NLL objective. We train this model for 20 epochs and select the best model over the validation set. For more details on training and the hyper-parameters please see Appendix A.

Hierarchical Attention Network
As a document-level translation baseline, we have used the Hierarchical Attention Network (HAN) of Miculicich et al. (2018). A HAN network is added to the sentence-level NMT model both in the encoder and in the decoder (referred to as HAN join in the following), allowing the model to encode information from t previous source and target sentences. The prediction can be expressed as: where (x i−1 , . . . , x i−t ) are the previous source sentences and (y i−1 , . . . , y i−t ) the previous target sentences that make up the context. At inference time, the target sentences are the model's own predictions. Following the indications given by the authors, we have set t = 3. Additionally, we have used the weights of the sentence-level NMT baseline to initialize the common parameters of the HAN join model, and we have initialized the extra parameters introduced by the HAN networks randomly. The model has been fine-tuned for 10 epochs and the best model over the validation set has been selected. For further information on the hyper-parameters see Appendix A.

Risk training with discourse rewards
In order to improve the baseline models, we propose to use the LC (Wong and Kit, 2012) and COH (Foltz et al., 1998) evaluation metrics as rewards during training, so that the model is explicitly rewarded for generating more cohesive and coherent translation at document level. For that, we use a reinforcement learning approach, which allows using discrete, non-differentiable functions as rewards in the objective. Following Edunov et al. (2018), we have used the structured loss that achieved the best results in their experiments, namely the expected risk minimization (Risk) objective: where x is the source sentence, y is the reference translation, p(u|x) is the conditional probability of a translation in our 'policy', or NMT model, U(x) is a set of candidate translations generated by the current policy, and r(·) is the reward function. In our work we have obtained the candidate translations using beam search, which achieved higher accuracy than sampling in Edunov et al. (2018). The conditional probability of a translation has been defined as: where m is the number of words in the candidate translation. Note that in order to avoid underflow and put all the sentences on a similar scale, the (unnormalized) sentence score, f (u, x, θ), in Eq. 4 is computed as a sum of logarithms, divided by the number of tokens in the sequence and, finally, brought back to scale with the exponential function. By minimizing this Risk objective, the NMT model is encouraged to give higher probability to candidate translations that obtain a higher reward. This function has been used at sentence level by Edunov et al. (2018). However, the same metrics could also be computed at document level by simply concatenating all the sentences from the same document together (both for the ground truth and the predictions). As a result, m now would be the number of words in a document, U(x) the candidate document translations, x the source document and y the reference document. Computing the Risk objective in this way permits having document-level reward functions as r(·).

Reward functions
We have explored the use and combination of different reward functions for training: LC doc : For LC, the metric proposed by Wong and Kit (2012) has been adopted. This metrics counts the number of lexical cohesive devices in the document and then divides that number by the total number of words in the document (Eq. 5). Cohesive devices include associations such as repetitions of words, synonyms, near-synonyms, hypernyms, meronyms, troponyms, antonyms, coordinating terms, and so on. WordNet (Fellbaum, 2012) has been used to classify the relationships between words. Note that this reward function is unsupervised since it does not require a ground-truth reference translation.

LC =
# of cohesion devices in document # of words in document COH doc : To calculate COH, we have used the approach proposed by Foltz et al. (1998). This approach first uses a trained LSA model to infer topic vectors (t i ) for each sentence in the document, and then computes the average cosine distance between adjacent sentences (Eq. 6). For the topic vectors, we have used the pre-trained LSA model (Wiki-6) from Stefanescu et al. (2014), which was trained over Wikipedia. Note that COH also does not require a ground-truth reference translation.
BLEU doc : In addition to the LC and COH rewards, we have decided to use a reference-based metric such as BLEU (Papineni et al., 2002). Due to the unsupervised nature of LC and COH, the model could trivially boost them by only repeating words and creating very similar sentences. However, this will come at the expense of producing translations that are increasingly unrelated to the reference translation (low adequacy) and grammatically incorrect (low fluency). As such, we encourage the model to also target a high BLEU score in its predictions.
BLUE sen : Finally, we have also used BLEU at sentence level as a reward. In this way, we can assess whether it is more beneficial to use this metric at document or sentence level.  These four rewards can be combined in several different ways. To limit the experiments, we have decided to use them in their natural range without reweighting. All the results with the different reward combinations are presented in Section 5.2.

Mixed objective
Similar to the MIXER training proposed by Ranzato et al. (2016), we have also explored mixing the Risk objective with the NLL. The rationale is similar to that of using BLEU doc and BLEU sen as rewards: the NLL loss can help the model to not deviate too much from the reference translation while improving discourse properties. To mix these losses, we have used an alternate batch approach: either loss is randomly selected in each training batch, with a certain probability (e.g. Risk(0.8) means that we have selected the Risk loss with 80% probability and the NLL with 20%).

Datasets and experimental setup
We have performed a broad range of experiments over four different language pairs and three different translation domains (TED talks, movie subtitles and news) which have been used by other popular document-level NMT research (Miculicich et al., 2018;Tu et al., 2018). For translations of TED talks 1 , we have used the datasets released in the IWSLT14 for Spanish-English (Es-En), in the IWSLT15 shared task for Chinese-English (Zh-En) and in IWSLT16 for Czech-English (Cs-En). For both language pairs, we have used their dev2010 set as the validation set, and sets tst2011-2013 (Zh-En), tst2010-2013 (Cs-En) and tst2010-2012 (Es-En), respectively, as test sets. For translations in the movie subtitles domain, we have used the OpenSubtitles-v2018 dataset (Lison et al., 2018) from OPUS 2 , and the language pairs tested have been Basque-English (Eu-En) and Spanish-English (Es-En). For Eu-En we have used all the available data, but for Es-En we have only used a subset of the corpus to limit time and memory requirements. In both cases, we have divided the data into a training, validation and test sets 3 . The last translation domain is news, for which we have used the Es-En News-Commentary11 dataset 4 . As validation and test sets, we have used its newstest2008 and newstest2009-2013 sets, respectively, from WMT 5 . The document boundaries are given by the individual talks for the TED talks dataset, by movie scripts for the subtitles datasets and by single-author news commentaries for the news dataset. All the datasets have been tokenized using the Moses tokenizer 6 , with the exception of Chinese for which we have used Jieba 7 . A truecaser model from Moses 7 has been learned over the training data of each dataset, and has been applied for consistent word casing as a final pre-processing step.
As models, we have compared multiple models trained with the Risk objective with different combinations of reward functions. This has allowed us to select the best reward functions for the translation quality at document level. Then, the model trained with the best reward combination has been compared against the sentence-level NMT and HAN baselines. In our experiments, the Risk training objective has been used as fine-tuning of a pre-trained HAN join baseline model, in order not to suffer from a "cold start" due to the large output label space. The main aim of our experiments is to show that the proposed training objectives can lead to performance improvements over HAN join . Candidate translations have been obtained using beam search with a beam size of only 2, due to memory and computational time limitations. Furthermore, the training batch size has been set to 15 sentences. Since the objective is computed over the batch, this is equivalent to subdividing longer documents into sub-documents of 15 sentences each. Yet, our experimental results show that computing the rewards at such batch level is still effective for improving the translation quality.
Each model has been trained with three different seeds over its training set, and the validation set has been used at all times to select the best model. Then, the average results of the three runs over the test set have been reported. We have measured four different evaluation metrics: BLEU, LC, COH and F BERT , an alternative metric to BLEU that compares the BERT sentence embeddings of the prediction and the reference and which has been shown to have better correlation with the human judgement than BLEU (Zhang et al., 2020). To select the best model over the validation set for the sentence-level NMT baseline, we have used the lowest perplexity. Instead, for the HAN join baseline and our models, we have chosen the model with the best results in the majority of the four evaluation metrics (BLEU, LC, COH and F BERT ). This has not affected the relative ranking of the sentence-level NMT baseline since its performance has been generally lower than the other approaches. Complete details about the experimental set-up and other hyper-parameters are provided in Appendix A. The code is publicly available 8 . Table 2 shows the main results from our experiments. Over all datasets, the HAN join baseline has consistently outperformed the sentence-level NMT in terms of BLEU score and F BERT which shows that including surrounding sentences can help to obtain better translation accuracy. However, HAN join has not performed significantly better than the sentence-level model in terms of LC and COH (even worse in a few cases), showing that it has not been able to specifically learn these discourse properties in the document. The COH and LC values of both baselines have also been generally lower than those of the human reference translations for all datasets (with the exception of the LC in Zh-En (TED talks) and Es-En (movie subtitles)). Table 2 also shows the results from our best models in comparison to these baselines. From preliminary experiments, we have seen that the Risk model that achieved the best results is the one that combines BLEU doc , LC doc and COH doc as rewards. Yet, choosing the right proportion of Risk and NLL training has proven very important and dataset-dependent. In the TED talks domain (Table 2a), the Risk(1.0) model has outperformed the HAN join baseline in all evaluated metrics over the Zh-En dataset, improving +0.63 percentage points (pp) in BLEU, 2.46 pp in LC, 1.17 pp in COH and 0.48 pp in F BERT , while in the Cs-En dataset the same model has got an improvement of 2.68 pp in LC, 0.55 pp in COH and 0.22 pp in F BERT , on a parity of BLEU score. Instead, over the Es-En dataset, even though the Risk(1.0) has achieved the highest LC and COH scores, this has come at a higher drop in translation accuracy (i.e. BLEU and F BERT ). Thus, we consider Risk(0.5) to be the best performing model over this dataset, as it still considerably improves LC and COH scores (1.28 pp and 0.23 pp respectively), while keeping similar translation accuracy in terms of BLEU (+0.22 pp) and F BERT (−0.27 pp). In general, we had not anticipated the improvements in BLEU score and F BERT since our main aim had only been to improve the translations in terms of discourse metrics. However, in some cases the improvements in discourse metrics have also translated into higher translation accuracy.

Results
In turn, Table 2b shows the main results over the movie subtitles datasets which are characterized by documents with, on average, more, yet much shorter, sentences than the TED talks. On these datasets, the Risk(1.0) model has been able to improve the LC and COH metrics to a large extent, but at a marked cost in BLEU score and F BERT . Qualitatively, the translations generated by this model have often displayed many word and phrase repetitions that had little correspondence with the reference translation, showing that COH and LC can reach values that are undesirable. Conversely, training the model with the mixed  objective has forced it to stay closer to the reference translations and helped it achieve higher BLEU and F BERT scores. On Eu-En, the Risk(0.8) model has improved the LC by 3.47 pp at a substantial parity of all the other metrics. On Es-En, none of the proposed models has clearly outperformed the HAN join baseline. For instance, the Risk(0.5) model has improved LC and COH by 0.33 pp and 0.18 pp, respectively, but at the cost of 0.35 pp in BLEU score and 0.20 pp in F BERT .
Finally, Table 2c shows that the proposed models have delivered better results on the news domain dataset, where they have been able to simultaneously improve the BLEU score, LC and COH at a mild cost in F BERT . In general, we can argue that the discourse rewards have proved more effective on documents such as talks and news commentaries -which come from single authors and are generally controlled in style -than on documents such as subtitles are more fragmented in nature.

Ablation study and translation example
To expand the analysis, Table 3 shows the results from an ablation study that explores the impact of the various reward functions over the Zh-En dataset. The best trade-off over the four evaluation metrics seems that returned by BLEU doc + LC doc + COH doc which has achieved the highest BLEU score, a high F BERT , and high LC and COH. The results also show that using BLEU sen as a reward has contributed to improve the F BERT score in all cases, but at the significant expense of the other evaluation metrics. However, when BLEU doc and BLEU sen have been compared head-to-head as the sole rewards, the sentence-level BLEU has been able to achieve higher scores in all metrics. In contrast, the BLEU doc reward has been most effective when used jointly with the cohesion and coherence rewards. At its turn, the LC doc reward without a balance from a BLEU reward has led to LC and COH scores that are likely ex-   cessive and undesirable, with a corresponding drop in BLEU score and F BERT . Conversely, the COH doc reward has not displayed a comparable degradation. The main overall result from this ablation analysis is that the rewards need to be used in a calibrated combination to deliver the best trade-off across all the evaluation metrics, and that the selection of the best combination can be effectively carried out by validation. Finally, Table 4 shows an example of the translation of a document excerpt from the Zh-En TED talks dataset made by our best model (Risk(1.0)-BLEU doc + LC doc + COH doc ), in comparison to that made by the HAN join baseline, the reference translation (Ref) and the text in the source language (Src). In this example, we can clearly see the positive influence of the LC and COH rewards, as the model has been able to provide better lexical cohesion and coherence in the translation. The model has also been able to correctly translate words such as bonobos and jungle while the HAN join model has uttered a more generic chimps. In addition, the translation generated by our model seems more faithful to the reference overall. Note also that these improvements have come at a significant drop in BLEU score. This may suggest that LC and COH can influence improvements that the BLEU score is not able to capture. Examples for the other language pairs are provided in Appendix B.

Conclusion
In this paper, we have presented a novel training method for document-level NMT models that uses discourse rewards to encourage the models to generate more lexically cohesive and coherent translations at document level. As training objective we have used a reinforcement learning-style function, named Risk, that permits using discrete, non-differentiable terms in the objective. Our results on four different language pairs and three translation domains have shown that our models have achieved a consistent improvement in discourse metrics such as LC and COH, while retaining comparable values of accuracy metrics such as BLEU and F BERT . In fact, on certain datasets, the models have even improved on those metrics. While the approach has proved effective in most cases, the best combination of discourse rewards, accuracy rewards and NLL has had to be selected by validation for each dataset. In the near future we plan to investigate how to automate this selection, and also explore the applicability of the proposed approach to other natural language generation tasks. Src: . . . otázka zní : " můzeme ho snízit na nulu ? " pokud budeme spalovat uhlí , tak ne . ani pȓi spalování zemního plynu ne . témȇȓ kazdý soucasný způsob výroby elektȓiny , s vyjímkou rozsiȓujících se obnovitelných a jaderných zdrojů , produkuje CO2 . budeme muset v globálním mȇȓítku vytvoȓitúplnȇ nový systém . a potȓebujeme energetické zázraky . . . Ref: . . . and so the question is : can you actually get that to zero ? if you burn coal , no . if you burn natural gas , no . almost every way we make electricity today , except for the emerging renewables and nuclear , puts out CO2 . and so , what we 're going to have to do at a global scale , is create a new system . and so , we need energy miracles . . . HAN join : . . . the question is , can we reduce it to zero ? if we keep burning coal , we don 't . even burning , natural gas don 't . almost every single way of production of electricity , except for the exception of electricity and nuclear resources , produces CO2 . we 're going to have to have a completely new system on a global scale . and we need energy miracles . . . Risk(1.0)-. . . the question is , can we reduce it to zero ?
if we keep burning coal , we don 't . even burning , natural gas don 't . almost every single way of producing electricity , except for example , with the exception of renewables and nuclear resources , produces CO2 . we 're going to have to build a completely new system on a global scale . and we need energy miracles . . .

Appendix C: Rewards during training
To show the behavior of the different rewards during training, Figure 2 shows the BLEU, LC and COH scores over the Cs-En validation set at different training iterations. This plot confirms the intuition that improving LC and COH comes at a cost of BLEU score. In the first 2000 training iterations, LC has improved by more than 2 pp and COH by more than 1 pp, while the BLEU score has dropped by approximately 0.3 pp. Moreover, the highest scores for LC and COH coincide with the lowest score for BLEU (iteration 4000). Overall, validation is needed to achieve a model with the best trade-off between BLEU, LC and COH (in this case, for instance, iteration 2000 or 6800).