Online Versus Offline NMT Quality: An In-depth Analysis on English-German and German-English

We conduct in this work an evaluation study comparing offline and online neural machine translation architectures. Two sequence-to-sequence models: convolutional Pervasive Attention (Elbayad et al. 2018) and attention-based Transformer (Vaswani et al. 2017) are considered. We investigate, for both architectures, the impact of online decoding constraints on the translation quality through a carefully designed human evaluation on English-German and German-English language pairs, the latter being particularly sensitive to latency constraints. The evaluation results allow us to identify the strengths and shortcomings of each model when we shift to the online setup.


Introduction
Sequence-to-Sequence models are state-of-the-art in a variety of sequence transduction tasks including machine translation (MT).The most widespread models are composed of an encoder that reads the entire source sequence, while a decoder (often equipped with an attention mechanism) iteratively produces the next target token given the full source and the decoded prefix.Aside from the conventional offline use case, recent works adapt sequence-to-sequence models for online (also referred to as simultaneous) decoding with low-latency constraints (Gu et al., 2017;Dalvi et al., 2018;Ma et al., 2019;Arivazhagan et al., 2019).Online decoding is desirable for applications such as real-time speech-to-speech interpretation.In such scenarios, the decoding process starts before the entire input sequence is available, and online prediction generally comes at the cost of reduced translation quality.
In this work we focus on online neural machine translation (NMT) with deterministic wait-k decoding policies (Dalvi et al., 2018;Ma et al., 2019).With such a policy, we first read k tokens from the source then alternate between producing a target token and reading another source token (see Figure 1).We consider two sequence-to-sequence models, 2D-convolution based Pervasive Attention (Elbayad et al., 2018) and attention-based Transformer (Vaswani et al., 2017).We investigate, for both architectures, the impact of online decoding constraints on the translation quality through a carefully designed human evaluation on English )German and German )English language pairs.Our contributions are twofold: (1) our work, to the best of our knowledge, is the first human evaluation of online vs. offline NMT systems.(2) We compare Transformer and Pervasive Attention architectures highlighting the advantages and shortcomings of each when we shift to the online setup.The rest of this paper is organized as follows: we present in §2 related work pertaining to online MT and error analysis of NMT systems.We describe our experimental setup for human evaluation and error analysis in §3.We follow with the evaluation results in §4 and summarize our findings in §5.

Related work 2.1 Online NMT
After pioneering works on online statistical MT (SMT) (Fügen et al., 2007;Yarmohammadi et al., 2013;He et al., 2015;Grissom II et al., 2014;Oda et al., 2015), one of the early works with attention-based online translation is Cho and Esipova (2016) using manually designed criteria that dictate whether the model should make a read/write operation.Dalvi et al. (2018) (horizontal) and writes (vertical) over a source-target grid.After first reading k tokens, the decoder alternates between reads and writes.In Wait-∞, or Wait-until-End (WUE), the entire source is read first.that starts with k read operations then alternates between blocks of l write/read operations.This simple approach outperforms the information based criteria of Cho and Esipova (2016), and allows complete control of the translation delay.Ma et al. (2019) trained Transformer models (Vaswani et al., 2017) with a wait-k decoding policy that first reads k source tokens then alternate single read-writes.For dynamic online decoding, Luo et al. (2017) and Gu et al. (2017) rely on Reinforcement Learning to optimize a read/write policy.To combine the end-to-end training of wait-k models with the flexibility of dynamic online decoding, Zheng et al. (2019b) and Zheng et al. (2019a) use Imitation Learning.Recent work on dynamic online translation use monotonic alignments (Raffel et al., 2017) with either a limited or infinite lookback (Chiu and Raffel, 2018;Arivazhagan et al., 2019;Ma et al., 2020).In this work, we focus on wait-k and greedy decoding strategies, but unlike other wait-k models (Ma et al., 2019;Zheng et al., 2019b;Zheng et al., 2019a) we opt for uni-directional encoders which are efficient to train in an online setup.

Error analysis for NMT
With the advances in NMT (Bahdanau et al., 2015;Vaswani et al., 2017), the quality of translations has improved substantially leading to claims of human parity in high-resource settings (Wu et al., 2016;Hassan et al., 2018).With such improvements, it becomes more and more difficult for automatic evaluation metrics such as BLEU (Papineni et al., 2002) to detect subtle differences.Manual error annotation is a more instructive quality assessment to gain insights into the performance of MT systems, especially in direct comparisons.Human evaluation might lead to conclusions at odd with automatic metrics, as was the case in last year WMT English-German evaluation (Barrault et al., 2019).
Comparison to SMT and rule-based MT.Bentivogli et al. (2016) studied post-editing of English-German TED talks and found that NMT makes considerably less word order errors than SMT.They also observed that the performance of NMT degrades faster than SMT with increasing sentence length.Toral and Sánchez-Cartagena (2017) reached similar conclusions on news stories in 9 language directions.Isabelle et al. (2017) tested NMT systems with challenging linguistic material and highlighted the efficiency of NMT systems at handling subject-verb agreement and syntactic and lexico-syntactic divergences and the struggle of NMT with idiomatic phrases.Castilho et al. (2017a), Castilho et al. (2017b) and Van Brussel et al. (2018) observed that NMT outperforms SMT in terms of fluency, but at the same time it is more prone to accuracy errors.Klubička et al. (2018) made similar observations in an evaluation of English-Croatian, concluding that compared to SMT and rule-based MT, NMT tends to sacrifice completeness of translations in order to increase fluency.
Error typologies for MT.Various error typologies with different levels of granularity have been proposed to evaluate MT systems (Flanagan, 1994;Vilar et al., 2006;Stymne and Ahrenberg, 2012;Lommel et al., 2014b).In their evaluation of SMT outputs, Vilar et al. (2006) defined five error categories: missing words, word order, incorrect words, unknown words and punctuation errors.Bentivogli et al. (2016) followed a simpler classification with three types of errors: morphological, lexical, and word order.Their choice was motivated by the difficulty to disambiguate sub-categories of lexical errors (Popović and Ney, 2011).The evaluation in Klubička et al. (2018) is based on the Multidimensional Quality Metrics (MQM) framework (Lommel et al., 2014b).In their study, they found that mistranslation is the most frequent accuracy error in NMT translations.Van Brussel et al. (2018) observed that mistranslation and omission errors are particularly challenging for NMT users, because contrary to SMT and rule-based MT, these errors are often not indicated by flawed fluency, which makes them more difficult to identify and post-edit.
Error analysis for online MT.In the context of online translation, Hamon et al. (2009) evaluated a spoken language translation system (ASR+MT) in comparison to a human interpreter, where each segment is judged in terms of adequacy and fluency.To our knowledge, our work is the first to propose a finegrained human evaluation of online NMT systems.It focuses on English )German and German )English language pairs, the latter being particularly sensitive to latency constraints.This is in part due to German sentence-final structures (e.g.verbs in subordinate clauses) that require long-distance reordering in translation into syntactically divergent languages.

Experimental setup
In this work we train Transformer (Vaswani et al., 2017) and Pervasive Attention (Elbayad et al., 2018) models for the tasks of online and offline translation.Following Elbayad et al. (2020), we use unidrectional encoders and train the online MT models with k train = 7, proven to yield better translations across the latency spectrum.We train our models on IWSLT'14 De )En (German )English) and En )De (English )German) datasets (Cettolo et al., 2014).Sentences longer than 175 words and pairs with length-ratio exceeding 1.5 are removed.The training set consists of 160K pairs with 7283 held out for development and the test set has 6750 pairs from TED dev2010+tst2010-2013.All data is tokenized using the standard scripts from the Moses toolkit (Koehn et al., 2007).We segment sequences using byte pair encoding (Sennrich et al., 2016), BPE for short, on the bi-texts resulting in a shared vocabulary of 32K types.We train Pervasive Attention (PA) with 14 layers and 7-wide filters and Transformer (TF) small for offline and online translation.We evaluated our wait-k models with k eval = 3 achieving a low latency of AL ∈ [2.5, 3.5] (see Table 1).For a fair comparison, both online and offline models are decoded greedily.We will refer these four models with PA-offline, PA-online, TF-offline and TF-online.

Analysis factors
In this section, we describe a few factors used in this work to analyze the results of automatic and human evaluations.
Source length.Similar to other evaluation studies of NMT systems (Bentivogli et al., 2016;Toral and Sánchez-Cartagena, 2017;Koehn and Knowles, 2017), we look into the length of the source sequence and its effect on the quality of translation.
Lagging difficulty (LD).In the particular context of online translation, source-target alignments are an indicator of how easy it is to translate an input.To measure the lagging difficulty of a pair (x, y), we first estimate source-target alignments with fast-align (Dyer et al., 2013) and then infer a reference decoding path.The reference decoding path, denoted with z align , is non-decreasing and guarantees that at a given decoding position t, z t is larger than or equal to all the source positions aligned with t.The lagging difficulty is finally measured as the Average Lagging (AL) (Ma et al., 2019) of the parsed z align as follows: AL measures the lag in tokens behind the ideal simultaneous policy wait-0, and so, LD measures the lag of a realistic simultaneous translation that has the aligned context available when decoding.The higher LD, the more challenging it is to constrain the latency of the translation.
Relative positions.We look into the correlation between the relative positions, source-side and targetside, with the translation's quality.An annotated token ỹt of the system's hypothesis ỹ has a target-side relative position t/|y|.Similarly, an annotated source token x j has a relative position j/|x|.We argue that with wait-k decoding policies, the position of the token might be a contributing factor to the adequacy/fluency of the translation.Figure 2: MQM-based error typology used for our manual annotation.

Human evaluation
In addition to the use of automatic evaluation metrics, we conduct an in-depth manual analysis to compare the quality of the output produced by the four systems.From the full test sets, we sample 200 segments in each translation direction.We first filter segments whose source sentence lengths fall within the first and third length quartiles.We then remove segments that contain the <unk> token (out-of-vocabulary) and bin the segments by lagging difficulty (see §3.1).We sample from the binned segments to cover all ranges of difficulty and manually remove misaligned segments.Subsequently, the sampled segments were manually error-annotated by a total of four human annotators.
Error typology.For error annotation, a subset of the MQM error typology (Lommel et al., 2014b) was used.A pilot annotation based on 50 translation segments not included into the present test data was carried out to select MQM error types relevant to this study.In this way, the typology was kept to a manageable size to avoid annotators' cognitive overload. 1The resulting error typology comprises 13 error types, grouped into three major branches: accuracy, fluency and other, as shown in Figure 2. The error type non-existing word form was added to the typology to capture target words invented with BPE tokenization that do not exist in the target language, such as translating Pfadfinder (German for scout) into Badfinder.The hierarchical nature of the error typology enables annotation and quality analyses on various levels of granularity; annotators were requested to give preference to more specific error types (i.e.located deeper in the hierarchy) whenever possible.
Annotation interface.In order to annotate the segments, we use ACCOLÉ, an online collaborative platform for error annotation (Esperança-Rodier et al., 2019).ACCOLÉ offers a range of services that allow simplified management of corpora and error typologies with the possibility to specify the error typology and search for a particular error type in the annotations.Annotators are tasked to label translation errors by locating the appropriate spans in the target and source segments.
Selection and training of annotators.Per language pair, two annotators with native proficiency in the respective target and near-native to native proficiency in the source language were recruited for the manual annotation task.Following the MQM guidelines and recommendations from NMT evaluation studies (Läubli et al., 2020), the annotators are professional translators.Two of the annotators (one per language pair) also teach in a translation degree programme at university and have thus considerable experience in linguistic translation analysis.To familiarize the annotators with the annotation scheme and the use of ACCOLÉ, they were provided with training materials consisting of a written annotation manual, a description of the error typology and a decision tree to guide the selection of appropriate error types.In addition, annotators practiced the annotation procedure on a calibration set of 30 segments representative of the full test data but not included in the 200 segments to be annotated.Subsequently, annotators were given individual feedback and corrective guidance on their annotations.4 Evaluation results

Automatic evaluation
For each translation direction, En )De and De )En, we assess the quality of our systems by measuring BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), TER (Snover et al., 2006), ROUGE-L (Lin, 2004) and BERTScore (Zhang et al., 2020).We use the default weights and parameters for METEOR3 and we report the F1 measure combining BERTScore precision and recall. 4We test for statistical significance with paired bootstrap resampling (Koehn, 2004) using a sample size of 3000 segments.We report the automatic scores evaluated on IWSLT'14 De↔En test set in Table 1 and Figure 3.For bucketed BLEU scores, we bin the test data based on the lagging difficulty of the pair or the source length and measure corpus-level BLEU in each bin.
We observe that (1) In offline translation, TF and PA have a comparable performance on De )En with a slight advantage to TF on all metrics except from BLEU.In the En )De direction, TF widens the gap with PA significantly.When binning En )De by lagging difficulty, PA is outperformed by TF in all ranges of difficulty except from the first easy bin.(2) As to be expected, online decoding leads to a degradation of the translation quality.The degradation is higher for De )En (5 BLEU points) than for En )De (3 BLEU points), arguably because German uses not only verb-initial but also verb-final constructions depending on clause type, thus posing more latency-related challenges for online translation.(3) When switching to online translation, the degradation of PA is narrower on average than the degradation of TF allowing for PA to close the gap with TF in both directions.(4) Although the translation quality of the systems in both directions decreases w.r.t. the length of the source segment, the length is a weaker feature for En )De compared to De )En.Lagging difficulty proves to be a better feature, not only in online translation, but also in offline translation with a steeper decline in BLEU scores as we increase the difficulty.

Human evaluation
To analyse the annotation data, we rely on the sum count of errors reported by the two annotators.For token-level analysis, we parse the span of each reported error and consider the union of the two annotations to label each output token.To assess the reliability of the error annotation, we measure inter-annotator agreement with Cohen's κ (Cohen, 1960) at the token level measuring whether the two annotators agree on the exact error type assigned to each token.We observe an agreement of 0.33 for De )En and 0.40 for En )De which is compatible with other MQM-based evaluation studies (Lommel et al., 2014a;Specia et al., 2017).
For each error type in our typology, we report in Table 2 the count of its occurrences as labeled by two annotators.The frequencies are arranged by task (De )En and En )De), by system (PA and TF) and by decoding setup (offline and online).We observe that (1) In alignment with previous works analysing NMT outputs, NMT systems are more prone to accuracy errors than fluency errors (Castilho et al., 2017b;Toral and Sánchez-Cartagena, 2017;Klubička et al., 2018;Van Brussel et al., 2018).(2) In accordance with automatic evaluation, the total increase of errors between offline and online (last row of Table 2) is higher for De )En than for En )De and PA is slightly less impacted by the shift from offline to online.
(3) Unlike automatic evaluation where TF slightly outperforms PA, in three out of the four setups in Table 2, PA has less errors than TF.(4) Relative increase of errors between offline and online is larger for fluency than for accuracy, especially for TF.(5) Relative increase of errors is particularly high for addition (ad), word order (wo) and duplication (du), the latter being even more problematic for TF.(6) In line with other error-annotation studies (Klubička et al., 2018;Van Brussel et al., 2018;Specia et al., 2017), mistranslation (mt) is the largest contributor to accuracy errors.Annotating mistranslations is particularly ambiguous leading to a lower inter-annotator agreement.(7) Offline errors with the most consistent gap between PA and TF are duplication and omission in favor of PA and grammar and overly literal in favor of TF. (8) More grammar errors are found for En )De compared to De )En, one reason might be that German morphology is richer and more complex than in English, leading to more possibilities for a system to make grammar errors.(9) Typography errors are more prevalent in En )De and are highly impacted by latency constraints: most of these typography errors are incorrect punctuation marks (extraneous or missing commas) as well as wrong casing and missing white spaces between compound nouns (producing correct German compounds seems to be especially difficult for online systems).This increase of typography errors suggests that online systems are more literal, as evidenced by the prevalence of incorrect punctuation marks.

Fine-grain analysis
In the following section, we breakdown the annotated set according to the analysis factors (source length (|x|), lagging difficulty (LD(x, y)) and relative positions).For each of these factors, we bin the annotated segments or tokens and report in Figures 4 and 5 normalized counts of errors (divided by the total number of tokens in each bin).An error corresponds to a row of the figure with a heat-map of the normalized counts (read left-to-right with an increasing factor) followed with Pearson's r correlation coefficient.
Source length.In Figure 4a, (1) in accordance with the automatic evaluation results of Figure 3, the relative count of errors per length bucket is positively correlated with the length.However, for mistranslation (mt) we observe a peak of the relative count of errors in early bins with a decrease of errors as we move to longer segments.This could be attributed to the higher cognitive load in the annotation of longer segments , as it may be easier for annotators to exhaustively label the errors in short segments than in longer ones. 5The ease of annotating short segments can also be attributed to their fluency; since these segments have fewer fluency errors, labeling mistranslated words is less ambiguous.(2) Addition (ad) errors in De )En online systems are considerably more correlated with the source length compared to the offline systems (PA: 0.01 )0.26 and TF: 0.02 )0.19).This shows that the increase in these errors is mostly located in longer segments.(3) Omission (om) and duplication (du) errors, more problematic for TF, have a higher correlation with the source length in En )De (du: 0.12 )0.27 and om: 0.03 )0.15) but not in De )En.
Difficulty. Figure 4b shows that (1) even in offline systems, the relative error count is positively correlated with lagging difficulty.This is particularly noticeable for word order (wo) errors.(2) For online systems, additions (ad) and omissions (om) are particularly correlated with the lagging difficulty.These errors are the system's solution to deal with missing context.(3) Although duplication (du) can also be thought of as a solution to missing context, it is more correlated with the source length than with the difficulty.
Relative position.Unlike length and lagging difficulty, analysis of relative position is based on token-wise labels.In Figure 5, we observe that (1) in online De )En systems, most omission (om) and mistranslation (mt) errors concern final source positions and occur near the end of the translation.The high prevalence of these error types in final source positions confirms the well-known difficulty of German sentence-final Figure 5: Bucketed token-level count of errors.
structures.Note that sentence-final errors on the source side (De) do not necessarily lead to sentence-final errors in the target (En), since the structural differences between the two languages require reordering operations.
(2) Duplication (du) errors in TF, on the other hand, affect initial source positions near the end of the translation.This means that TF circles back to the beginning of the source to pad the length of the hypothesis.

Agreement of the automatic metrics with human annotation
In Table 4 we evaluate the correlation (Pearson's r) between the automatic metrics and the human judgement by proxy of the error count.Unlike in §4.1 where we evaluate corpus-level BLEU, in this section we use sentence-level BLEU denoted with SentBLEU.We highlight the following observations: (1) On average, the model's confidence (p(ỹ|x)) and BERTScore have the highest correlation with the annotated errors in De )En.In En )De BERTScore is less correlated with the human evaluation, this is likely due to the use of a smaller German model (base instead of the English large).The correlation is higher for accuracy errors.(2) with its high correlation with errors, the confidence can be used to efficiently decide when to read and when to write instead of following a deterministic decoding path such as wait-k.(3) TER (the normalized number of edits required to get from the hypothesis to the reference) is a better indicator of quality than BLEU and METEOR in online systems.It is particularly correlated with addition (ad) and duplication (du) errors frequent online.(4) Although omissions (om) in En )De are negatively correlated with confidence and can thus be avoided with a well tuned decoding algorithm, this is not the case for De )En (verb-final to verb-medial).This means that the model omits tokens with a high confidence and is unable to predict that context is missing from the source.

Conclusion
We have conducted an evaluation of offline and online NMT systems for spoken language translation.Our aim was to shed light on the strengths and weaknesses of wait-k decoding under two different architectures, Transformer and Pervasive Attention.We found that Transformer models are strongly affected by the shift to online decoding with a significant increase in fluency errors, most of which are duplications.PA on the other hand is well adapted for online decoding.Our error analysis show that translation quality in online models can be potentially improved by making read/write decisions based on the model's confidence in order to filter out avoidable additions, mistranslations and duplications.The syntactic asymmetry between German and English remains a challenge for deterministic online decoding.A more detailed analysis of syntactically required long-distance reorderings is left for future work.In this regard, indicators such as lagging difficulty or relative position are more informative for online translation.

A Inter-annotator agreement
Length.To study whether long segments are more challenging to annotate, we measure inter-annotator agreement (IAA) in buckets of source length.Similar to prior studies (Flammia and Zue, 1995;Stymne and Ahrenberg, 2012;Bojar et al., 2011), we found that the length of the sequence has a negative correlation with the agreement possibly because of the increasing cognitive load (see Figure 6a).
Error type.To assess the ambiguity of the error types in our study, we measure binary agreements for each error in the typology.In this setup, we consider the task of annotating each error as a binary classification.Without chance correction (Figure 6b), mistranslation (mt) error is the one with the highest disagreement possibly because of its ambiguity.Agreement on rare errors such as accuracy (ac) and unintelligible (un) is zeroed out after chance correction.
Figure1: Wait-k decoding as a sequence of reads (horizontal) and writes (vertical) over a source-target grid.After first reading k tokens, the decoder alternates between reads and writes.In Wait-∞, or Wait-until-End (WUE), the entire source is read first.

Figure 3 :
Figure 3: Bucketed BLEU scores by source length and by lagging difficulty of the full test set.

Figure 6 :
Figure 6: IAA measured with Cohen's kappa or as agreement proportion without chance correction.The left panel shows the IAA per hypothesis length and the two right panels breakdown the agreement per error type.

Table 2 :
Total number of errors (sum of two annotations) per error type for each system.The system (PA or TF) with less errors is put in bold.
Und wenn wir dies für Rohdaten machen können warum nicht auch(om) für Inhalte selbst?(ol)TF-online : And if we could do this for raw data, why not do it (om) for content itself?(ol)

Table 3 :
Example annotations from De )En.Accuracy errors are in red and fluency in blue.