Beyond Error Propagation in Neural Machine Translation: Characteristics of Language Also Matter

Neural machine translation usually adopts autoregressive models and suffers from exposure bias as well as the consequent error propagation problem. Many previous works have discussed the relationship between error propagation and the accuracy drop (i.e., the left part of the translated sentence is often better than its right part in left-to-right decoding models) problem. In this paper, we conduct a series of analyses to deeply understand this problem and get several interesting findings. (1) The role of error propagation on accuracy drop is overstated in the literature, although it indeed contributes to the accuracy drop problem. (2) Characteristics of a language play a more important role in causing the accuracy drop: the left part of the translation result in a right-branching language (e.g., English) is more likely to be more accurate than its right part, while the right part is more accurate for a left-branching language (e.g., Japanese). Our discoveries are confirmed on different model structures including Transformer and RNN, and in other sequence generation tasks such as text summarization.


Introduction
Neural machine translation (NMT) has attracted much research attention in recent years (Bahdanau et al., 2014;Shen et al., 2018;Song et al., 2018;Xia et al., 2018;He et al., 2016;Wu et al., 2017Wu et al., , 2018)).The major approach to the task typically leverages an encoder-decoder framework (Cho et al., 2014;Sutskever et al., 2014) and the decoder usually generates the target tokens one by one from left to right autoregressively, in which the generation of a target token is conditioned on previously generated target tokens.
It has been observed that for an NMT model with left-to-right decoding, the right part words in * Authors contribute equally to this work.its translation results are usually worse than the left part words in terms of accuracy (Zhang et al., 2018;Bengio et al., 2015;Ranzato et al., 2015;Hassan et al., 2018;Liu et al., 2016b,a).This phenomenon is referred to as accuracy drop in this paper.A straightforward explanation to accuracy drop is error propagation: If a word is mistakenly predicted during inference, the error will be propagated and the future words conditioned on this one will be impacted.Different methods have been proposed to address the problem of accuracy drop (Liu et al., 2016a,b;Hassan et al., 2018).
Instead of solving the problem, in this paper, we aim to deeply understand the causes of the problem.In particular, we want to answer the following two questions: • Is error propagation the main cause of accuracy drop?
• Are there any other causes leading to accuracy drop?
To answer these two questions, we conduct a series of experiments to analyze the problem.
First, we train NMT models separately using left-to-right and right-to-left decoding (Sennrich et al., 2016a;Liu et al., 2016b;He et al., 2017;Gao et al., 2018) on several language pairs (i.e., German to English, English to German, and English to Chinese).If error propagation is the main cause of accuracy drop, then the right part words in the translation results generated by right-to-left NMT models should be more accurate than the left part words.However, we observe the opposite phenomenon that the accuracy of the right part words of the translated sentences in both left-toright and right-to-left models is lower than that of the left part, which contradicts with error propagation.This shows that error propagation alone cannot well explain the accuracy drop and even suggests that error propagation may not exist or matter.
Second, to further investigate the influence of error propagation on accuracy drop, we conduct a set of experiments with teacher forcing (Williams and Zipser, 1989) during inference, in which we feed the ground-truth preceding words to predict the next target word.Teacher forcing eliminates exposure bias as well as error propagation in inference.The results verify the existence of error propagation, since the later part (the right part in left-to-right decoding and the left part in right-toleft decoding) of the translation results get more accuracy improvement with teacher forcing, regardless of the decoding direction.Meanwhile, the accuracy of the right part is still lower than that of the left part with teacher forcing, which demonstrates that there must be some other causes apart from error propagation leading to accuracy drop.
Third, inspired by linguistics, we find that the concept of branching (Berg et al., 2011;Payne, 2006) can help to explain the problem.We conduct the third set of experiments to study the correlation between language branching and accuracy drop.We find that if a target language is right branching such as English, the accuracy of the left part words is usually higher than that of the right part words, no matter for left-to-right or right-toleft NMT models, while for a left-branching target language such as Japanese, the accuracy of the left part words is usually lower than that of the right part, no matter for which models.The intuitive explanation is that a right-branching language has a clearer structure pattern (easier to predict) in the left part of sentence than that in the right part, since the main subject of the sentence is usually put in the left part.We calculate two statistics to verify this assumption: n-gram statistics (including n-gram frequency and conditional probabilities) and dependency parsing statistics.For rightbranching languages, we found higher n-gram frequency/conditional probabilities as well as more dependencies in the left part compared with that in the right part.The opposite results are also found in left-branching languages.
We summarize our findings as follows.
• Through empirical analyses, we find that the influence of error propagation is overstated in the literature, which may misguide the future research.Error propagation alone cannot fully explain the accuracy drop in the left or right part of sentence.
• We find the branching in linguistics well correlates with accuracy drop in the left or right part of sentence and the corresponding analysis on n-gram and dependency parsing statistics well explain this phenomenon.
Our studies show that linguistics can be very helpful to understand existing machine learning models and build better models for language related tasks.We hope that our work can bring some insights to the research on neural machine translation.We believe that our findings can help us to design better translation models.For example, the finding on language branching suggests us to use left-to-right NMT models for rightbranching languages such as English and right-toright NMT models for left-branching languages such as Japanese.
2 Related Work

Exposure Bias and Error Propagation
Exposure bias and error propagation are two different concepts but often mentioned together in literature (Bengio et al., 2015;Shen et al., 2016;Ranzato et al., 2015;Liu et al., 2016b,a;Zhang et al., 2018;Hassan et al., 2018).Exposure bias refers to the fact that the sequence generation model is usually trained with teacher-forcing while generates the sequence autoaggressviely during inference.This discrepancy between training and inference can yield errors that accumulate quickly along the generated sequence, which is known as error propagation (Bengio et al., 2015;Shen et al., 2016;Ranzato et al., 2015).Bengio et al. (2015) propose the scheduled sampling method to eliminate the exposure bias and the resulting error propagation, which achieves promising performance on sequence generation tasks such as image captioning.Shen et al. (2016);Ranzato et al. (2015) improve the basic maximum likelihood estimation (MLE) with reinforcement learning or minimum risk training and aim to address the limitation of MLE training and exposure bias problem.(Liu et al., 2016b,a;Zhang et al., 2018;Hassan et al., 2018) mainly ascribe accuracy drop (the accuracy of right part words is worse than that in the left part in most cases) to error propagation and propose different methods to solve this problem.Liu et al. (2016b,a); Hassan et al. (2018) use agreement regularization between the left-to-right and right-to-left models to achieve better performance.Zhang et al. (2018) and (Hassan et al., 2018) propose to use two-pass decoding to refine the generated sequence to yield better quality.

Tackling Accuracy Drop
All these works focus on error propagation and accuracy drop.To our knowledge, there is no deep study about other causes of accuracy drop.In this paper, we aim to conduct such a study.Our study shows that accuracy drop is not only caused by error propagation, but also the characteristics of language itself.
3 Error Propagation and Accuracy Drop

Error Propagation is Not the Only Cause
A left-to-right NMT model feeds target tokens one by one from left to right in training and generate target tokens one by one from left to right during inference, while a right-to-left NMT model trains and generates token in the reverse direction.Intuitively, if error propagation is the root cause of accuracy drop, then a right-to-left NMT model will generate translations with better right half accuracy than the left half.In this section, we study the results of both left-to-right and right-to-left NMT models to analyze the relationship between error propagation and accuracy drop.
We conduct experiments on three translation tasks with different language pairs, which include: IWSLT 2014 German-English (De-En), WMT 2014 English-German (En-De) and WMT 2017 English-Chinese (En-Zh).We choose the state-ofthe-art NMT model Transformer (Vaswani et al., 2017) as the basic model structure and train two separate models with left-to-right and right-to-left decoding on each language pair.More details about the datasets and model descriptions can be found in supplementary materials (section A.1 and A.2). We evenly split each generated sentence into the left half and the right half with same number of words1 .Then for both the left and right half, we compute their accuracy with respect to the reference target sentence, in terms of BLEU score (Pa- We first report the BLEU scores of the full translation results (without split) in Table 1.As can be seen, the accuracy of the model is comparable to state-of-the-art results (Vaswani et al., 2017;Wang et al., 2017Wang et al., , 2018)).Afterwards we report the BLEU scores of the left half and the right half in Table 2.We have several observations.
• When translating from left-to-right, the BLEU score of the left half is higher than the right half on all the three tasks, which is consistent with previous observation and is able to be explained via error propagation.
• When translating from right-to-left, the accuracy of the left half (in this way it's the later part of the generated sentence) is still higher than the right half.Such an observation is contradictory to the previous analyses between error propagation and accuracy drop, which regard that accumulated error brought by exposure bias will deteriorate the quality in later part of translation (i.e., the left half).
The inconsistent observation above suggests that error propagation is not the only cause of ac-  curacy drop that there are other factors beyond error propagation for accuracy drop.It even challenges the existence of error propagation: does error propagation really exist?In the next section we try to answer this question through teacher forcing experiments.

The Influence of Error Propagation
Teacher forcing (Williams and Zipser, 1989) in sequence generation means that when training a sequence generation model, we feed the previous ground-truth tokens as inputs to predict the next target word.Here we apply teacher forcing in the inference phase of NMT: to generate the next word ŷi , we input the preceding ground-truth words y <i rather than the previously generated words ŷ<i , which largely alleviates the effect of error propagation, since there will be no error propagated from the previously generated words.Same as last section, we evaluate the quality of the left and right half of the translation results generated by both the left-to-right and right-toleft models.The results are summarized in Table 3.For comparison, we also include the BLEU scores of normal translation (without teacher forcing).We have several findings from Table 3 as follows: • Exposure bias exists.The accuracy of both left and right half tokens in the normal translation is lower than that in teacher forcing, which feeds the ground-truth tokens as inputs.This demonstrates that feeding the pre-viously generated tokens (which might be incorrect) in inference indeed hurts translation accuracy.
• Error propagation does exist.We find the error is accumulated along the sequential generation of the sentence.Taking En-Zh and the left-to-right NMT model as an example, the BLEU score improvement of the right half (the second half of the generation) of teacher forcing over normal translation is 2.64, which is much larger than the accuracy improvement of the left half (the first half of the generation): 1.70.Similarly, for En-Zh with the right-to-left NMT model, the BLEU score improvement of the left half (the second half of the generation) of teacher forcing over normal translation is 2.82, which is much larger than the accuracy improvement of the right half (the first half of the generation): 1.77.
• Other causes exist.Taking En-De translation with the left-to-right model as an example, the accuracy of the left half (9.43) is higher than that of the right half (8.36) when there is no error propagation with teacher forcing.Similar results can be found in other language pairs and models.This suggests that there must be some other causes leading to accuracy drop, which will be studied in the next section.

Language Branching Matters
Section 3.1 and 3.2 together show that error propagation has influence on but is not the only cause of accuracy drop.We hypothesize that the language itself, i.e., its characteristics, may explain the phenomenon of accuracy drop.Watanabe and Sumita (2002) finds that leftto-right decoding performs better for Japanese-English translation while right-to-left decoding performs better for English-Japanese translation.We conduct the same analysis settings as in Section 3.1 and 3.2 on English-Japanese (En-Jp) translation dataset.More details about this dataset and model descriptions can be found in supplementary materials (section A.1 and A.2).
Table 4 shows the BLEU score on the En-Jp test set.It can be observed that regardless of decoding direction (i.e., from left-to-right or from right-toleft) and with or without teacher forcing, the accuracy of the right half is always higher than that in the left half.This observation on Japanese is opposite to English, German and Chinese in Section 3.1 and 3.2, and motivates us to investigate the differences between these languages.We find that a linguistics concept, the branching, can differentiate Japanese from other languages such as English/German.Branching refers to the shape of the parse trees that represent the structure of sentences (Berg et al., 2011;Payne, 2006).Usually, right-branching sentences are head-initial, which means the main subject of the sentence is described first, and is followed by a sequence of modifiers that provide additional information about the subject.On the contrary, left-branching sentences are head-final that putting such modifiers to the left of the sentence (Payne, 2006).
English is a typical right-branching language, while Japanese is almost fully leftbranching (Wikipedia, 2018).The two languages demonstrate the opposite phenomenon of accuracy drop as shown in previous studies.When we say a language is typical left/right-branching, we mean most of the sentences in this language follows the left/right-branching structure.While being predominantly right-branching, German is less conclusively so than English.Chinese features a mixture of head-final and head-initial structures, with the noun phrases are head-final while the strict head/complement ordering sentences are head-initial as right-branching (Wikipedia, 2018), but less conclusively than German.
We believe the language branching is a main cause of accuracy drop.Intuitively, the main subject of a right-branching sentence is described first (in the left part) and is followed by additional modifiers (in the right part) (Berg et al., 2011).Therefore, the left half of a right-branching sentence is more likely to possess a clearer structure pattern and thus lead to higher generation accuracy than in the right part, since the main subject is usually simpler and clearer than the mod-ifiers that providing additional information about the subject.In next section, we will verify this intuition this assumption from a statistical perspective.

Correlation between Language Branching and Accuracy Drop
As previous work (Arpit et al., 2017) shows, neural networks are easy to learn and memorize simple patterns but difficult to make a correct prediction on noise examples.In this section, we study different branching languages from two aspects, including the n-gram statistics of a target language, which has been used as a kind of characterization of hardness of learning (Bengio et al., 2009), and the dependency statistics in parse trees.We show that these statistics well correlate with the accuracy drop between the left half and the right half of translation results.

N-gram Statistics
Intuitively speaking, if a pattern occurs frequently and deterministically, it is easy to be learned by neural networks.By comparing the general statistics on the n-gram frequency and n-gram conditional probability of the left and right half tokens, we link the language branching to accuracy drop.Denote a bilingual dataset D = {(x i , y i }, i = 1, • • • , M , where each y i is a sequence of words and P l i,n denote the average n-gram frequency and n-gram conditional probability of the left half of y i3 , i.e., where F (.) and P (.) are the n-gram frequency and n-gram conditional probability calculated from the training dataset.Similarly, F r i,n and P r i,n denote the n-gram frequency and n-gram conditional probability of the right half.
We calculate the average n-gram frequencies F l n and F r n of the left half and right half over all the target sentences in the training set.We also calculate the average n-gram conditional probabilities P l n and P r n over all the training sentences to compare the uncertainty of phrases in the left half and right half. (2) We also calculate the ratio of the sentences that the frequency/conditional probability of left half is bigger/smaller than that in the right half, denoted as RF l>r n /RF l<r n and RP l>r n /RP l<r n : (3) We choose n = 2 and 3 to calculate the metrics in Equation 2 and 3 on different translation datasets.The numbers are listed in Table 5 and 6.
We can see the 2/3-gram frequency as well as the conditional probability of the left half is higher than that of the right half for right-branching languages including English, German and Chinese in De-En, En-De and En-Zh translation datasets.For left-branching language Japanese, the result is opposite.The n-gram frequency and conditional probability statistics are consistent with our observations on accuracy drop in left/rightbranching languages and verify our hypothesis: right-branching languages have clearer patterns in left part (with larger n-gram frequency as well as the conditional probability) and consequently leads to higher translation accuracy in the left part than the right part; left-branching languages are opposite.
We further visualize how the accuracy drop (between the left half and right half of the translations) correlates with the gap of n-gram statistics in the left and right part.The accuracy drop (e.g., BLEU score) of left/right half is taken from the teacher-forcing with left-to-right decoding in Table 3, and the n-gram gap is taken from the ∆ in the last row of Table 5 and 6. Figure 1 shows strong correlation between accuracy drop and the gap of n-gram statistics: As the gap of n-gram   is less than 1 due to two reasons: (1) sentence with less than 4 words does not contribute to the statistics, and (2) we remove the n-gram condition probability with the denominator less than 100 to make probability calculation robust.statistics increases from negative values to posi-   5 and 6 in the four translation tasks.The x-axis ∆RF 3 and ∆RP 3 represent the gap of between the left and right ratio of the 3-gram frequency/conditional probability defined in Table 5 and 6 tive values, the accuracy drop also increases from negative to positive.

Dependency Statistics
In this subsection, we study language branching from the perspective of dependency structure.We hypothesize that if the left/right half of sentence contains more dependencies between its intra words, this half should be easier to predict, leading to higher accuracy.Here we analyze the English sentence in De-En translation and Japanese sentence in En-Jp translation, since English is fully right-branching and Japanese is fully left-branching as introduced before.
For English parsing, we utilize the wellacknowledged Standford Parser 4 to parse the sentences.After obtaining the parsing results, we split the sentence into left and right half, and separately count the numbers of dependencies in each half 5 .For Japanese, we leverage the open-source toolkit J.DepP 6 to parse the sentence, and then count the number of dependencies of each half.
We provide the results in Table 7.As can be observed, for English sentences, the left-half words depend more on each other than the right-half words, while for the Japanese sentences, the right-4 https://nlp.stanford.edu/software/lex-parser.shtml 5 For simplicity, we just count the number of dependency, without considering dependency types.The detailed parsing formats can be found in the supplementary material (Section A.3 half words have more dependencies.This observation is consistent with our observations on accuracy drop, and can well explain the high accuracy of left part in English translation and right part in Japanese translation.

Extended Analyses and Discussions
We have analyzed the accuracy drop problem from the view of error propagation and language itself in previous sections.In this section, we further provide extended analyses and several discussions to give a more clear understanding of the accuracy drop problem.

More Languages on Left-Branching
The previous analyses are based on four languages, three right-branching (En, De, Zh) and one left-branching language (Jp).To avoid the experimental bias and randomness, we provide one more translation task, English-Turkish (En-Tr) translation7 , as Turkish is a left-branching lan- guage.We simply calculate the BLEU score of the left/right half in left-to-right and right-to-left decodings, as in Section 3.1 and 3.2.The result is provided in Table 8.For the leftto-right decoding, the accuracy of the left half is higher than that of the right half in the normal translation.However, the accuracy of the right half becomes higher with teacher forcing translation.This demonstrates that English-Turkish translation performs similar to English-Japanese translation as the accuracy of right half is higher than that of the left half.But different from what we observed in Japanese, Turkish shows the opposite phenomenon: the influence of language branching is weaker than error propagation.

Other Model Structures
One may wonder whether the results in the paper are biased towards a certain model structure as we use Transformer on all the above analyses.To address such concerns, we conduct an additional experiment on De-En translation task with RNN (GRU)-based model8 .The results are shown in Table 9 and the observations are consistent with what we observed on Transformer.The accuracy of the left half of the De-En translation sentence is always higher than the right half, in both the leftto-right and right-to-left decodings.

Other Sequence Generation Tasks
We conduct experimental analysis on abstractive summarization, which is also a sequence generation task.The goal of the task is to recap a long  10.We observe the same phenomenon as in translation tasks.The accuracy of the left half is always better than the right half, no matter in left-to-right or right-to-left decoding, since the target language English is a right-branching language.

Conclusion
In this work, we studied the problem of accuracy drop between the left half and the right half of the results generated by neural machine translation models.We found the influence of error propagation is overstated in literature and error propagation alone cannot explain accuracy drop.We showed that language branching well correlates to the accuracy drop problem and the evidences on n-gram statistics as well as the dependency statistics well support this correlation.Our discoveries suggest that left-to-right NMT models fit better for right-branching languages (e.g., English) and right-to-left NMT models fit better for leftbranching languages (e.g., Japanese).
For future works, we will study more left/rightbranching languages as well as other languages that have no obvious branching characteristics.We will also investigate how language branching influences other natural language tasks, especially for neural networks based models.

Figure 1 :
Figure 1: Accuracy drop (the gap between the left/right BLEU score) with respect to the ∆RF 3 and ∆RP 3 from Table5and 6 in the four translation tasks.The x-axis ∆RF 3 and ∆RP 3 represent the gap of between the left and right ratio of the 3-gram frequency/conditional probability defined in Table5 and 6.The y-axis represents the accuracy drop in terms of BLEU score calculated by the teacher forcing decoding.
Figure 1: Accuracy drop (the gap between the left/right BLEU score) with respect to the ∆RF 3 and ∆RP 3 from Table5and 6 in the four translation tasks.The x-axis ∆RF 3 and ∆RP 3 represent the gap of between the left and right ratio of the 3-gram frequency/conditional probability defined in Table5 and 6.The y-axis represents the accuracy drop in terms of BLEU score calculated by the teacher forcing decoding.

Table 1 :
BLEU scores on the test set of the three translation tasks with both left-to-right and right-to-left decoding.

Table 2 :
(Liu et al., 2016a)left and right half of leftto-right and right-to-left NMT models.In(Liu et al., 2016a), the authors report the partial BLEU score without length penalty, our result is consistent with partial BLEU if simply removing length penalty when calculating BLEU.

Table 3 :
BLEU scores."0" represents the translation results without teacher forcing during inference, and "1" represents the translation results with teacher forcing during inference.∆ represents the BLEU score improvement of teacher forcing over normal translation.

Table 4 :
BLEU scores on En-Jp test set."0" represents the normal translation results, and "1" represents the teacher-forcing translation results.

Table 5 :
The n-gram frequency statistics on different translation datasets.F l n and F r n represent the average of n-gram frequency of left and right half of target sentences.RF l>r is less than 1 since sentence with less than 4 words does not contribute to the n-gram statistics.

Table 7 :
Number of dependencies in left and right half of English (De-En) and Japanese (En-Jp) training corpus.The number varies a lot since the two training corpus have different training sentences.

Table 8 :
BLEU scores on En-Tr test set with left-toright generation.Normal translation is denoted as "0", and teacher-forcing translation is denoted as "1".

Table 9 :
BLEU scores on the left-to-right and rightto-left translation sentences on the De-En test set, with RNN-based model."Full" means the BLEU score of the whole translation sentence.

Table 10 :
ROUGE F1 scores for left-to-right and rightto-left generated translation sentences in abstractive summarization task.ROUGE-N stands for N-gram based ROUGE F1 score, ROUGE-L stands for longest common subsequence based ROUGE F1 score."Full" means the entire translation sentence.news sentence into a short summary.We use Gigaword dataset which contains 3.8M training pairs, 190k validation and 2k test pairs of English sentence, and train an RNN-based model for sentence summarization.The accuracy is measured by the commonly used metric ROUGE F1 score and are reported in Table