Evaluating Robustness to Input Perturbations for Neural Machine Translation

Neural Machine Translation (NMT) models are sensitive to small perturbations in the input. Robustness to such perturbations is typically measured using translation quality metrics such as BLEU on the noisy input. This paper proposes additional metrics which measure the relative degradation and changes in translation when small perturbations are added to the input. We focus on a class of models employing subword regularization to address robustness and perform extensive evaluations of these models using the robustness measures proposed. Results show that our proposed metrics reveal a clear trend of improved robustness to perturbations when subword regularization methods are used.


Introduction
Recent work has pointed out the challenges in building robust neural network models (Goodfellow et al., 2015;Papernot et al., 2016).For Neural Machine Translation (NMT) in particular, it has been shown that NMT models are brittle to small perturbations in the input, both when these perturbations are synthetically created or generated to mimic real data noise (Belinkov and Bisk, 2018).Consider the example in Table 1 where an NMT model generates a worse translation as a consequence of only one character changing in the input.
Improving robustness in NMT has received a lot of attention lately with data augmentation (Sperber et al., 2017;Belinkov and Bisk, 2018;Vaibhav et al., 2019;Liu et al., 2019;Karpukhin et al., 2019) and adversarial training methods (Cheng et al., 2018;Ebrahimi et al., 2018;Cheng et al., 2019;Michel et al., 2019) as some of the more popular approaches used to increase robustness in neural network models.
In this paper, we focus on one class of methods, subword regularization, which addresses NMT robustness without introducing any changes to the architectures or to the training regime, solely through dynamic segmentation of input into subwords (Kudo, 2018;Provilkov et al., 2019).We provide a comprehensive comparison of these methods on several language pairs and under different noise conditions on robustness-focused metrics.
Previous work has used translation quality measures such as BLEU on noisy input as an indicator of robustness.Absolute model performance on noisy input is important, and we believe this is an appropriate measure for noisy domain evaluation (Michel and Neubig, 2018;Berard et al., 2019;Li et al., 2019).However, it does not disentangle model quality from the relative degradation under added noise.
For this reason, we propose two additional measures for robustness which quantify the changes in translation when perturbations are added to the input.The first one measures relative changes in translation quality while the second one focuses on consistency in translation output irrespective of reference translations.Unlike the use of BLEU scores alone, the metrics introduced show clearer trends across all languages tested: NMT models are more robust to perturbations when subword regularization is employed.We also show that for the models used, changes in output strongly correlate with decreased quality and the consistency measure alone can be used as a robustness proxy in the absence of reference data.

arXiv:2005.00580v1 [cs.CL] 1 May 2020
Robustness is usually measured with respect to translation quality.Suppose an NMT model M translates input x to y and translates its perturbed version x δ to y δ , the translation quality (TQ) on these datasets is measured against reference translations y: TQ(y , y) and TQ(y δ , y).TQ can be implemented as any quality measurement metric, such as BLEU (Papineni et al., 2002) or 1 minus TER (Snover et al., 2006).
Previous work has used TQ on perturbed or noisy input as an indicator of robustness.However, we argue that assessing models' performance relative to that of the original dataset is important as well in order to capture models' sensitivity to perturbations.Consider the following hypothetical example: M 2 : BLEU(y 2 , y) = 37, BLEU(y δ2 , y) = 37.
Selecting M 1 to translate noisy data alone is preferable, since M 1 outperforms M 2 (38 > 37).However, M 1 's quality degradation (40 → 38) reflects that it is in fact more sensitive to perturbation δ comparing with M 2 .
To this end, we use the ratio between TQ(y , y) and TQ(y δ , y) to quantify an NMT model M 's invariance to specific data and perturbation, and define it as robustness: ROBUST(M |x, y, δ) = TQ(y δ , y) TQ(y , y) .
When evaluating on the dataset (x, y), ROBUST(M |x, y, δ) < 1 means the translation quality of M is degraded under perturbation δ; ROBUST(M |x, y, δ) = 1 indicates that M is robust to perturbation δ.
We opt for the ratio definition because it is on a [0, 1] scale, and it is easier to interpret than ∆TQ since the latter needs to be interpreted in the context of the TQ score.(2) High robustness can only be expected under low levels of noise, as it is not realistic for a model to recover from extreme perturbations.
Evaluation without References Reference translations are not readily available in some cases, such as when evaluating on a new domain.Inspired by unsupervised consistency training (Xie et al., 2019), we test if translation consistency can be used to estimate robustness against noise perturbations.Specifically, a model is consistent under a perturbation δ if the two translations, y δ and y are similar to each other.Note that consistency is sufficient but not necessary for robustness: a good translation can be expressed in diverse ways, which leads to high robustness but low consistency.
We define consistency by Sim can be any symmetric measure of similarity, and in this paper we opt for Sim(y δ , y ) to be the harmonic mean of TQ(y δ , y ) and TQ(y , y δ ), where TQ is BLEU between two outputs.
3 Experimental Set-Up Recently, two datasets were built from usergenerated content, MTNT (Michel and Neubig, 2018) and 4SQ (Berard et al., 2019) Following the convention, we also evaluate models directly on noisy MTNT (mtnt2019) and 4SQ test sets.We fine-tune baseline models with corresponding MTNT/4SQ training data, inheriting all hyper-parameters except the checkpoint interval which is re-set to 100 updates.Table 2 shows itemized training data statistics after pre-processing.
Perturbations We investigate two frequently used types of perturbations and apply them to WMT and KTJ test data.The first is synthetic misspelling: each word is misspelled with probability of 0.1, and the strategy is randomly chosen from single-character deletion, insertion, and substitution (Karpukhin et al., 2019).The second perturbation is letter case changing: each sentence is modified with probability of 0.5, and the strategy is randomly chosen from upper-casing all letters, lower-casing all letters, and title-casing all words (Berard et al., 2019). 2ince we change the letter case in the test data, we always report case-insensitive BLEU with '13a' tokenization using sacreBLEU (Post, 2018).Japanese output is pre-segmented with Kytea before running sacreBLEU. 3odel Variations We focus on comparing different (stochastic) subword segmentation strategies: BPE (Sennrich et al., 2016), BPE-Dropout (Provilkov et al., 2019), and SentencePiece (Kudo, 2018).Subword regularization methods (i.e., BPE-Dropout and SentencePiece) generate various segmentations for the same word, so the resulting NMT model better learns the meaning of less frequent subwords and should be more robust to noise that yields unusual subword combinations, such as misspelling.We use them only in offline training data pre-processing steps, which requires no modification to the NMT model.4

Experimental Results
As shown in Table 3, there is no clear winner among the three subword segmentation models based on BLEU scores on original WMT or KTJ test sets.This observation is different from results reported by Kudo (2018) and Provilkov et al. (2019).One major difference from previous work is the size of the training data, which is much larger in our experiments -subword regularization is presumably preferable on low-resource settings.
However, both our proposed metrics (i.e., robustness and consistency) show clear trends of models' robustness to input perturbations across all languages we tested: BPE-Dropout > SentencePiece > BPE.This suggests that although we did not observe a significant impact of subword regularization on generic translation quality, the robustness of the models is indeed improved drastically.
Unfortunately, it is unclear if subword regularization can help translating real-world noisy input, as shown in Table 4. MTNT and 4SQ contain several natural noise types such as grammar errors, emojis, with misspelling as the dominating noise type for English and French.The training data we use may already cover common natural misspellings, perhaps contributing to the failure of regularization methods to improve over BPE in this case.
Robustness Versus Consistency Variation in output is not necessarily in itself a marker of reduced translation quality, but empirically, consistency and robustness nearly always provide same model rankings in Table 3.We conduct more comprehensive analysis on the correlation between them, and we collect additional data points by varying the noise level of both perturbations.Specif-  ically, we use the following word misspelling probabilities: {0.05, 0.1, 0.15, 0.2} and the following sentence case-changing probability values: {0.3, 0.5, 0.7, 0.9}.As illustrated in Figure 1, consistency strongly correlates with robustness (sample Pearson's r = 0.91 to 0.98) within each language pair.This suggests that for this class of models, low consistency signals a drop in translation quality and the consistency score can be used as a robustness proxy when the reference translation is unavailable.
Robustness Versus Noise Level In this paper, robustness is defined by giving a fixed perturbation function and its noise level.We observe consistent model rankings across language pairs, but is it still true if we vary the noise level?
To test this, we plot the robustness data points from the last section against the noise level.Focusing on the misspelling perturbation for EN→DE models, Figure 2 shows that varying the word misspelling probability does not change the ranking of the models, and the gap in the robustness measurement only increases with larger amount of noise.This observation applies to all perturbations and language pairs we investigated.

Conclusion
We proposed two additional measures for NMT robustness which can be applied when both original and noisy inputs are available.These measure robustness as relative degradation in quality as well as consistency which quantifies variation in translation output irrespective of reference translations.We also tested two popular subword regularization techniques and their effect on overall performance and robustness.Our robustness metrics reveal a clear trend of subword regularization being much more robust to input perturbations than standard BPE.Furthermore, we identify a strong correlation between robustness and consistency in these models indicating that consistency can be used to estimate robustness on data sets or domains lacking reference translations.

Figure 1 :
Figure 1: Robustness (in percentage) and consistency are highly correlated within each language pair.Correlation coefficients are marked in the legend.

Figure 2 :
Figure 2: Varying the synthetic word misspelling probability for EN→DE models does not change the model ranking w.r.t.robustness (in percentage).

Table 1 :
An example of NMT English translations for a Finish input and its one-letter misspelled version.

Table 2 :
. They provide naturally occurring noisy inputs and translations for EN↔FR and EN↔JA, thus enabling automatic evaluations.EN↔JA baseline models are trained and also tested with aggregated data provided by MTNT, i.e., KFTT+TED+JESC (KTJ).EN↔FR Statistics of various training data sets.

Table 3 :
(Koehn, 2004)ess (in percentage), and consistency scores of different subword segmentation methods on original and perturbed test sets.We report mean and standard deviation using bootstrap resampling(Koehn, 2004).Subword regularization makes NMT models more robust to input perturbations.

Table 4 :
BLEU scores of using different subword segmentation methods on two datasets with natural noise.Subword regularization methods do not achieve consistent improvement over BPE, nor with or without fine-tuning.