Adversarial Subword Regularization for Robust Neural Machine Translation

Exposing diverse subword segmentations to neural machine translation (NMT) models often improves the robustness of machine translation as NMT models can experience various subword candidates. However, the diversification of subword segmentations mostly relies on the pre-trained subword language models from which erroneous segmentations of unseen words are less likely to be sampled. In this paper, we present adversarial subword regularization (ADVSR) to study whether gradient signals during training can be a substitute criterion for exposing diverse subword segmentations. We experimentally show that our model-based adversarial samples effectively encourage NMT models to be less sensitive to segmentation errors and improve the performance of NMT models in low-resource and out-domain datasets.


Introduction
Subword segmentation is a method of segmenting an input sentence into a sequence of subword units (Sennrich et al., 2016;Wu et al., 2016;Kudo, 2018). Segmenting a word to the composition of subwords alleviates the out-of-vocabulary problem while retaining encoded sequence length compactly. Due to its effectiveness in the open vocabulary set, the method has been applied to many NLP tasks including neural machine translation (NMT) and others (Gehring et al., 2017;Vaswani et al., 2017;Devlin et al., 2019;Yang et al., 2019).
Recently, Byte-Pair-Encoding(BPE) (Sennrich et al., 2016) has become one of the de facto subword segmentation methods. However, as BPE deterministically segments each word into subword units, NMT models with BPE always observe the Figure 1: NMT models suffer from typos (character drop, character replacement) in the source text due to the unseen subword compositions ('_' denotes segmentation). On the other hand, Ours correctly decodes them. Base: standard training, SR: subword regularization (Kudo, 2018) same segmentation result for each word and often fail to learn diverse morphological features. In this regard, Kudo (2018) proposed subword regularization, a training method that exposes multiple segmentations using a unigram language model. Starting from machine translation, it has been shown that subword regularization can improve the robustness of NLP models in various tasks (Kim, 2019;Provilkov et al., 2019;Drexler and Glass, 2019;Müller et al., 2019).
However, subword regularization relies on the unigram language models to sample candidates, where the language models are optimized based on the corpus-level statistics from training data with no regard to the translation task objective. This causes NMT models to experience a limited set of subword candidates which are frequently observed in the training data. Thus, NMT models trained with the subword regularization can fail to inference the meaning of unseen words having unseen segmentations. This issue can be particularly problematic for low resource languages and noisy text where many morphological variations are not present in the training data. The suboptimality issue of the subword segmentation methods has been also raised in many prior works (Kreutzer and Sokolov, 2018;Wang et al., 2019b;Ataman et al., 2019;Salesky et al., 2020).
To tackle the problem of unigram language models, we search for a different sampling strategy using gradient signals which does not rely on corpus-level statistics and is oriented to the task objective. We adopt the adversarial training framework (Goodfellow et al., 2014;Miyato et al., 2016;Ebrahimi et al., 2017;Cheng et al., 2019) to search for a subword segmentation that effectively regularizes the NMT models. Our proposed method, adversarial subword regularization (ADVSR), greedily searches for a diverse, yet adversarial subword segmentation which will likely incur the highest translation loss. Our experiment shows that the NMT models trained with ADVSR improve the performance of baseline NMT models up to 3.2 BLEU scores in IWSLT datasets while outperforming the standard subword regularization method. We also highlight that NMT models trained with the proposed method are highly robust to characterlevel input noises. 1

Background
Subword Regularization Subword regularization (Kudo, 2018) exposes multiple subword candidates during training via on-the-fly data sampling. The proposed training method optimizes the parameter set θ with marginal log-likelihood: where x = (x 1 , . . . , x M ) and y = (y 1 , . . . , y N ) are sampled segmentations (in a subword unit) from a source sentence X and a target sentence Y through the unigram language model (subwordlevel) P seg (·) and D denotes the number of samples. Generally, a single sample per epoch is used during training to approximate Eq 1.
The probability of a tokenized output is obtained by the product of each subword's occurrence probability where subword occurrence probabilities are attained through the Bayesian EM algorithm (Dempster et al., 1977;Liang et al., 2007;Liang and Klein, 2009). Segmentation output with maximum probability is acquired by using Viterbi algorithm (Viterbi, 1967).
Adversarial Regularization in NLP Adversarial samples are constructed by corrupting the original input with a small perturbation which distorts the model output. Miyato et al. (2016) adopted the adversarial training framework to the task of text classification where input embeddings are perturbed with adversarial noiser: E is an embedding matrix, e i is an perturbed embedding vector, and (·) is loss function obtained with the input embeddings perturbed with noise r. Note that Miyato et al. (2016) use a word for the unit of x i unlike our definition. As it is computationally expensive to exactly estimater in Eq 3, Miyato et al. (2016) resort to the linear approximation method (Goodfellow et al., 2014), wherer i is approximated as follows: indicates the degree of perturbation and g i denotes a gradient of the loss function with respect to a word vector. Moreover, Ebrahimi et al. (2017) extended adversarial training framework to directly perturb discrete input space, i.e. character, through the first-order approximation by the use of gradient signals.

Approach
Relying on the subword language models might bias NMT models to frequent segmentations, hence hinders the NMT model in understanding diverse segmentations. This may harm the translation quality of the NMT models when diverse morphological variations occur.
However, simply exposing diverse segmentations uniformly leads to a decrease in performance (Kudo, 2018). In this regard, we utilize gradient signals for exposing diverse, yet adversarial subword segmentation inputs for effectively regularizing NMT models. Kreutzer and Sokolov (2018) proposed to jointly learn to segment and translate by using hierarchical RNN (Graves, 2016), but the method is not model-agnostic and slow due to the increased sequence length of characterlevel inputs. On the other hand, our method is model-agnostic and operates on the word-level. Our method seeks adversarial segmentations onthe-fly, thus the model chooses the subword candidates that are vulnerable to itself according to the state of the model at each training step.

Problem Definition
Our method generates a sequence of subwords by greedily replacing the word's original segmentation to that of adversarial ones estimated by gradients. Given a source sentence X and a target sentence Y , we want to find the sequence of subwordsx and y which incurs the highest loss: and Ω(Y ) denote all the subword segmentation candidates of X and Y and (·) denotes loss function.
Our method operates on a word unit split by whitespaces, each of which consists of variable length subwords. We first define a sequence of words in X as w = (w 1 , . . . , w M ) where M denotes the length of the word-level sequence. Then, we can segment w j as s j = (s j 1 , . . . , s j K ) which are K subword units of the j-th word (note that now we can represent input X as as a sequence of s j as s = (s 1 , . . . , s M )). For example, as for the j-th word "lovely", its tokenized output "love" and "ly" will be s j 1 and s j 2 respectively. Then, we define the embedding and the gradient of the word segmentation as the aggregation of K subwords consisting it: where where e denotes the embedding lookup operation, d denotes the hidden dimension of embeddings. We simply use the element-wise average operation for f . Therefore if the segmentation of the word changes, the corresponding embedding and gradient vector will change accordingly.

Adversarial Subword Regularization
As it is intractable to find the most adversarial sequence of subwords given combinatorially large space, we approximately search for word-wise adversarial segmentation candidates. We seek for the adversarial segmented result of a j-th word, i.e. w j , from the sentence X by following criteria which was originally proposed by Ebrahimi et al. (2017) and applied to many other NLP tasks (Cheng et al., 2019;Wallace et al., 2019;. More formally, we seek an adversarial segmentationŝ j of the j-th word w j aŝ where s j represents one of the tokenized output among the possible candidates Ω(w j ) which are obtained by SentencePiece tokenizer (Kudo and Richardson, 2018).s j denotes an original deterministic segmentation of j-th word. Note that for computing gs j , we use (x,ỹ) which is from the original deterministic segmentation results. We applied L2 normalization to the gradient vectors and embedding vectors. We uniformly select words in the sentence with a probability R and replace them into adversarial subword composition according to the Eq 9. We perturb both the source and the target sequences. We summarize our method in Algorithm 1.

Datasets and Implementation Details
We conduct experiments on a low-resource multilingual dataset, IWSLT 2 , where unseen morphological variations outside the training dataset can occur frequently. We also test NMT models on MTNT (Michel and Neubig, 2018), a testbed for evaluating the NMT systems on the noisy text. We used the English-French language pair. Moreover, for evaluating the robustness to the typos, we generate the synthetic test data with character-level noises using the IWSLT dataset.
2 http://iwslt.org/ For all experiments, we use Transformer-Base (Vaswani et al., 2017) as a backbone model (L=6, H=512) and follow the same regularization and optimization procedures. We train our models with a joined dictionary of the size 16k. Our implementation is based on Fairseq (Ott et al., 2019). Further details on the experimental setup are described in Appendix A.2.

Evaluation
For inference, we use a beam search with a beam size of 4. For the evaluation, we used the checkpoint which performed the best in the validation dataset. We evaluated the translation quality through BLEU (Papineni et al., 2002) computed by SacreBleu (Post, 2018). Our baselines are NMT models trained with deterministic segmentations (BASE) and models trained with the subword regularization method (SR) (Kudo, 2018). We set the hyperparameters of subword regularization equivalent to those of Kudo (2018). Table 2 shows the main results on IWSLT datasets. Our method significantly outperforms both the BASE and the SR. This shows that leveraging translation loss to expose various segmentations is more effective than constraining the NMT models to observe limited sets of segmentations. Specifically, ADVSR improves 1.6 BLEU over SR and 3.2 BLEU over BASE in the Czech to English dataset. We assume that the large gains are due to the morphological richness of Czech. The performance improvement over the baselines can also be explained by the robustness to unseen lexical variations, which are shown in Appendix B. Table 3 shows the results on the MTNT dataset where we utilized the NMT models trained from Section 5.1. We also experiment with the domain adaptive fine-tuning with the MTNT dataset (denoted as + FT).

Results on Out-Domain Dataset
Generally, exposing multiple subword candidates to the NMT models shows superior performance in domain adaptation, which matches the finding from Müller et al. (2019). Above all, NMT models trained with our proposed method outperforms BASE up to 2.3 and SR up to 0.9 BLEU scores.

Results on Synthetic Dataset
Additionally, we conduct an experiment to see the changes in translation quality according to different noise ratios. Using IWSLT17 (FR ↔ EN), we synthetically generated 3 types of noise, 1. character drop, 2. character replacement, 3. character insertion and perturbed each word with the given noise probability. Table 4 shows that as the noise fraction increases, our method proves its robustness compared to the baseline models improving BASE up to 10.4 and SR up to 7.1 BLEU scores.

Related Work
Subword segmentation has been widely used as a standard in the NMT community since the Byte-Pair-Encoding (Sennrich et al., 2016) was proposed. Kudo (2018) introduced the training method of subword regularization. Most recently, the BPEdropout (Provilkov et al., 2019) was introduced which modifies the original BPE's encoding process to enable stochastic segmentation. Our work shares the motivation of exposing diverse subword candidates to the NMT models with previous works but differs in that our method uses gradient signals. Other segmentation methods include word-piece (Schuster and Nakajima, 2012) and variable length encoding schme (Chitnis and DeNero, 2015). Also, there is another line of research that utilizes character-level segmentation (Luong and Manning, 2016;Lee et al., 2017;Cherry et al., 2018).
Other works explored generating synthetic or natural noise for regularizing NMT models (Belinkov and Bisk, 2018;Sperber et al., 2018;Karpukhin et al., 2019). Michel and Neubig (2018) introduced a dataset scraped from Reddit for testing the NMT systems on the noisy text. Recently, a shared task on building the robust NMT models was held Bérard et al., 2019).
Our method extends the adversarial training framework, which was initially developed in the vision domain (Goodfellow et al., 2014) and has begun to be adopted in the NLP domain recently (Jia and Liang, 2017;Belinkov and Bisk, 2018;Samanta and Mehta, 2017;Miyato et al., 2016;Motoki Sato, 2019;Wang et al., 2019a;Cheng et al., 2019). Miyato et al. (2016) adopted the adversarial training framework on text classification by perturbing embedding space with continuous adversarial noise. Cheng et al. (2019) introduced an adversarial training framework by discrete word replacements where candidates were generated from the language model. However, our method does not replace the word but replaces its subword composition.

Conclusions
In this study, we propose adversarial subword regularization which samples subword segmentations that maximize the translation loss. Segmentations from the subword language model might bias NMT models to frequent segmentations in the training set. On the other hand, our method regularizes the NMT models to be invariant to unseen segmentations. Experimental results on low resource and out-domain datasets demonstrate the effectiveness of our method. We act a community. Protect your evening. ADVSR Come, dance with me.

B Sampled Translation Outputs
We activate the community. Enjoy your evening. Table B.1: Excerpt from the translation results of the NMT models trained with different training methods. Presented samples demonstrate how our method infers the meaning of rarely appearing words' variations. Despite its low frequency of appearance, the NMT model trained with our method infers the meaning of the observed word's morphosyntactic variation. This can be explained by the fact that our method encourages the NMT model to be segmentation invariant, and is better at inferring the meaning from unseen subword composition.