Token Drop mechanism for Neural Machine Translation

Neural machine translation with millions of parameters is vulnerable to unfamiliar inputs. We propose Token Drop to improve generalization and avoid overfitting for the NMT model. Similar to word dropout, whereas we replace dropped token with a special token instead of setting zero to words. We further introduce two self-supervised objectives: Replaced Token Detection and Dropped Token Prediction. Our method aims to force model generating target translation with less information, in this way the model can learn textual representation better. Experiments on Chinese-English and English-Romanian benchmark demonstrate the effectiveness of our approach and our model achieves significant improvements over a strong Transformer baseline.


Introduction
Neural machine translation (NMT) achieved enormous success in advancing the quality of translation (Bahdanau et al., 2015;Vaswani et al., 2017;Gehring et al., 2017). In spite of the impressive performance, NMT models are still vulnerable to perturbations in the input sentences (Belinkov and Bisk, 2018;Cheng et al., 2019) , i.e. a tiny perturbation will affect hidden representation and lead to low quality of translation (Zhao et al., 2018). Moreover, NMT commonly consists of millions of parameters, which making it prone to overfitting especially in low resource scene.
A natural way to improve generalization is synthesizing natural noise (Karpukhin et al., 2019) or adopting arbitrary noise (Cheng et al., 2018;Ebrahimi et al., 2018). Another way is exploring regularization techniques to avoid overfitting (Miceli et al., 2017), making model robust to unseen or unfamiliar inputs. However, as discrete data, the text is hard to retain the semantic information after corruption.
In this paper, we propose Token Drop to prevent overfitting and improve generalization. Different from standard dropout (Srivastava et al., 2014) that drops neurons in network randomly, we drop tokens of the input sentences. In order to retain semantic information, we replace tokens with a special symbol < unk > . This allows model learn hidden representation from rest token's context, and predict target translation condition on latent variable. On the one hand, our method allows model meeting exponentially different sentences can be explained as data augmentation; On the other hand, our method corrupts input sentences with natural noise can be seen as regularization term for NMT.
We investigate two self-supervised objectives: Replaced Token Detection and Dropped Token Prediction. Considering our Token Drop method regularize parameters by weakening model inputs, making NMT suitable for applying self-supervised objective. During training: (1) use a discriminator to detect whether input tokens are dropped or not; (2) leverage hidden state to predict original tokens of dropped tokens inspired by Cloze task (Devlin et al., 2019). Both of them guide model to generate semantically similar representation, leading to a better generalization capacity.

Token Drop Training
Standard dropout prevents overfitting by setting input neurons or hidden neurons to zero with a certain probability p (Hinton et al., 2012;Srivastava et al., 2014). whereas we consider the input sequences of machine translation models instead of the network's neurons, which named Token Drop. Given a input tokens sequence of sentence X = {x 1 , x 2 , ..., x n } and posit a |X| independent drop rate p. The token x i in X will be dropped if m i is 1 . This process as Equation 1: (1) The Token Drop can be interpreted as data augmentation and regularization technique for NMT. Seeing that NMT model commonly adopts encoder and decoder architecture, therefore our method drops tokens for both source and target inputs. For the source side, model encoder learns intermediate representation by exponentially different incomplete sentences. For the target side, model decoder generates target translation condition on latent variable, weakening the constraint caused by teacher forcing. Both of them receives incomplete information from inputs, simulating the real situation (e.g unknown or unfamiliar data) at test time.

Token Drop methods
We adopt three drop strategy for Token Drop: Zero-Out is introduced by Sennrich et al. (2016a), different from the standard dropout, the method drops full word by setting zero to word embedding during training. The deficiency is zero vector can not learn representation from its context in the self-attention layer.
Drop-Tag (Kågebäck and Salomonsson, 2016) replaces token with a < dropped > tag. The tag is subsequently treated just like any other word in the vocabulary and has a corresponding word embedding that is trained. We adopt this technique for NMT to learn better feature representation.
Unk-Tag replaces token with generic unknown word token < unk >. Bowman et al. (2016) and Yang et al. (2017) apply it to RNN decoder to force model make prediction by latent variable. We found this perfectly suits for NMT system especially on self-attention layers. Better than Drop-Tag method, it need not to add an extra token as well as parameters.

Replaced Token Detection
We propose the Replaced Token Detection task to promote generalization ability of the model encoder. We regard dropped information as self-supervised label, following Clark et al. (2020), we train a discriminator D(G(x)) to detect whether tokens are dropped or not. On account of our dropped tokens are obvious to distinguish, so we add a simple linear classifier to detect replaced tokens. The objective is : Where d(x) denotes the dropped tokens. In our model, the encoder serves as a generator G, which generates hidden state of input tokens. The discriminator D tries to distinguish whether a token is dropped or not, while the generator G has to produce similar representation for x andx, making the model robust to noisy and unknown inputs.

Dropped Token Prediction
In consideration of our Token Drop model randomly replaces tokens of input sentence, similar to Masked Language Model (Devlin et al., 2019), which masks then predicts masked tokens by the rest of the tokens, making use of contextual information. Accordingly, we propose Dropped Token Prediction (DTP), predicting dropped tokens by their hidden states. The DTP objective is : Where d(x) and X /d(x) denote the dropped tokens and the rest tokens respectively. G is model encoder, E(.) is prediction layer. In our implementation of DTP, we adopt weight tying (Press and Wolf, 2017) , that is to share the same weight matrix between embedding layer and token prediction classifier. At the end we train our model jointly with DTP and RTD objective:

Experiment
We conduct our approach on two machine translation benchmarks: LDC (ZH-EN) and WMT16 (EN-RO) 2

Dataset and Evaluation
For  (Sennrich et al., 2016b). we use newstest-2016 as test set and report tokenized BLEU score.

Models and Settings
We adopt the Transformer model (Vaswani et al., 2017) implemented in PyTorch in the fairseq-py toolkit (Ott et al., 2019). We closely followed settings by Vaswani et al. (2017) The results of our experiment on NIST Chinese-English and WMT16 English-Romanian tasks are shown in Table 1. We first conduct Token Drop through three drop methods (Zero-out, Drop-Tag, Unk-Tag), the results show that Token Drop model significantly outperform baseline on two languages. Furthermore, we combine Unk Tag method with DTP and RTD training objective, the results show that both DTP and RTD provide a further improvement on Token Drop training. Overall, we get a gain of 2.37, 1.15 and 1.73 BLEU score on three tasks respectively.  In order to demonstrate the generalization capacity of our model on real situation. We constrain input information by replacing words with generic unknown symbol < unk >. For each sentence, we generate 100 noisy sentences then report average BLEU score. We also compare our method with Cheng et al. (2019), who introduced white-box adversarial noisy inputs to improve robustness. Table 2 reports our result on incomplete inputs, from where we can see our approach outperforms previous method. This improvement confirms that our model obtains larger generalization capacity over baseline.  Figure 1 we can see that model training with a moderate drop rate p would advanced in performance significantly. Drop-Tag and Unk-Tag are quite similar, both of them outperforms Zero-Out method. We plot learning curves of baseline model and our Token Drop model. Figure 2 shows that with the increase of training iterations, our model achieves lower and more stable perplexity than baseline, demonstrating the effectiveness of our approach to prevent overfitting and improve translation quality.

Related Work
Word Dropout Iyyer et al. (2015) proposed word dropout as feature extractor for text classification task. Bowman et al. (2016) and Xie et al. (2017) applied word dropout to RNN decoder can be regard as a smoothing technique. For machine translation task, Sennrich et al. (2016a) randomly set zero to words of input sentence to prevent overfitting, advancing in considerable performance on noisy dataset. In this paper, we explain word dropout as data augmentation (i.e. allows model meeting exponentially different sentences) and a regularization technique (i.e. weakens the encoder and decoder, obtaining better intermediate representations).

Conclusion
In this paper, we have proposed Token Drop mechanism for neural machine translation task. Inspired by self-supervised learning, we introduced Replaced Token Detection and Dropped Token Prediction training objective. We found that NMT model trained with Token Drop gains larger generalization capacity and reduction in overfitting. Even without prior knowledge and additional parameters, our proposed approach reports convincing results on neural machine translation. In future work, we plan to investigate impact of dropping on different words, e.g. word importance and word type.