Robust Machine Translation with Domain Sensitive Pseudo-Sources: Baidu-OSU WMT19 MT Robustness Shared Task System Report

This paper describes the machine translation system developed jointly by Baidu Research and Oregon State University for WMT 2019 Machine Translation Robustness Shared Task. Translation of social media is a very challenging problem, since its style is very different from normal parallel corpora (e.g. News) and also include various types of noises. To make it worse, the amount of social media parallel corpora is extremely limited. In this paper, we use a domain sensitive training method which leverages a large amount of parallel data from popular domains together with a little amount of parallel data from social media. Furthermore, we generate a parallel dataset with pseudo noisy source sentences which are back-translated from monolingual data using a model trained by a similar domain sensitive way. In this way, we achieve more than 10 BLEU improvement in both En-Fr and Fr-En translation compared with the baseline methods.


Introduction
Translation of social media is very challenging. First, there are various types of noises, such as abbreviations, spelling errors, obfuscated profanities, inconsistent capitalization, Internet slang and emojis (Michel and Neubig, 2018). Second, the amount of parallel data is limited. These characteristics of social media make existing neural machine translation systems extremely vulnerable.
The noise issue of social media has been investigated in some previous work (Baldwin et al., 2013;Eisenstein, 2013).
Most recently, Belinkov and Bisk (2017) demonstrated the vulnerability of neural machine translation system to both synthetic and natural noises. However, the noises tested in (Belinkov and Bisk, 2017) are not real noises in social media. To our best * Equal contribution knowledge, there seems to be a lack of translation methods systematically targeting noises in social media.
Existing neural machine translation systems are famous for their hungry of data. However, the amount of parallel data in social media domain is very limited. Just recently, a dataset collected from Reddit has been published and attracted a lot of attention (Michel and Neubig, 2018). The amount of data in this dataset is still very small, compared to the large amount of data from News domain. Naturally, how to utilize the large amount of parallel data from the News domain become a central problem in improving the translation of social meida.
In this paper, inspired by the success of backtranslation technique (Sennrich et al., 2015a), we propose to learn a model to generate "socialmedia-style" translation in source language from clean sentences in target language. Since the amount of parallel data in social media domain is limited, we utilize the large amount of parallel data in News domain to help the training. With this model, large mount of parallel data for back-translation can be generated from monolingual data in target language. In the final translation model, a special "domain" symbol is added to indicate which domain the source sentence belonging to.
The contributions of this paper are multifold, and some important ones are highlighted below: 1. We found that "social-media-style" sentences can be generated by training a translation model with different "start-of-sentence" symbols for sentences in different domains in the decoder side. The model is trained with data from all domains, especially News domain, which has a large amount of parallel data, but also adapted to the style in the do-main of social media, even the amount of parallel data in social media is limited. As demonstrated by our experiments, generating "social-media-style" sentences is crucial in the effectiveness of back-translation for training a translation model suitable for translating social media.
2. We illustrated that adding a domain symbol in source sentence improves the robustness of the model. This may be because the encoder learns some domain-specific features from input sentences.

Methods
Noisy text translation is short of in-domain training data. In this section, we present approaches to leverage a large amount of out-of-domain (e.g. News) dataset and monolingual data paired with pseudo noisy source data from back-translation.

Domain Sensitive Data Mixing
To improve the translation model from limited parallel data, we want to make the use of larger amount of out-of-domain data. However, simply mixing the clean and noisy data will make the whole training set unbalanced. To differentiate the data from different domain, we use different start symbol in source side. The intuition of injecting domain label in source side is based on the noise occurrence statistics from (Michel and Neubig, 2018), which shows much more spelling and grammar errors in the source side of noisy text translation dataset. Thus the clean and noisy start symbols work as a meaningful sign of source text style for encoder. Compared with the source side sentences, the human translation of target side sentences are less noisier with less spelling and grammar errors.

Noisy Pseudo-Sources Generation with Back-Translation
To further make the use of monolingual data, we regard them as target data and generate it's corresponding source data by back-translation (Sennrich et al., 2015a). However, different from Sennrich et al. (2015a) who uses this method in both clean source and target sentences, the source side sentences in our test set is much noisier than target side (as mentioned in previous subsection). Therefore, we reverse the source and target sentences where the noisy source sentences be-comes target and cleaner target sentences becomes source. For example, to generate noisy pseudo French source sentences for English monolingual data, we train a En-Fr translation model which takes the noisy French source sentences in Fr-En noisy dataset as target, and the corresponding paralleled English target sentences as source.
In this way, the model will learned how to inject noise into the target side. Similar to previous domain sensitive method, we include out-of-domain clean data during the training of this noisy translation model and differentiate them by different start symbol int target side.

Ensemble
In our experiments with relatively small training dataset, the translation qualities of models with different initializations can vary notably. To make the performance much more stable and improve the translation quality, we ensemble different models during decoding to achieve better translation.
To ensemble, we take the average of all model outputs:ŷ whereŷ i t denotes the output distribution of ith model at position t. Similar to Zhou et al. (2017) and Zheng et al. (2018c), we can ensemble models trained with different architectures and training algorithms.

Experiments
To investigate the empirical performances of our proposed methods, we conduct experiments on MTNT dataset (Michel and Neubig, 2018) using Transformer (Vaswani et al., 2017).
We first apply BPE (Sennrich et al., 2015b) on both sides in order to reduce the vocabulary for both source and target sides. We then exclude the sentences pairs whose length are longer than 256 words or subwords. We use length reward  to find the optimal target length.
Our implementation is adapted from PyTorchbased OpenNMT (Klein et al., 2017). Our Transformer's parameters are as the same as the base model's parameter settings in the original paper (Vaswani et al., 2017).    In all experiments, our evaluation uses sacre-BLEU 1 , a standardized BLEU score evaluation tool by Post (2018). We specify the intl tokenization option during BLEU evaluation. We also uses detokenization and normalization tools in Moses. Table 1 and 2 show statistics of En2Fr and Fr2En datasets.
For both En-Fr and Fr-En dataset, the clean parallel data is from WMT15 news translation task. The noisy data is from (Michel and Neubig, 2018) collected from social network. Except the French and English monolingual data from WMT15 news translation task, we also make the use of English portion of parallel data from KFTT, TED and JESC used in (Michel and Neubig, 2018).

Noisy Data Generation
To make use of monolingual target data, we want to generate the corresponding parallel pseudo noisy source data and put them into training set. Table 3 shows the performance of our noisy data generation models. In this experiment, we mix the clean and noisy dataset as the training set, but use the target sentences in reversed direction of noisy dataset (training, validation, test) set as source and 1 https://github.com/mjpost/sacreBLEU source sentences as target. The domain insensitive method simply mix the clean and noisy dataset in training while the domain sensitive method differentiate the clean and noisy dataset in target side by starting with different symbol (e.g. < clean s >, < noisy s >). The experiment shows that the domain sensitive method can outperform the domain insensitive method with a large margin. Table 4 shows the final results of different methods on test set. Similar with the previous experiments, the domain insensitive methods simply mix all the clean, noisy training data. The performance has a little improvement in En-Fr by adding the monolingual data paired with the pseudo source data generated by the model trained in previous experiments. To differentiate the clean and noisy dataset, we assign different label at the start of them and the performance is thus boosted about 3 to 4 BLEU score. We further generate pseudo noisy source data from the monolingual target with the model using the domain sensitive method in previous experiment. By adding these noisy back-translation data, we achieve more than 2 BLEU improvement. Our final submission ensembles 5 models trained with the domain sensitive method and including the noisy back translation data. Table 5 and Table 6 show the final results of our submission in Fr-En and En-Fr. Our system ranks third in both directions. Table 7 shows the human judgments over all submitted systems which are done by Li et al. (2019) who also analyze and discuss all submitted systems.

Related Work
The method proposed in this paper is a kind of domain adaptation technique. There are many previous work on domain adaptation for machine translation (Britz et al., 2017;Wang et al., 2017;Chu et al., 2017;Chu and Wang, 2018), which leverages out-of-domain parallel corpora and indomain monolingual corpora to improve translation. The difference between our method and previous work lies in that we use back-translation (Sennrich et al., 2015a) for domain adaptation. Different from some previous work using adversarial training       tion (Zheng et al., 2018a) to differentiate multiple tasks, we simply assign different starting symbol for multiple tasks (Lample et al., 2018).
A similar method was proposed in (Xie et al., 2018) in the context of grammar correction, where a model is trained to add noises on original sen-tences to produce noisy sentences. However, instead of learn how to generate arbitrary "noises", our goal is to learn "social-media-style" translations. Singh et al. (2019) injects artificial noise in the clean data according to the distribution of noisy data. Liu et al. (2019a) propose to leverage phonetic information to reduce the noises in data.
Another group of work related to this paper is data augmentation in machine translation. Although data augmentation is very popular in general learning tasks, such as image processing, it is non-trivial to do so in machine translation because even slight modifications of sentences can make huge difference in semantics. To our best knowledge, there are two categories of successful data augmentation approaches for machine translation. The first one is based on backtranslation ( (Sennrich et al., 2015a)) which augments monolingual data into training set. The second one is based on word replacement, such as (Sennrich et al., 2016) and . Zheng et al. (2018b) make the use of multiple references and generates even more pseudoreferences and achieve improvement in both machine translation and image captioning.

Conclusions and Future Work
In this paper, we proposed a method to improve the translation of social media. The style of social media is very unique, and is very different from the style of widely researched News sentences. The core part of our method is to generate useful parallel data for back-translation, that is, generating synthetic in-domain parallel data. To achieve this goal, we proposed a method to generate "social-media-style" source sentences from monolingual target sentences. We also distinguish the domain of source sentences by inserting a domain symbol into source sentences. Both techniques are proven to be extremely useful in the scenario of translating social media. Finally, we utilized the ensemble to further boosts the translation performance.
The noises in social media are mostly introduced by human mistakes. There are some other cases that noises in source side are introduced by systems, such as ASR in speech-to-text translation (Liu et al., 2019b). We plan to further investigate this domain sensitive method on these tasks, even on speech-to-text simultaneous translation (Ma et al., 2018;Zheng et al., 2019).