NTT’s Machine Translation Systems for WMT19 Robustness Task

This paper describes NTT’s submission to the WMT19 robustness task. This task mainly focuses on translating noisy text (e.g., posts on Twitter), which presents different difficulties from typical translation tasks such as news. Our submission combined techniques including utilization of a synthetic corpus, domain adaptation, and a placeholder mechanism, which significantly improved over the previous baseline. Experimental results revealed the placeholder mechanism, which temporarily replaces the non-standard tokens including emojis and emoticons with special placeholder tokens during translation, improves translation accuracy even with noisy texts.


Introduction
This paper describes NTT's submission to the WMT 2019 robustness task (Li et al., 2019). This year, we participated in English-to-Japanese (En-Ja) and Japanese-to-English (Ja-En) translation tasks with a constrained setting, i.e., we used only the parallel and monolingual corpora provided by the organizers.
The task focuses on the robustness of Machine Translation (MT) to noisy text that can be found on social media (e.g., Reddit, Twitter). The task is more challenging than a typical machine translation task like the news translation tasks (Bojar et al., 2018) due to the characteristics of noisy text and the lack of a publicly available parallel corpus (Michel and Neubig, 2018). Table 1 shows example comments from Reddit, a discussion website. Text on social media usually contains various noise such as (1) abbreviations, (2) grammatical errors, (3) misspellings, (4) emojis, and (5) emoticons. In addition, most provided parallel corpora are not related to our target domain, ⇤ Equal contribution.
(1) I'll let you know bro, thx (2) She had a ton of rings.
(3) oh my god it's beatiful (4) Thank you so much for all your advice!! (5) (\ ⇤´8`⇤ ) so cute and the amount of in-domain parallel corpus is still limited as compared with parallel corpora used in the typical MT tasks (Bojar et al., 2018).
To tackle this non-standard text translation with a low-resource setting, we mainly use the following techniques. First, we incorporated a placeholder mechanism (Crego et al., 2016) to correctly copy special tokens such as emojis and emoticons that frequently appears in social media. Second, to cope with the problem of the low-resource corpus and to effectively use the monolingual corpus, we created a synthetic corpus from a target-side monolingual corpus with a target-to-source translation model. Lastly, we fine-tuned our translation model with the synthetic and in-domain parallel corpora for domain adaptation.
The paper is organized as follows. In Section 2, we present a detailed overview of our systems. Section 3 shows experimental settings and main results, and Section 4 provides an analysis of our systems. Finally, Section 5 draws a brief conclusion of our work for the WMT19 robustness task.

System Details
In this section, we describe the overview and features of our systems: • Data preprocessing techniques for the provided parallel corpora (Section 2.2).
• Synthetic corpus, back-translated from the  provided monolingual corpus, and noisy data filtering for its data. (Section 2.3).
• Placeholder mechanism to handle tokens that should be copied from a source-side sentence (Section 2.4).

NMT Model
Neural Machine Translation (NMT) has been making remarkable progress in the field of MT (Bahdanau et al., 2015;Luong et al., 2015). However, most existing MT systems still struggle with noisy text and easily make mistranslations (Belinkov and Bisk, 2018), though the Transformer has achieved the state-of-the-art performance in several MT tasks (Vaswani et al., 2017). In our submission system, we use the Transformer model (Vaswani et al., 2017) without changing the neural network architecture as our base model to explore strategies to tackle the robustness problem. Specifically, we investigate how its noise-robustness against the noisy text can be boosted by introducing preprocessing techniques and a monolingual corpus in the experiments.

Data Preprocessing
For an in-domain corpus, the organizers provided the MTNT (Machine Translation of Noisy Text) parallel corpus (Michel and Neubig, 2018), which is a collection of Reddit discussions and their manual translations. They also provided relatively large out-of-domain parallel corpora, namely KFTT (Kyoto Free Translation Task) (Neubig, 2011), JESC (Japanese-English Subtitle Corpus) (Pryzant et al., 2017), and TED talks (Cettolo et al., 2012). Table 2 shows the number of sentences and words on the English side contained in the provided parallel corpora.   Yamamoto and Takahashi (2016) pointed out that the KFTT corpus contains some inconsistent translations. For example, Japanese era names are only contained in the Japanese side and not translated into English. We fixed these errors by the script provided by Yamamoto and Takahashi (2016) We use different preprocessing steps for each translation direction. This is because we need to submit tokenized output for En-Ja translation, thus it seems to be better to tokenize the Japanese side in the same way as the submission in the preprocessing steps, whereas we use a relatively simple method for Ja-En direction.
For Ja-En, we tokenized the raw text into subwords by simply applying sentencepiece with the vocabulary size of 32,000 for each language side (Kudo, 2018;Kudo and Richardson, 2018). For En-Ja, we tokenized the text by KyTea  and the Moses tokenizer (Koehn et al., 2007) for Japanese and English, respectively. We also truecased the English words by the script provided with Moses toolkits 2 . Then we further tokenized the words into subwords using joint Byte-Pair-Encoding (BPE) with 16,000 merge operations 3 (Sennrich et al., 2016b).

Monolingual Data
In addition to both the in-domain and out-ofdomain parallel corpora, the organizers provided a MTNT monolingual corpus, which consists of comments from the Reddit discussions. Table 3 shows the number of sentences and words contained in the provided monolingual corpus.
As NMT can be trained with only parallel data, utilizing a monolingual corpus for NMT is a key  challenge to improve translation quality for lowresource language pairs and domains. Sennrich et al. (2016a) showed that training with a synthetic corpus, which is generated by translating a monolingual corpus in the target language into the source language, effectively works as a method to use a monolingual corpus. Figure 1 illustrates an overview of the back-translation and fine-tuning processes we performed.
(1) We first constructed both of source-to-target and target-to-source translation models with the provided parallel corpus.
(2) Then, we created a synthetic parallel corpus through back-translation with the target-to-source translation model. (3) Next, we applied filtering techniques to the synthetic corpus to discard noisy synthetic sentences. (4) Finally, we fine-tuned the source-to-target model on both the synthetic corpus and in-domain parallel corpus. Before the back-translation, we performed several data cleaning steps on the monolingual data to remove the sentences including ASCII arts and sentences that are too long or short. To investigate whether each sentence contains ASCII art or not, we use a word frequency-based method to detect ASCII arts. Since ASCII arts normally consist of limited types of symbols, the frequency of specific words in a sentence tends to be locally high if the sentence includes an ASCII art. Therefore, we calculate a standard deviation of word frequencies in each sentence of monolingual data to determine whether a sentence is like ASCII arts. More specifically, we first define a word frequency list x i of the sentence i. For example, the word frequency list is denoted as x i = [1, 1, 1, 1, 1] for the sentence i, "That 's pretty cool ." but as x j = [1, 1, 1, 1, 3] for another sentence j, "THIS IS MY LIFE ! ! !". Note that the length of the list x i is equal to the vocabulary size of the sentence i or j and each element of the list corresponds to the word frequency of a specific word. Second, we calculate the standard deviation i of the word frequency list x i for the sentence i. Finally, if i is higher than a specific threshold, we assume that the sentence i contains an ASCII art and discard it from the monolingual data. We set the threshold to 6.0.
Moreover, since the provided monolingual data includes lines with more than one sentence, we first performed the sentence tokenization using the spaCy 4 toolkit. After that, we discarded the sentences that are either longer than 80 tokens or equal to 1 token.
Since a synthetic corpus might contain noisy sentence pairs, previous work shows that an additional filtering technique helps to improve accuracy (Morishita et al., 2018). We also apply a filtering technique to the synthetic corpus as illustrated in (3) in Figure 1. For this task, we use the qe-clean 5 toolkit, which filtered out the noisy sentences on the basis of a word alignment and language models by estimating how correctly translated and natural the sentences are (Denkowski et al., 2012). We train the word alignment and language models by using KFTT, TED, and MTNT corpora 6 . We use fast_align for word alignment and KenLM for language modeling (Dyer et al., 2013;Heafield, 2011).

Placeholder
Noisy text on social media often contains tokens that do not require translation such as emojis, " , , ", and emoticons, "m(_ _)m, (`· ! ·´), \(ˆoˆ)/". However, to preserve the meaning of the input sentence that contains emojis or emoticons, such tokens need to be output to the target language side. Therefore, we simply copy the emojis and emoticons from a source language to a target language with a placeholder mechanism (Crego et al., 2016), which aims at alleviating the rare-word problem in NMT. Both the sourceand target-side sentences containing either emojis or emoticons need to be processed for the placeholder mechanism. Specifically, we use a special token "<PH>" as a placeholder and replace the emojis and emoticons in the sentences with the special tokens.
To leverage the placeholder mechanism, we need to recognize which tokens are corresponding to emojis or emoticons in advance. Emojis can easily be detected on the basis of Unicode Emoji Charts 7 . We detect emoticons included in both the source-and the target-side sentences with the nagisa 8 toolkit, which is a Japanese morphological analyzer that can also be used as an emoticon detector for Japanese and English text.
Moreover, we also replace ">" tokens at the beginning of the sentence with the placeholders because ">" is commonly used as a quotation mark in social media posts and emails and does not require translation.

Fine-tuning
Since almost all the provided corpora are not related to our target domain, it is natural to adapt the model by fine-tuning with the in-domain corpora. Whereas we use both the MTNT and synthetic corpora for Ja-En, we only use the MTNT corpus for En-Ja because the preliminary experiment shows that synthetic corpus does not help to improve accuracy for the En-Ja direction. We suspect this is due to the synthetic corpus not having sufficient quality to improve the model.

Experimental Settings
We used the Transformer model with six blocks. Our model hyper-parameters are based on trans-former_base settings, where the word embedding dimensions, hidden state dimensions, feedforward dimensions and number of heads are 512, 512, 2048, and 8, respectively. The model shares 7 https://unicode.org/emoji/charts 8 https://github.com/taishi-i/nagisa the parameter of the encoder/decoder word embedding layers and the decoder output layer by three-way-weight-tying (Press and Wolf, 2017). Each layer is connected with a dropout probability of 0.3 (Srivastava et al., 2014). For an optimizer, we used Adam (Kingma and Ba, 2015) with a learning rate of 0.001, 1 = 0.9, 2 = 0.98. We use a root-square decay learning rate schedule with a linear warmup of 4000 steps (Vaswani et al., 2017). We applied mixed precision training that makes use of GPUs more efficiently for faster training (Micikevicius et al., 2018). Each minibatch contains about 8000 tokens (subwords), and we accumulated the gradients of 128 mini-batches for an update (Ott et al., 2018). We trained the model for 20,000 iterations, saved the model parameters each 200 iterations, and took an average of the last eight models 9 . Training took about 1.5 days to converge with four NVIDIA V100 GPUs. We compute case-sensitive BLEU scores (Papineni et al., 2002) for evaluating translation quality 10 . All our implementations are based on the fairseq 11 toolkit (Ott et al., 2019).
After training the model with the whole provided parallel corpora, we fine-tuned it with indomain data. During fine-tuning, we used almost the same settings as the initial training setup except we changed the model save interval to every three iterations and continued the learning rate decay schedule. For fine-tuning, we trained the model for 50 iterations, which took less than 10 minutes with four GPUs.
When decoding, we used a beam search with the size of six and a length normalization technique with ↵ = 2.0 and = 0.0 (Wu et al., 2016). For the submission, we used an ensemble of three (En-Ja) or four (Ja-En) independently trained models 12 . Table 4 shows the case-sensitive BLEU scores of provided blind test sets. Replacing the emoticons
Improved Degraded Unchanged Ja-En 9 (53%) 0 (0%) 8 (47%) En-Ja 14 (82%) 1 (6%) 2 (12%) and emojis with the placeholders achieves a small gain over the baseline model, which was trained with the provided raw corpora. Also, additional fine-tuning with in-domain and synthetic corpora also leads to a substantial gain for both directions. For Ja-En, although we failed to improve the accuracy by fine-tuning the MTNT corpus only, we found that the fine-tuning on both the in-domain and synthetic corpora achieves a substantial gain. We suspect this is due to overfitting, and modifying the number of iterations might alleviate this problem. As described in Section 2.5, we did not use the synthetic corpus for the En-Ja direction.
For the submission, we decoded using an ensemble of independently trained models, which boosts the scores.

Effect of Placeholders
To investigate the effectiveness of using the placeholder mechanism, we compared the translation of the baseline to the model trained with the placeholders. We manually evaluated how correctly the emojis and emoticons were copied to the output. Table 5 shows the numbers of sentences on the MTNT test set that are improved/degraded by applying the placeholder mechanism. These result demonstrate that the placeholder mechanism could improve the translation of the noisy text, which frequently includes emojis and emoticons, almost without degradation. Tables 6 and 7 show examples of translations in the Ja-En and En-Ja tasks, respectively. Both the emoji ( ) and the ">" token, which represents a quotation mark, were properly copied from the source text to the translation of +placeholders, whereas the baseline model did not output such tokens as shown in Tables 6 and 7. Thus, we can consider this to be the reason the placeholders contribute to improving case-sensitive BLEU scores over the baseline.
In our preliminary experiments, although we tried a method to introduce the placeholder technique to our systems at the fine-tuning phase, we found that it does not work properly with only the fine-tuning. This means that an NMT needs to be trained with the corpus pre-processed for the placeholder mechanism before the fine-tuning.

Effect of Fine-tuning
According to the comparison between +finetuning and baseline shown in Table 4, fine-tuning on the in-domain and synthetic corpus achieved a substantial gain in both directions. Accordingly, we can see that the sentence translated by +finetuning has a more informal style than those translated by baseline and +placeholders as presented in Tables 6 and 7.

Difficulties in Translating Social Media
Texts Challenges still remain to improving the model's robustness against social media texts such as Reddit comments. As we pointed out in Section 1, various abbreviations are often used. For example, the term, "qπ›Web" (literally East Spo Web) in   the MTNT dataset should be translated to "Tokyo Sports Website" according to its reference, but our model incorrectly translated it to "East Spoweb". Such abbreviations that cannot be translated correctly without prior knowledge, such as "q π ›Web stands for q¨π›¸ƒWebµ §» (literally Tokyo Sports Website)", are commonly used on social media.

Use of Contextual Information
Some sentences need contextual information for them to be precisely translated. The MTNT corpus provides comment IDs as the contextual information to group sentences from the same original comment. We did not use the contextual information in our systems, but we consider that it would help to improve translation quality as in previous work (Tiedemann and Scherrer, 2017;Bawden et al., 2018). For example, in the following two sentences, "Airborne school isn't a hard school." and "Get in there with some confidence!", which can be found in the MTNT corpus and have the same comment ID, we consider that leveraging their contextual information would help to clarify what "there" means in the latter and to translate it more accurately.

Conclusion
In this paper, we presented NTT's submission to the WMT 2019 robustness task. We participated in the Ja-En and En-Ja translation tasks with constrained settings. Through experiments, we showed that we can improve translation accuracy by introducing the placeholder mechanism, performing fine-tuning on both in-domain and synthetic corpora, and using ensemble models of Transformers. Moreover, our analysis indicated that the placeholder mechanism contributes to improving translation quality.
In future work, we will explore ways to use monolingual data more effectively, introduce contextual information, and deal with a variety of noisy tokens such as abbreviations, ASCII-arts, and grammar errors.