MTNT: A Testbed for Machine Translation of Noisy Text

Noisy or non-standard input text can cause disastrous mistranslations in most modern Machine Translation (MT) systems, and there has been growing research interest in creating noise-robust MT systems. However, as of yet there are no publicly available parallel corpora of with naturally occurring noisy inputs and translations, and thus previous work has resorted to evaluating on synthetically created datasets. In this paper, we propose a benchmark dataset for Machine Translation of Noisy Text (MTNT), consisting of noisy comments on Reddit (www.reddit.com) and professionally sourced translations. We commissioned translations of English comments into French and Japanese, as well as French and Japanese comments into English, on the order of 7k-37k sentences per language pair. We qualitatively and quantitatively examine the types of noise included in this dataset, then demonstrate that existing MT models fail badly on a number of noise-related phenomena, even after performing adaptation on a small training set of in-domain data. This indicates that this dataset can provide an attractive testbed for methods tailored to handling noisy text in MT.


Introduction
#nlproc is actualy f*ing hARD tbh This handcrafted sentence showcases several types of noise that are commonly seen on social media: abbreviations ("#nlproc"), typographical errors ("actualy"), obfuscated profanities ("f*ing"), inconsistent capitalization ("hARD"), Internet slang ("tbh" for "to be honest") and emojis ( ). Although machine translation has achieved significant quality improvements over the past few years due to the advent of Neural Machine Translation (NMT) (Kalchbrenner and Blunsom;Sutskever et al., 2014;Bahdanau et al., 2014;Wu et al., 2016), systems are still not robust to noisy input like this (Belinkov and Bisk, 2018;Khayrallah and Koehn). For example, Google Translate 3 translates the above example into French as: #nlproc est en train de f * ing dur hb which translates back into English as "#nlproc is in the process of [f * ing] hard hb". This shows that noisy input can lead to erroneous translations that can be misinterpreted or even offensive.
Noise in social media text is a known issue that has been investigated in a variety of previous work (Eisenstein; Baldwin et al.). Most recently, Belinkov and Bisk (2018) have focused on the difficulties that character based NMT models have translating text with character level noise within individual words (from scrambling to simulated human errors such as typos or spelling/conjugation errors). This is a good first step towards noise-robust NMT systems, but as we demonstrate in §2, word-by-word replacement or scrambling of characters doesn't cover all the idiosyncrasies of language on the Internet.
At this point, despite the obvious utility of creating noise-robust MT systems, and the scientific challenges contained therein, there is currently a bottleneck in that there is no standard open benchmark for researchers and developers of MT systems to test the robustness of their models to these and other phenomena found in noisy text on the Internet. In this work, we introduce MTNT, a new, realistic dataset aimed at testing robustness of MT systems to these phenomena. The dataset contains naturally created noisy source sentences with professionally sourced translations both in a pair of typologically close languages (English and French) and distant languages (English and Japanese). We collect noisy comments from the Reddit online discussion website ( §3) in English, French and Japanese, and ask professional translators to translate to and from English, resulting in approximately 1000 test samples and from 6k to 36k training samples in four language pairs (English-French (en-fr), French-English (fr-en), English-Japanese (en-ja) and Japanese-English (ja-en)). In addition, we release additional small monolingual corpora in those 3 languages to both provide data for semisupervised adaptation approaches as well as noisy Language Modeling (LM) experiments. We test standard translation models ( §5) and language models ( §6) on our data to understand their failure cases and to provide baselines for future work.

Noise and Input Variations in
Language on the Internet

Examples from Social Media Text
The term "noise" can encompass a variety of phenomena in natural language, with variations across languages (e.g. what is a typo in logographic writing systems?) and type of content (Baldwin et al.).

Is Translating Noisy Text just another
Adaptation Problem?
To a certain extent, translating noisy text is a type of adaptation, which has been studied extensively in the context of both Statistical Machine Translation (SMT) and NMT (Axelrod et al.;Luong and Manning, 2015;Chu et al.;Miceli Barone et al.;Michel and Neubig, 2018). However, it presents many differences with previous domain adaptation problems, where the main goal is to adapt from a particular topic or style. In the case of noisy text, it will not only be the case that a particular word will be translated in a different way than it is in the general domain (e.g. as in the case of "sub"), but also that there will be increased lexical variation (e.g. due to spelling or typographical errors), and also inconsistency in grammar (e.g. due to omissions of critical words or mis-usage). The sum of these differences warrants that noisy MT be treated as a separate instance than domain adaptation, and our experimental analysis in 5.4 demonstrates that even after performing adaptation, MT systems still make a large number of noise-related errors.

Collection Procedure
We first collect noisy sentences in our three languages of interest, English, French and Japanese. 3.4 3.4 Figure 1: Summary of our collection process and the respective sections addressing them. We apply the same procedure for each language.
We refer to Figure 1 for an overview of the data collection and translation process. We choose Reddit as a source of data because (1) its content is likely to exhibit noise, (2) some of its sub-communities are entirely run in different languages, in particular, English, French and Japanese, and (3) Reddit is a popular source of data in curated and publicly distributed NLP datasets (Tan et al.). We collect data using the public Reddit API. 4 Note that the data collection and translation is performed at the comment level. We split the parallel data into sentences as a last step.

Data Sources
For each language, we select a set of communities ("subreddits") that we know contain many comments in that language: English: Since an overwhelming majority of the discussions on Reddit are conducted in English, we don't restrict our collection to any community in particular.
French: /r/france, /r/quebec and /r/rance. The first two are among the biggest French speaking communities on Reddit. The third is a humor/sarcasm based offspring of /r/france. Japanese: /r/newsokur, /r/bakanewsjp, /r/newsokuvip, /r/lowlevelaware 4 In particular, we use this implementation: praw.readthedocs.io/en/latest, and our complete code is available at http://www.cs.cmu.edu/ pmichel1/mtnt/. and /r/steamr. Those are the biggest Japanese speaking communities, with over 2,000 subscribers.
We collect comments made during the 03/27/2018-03/29/3018 time period for English, 09/2018-03/2018 for French and 11/2017-03/2018 for Japanese. The large difference in collection time is due to the variance in comment throughput and relative amount of noise between the languages.

Contrast Corpora
Not all comments found on Reddit exhibit noise as described in Section 2. Because we would like to focus our data collection on noisy comments, we devise criteria that allow us to distinguish potentially noisy comments from clean ones. Specifically, we compile a contrast corpus composed of clean text that we can compare to, and find potentially noisy text that differs greatly from the contrast corpus. Given that our final goal is MT robust to noise, we prefer that these contrast corpora consist of the same type of data that is often used to train NMT models. We select different datasets for each language: English: The English side of the preprocessed parallel training data provided for the German-English WMT 2017 News translation task, 5 as provided on the website. This amounts to ≈ 5.85 million sentences.
French: The entirety of the French side of the parallel training data provided for the English-French WMT 2015 translation task. 6 This amounts to ≈ 40.86 million sentences.

Identifying Noisy Comments
We now describe the procedure used to identify comments containing noise.
Pre-filtering First, we perform three preprocessing to discard comments that do not represent natural noisy text in the language of interest: 1. Comments containing a URL, as detected by a regular expression.
2. Comments where the author's username contains "bot" or "AutoModerator". This mostly removes automated comments from bots.
3. Comments in another language: we run langid.py 7 (Lui and Baldwin) and discard comments where p(lang | comment) > 0.5 for any language other than the one we are interested in.
This removes cases that are less interesting, i.e. those that could be solved by rule-based pattern matching or are not natural text created by regular users in the target language. Our third criterion in particular discards comments that are blatantly in another language while still allowing comments that exhibit code-switching or that contain proper nouns or typos that might skew the language identification. In preliminary experiments, we noticed that these criteria 14.47, 6.53 and 7.09 % of the collected comments satisfied the above criteria respectively.
Normalization After this first pass of filtering, we pre-process the comments before running them through our noise detection procedure. We first strip Markdown 8 syntax from the comments. For English and French, we normalize the punctuation, lowercase and tokenize the comments using the Moses tokenizer. For Japanese, we simply lowercase the alphabetical characters in the comments. Note that this normalization is done for the purpose of noise detection only. The collected comments are released without any kind of preprocessing. We apply the same normalization procedure to the contrast corpora.
Unknown words In the case of French and English, a clear indication of noise is the presence of out-of-vocabulary words (OOV): we record all lowercased words encountered in our reference corpus described in Section 3.2 and only keep comments that contain at least one OOV. Since we did not use word segmentation for the Japanese reference corpus, we found this method not to be very effective to select Japanese comments and therefore skipped this step.
Language model scores The final step of our noise detection procedure consists of selecting those comments with a low probability under a language model trained on the reference monolingual corpus. This approach mirrors the one used in Moore and Lewis and Axelrod et al. to select data similar to a specific domain using language model perplexity as a metric. We search for comments that have a low probability under a sub-word language model for more flexibility in the face of OOV words. We segment the contrast corpora with Byte-Pair Encoding (BPE) using the sentencepiece 9 implementation. We set the vocabulary sizes to 1, 000, 1, 000 and 4, 000 for English, French and Japanese respectively. We then use a 5-gram Kneser-Ney smoothed language model trained using kenLM 10 (Heafield et al.) to calculate the log probability, normalized by the number of tokens for every sentence in the reference corpus. Given a reddit comment, we compute the normalized log probability of each of its lines under our subword language model. If for any line this score is below the 1st percentile of scores in the reference corpus, the comment is labeled as noisy and saved.

Creating the Parallel Corpora
Once enough data has been collected, we isolate 15, 000 comments in each language by the follow-9 https://github.com/google/ sentencepiece 10 https://kheafield.com/code/kenlm/ ing procedure: • Remove all duplicates. In particular, this handles comments that might have been scraped twice or automatic comments from bots.
• To further weed out outliers (comments that are too noisy, e.g. ASCII art, wrong language. . . or not noisy enough), we discard comments that are on either end of the distribution of normalized LM scores within the set of collected comments. We only keep comments whose normalized score is within the 5-70 percentile for English (resp. 5-60 for French and 10-70 for Japanese). These numbers are chosen by manually inspecting the data.
We then concatenate the title of the thread where the comment was found to the text and send everything to an external vendor for manual translations. Upon reception of the translations, we noticed a certain amount of variation in the quality of translations, likely because translating social media text, with all its nuances, is difficult even for humans. In order to ensure the highest quality in the translations, we manually filter the data to segment the comments into sentences and weed out poor translations for our test data. We thereby retain around 1, 000 sentence pairs in each direction for the final test set.
We gather the samples that weren't selected for the test sets to be used for training or fine-tuning models on noisy data. We automatically split comments into sentences with a regular expression detecting sentence delimiters, and then align the source and target sentences. Should this alignment fail (i.e. the source comment contains a different number of sentences than the target comment after automatic splitting), we revert back to providing the whole comment without splitting. For the training data, we do not verify the correctness of translations as closely as for the test data. Finally, #samples #src tokens #trg tokens en-fr 36,058 841k 965k fr-en 19,161 661k 634k en-ja 5,775 281k 506k ja-en 6,506 172k 128k  we isolate ≈ 900 samples in each direction to serve as validation data. Information about the size of the data can be found in Table 1, 2 and 3 for the test, training and validation sets respectively. We tokenize the English and French data with the Moses (Koehn et al.) tokenizer and the Japanese data with Kytea  before counting the number of tokens in each dataset.

Monolingual Corpora
After the creation of the parallel train and test sets, a large number of unused comments remain in each language, which we provide as monolingual corpora. This additional data has two purposes: first, it serves as a resource for in-domain training using semi-supervised methods relying on monolingual data (e.g. Cheng et al.; Zhang and Zong). Second, it provides a language modeling dataset for noisy text in three languages.
We select 3, 000 comments at random in each dataset to form a validation set to be used to tune hyper-parameters, and provide the rest as training data. The data is provided with one comment per line. Newlines within individual comments are replaced with spaces.   on the size of the datasets. As with the parallel MT data, we provide the number of tokens after tokenization with the Moses tokenizer for English and French and Kytea for Japanese.

Dataset Analysis
In this section, we investigate the proposed data to understand how different categories of noise are represented and to show that our test sets contain more noise overall than established MT benchmarks.

Quantifying Noisy Phenomena
We run a series of tests to count the number of occurrences of some of the types of noise described in Section 2. Specifically we pass our data through spell checkers to count spelling and grammar errors. Due to some of these tests being impractical to run on a large scale, we limit our analysis to the test sets of MTNT. We use slightly different procedures depending on the tools available for each language. We test for spelling and grammar errors in English data using Grammarly 11 , an online resource for English spell-checking. Due to the unavailability of an equivalent of Grammarly in French and Japanese, we test for spelling and grammar error using the integrated spell-checker in Microsoft Word 2013 12 . Note that Word seems to count proper nouns as spelling errors, giving higher numbers of spelling errors across the board in French as compared to English.
For all languages, we also count the number 11 https://www.grammarly.com/ 12 https://products.office.com/en-us/ microsoft-word-2013 of profanities and emojis using custom-made lists and regular expressions 13 . In order to compare results across datasets of different sizes, we report all counts per 100 words.
The results are recorded in the last row of each section in Table 5. In particular, for the languages with a segmental writing system, English and French, spelling errors are the dominant type of noise, followed by grammar error. Unsurprisingly, the former are much less present in Japanese. Table 5 also provide a comparison with the relevant side of established MT test sets. For English and French, we compare our data to new-stest2014 14 and newsdiscusstest2015 15 test sets. For Japanese, we compare with the test sets of the datasets described in Section 3.2.

Comparison to Existing MT Test Sets
Overall, MTNT contains more noise in all metrics but one (there are more profanities in JESC, a Japanese subtitle corpus). This confirms that MTNT indeed provides a more appropriate benchmark for translation of noisy or non-standard text.
Compared to synthetically created noisy test sets (Belinkov and Bisk, 2018) MTNT contains less systematic spelling errors and more varied types of noise (e.g. emojis and profanities) and is thereby more representative of naturally occurring noise.

Machine Translation Experiments
We evaluate standard NMT models on our proposed dataset to assess its difficulty. Our goal is not to train state-of-the art models but rather to test standard off-the-shelf NMT systems on our data, and elucidate what features of the data make it difficult.

Model Description
All our models are implemented in DyNet (Neubig et al., 2017) with the XNMT toolkit (?). We use approximately the same setting for all language pairs: the encoder is a bidirectional LSTM with 2 layers, the attention mechanism is a multi layered perceptron and the decoder is a 2 layered LSTM. The embedding dimension is 512, all other dimensions are 1024. We tie the target word embeddings and the output projection weights (Press and Wolf). We train with Adam (Kingma and Ba, 2014) with XNMT's default hyper-parameters, as well as dropout (with probability 0.3). We used BPE subwords to handle OOV words. Full configuration details as well as code to reproduce the baselines is available at https://github. com/pmichel31415/mtnt.

Training Data
We train our models on standard MT datasets: • en ↔ fr: Our training data consists in the europarl-v7 16 and news-commentary-v10 17 corpora, totaling 2, 164, 140 samples, 54, 611, 105 French tokens and 51, 745, 611 English tokens (non-tokenized). We use the newsdiscussdev2015 14 dev set from WMT15 as validation data and evaluate the model on the newsdiscusstest2015 15 and newstest2014 14 test sets.
• en ↔ ja: We concatenate the respective train, validation and test sets of the three corpora mentioned in 3.2. In particular we detokenize the Japanese part of each dataset to make sure that any tokenization we perform will be uniform (in practice we remove ASCII spaces). This amounts to 3, 900, 772 training samples (34, 989, 346 English tokens without tokenization). We concatenate the dev sets  associated with these corpora to serve as validation data and evaluate on each respective test set separately.

Results
We use sacreBLEU 18 , a standardized BLEU score evaluation script proposed by Post (2018), for BLEU evaluation of our benchmark dataset. It takes in detokenized references and hypotheses and performs its own tokenization before computing BLEU score. We specify the intl tokenization option. In the case of Japanese text, we run both hypothesis and reference through KyTea before computing BLEU score. We strongly encourage that evaluation be performed in the same manner in subsequent work, and will provide both scripts and an evaluation web site in order to facilitate reproducibility. Table 6 lists the BLEU scores for our models on the relevant test sets in the two language pairs, including the results on MTNT.

Analysis
To better understand the types of errors made by our model, we count the n-grams that are overand under-generated with respect to the reference translation. Specifically, we compare the count ratios of all 1-to 3-grams in the output and in the reference and look for the ones with the highest (over-generated) and lowest (under-generated) ratio.
We find that in English, the model undergenerates the contracted form of the negative ("do not"/"don't") or of auxiliaries ("That is"/"I'm").

Source
Moi faire la gueule dans le métro me manque, c'est grave ?  Similarly, in French, our model over generates "de votre" (where "votre" is the formal 2nd person plural for "your") and "n'ai pas" which showcases the "ne [. . . ] pas" negation, often dropped in spoken language. Conversely, the informal second person "tu" is under-generated, as is the informal and spoken contraction of "cela", "ça". In Japanese, the model under-generates, among others, the informal personal pronoun 俺 ("ore") or the casual form だ ("da") of the verb です ("desu", to be). In ja-en the results are difficult to interpret as the model seems to produce incoherent outputs (e.g. "no, no, no. . . ") when the NMT system encounters sentences it has not seen before. The full list of n-grams with the top 5 and bottom 5 count ratios in each language pair is displayed in Table 8.

fr-en en-fr ja-en en-ja
Over generated

Fine-Tuning
Finally, we test a simple domain adaptation method by fine-tuning our models on the training data described in Section 3.4. We perform one epoch of training with vanilla SGD with a learning rate of 0.1 and a batch size of 32. We do not use the validation data at all. As evidenced by the results in the last row of Table 6, this drives BLEU score up by 3.17 to 7.96 points depending on the language pair. However large this increase might be, our model still breaks on very noisy sentences. Table 7 shows three examples in fr-en. Although our model somewhat improves after finetuning, the translations remain inadequate in all cases. In the third case, our model downright fails to produce a coherent output. This shows that despite improving BLEU score, naive domain adaptation by fine-tuning doesn't solve the problem of translating noisy text.

Language Modeling Experiments
In addition to our MT experiments, we report character-level language modeling results on the monolingual part of our dataset. We use the data described in Section 3.5 as training and validation sets. We evaluate the trained model on the source side of our en-fr, fr-en and ja-en test sets for English, French and Japanese respectively. We report results for two models: a Kneser-Ney smoothed 6-gram model (implemented with KenLM) and an implementation of the AWD-LSTM proposed in (Merity et al., 2018) 19 . We report the Bit-Per-Character (bpc) counts in table 9.  We intend these results to serve as a baseline for future work in language modeling of noisy text in either of those three languages.

Related work
Handling noisy text has received growing attention among various language processing tasks due to the abundance of user generated content on popular social media platforms (Crystal, 2001;Herring, 2003;Danet and Herring, 2007). These contents are considered as noisy when compared to news corpora which have been the main data source for language tasks (Baldwin et al.; Eisenstein). They pose several unique challenges because they contain a larger variety of linguistic phenomena that are absent in the news domain and that lead to degraded quality when applying an model to outof-domain data (Ritter et al.;Luong and Manning, 2015). Additionally, they are live examples of the Cmabrigde Uinervtisy (Cambridge University) effect, where state-of-the-art models become brittle while human's language processing capability is more robust (Sakaguchi et al., 2017;Belinkov and Bisk, 2018).
Efforts to address these challenges have been focused on creating in-domain datasets and annotations (Owoputi et al.;Kong et al.;Blodgett et al., 2017), and domain adaptation training (Luong and Manning, 2015). In MT, improvements were obtained for SMT (Formiga and Fonollosa). However, the specific challenges for neural machine translation have not been studied until recently (Belinkov and Bisk, 2018;Sperber et al.;Cheng et al., 2018). The first provides empirical evidence of non-trivial quality degradation when source sentences contain natural noise or synthetic noise within words, and the last two explore data augmentation and adversarial approaches of adding noise efficiently to training data to improve robustness.
Our work also contributes to recent advances in evaluating neural machine translation quality with regard to specific linguistic phenomena, such as manually annotated test sentences for English to French translation, in order to identify errors due to specific linguistic divergences between the two languages (Isabelle et al.), or automatically generated test sets to evaluate typical errors in English to German translation (Sennrich). Our contribution distinguishes itself from this previous work and other similar initiatives (Peterson, 2011) by providing an open test set consisting of naturally occurring text exhibiting a wide range of phenomena related to noisy input text from contemporaneous social media.

Conclusion
We proposed a new dataset to test MT models for robustness to the types of noise encountered in natural language on the Internet. We contribute parallel training and test data in both directions for two language pairs, English ↔ French and English ↔ Japanese, as well as monolingual data in those three languages. We show that this dataset contains more noise than existing MT test sets and poses a challenge to models trained on standard MT corpora. We further demonstrate that these challenges cannot be overcome by a simple domain adaptation approach alone. We intend this contribution to provide a standard benchmark for robustness to noise in MT and foster research on models, dataset and evaluation metrics tailored for this specific problem.