Machine Translation of Restaurant Reviews: New Corpus for Domain Adaptation and Robustness

We share a French-English parallel corpus of Foursquare restaurant reviews, and define a new task to encourage research on Neural Machine Translation robustness and domain adaptation, in a real-world scenario where better-quality MT would be greatly beneficial. We discuss the challenges of such user-generated content, and train good baseline models that build upon the latest techniques for MT robustness. We also perform an extensive evaluation (automatic and human) that shows significant improvements over existing online systems. Finally, we propose task-specific metrics based on sentiment analysis or translation accuracy of domain-specific polysemous words.


Introduction
Very detailed information about social venues such as restaurants is available from user-generated reviews in applications like Google Maps, TripAdvisor or Foursquare 1 (4SQ). Most of these reviews are written in the local language and are not directly exploitable by foreign visitors: an analysis of the 4SQ database shows that, in Paris, only 49% of the restaurants have at least one review in English, and the situation can be much worse for other cities and languages (e.g., only 1% of Seoul restaurants for a French-only speaker).
Machine Translation of such user-generated content can improve the situation and make the data available for direct display or for downstream NLP tasks (e.g., cross-lingual information retrieval, sentiment analysis, spam or fake review detection), provided its quality is sufficient.
We asked professionals to translate 11.5k French 4SQ reviews (18k sentences) to English. We believe that this resource 2 will be valuable to the community for training and evaluating MT systems addressing challenges posed by user-generated content, which we discuss in detail in this paper.
We conduct extensive experiments and combine techniques that address these challenges (e.g., factored case, noise generation, domain adaptation with tags) on top of a strong Transformer baseline. In addition to BLEU evaluation and human evaluation, we use targeted metrics that measure how well polysemous words are translated, or how well sentiments expressed in the original review can still be recovered from its translation.

Related work
Translating restaurant reviews written by casual customers presents several challenges for NMT, in particular robustness to non-standard language and adaptation to a specific style or domain (see Section 3.2 for details).
Concerning robustness to noisy user generated content, Michel and Neubig (2018) stress differences with traditional domain adaptation problems, and propose a typology of errors, many of which we also detected in the 4SQ data. They also released a dataset (MTNT), whose sources were selected from a social media (Reddit) on the basis of being especially noisy (see Appendix for a comparison with 4SQ). These sources were then translated by humans to produce a parallel corpus that can be used to engineer more robust NMT systems and to evaluate them. This corpus was the basis of the WMT 2019 MT Robustness Task (Li et al., 2019), in which Berard et al. (2019) ranked first. We use the same set of robustness and domain adaptation techniques, which we study more in depth and apply to our review translation task. Sperber et al. (2017), Belinkov and Bisk (2018) and Karpukhin et al. (2019) propose to improve robustness by training models on data-augmented corpora, containing noisy sources obtained by random word or character deletions, insertions, substitutions or swaps. Recently, Vaibhav et al. (2019) proposed to use a similar technique along with noise generation through replacement of a clean source by one obtained by back-translation.
Addressing the technical issues of robustness and adaptation of an NMT system is decisive for real-world deployment, but evaluation is also critical. This aspect is stressed by Levin et al. (2017) (NMT of curated hotel descriptions), who point out that automatic metrics like BLEU tend to neglect semantic differences that have a small textual footprint, but may be seriously misleading in practice, for instance by interpreting available parking as if it meant free parking. To mitigate this, we conduct additional evaluations of our models: human evaluation, translation accuracy of polysemous words, and indirect evaluation with sentiment analysis.

Task description
We present a new task of restaurant review translation, which combines domain adaptation and robustness challenges.

Corpus description
We sampled 11.5k French reviews from 4SQ, mostly in the food category, 3 split them into sentences (18k), and grouped them into train, valid and test sets (see Table 1). The French reviews contain on average 1.5 sentences and 17.9 words. Then, we hired eight professional translators to translate them to English. Two of them created the training set by post-editing (PE) the outputs of baseline NMT systems. 4 The other six translated the valid and test sets from scratch. They were asked to translate (or post-edit) the reviews sentence-by-sentence (to avoid any alignment problem), but they could see the full con-  text. We manually filtered the test set to remove translations that were not satisfactory. The full reviews and additional metadata (e.g., location and type of the restaurant) are also available as part of this resource, to encourage research on contextual machine translation. 4SQ-HT was translated from scratch by the same translators who post-edited 4SQ-PE. While we did not use it in this work, it can be used as extra training or development data. We also release a human translation of the Frenchlanguage test set (668 sentences) of the Aspect-Based Sentiment Analysis task at SemEval 2016 (Pontiki et al., 2016).

Challenges
Translating restaurant reviews presents two main challenges compared to common tasks in MT. First, the reviews are written in a casual style, close to spoken language. Some liberty is taken w.r.t. spelling, grammar, and punctuation. Slang is also very frequent. MT should be robust to these variations. Second, they generally are reactions, by clients of a restaurant, about its food quality, service or atmosphere, with specific words relating to these aspects or sentiments. These require some degree of domain adaptation. The following table illustrates these issues, with outputs from an online MT system. Examples of full reviews from 4SQ-PE along with metadata are shown in Appendix.
( Examples 1 and 2 fall into the robustness category: 1 is an extreme form of SMS-like, quasiphonetic, language (et quand j'ai vu ça); 2 is a literal transcription of a long-vowel phonetic stress (trop → trooop). Example 3 falls into the domain category: in a restaurant context, cadre typically refers to the setting. Examples 4 and 5 involve both robustness and domain adaptation: pété un cable is a non-compositional slang expression and garçon is not a boy in this domain; nickel is slang for great, très is missing an accent, and pâtes is misspelled as pattes, which is another French word.

Robustness to noise
We propose solutions for dealing with nonstandard case, emoticons, emojis and other issues.

Rare character placeholder
We segment our training data into subwords with BPE (Sennrich et al., 2016c), implemented in Sen-tencePiece (Kudo and Richardson, 2018). BPE can deal with rare or unseen words by splitting them into more frequent subwords but cannot deal with unseen characters. 5 While this is not a problem in most tasks, 4SQ contains a lot of emojis, and sometimes symbols in other scripts (e.g., Arabic). Unicode now defines around 3k emojis, most of which are likely to be out-of-vocabulary.
We replace rare characters on both sides of the training corpus by a placeholder (<x>); a model trained on this data is typically able to copy the placeholder at the correct position. Then, at inference time, we replace the output tokens <x> by the rare source-side characters, in the same or- der. This approach is similar to that of Jean et al. (2015), who used the attention mechanism to replace output UNK symbols with the aligned word in the source. Berard et al. (2019) used the same technique to deal with emojis in the WMT robustness task.

Capital letters
As shown in Table 2, capital letters are another source of confusion. HONTE and honte are considered as two different words. The former is out-of-vocabulary and is split very aggressively by BPE. This causes the MT model to hallucinate.
Lowercasing A solution is to lowercase the input, both at training and at test time. However, when doing so, some information may be lost (e.g., named entities, acronyms, emphasis) which may result in lower translation quality. We implement this with two embedding matrices, one for words and one for case, and represent a token as the sum of the embeddings of its factors. For the target side, we follow Garcia-Martinez et al. (2016) and have two softmax operations. We first predict the word in its lowercase form and then predict its case. 6 The embeddings of the case and word are then summed and used as input for the next decoder step. Berard et al. (2019) propose another approach, inline casing, which does not require any change in the model. We insert the case as a regular token into the sequence right after the word. Special tokens <U>, <L> and <T> (upper, lower and title) are used for this purpose and appended to the vocabulary. Contrary to the previous solution, there is only one embedding matrix and one softmax.

Inline casing
In practice, words are assumed to be lowercase by default and the <L> tokens are dropped to keep the factored sequences as short as possible. "Best fries EVER" becomes "best <T> _f ries _ever <U>". Like Berard et al. (2019), we force Senten-cePiece to split mixed-case words like MacDonalds into single-case subwords (Mac and Donalds).
Synthetic case noise Another solution that we experiment with (see Section 6) is to inject noise on the source side of the training data by changing random source words to upper (5% chance), title (10%) or lower case (20%).

Natural noise
One way to make an NMT system more robust is to train it with some of the most common errors that can be found in the in-domain data. Like Berard et al. (2019), we detect the errors that occur naturally in the in-domain data and then apply them to our training corpus, while respecting their natural distribution. We call this "natural noise generation" in opposition to what is done in (Sperber et al., 2017;Belinkov and Bisk, 2018;Vaibhav et al., 2019) or in Section 4.2, where the noise is more synthetic.

Detecting errors
We compile a general-purpose French lexicon as a transducer, 7 implemented to be traversed with extended edit distance flags, similar to Mihov and Schulz (2004). Whenever a word is not found in the lexicon (which means that it is a potential spelling mistake), we look for a French word in the lexicon within a maximum edit distance of 2, with the following set of edit operations: (1) deletion (e.g., apelle instead of appelle) (2) insertion (e.g., appercevoir instead of apercevoir) (3) constrained substitution on diacritics (e.g., mangè instead of mangé) (4) swap counted as one operation: (e.g., mnager instead of manger) (5) substitution (e.g., menger instead of manger) (6) repetitions (e.g., Merciiiii with a threshold of max 10 repetitions) We apply the transducer to the French monolingual Foursquare data (close to 1M sentences) to detect and count noisy variants of known French words. This step produces a dictionary mapping 7 In Tamgu: https://github.com/naver/tamgu the correct spelling to the list of observed errors and their respective frequencies.
In addition to automatically extracted spelling errors, we extract a set of common abbreviations from (Seddah et al., 2012) and we manually identify a list of common errors in French: (7) Wrong verb endings (e.g., il a manger instead of il a mangé) Generating errors With this dictionary, describing the real error distribution in 4SQ text, we take our large out-of-domain training corpus, and randomly replace source-side words with one of their variants (rules 1 to 6), while respecting the frequency of this variant in the real data. We also manually define regular expressions to randomly apply rules 7 to 11 (e.g., "er "→"é ").

Domain Adaptation
To adapt our models to the restaurant review domain we apply the following types of techniques: back-translation of in-domain English data, finetuning with small amounts of in-domain parallel data, and domain tags.

Back-translation
Back-translation (BT) is a popular technique for domain adaptation when large amounts of indomain monolingual data are available (Sennrich et al., 2016b;. While our in-domain parallel corpus is small (12k pairs), Foursquare contains millions of English-language reviews. Thus, we train an NMT model 8 in the reverse direction (EN→FR) and translate all the 4SQ English reviews to French. 9 This gives a large synthetic parallel corpus. This in-domain data is concatenated to the outof-domain parallel data and used for training.  show that doing backtranslation with sampling instead of beam search brings large improvements due to increased diversity. Following this work, we test several settings: Name Description BT-B Back-translation with beam search.
BT-S Back-translation with sampling.

BT-S × 3 Three different FR samplings for each EN
sentence. This brings the size of the backtranslated 4SQ closer to the out-of-domain corpus.
BT No oversampling, but we sample a new version of the corpus for each training epoch.
We use a temperature 10 of T = 1 0.9 to avoid the extremely noisy output obtained with T = 1 and strike a balance between quality and diversity.

Fine-tuning
When small amounts of in-domain parallel data are available, fine-tuning (FT) is often the preferred solution for domain adaptation (Luong and Manning, 2015;Freitag and Al-Onaizan, 2016). It consists in training a model on out-of-domain data, and then continuing its training for a few epochs on the in-domain data only. Kobus et al. (2017) propose a technique for multidomain NMT, which consists in inserting a token in each source sequence specifying its domain. The system can learn the particularities of multiple domains (e.g., polysemous words that have a different meaning depending on the domain), which we can control at test time by manually setting the tag. Sennrich et al. (2016a) also use tags to control politeness in the model's output.

Corpus tags
As our corpus (see Section 6.1) is not clearly divided into domains, we apply the same technique as Kobus et al. (2017) but use corpus tags (each sub-corpus has its own tag: TED, Paracrawl, etc.) which we add to each source sequence. Like in (Berard et al., 2019), the 4SQ post-edited and backtranslated data also get their own tags (PE and BT). 9 This represents ≈15M sentences. This corpus is not available publicly, but the Yelp dataset (https://www. yelp.com/dataset) could be used instead. 10  TED The map is too small.

Multi-UN The card is too small.
PE The menu is too small.   Figure 1 gives an example where using the PE corpus tag at test time helps the model pick a more adequate translation.

Training data
After some initial work with the WMT 2014 data, we built a new training corpus named UGC (User Generated Content), closer to our domain, by combining: Multi UN, OpenSubtitles, Wikipedia, Books, Tatoeba, TED talks, ParaCrawl 11 and Gourmet 12 (See Table 3). Notably, UGC does not include Common Crawl (which contains a lot of misaligned sentences and caused hallucinations), but it includes OpenSubtitles (Lison and Tiedemann, 2016) (spoken-language, possibly closer to 4SQ). We observed an improvement of more than 1 BLEU on news-test 2014 when switching to UGC, and almost 6 BLEU on 4SQ-valid.

Pre-processing
We use langid.py (Lui and Baldwin, 2012) to filter sentence pairs from UGC. We also remove duplicate sentence pairs, and lines longer than 175 words or with a length ratio greater than 1.5 (see Table 3). Then we apply SentencePiece and our rare character handling strategy (Section 4.1). We use a joined BPE model of size 32k, trained on the concatenation of both sides of the corpus, and set SentencePiece's vocabulary threshold to 100. Finally, unless stated otherwise, we always use the inline casing approach (see Section 4.2).

Model and settings
For all experiments, we use the Transformer Big (Vaswani et al., 2017) as implemented in Fairseq, with the hyperparameters of . Training is done on 8 GPUs, with accumulated gradients over 10 batches , and a max batch size of 3500 tokens (per GPU). We train for 20 epochs, while saving a checkpoint every 2500 updates (≈ 2 5 epoch on UGC) and average the 5 best checkpoints according to their perplexity on a validation set (a held-out subset of UGC).
For fine-tuning, we use a fixed learning rate, and a total batch size of 3500 tokens (training on a single GPU without delayed updates). To avoid overfitting on 4SQ-PE, we do early stopping according to perplexity on 4SQ-valid. 13 For each fine-tuned model we test all 16 combinations of dropout in {0.1, 0.2, 0.3, 0.4} and learning rate in {1, 2, 5, 10} × 10 −5 . We keep the model with the best perplexity on 4SQ-valid. 14

Evaluation methodology
During our work, we used BLEU (Papineni et al., 2002) on news-valid (concatenation of news-test 2012 and 2013) to ensure that our models stayed good on a more general domain, and on 4SQ-valid to measure performance on the 4SQ domain.
For sake of brevity, we only give the final BLEU scores on news-test 2014 and 4SQ-test. Scores on 4SQ-valid, and MTNT-test (for comparison with Michel and Neubig, 2018;Berard et al., 2019) are given in Appendix. We evaluate "detokenized" MT outputs 15 against raw (non-tokenized) references using SacreBLEU (Post, 2018). 16 In addition to BLEU, we do an indirect evaluation on an Aspect-Based Sentiment Analysis (ABSA) task, a human evaluation, and a taskrelated evaluation based on polysemous words. Table 4 compares the case handling techniques presented in Section 4.2. To better evaluate the robustness of our models to changes of case, we built 3 synthetic test sets from 4SQ-test, with the same target, but all source words in upper, lower or title case.   Inline and factored case perform equally well, significantly better than the default (cased) model, especially on all-uppercase inputs. Lowercasing the source is a good option, but gives a slightly lower score on regular 4SQ-test. 17 Finally, synthetic case noise added to the source gives surprisingly good results. It could also be combined with factored or inline case. Table 5 compares the baseline "inline case" model with the same model augmented with natural noise (Section 4.3). Performance is the same on 4SQ-test, but significantly better on news-test artificially augmented with 4SQ-like noise. Table 6 shows the results of the back-translation (BT) techniques. Surprisingly, BT with beam search (BT-B) deteriorates BLEU scores on 4SQ-test, while BT with sampling gives a consistent improvement. BLEU scores on news-test are not significantly impacted, suggesting that BT can be used for domain adaptation without hurting quality on other domains.   As shown in Table 8, these techniques can be combined to achieve the best results. The natural noise does not have a significant effect on BLEU scores. Back-translation combined with fine-tuning gives the best performance on 4SQ (+4.5 BLEU vs UGC). However, using tags instead of fine-tuning strikes a better balance between general domain and in-domain performance.

Targeted evaluation
In this section we propose two metrics that target specific aspects of translation adequacy: translation accuracy of domain-specific polysemous words and Aspect-Based Sentiment Analysis performance on MT outputs.

Translation of polysemous words
We propose to count polysemous words specific to our domain, similarly to (Lala and Specia, 2018), to measure the degree of domain adaptation. TER between the translation hypotheses and the post-edited references in 4SQ-PE reveals the most common substitutions (e.g., "card" is often replaced with "menu", suggesting that "card" is a common mistranslation of the polysemous word "carte"). We filter this list manually to only keep words that are polysemous and that have a high frequency in the test set. Table 9 gives the 3 most frequent ones. 20 Table 10 shows the accuracy of our models when translating these words. We see that the domainadapted model is better at translating domainspecific polysemous words.

Indirect evaluation with sentiment analysis
We also measure adequacy by how well the translation preserves the polarity of the sentence regarding various aspects. To evaluate this, we perform an indirect evaluation on the SemEval 2016 Aspect-Based Sentiment Analysis (ABSA) task (Pontiki et al., 2016). We use our internal ABSA systems trained on English or French SemEval    Table 11, translations obtained with domain-adapted models lead to significantly better scores on the ABSA task than the generic models.

Human Evaluation
We conduct a human evaluation to confirm the observations with BLEU and to overcome some of the limitations of this metric. We select 4 MT models for evaluation (see Table 12) and show their 4 outputs at once, sentenceby-sentence, to human judges, who are asked to rank them given the French source sentence in context (with the full review). For each pair of models, we count the number of wins, ties and losses, and apply the Wilcoxon signed-rank test.
We took the first 300 test sentences to create 6 tasks of 50 sentences each. Then we asked bilingual colleagues to rank the output of 4 models by their translation quality. They were asked to do one or more of these tasks. The judge did not know about the list of models, nor the model that produced any given translation. We got 12 an-  Table 12: In-house human evaluation ("≫" means better with p ≤ 0.05). The 4 models Baseline, GT, Tags and Tags + noise correspond respectively to rows 2 (UGC with inline case), 3 (Google Translate), 6 (Combination of BT, PE and tags) and 8 (Same as 6 with natural noise) in Table 8. swers. The inter-judge Kappa coefficient ranged from 0.29 to 0.63, with an average of 0.47, which is a good value given the difficulty of the task. Table 12 gives the results of the evaluation, which confirm our observations with BLEU. We also did a larger-scale monolingual evaluation using Amazon Mechanical Turk (see Appendix), which lead to similar conclusions.