Olá, Bonjour, Salve! XFORMAL: A Benchmark for Multilingual Formality Style Transfer

We take the first step towards multilingual style transfer by creating and releasing XFORMAL, a benchmark of multiple formal reformulations of informal text in Brazilian Portuguese, French, and Italian. Results on XFORMAL suggest that state-of-the-art style transfer approaches perform close to simple baselines, indicating that style transfer is even more challenging when moving multilingual.


Introduction
Style Transfer (ST) is the task of automatically transforming text in one style into another (for example, making an impolite request more polite). Most work in this growing field has focused primarily on style transfer within English, while covering different languages has received disproportional interest. Concretely, out of 35 ST papers we reviewed, all of them report results for ST within English text, while there is just a single work covering each of the following languages: Chinese, Russian, Latvian, Estonian, and French (Shang et al., 2019;Tikhonov et al., 2019;Korotkova et al., 2019;Niu et al., 2018). Notably, even though some efforts have been made towards multilingual ST, researchers are limited to providing system outputs as a means of evaluation, and progress is hampered by the scarcity of resources for most languages. At the same time, ST lies at the core of human communication: when humans produce language, they condition their choice of grammatical and lexical transformations to a target audience and a specific situation. Among the many possible stylistic variations, Heylighen et al. (1999) argue that "a dimension similar to formality appears as the most important and universal feature distinguishing styles, registers or genres in different languages". Consider the informal excerpts and their formal reformulations in French (FR)  Il ne prêtait pas attention à la situation.
He was not paying attention to the situation.

ITALIAN in bocca al lupo! good luck!
Ti rivolgo un sincero augurio! I send you a sincere wish! (BR-PT) in Table 1. Both informal-formal pairs share the same content. However, the informal language conveys more information than is contained in the literal meaning of the words (Hovy, 1987). These examples relate to the notion of deep formality (Heylighen et al., 1999), where the ultimate goal is that of adding the context needed to disambiguate an expression. On the other hand, variations in formality might just reflect different situational and personal factors, as shown in the Italian (IT) example. This work takes the first step towards a more language-inclusive direction for the field of ST by building the first corpus of style transfer for non-English languages. In particular, we make the following contributions: 1. Building upon prior work on Formality Style Transfer (FoST) (Rao and Tetreault, 2018), we contribute an evaluation dataset, XFORMAL that consists of multiple formal rewrites of informal sentences in three Romance languages: Brazilian Portuguese (BR-PT), French (FR), and Italian (IT); 2. Without assuming access to any gold-standard training data for the languages at hand, we benchmark a myriad of leading ST baselines through automatic and human evaluation methods. Our results show that FoST in non-English languages is particularly challenging as complex neural models perform on par with a simple rule-based system consisting of handcrafted transformations. We make XFORMAL, our annotations protocols, and analysis code publicly available and hope that this study facilitates and encourages more research towards Multilingual ST.

Related Work
Controlling style aspects in generation tasks is studied in monolingual settings with an English-centric focus (intra-language) and cross-lingual settings together with Machine Translation (MT) (interlanguage). Our work rests in intra-language ST with a multilingual focus, in contrast to prior work.
Intra-language ST was first cast as generation task by Xu et al. (2012) and is addressed through methods that use either parallel data or unpaired corpora of different styles. Parallel corpora designed for the task at hand are used to train traditional encoder-decoder architectures (Rao and Tetreault, 2018), learn mappings between latent representation of different styles (Shang et al., 2019), or fine-tune pre-trained models (Wang et al., 2019). Other approaches use parallel data from similar tasks to facilitate transfer in the target style via domain adaptation (Li et al., 2019), multi-task learning (Niu et al., 2018;Niu and Carpuat, 2020), and zero-shot transfer (Korotkova et al., 2019) or create pseudo-parallel data via data augmentation techniques (Zhang et al., 2020;Krishna et al., 2020). Approaches that rely on non-parallel data include disentanglement methods based on the idea of learning style-agnostic latent representations (e.g., Shen et al. (2017); Hu et al. (2017)). However, they are recently criticized for resulting in poor content preservation (Xu et al., 2018;Jin et al., 2019;Luo et al., 2019;Subramanian et al., 2018) and alternatively, translation-based models are proposed that use reconstruction and back-translation losses (e.g., Logeswaran et al. (2018); Prabhumoye et al. (2018)). Another line of work, focuses on manipulation methods that remove the style-specific attribute of text (e.g., Li et al. (2018); Xu et al. (2018)), while recent approaches use reinforcement learning (e.g., Gong et al. (2019), probabilistic formulations (He et al., 2020, and masked language models (Malmi et al., 2020).

XFORMAL Collection
We describe the process of collecting formal rewrites using data statements protocols (Bender and Friedman, 2018;Gebru et al., 2018).
Curation rational To collect XFORMAL, we firstly curate informal excerpts in multiple languages. To this end, we follow the procedures described in Rao and Tetreault (2018) (henceforth RT18) who create a corpus of informalformal sentence-pairs in English (EN) entitled Grammarly's Yahoo Answers Formality Corpus (GYAFC).
Concretely, we use the L6 Yahoo! Answers corpus that consists of questions and answers posted to the Yahoo! Answers platform. 2 The corpus contains a large number of informal text and allows control for different languages and different domains. 3 Similar to the collection of GYAFC, we extract all answers from the Family & Relationships (F&R) topic that correspond to the three languages of interest: Família e Relacionamentos (BR-PT), Relazioni e famiglia (IT ), and Amour et relations (FR) (Step 1). We follow the same pre-processing steps as described in RT18 for consistency (Step 2). We filter out answers that: a) consist of questions; b) include URLs; c) have fewer  than five or more than 25 tokens; or d) constitute duplicates. 4 We automatically extract informal candidate sentences, as described in §5.3 ( Step 3). Finally, we randomly sample 1,000 sentences from the pool of informal candidates for each language. Table 2 presents statistics of the curation steps.
Procedures We use the Amazon Mechanical Turk (MTurk) platform to collect formal rewrites for our informal sentences. For each language, we split the annotation into 20 batches of 50 Human Intelligence Tasks (HITs). In each HIT, Turkers are given an informal excerpt and asked to generate its formal rewrite in the same language without changing its meaning. We collect 4 rewrites per excerpt and release detailed instructions under A.G.

Annotation Workflow & Quality Control
Our annotation protocol consists of multiple Quality Control (QC) steps to ensure the recruitment of high-quality annotators. As a first step, we use location restrictions (QC1) to limit the pool of workers to countries where native speakers are most likely to be found. Next, we run several small pilot studies (of 10 HITs) to recruit potential workers. To participate in the pilot study, Turkers have to pass a qualification test (QC2) consisting of multiplechoice questions that test workers' understanding of formality (see A.L). The pilot study results are reviewed by a native speaker (QC3) of each language to exclude workers who performed consistently poorly. We find that the two main reasons for poor quality are: a) rewrites of minimum-level edits, or b) rewrites that change the input's meaning. Table 3 presents the number of workers at each QC step. Only workers passing all quality control steps (last row of Table 3) contribute to the final task. Finally, we post-process the collected rewrites by a) removing instances consisting of normalizationbased edits only and b) correcting minor spelling errors using an off-the-shelf tool. 5 Step Description  Turkers' demographics We recruit Turkers from Brazil, France/Canada, and Italy for BR-PT, FR, and IT, respectively. Beyond their country of residence, no further information is available.
Compensation We compensate at a rate of $0.10 per HIT with additional one-time bonuses that bumps them up to a target rate of over $10/hour.
After this entire process, we have constructed a high-quality corpus of formality rewrites of 1,000 sentences for three languages. In the next section, we provide statistics and an analysis of XFORMAL.

XFORMAL Statistics & Analysis
Types of formality edits Following Pavlick and Tetreault (2016), we analyze the most frequent edit operations Turkers perform when formalizing the informal sentences. We conduct both an automatic analysis (details in A.I) of the whole set of rewrites, and a human analysis (details in A.H) of a random sample of 200 rewrites per language (we recruited a native speaker for each language). Table 4 presents both analyses' results, where we also include the corresponding statistics for the English language (GYAFC). In general, we observe similar trends across languages: humans make edits covering both the "noisy-text" sense of formality (e.g., fixing punctuation, spelling errors, capitalization) and the more situational sense (paraphrase-based edits). Although cross-language trends are similar, we also observe differences: deleting fillers and word completion seems to be more prominent in the English rewrites than in other languages; normalizing abbreviations is a considerably frequent edit type for Brazilian Portuguese; paraphrasing is more frequent in the three non-English languages.

Surface differences of informal-formal pairs
We quantify surface-level differences between the informal sentences and formal rewrites via computing their character-level Levenshtein distance (Figure 1) and their pairwise Lexical Difference (LeD) based on the percentages of tokens that are not found in both sentences (Table 5). Both analyses show that Italian rewrites have the most edits com-   a large number of reformulations consist of paraphrase-based edits (more than 50%), we want to quantify the extent to which the formal rewrites of each sentence are diverse, in terms of their lexical choices. To that end, we quantify diversity via measuring self-BLEU (Zhu et al., 2018): considering one set of formal sentences as the hypothesis set and the others as references, we compute BLEU for each formal set and define the average BLEU score as a measure of the dataset's diversity. Higher scores imply less diversity of the set. Results (last row of Table 4) show that XFORMAL consists of more diverse rewrites compared to GYAFC.
Formality shift of rewrites We analyze the formality distribution of the original informal sentences with their formal rewrites in GYAFC and XFORMAL, as predicted by formality mBERT models ( §5.3). The distributions of formal rewrites are skewed towards positive values ( Figure 2).

Multilingual FoST Experiments
We benchmark eight ST models on XFORMAL to serve as baseline scores for future research. We describe the models ( §5.1), the experimental setting ( §5.2), the human and automatic evaluation methods ( §5.3 and §5.4), and results ( §5.5).

Models
Simple baselines We define three baselines: 1. COPY Motivated by Pang and Gimpel (2019) who notice that untransferred sentences with no alterations have the highest BLEU score by a large margin for ST tasks, we use this simple baseline as a lower bound; 2. RULE-BASED Based on the quantitative analysis of §4 and similarly to RT18, we develop a rule-based approach that performs a set of predefined edits-based operations defined by handcrafted rules. Example transformations include fix casing, remove repeated punctuation, handcraft a list of contraction expansions-a detailed description is found at A.C; 3. ROUND-TRIP MT Inspired by Zhang et al. (2020) who identify useful training pairs from the paraphrases generated by round-trip translations of millions of sentences, we devise a simpler baseline that starts from a text in language x, pivots to EN and then backtranslates to x, using the AWS translation service. 6 NMT-based models with synthetic parallel data We follow the TRANSLATE TRAIN (Conneau et al., Artetxe et al., 2020) approach to collect data in multilingual settings: we obtain pseudo-parallel corpora in each language via machine translating an EN resource of informal-formal pairs ( §5.2). 7 Then, starting with TRANSLATE TRAIN we benchmark the following NMT-based models: 1. TRANSLATE TRAIN TAG extends a leading EN FoST approach (Niu et al., 2018) and trains a unified model that handles either formality direction via attaching a source tag that denotes the desired target formality; 2. MULTI-TASK TAG-STYLE Niu et al. (2018) augments the previous approach with bilingual data that is automatically identified as formal ( §5.3). The models are then trained in a multi-task fashion; 3. BACKTRANSLATE augments the TRANSLATE TRAIN data with back-translated sentences of automatically detected informal text (Sennrich et al., 2016b), using 1. as the base model. We exclude backtranslated pairs consisting of copies. The output of the RULE-BASED system is given as input to each model at inference time. For all three models, we run each system with 4 random seeds, and combine them in a linear ensemble for decoding.
Unsupervised approaches We benchmark two unsupervised methods that are used for EN ST: 1. UNPSUPERVISED NEURAL MACHINE TRANS-6 https://aws.amazon.com/translate/ 7 Details on the AWS performance are found in A.B. (Subramanian et al., 2018) defines a pseudo-supervised setting and combines denoising auto-encoding and back-translation losses; 2. DEEP LATENT SEQUENCE MODEL (DLSM) (He et al., 2020) defines a probabilistic generative story that treats two unpaired corpora of separate styles as a partially observed parallel corpus and learns a mapping between them, using variational inference.

Experimental setting
Training data For TRANSLATE TRAIN TAG we use GYAFC, a large set of 110K EN informal-formal parallel sentence-pairs obtained through crowdsourcing. Additionally, we augment the translated resource with OpenSubtitles (Lison and Tiedemann, 2016) bilingual data used for training MT models. 8 Given that bilingual sentence-pairs can be noisy, we perform a filtering step to extract noisy bitexts using the Bicleaner toolkit (Sánchez-Cartagena et al.). 9 Furthermore, we apply the same filtering steps as in §3 (Curation rational). Finally, each of the remaining sentences is assigned a formality score ( §5.3), resulting in two pools of informal and formal text. Training instances are then randomly sampled from those pools: formal parallel pairs are used for MULTI-TASK TAG-STYLE; informal target side sentences are backtranslated for BACKTRANSLATE; both informal and formal target-side texts are independently sampled from the two pools for training unsupervised models. Finally, for unsupervised FoST in FR, we additionally experiment with in-domain data from the L26 French Yahoo! Answer Corpus that consists of 1.7M FR questions. 10, 11 Table 6 includes statistics on training sizes. 12  9 We use the publicly available pretrained Bicleaner models: https://github.com/bitextor/bicleaner, and discard sentences with a score lower than 0.5. 10 https://webscope.sandbox.yahoo.com/ catalog.php?datatype=l&did=74 11 Split into 6.2M/6.6M formal/informal sentences. 12 Bilingual data statistics are in A.K.
Preprocessing We preprocess data consistently across languages using MOSES (Koehn et al., 2007). Our pipeline consists of three steps: a) normalization; b) tokenization; c) true-casing. For NMTbased approaches, we also learn joint source-target BPE with 32K operations (Sennrich et al., 2016b).
Model Implementations For NMT-based and unsupervised models we use the open-sourced impementations of Niu et al. (2018) and He et al.

Automatic Evaluation
Recent work on ST evaluation highlights the lack of standard evaluation practices (Yamshchikov et al., 2020;Pang, 2019;Pang and Gimpel, 2019;Mir et al., 2019). We follow the most frequent evaluation metrics used in EN tasks and measure the quality of the system's outputs with respect to four dimensions, while we leave an extensive evaluation of automatic metrics for future work. language model on 2M random sample of the non-English side of OpenSubtitles formal data.

Meaning Preservation
Overall We compute multi-BLEU (Post, 2018) via comparing with multiple formal rewrites on XFORMAL. Freitag et al. (2020) shows that correlation with human judgments improves when considering multiple references for MT evaluation.

Human evaluation
Given that automatic evaluation of ST lacks standard evaluation practices-even in cases when EN is considered-we turn to human evaluation to reliably assess our baselines following the protocols of RT18. We sample a subset of 100 sentences from XFORMAL per language, evaluate outputs of 5 systems, and collect 5 judgments per instance.We open the task to all workers passing QC2 in Table 3. We include inter-annotator agreement results in A.E.
Formality We collect formality ratings for the original informal reference, the formal human rewrite, and the formal system outputs on a 7point discrete scale of −3 to 3, following Lahiri (2015) (Very informal→ Informal→ Somewhat Informal→ Neutral→ Somewhat Formal→ Formal → Very Formal).
Fluency We collect fluency ratings for the original informal reference, the formal human rewrite, and the formal system outputs on a discrete scale of 1 to 5, following Heilman et al.
Meaning Preservation We adopt the annotation scheme of Semantic Textual Similarity (Agirre et al., 2016): given the informal reference and formal human rewrite or the formal system outputs, Turkers rate the two sentences' similarity on a 1 to 6 scale (Completely dissimilar → Not equivalent but on same topic → Not equivalent but share some details → Roughly equivalent → Mostly equivalent → Completely equivalent).
Overall We collect overall judgments of the system outputs using relative ranking: given the informal reference and a formal human rewrite, workers are asked to rank system outputs in the order of their overall formality, taking into account both fluency and meaning preservation. An overall score is then computed for each model via averaging results across annotating instances.   Table 7 shows automatic results for all models across the four dimensions as well as human ratings for selected top models.

Results
NMT-based model evalatuation Concretely, the RULE-BASED baselines are significantly (p < 0.05) the best performing models in terms of meaning preservation across languages. This result is intuitive as the RULE-BASED models act at the surface level and are unlikely to change the informal sentence's meaning. The BACKTRANSLATE ensemble systems are the second-best performing models in terms of meaning preservation, while the ROUND-TRIP MT outputs diverge semantically from the informal sentences the most. Those results are consistent across languages and human/automatic evaluations. On the other hand, when we compare systems in terms of their formality, we observe the opposite pattern: the RULE-BASED and BACK-TRANSLATE outputs are the most informal compared to the other ensemble NMT-based approaches across languages. Interestingly, the ROUND-TRIP MT outputs exhibit the largest formality shift for BR-PT and FR as measured by human evaluation. The trade-off between meaning preservation and formality among models was also observed in EN (RT18). Moreover, when we move to fluency, we notice similar results across systems. Specifically, human evaluation assigns almost all models an average score of > 4, denoting that system outputs are comprehensible on average, with small differences between systems not being statistically significant. Notably, perplexity tells a different story: all system outputs are significantly better compared to the RULE-BASED systems across configurations and languages. This result denotes that perplexity might not be a reliable metric to measure fluency in this setting, as noticed in Mir et al. (2019) and Krishna et al. (2020). When it comes to the overall ranking of systems, we observe that the NMT-based ensembles are better than the RULE-BASED baselines for BR-PT and FR, yet by a small margin as denoted by both multi-BLEU and human evaluation. However, the corresponding results for IT denote that there is no clear win, and the NMTbased ensembles still fail to surpass the naive RULE-BASED models, yet by a small margin. Finally, all ensembles outperform the trivial COPY baseline. Table 8 presents examples of system outputs. As a side note, we followed the recommendation of Tikhonov and Yamshchikov (2018) to show the performance of ST models of individual runs and visualize trade-offs between metrics better. Unlike their work which found that reruns of the same model showed wide performance discrepancies, we found that most of our NMT-based models did not vary in performance on XFORMAL. The results can be visualized in A.M.
Unsupervised model evaluation We also benchmark the unsupervised models but focus solely on automatic metrics since they lag behind their supervised counterparts. As shown in Un po'di raffreddore, ma tutto va bene.  using out-of-domain data (e.g., OpenSubtitles) for training, the models perform worse than their NMT counterparts across all three languages. The difference is most stark when considering self-BLEU and multi-BLEU scores. However, given access to large in-domain corpora (e.g., L26 Yahoo! French Answers) the gap between the two model classes closes with DLSM achieving a multi-BLEU score of 42.1 compared to 48.3 for the best performing NMT model BACKTRANSLATE. This shows the promise of unsupervised methods, assuming a large amount of in-domain data, on multilingual ST tasks.
Lexical differences of system outputs Finally, in Figure 3 we analyze the diversity of outputs by leveraging LeD scores resulting from pair-wise comparisons of different NMT systems. A larger LeD score denotes a larger difference between the lexical choices of the two systems under comparison. First, we observe that the ROUND-TRIP MT outputs have the smallest lexical overlap with the informal input sentences. However, when this observation is examined together with human evaluation results, we conclude that the large number of lexical edits happens at the cost of diverging se-mantically from the input sentences. Moreover, we observe that the average lexical differences within NMT-based systems are small. This indicates that different systems perform similar edit operations that do not deviate a lot from the input sentence in terms of their lexical choices. This is unfortunate given that multilingual FoST requires systems to perform more phrase-based operations, as shown in the analysis in §4.
Evaluation Metric While evaluating evaluation metrics is not a goal of this work (though the data can be used for that purpose), we observe that the top models identified by the automatic metrics generally align with the top models identified by humans. While promising, further work is required to confirm if the automatic measures really do correlate with human judgments.

Conclusions & Future Directions
This work extends the task of formality style transfer to a multilingual setting. Specifically, we contribute XFORMAL, an evaluation testbed consisting of informal sentences and multiple formal rewrites spanning three languages: BR-PT, FR, and IT. As in Rao and Tetreault (2018) Turkers can be effective in creating high quality ST corpora. In contrast to the aforementioned EN corpus, we find that the rewrites in XFORMAL tend to be more diverse, making it a more challenging task. Additionally, inspired by work on cross-lingual transfer and EN FoST, we benchmark several methods and perform automatic and human evaluations on their outputs. We found that NMT-based ensembles are the best performing models for FR and BR-PT-a result consistent with EN-however, they perform comparably to a naive RULE-BASED baseline for IT. To further facilitate reproducibility of our evaluations and corpus creation processes, as well as drive future work, we will release our scripts, rule-based baselines, source data, and annotation templates, on top of the release of XFORMAL. Our results open several avenues for future work in terms of benchmarking and evaluating FoST in a more language inclusive direction. Notably, current supervised and unsupervised approaches for EN FoST rely on parallel in-domain data-with the latter treating the parallel set as two unpaired corpora-that are not available in most languages. We suggest that benchmarking FoST models in multilingual settings will help understand their ability to generalize and lead to safer conclusions when comparing approaches. At the same time, multilingual FoST calls for more language-inclusive consideration for automatic evaluation metrics. Modelbased approaches have been recently proposed for evaluating different aspects of ST. However, most of them rely heavily on English resources or pretrained models. How those methods can be extended to multilingual settings and how we evaluate their performance remain open questions.

Ethical Considerations
Finally, we address ethical considerations for dataset papers given that our work proposes a new corpus XFORMAL. We reply to the relevant questions posed in the NAACL 2021 Ethics FAQ. 16

Dataset Rights
The underlying data for our dataset as well as training our FoST models and formality classifiers are from Yahoo! Answers L6 dataset. We were granted written permission by Yahoo (now Verizon) to make the resulting dataset public for academic use.

Dataset Collection Process
Turkers are paid over 10 USD an hour. We targeted a rate higher than the US national minimum wage of 7.50 USD given discussions with other researchers who use crowdsourcing. We include more information on collection procedures in §3.

IRB Approval
This question is not applicable for our work.

Dataset Characteristics
We follow Bender and Friedman (2018) and Gebru et al. (2018) and report characteristics in §3 and §4.

A Does translation preserve formality?
We examine the extend to which machine translation-through the AWS service-affects the formality level of an input sentence: starting from a set of English sentences we have formality judgments for (ORIGINAL-EN), we perform a roundtrip translation via pivoting through an auxiliary language (PIVOT-X). We then compare the formality prediction scores of the English Formality regression model for the two versions of the English input. In terms of Spearman correlation, the model's performance drops by 7.5 points on average when tested on round-trip translations.
To better understand what causes this drop in performance, we present a per formality bin analysis in Table 9. On average we observe that translation preserves the formality level of formal sentences considerably well, while at the same time it tends to shift the formality level of informal sentences towards formal values-by a margin smaller than 1 point-most of the times. To account for the formalization effect of translation, we draw the line between formal and informal sentences at the value of −1 for scores predicted by multilingual regression models. This decision is based on the following intuition: if the formality shift of machine translated informal sentences is around +1 value, the propagation of English formality labels imposes a negative shift of formal sentences in the model's predictions.

B Amazon Web Service details
We compute the performance of the AWS system on 2.5K randomly sentences from OpenSubtitles, as a sanity check of translation performance. We report BLEU of 37.16 (BR-PT), 33.79 (FR), and 32.67 (IT).

C Rule-based baselines
We develop a set of rules to automatically make an informal sentence more formal via performing surface-level edits similar to the EN rule-based system of Rao and Tetreault (2018). The set of extracted rules are shared across languages with the only difference being the list of abbreviations: Normalize punctuation We remove punctuation symbols that are repeated >= 2 times.
Normalize casing Several sentences might consist of words that are written in upper case. We lower case all characters apart from the first characters of the first word.

D NMT architecture details
We used the NMT implementations of Niu et al. (2018) that are publicly available: https:// github.com/xingniu/multitask-ft-fsmt. The NMT models are implemented as bi-directional LSTMs on Sockeye (Hieber et al., 2018), using the same configurations across languages to establish fair comparisons. We use single LSTMs consisting of a single of size 512, multilayer perceptron attention with a layer size of 512, and word representations of size 512. We apply layer normalization and tie the source and target embeddings as well as the output layer's weight matrix. We add dropout with probability 0.2 (for the embeddings and LSMT cells in both the encoder and the decoder). For training, we use the Adam optimizer with a batch size of 64 sentences and checkpoint the model every 1000 updates. Training stops after 8 checkpoints without improvement of validation perplexity. For decoding, we use a beam size of 5.

E Inter-annotator agreement on human evaluation
To quantify inter-annotator agreement for the tasks of formality, meaning preservation, and fluency we measure the correlation of their ordinal ratings using inter-class correlation (ICC) and their categorical agreement using a variations of Cohen's κ coefficient. For the latter, given that we collect human evaluation judgments through crowd-sourcing, we follow the simulation framework of Pavlick and Tetreault (2016) Table 9: Average formality scores of human annotations (GOLD-STANDARD) and model's predictions on the original and round-trip translated PT16 test set grouped in 6 bins of varying formality. ∆ gives the average formality shift of the English sentences resulting from round-trip translation. Scores in blue and red indicate that mean is above and below zero, respectively. * denotes statistical significant formality shifts with p < 0.05. Translation preserves formality of formal sentences while informal sentences exhibit a shift towards formal values.
Annotator 2) via randomly choosing one annotator's judgment for a given instance as the rating of Annotator 1 and taking the mean rating of the rest judgments as the rating of Annotator 2. We then compute Cohen's κ for these two simulated annotators. We repeat this process 1,000 times, and report the median and standard deviation of results. For measuring agreement of the overall ranking evaluation task we use the same simulated framework and report results of Kendall's τ . Table 10 presents Inter-annotator agreement results on human evaluation across evaluation aspects and languages.

H Qualitative analysis of XFORMAL
Following, we include the instructions given to native speakers for the qualitative analysis of XFOR-MAL, as described in §4.
Background context In this task you will be asked to judge the quality of formal rewrites of a set of informal sentences. To give more background context, your work would serve as a quality check over annotations obtained from the Amazon Mechanical Turk (AMT) platform. AMT workers were given an informal sentence, e.g., "I'd say it is punk though", and then asked to provide us with its formal rewrite while maintaining the original   content and being grammatical/fluent, e.g., "However, I do believe it to be punk". In our work, workers were presented with sentences in French, Italian and Brazilian Portuguese. For our analysis we want to know a) whether the quality of the collected annotations is good (Task 1), b) what are the types of edits workers performed when formalizing the input sentence (Task 2). Both tasks consist of the same 200 informal-formal sentences-pairs and could be performed either in parallel (e.g., judging a single informal-formal sentence-pair both in terms of quality and types of edit at the same time; Task 1 and 2 in Google sheet), or individually (Task 1, Task 2 in Google sheet). More information and examples for the two tasks are included below.
Task 1 Given an informal sentence you are asked to assess the quality of its formal rewrite. For each sentence pair type excellent, acceptable, or poor under the rate column. Read the instructions below before performing the task! What constitutes a good formal rewrite?
• The style of the rewrite should be formal • The overall meaning of the informal sentence should be preserved

• The rewrite should be fluent
How should I interpret the provided options? Below we include detailed instructions on how to interpret the provided options.
• Excellent the rewrite is formal, fluent and the original meaning is maintained. There is very little that could be done to make a better rewrite.
• Acceptable the rewrite is generally an improvement upon the original but there are some minor issues (such as the rewrite contains a typo or missed transforming some informal parts of the sentence, for example).
• Poor the rewrite offers a marginal or no improvement over the original. There are many aspects that could be improved.
Task 2 In this task the goal is to characterize the types of edits workers made while formalizing the informal input sentence. For each informal-formal sentence pair you should check each of the provided boxes. Note that: a) Multiple types of edits might hold at the same type (e.g., in the informalformal "Lexus cars are awesome!"-"Lexus cars are very nice." you should check both the paraphrase and the punctuation boxes.); b) At the end of the Google sheet there is a column named 'Other' you are welcome to write down in plain text any additional type that you observed and it is not covered by the existing classes. The provided classes are: capitalization, punctuation, paraphrase, delete fillers, completion, add context, contractions, spelling, normalization, slang/idioms, politeness, split sentences, and relativizers.

I Quantitative analysis of XFORMAL
Following, we include more details on the qualitative analysis procedures.
Capitalization A rewrite performs a capitalization edit if it contains tokens that appear in capital letters in the informal text but in lowercase letters in the formal text.
Punctuation A rewrite contains a punctuation edits if any of the punctuation of the informalformal texts differs.
Spelling We identify spelling errors based on the character-based Levenshtein distance between informal-formal tokens.
Normalization We identify normalization edits based on a hand-crafted list of abbreviations for each language.
Split sentences We split sentences using the NLTK toolkit.
Paraphrase A formal rewrite is considered to contain a paraphrase edit if the token level Levenshtein distance between the informal-formal text is greater than 3 tokens.
Lowercase A rewrite performs a capitalization edit if it contains tokens that appear in lower case in the informal text but in capital letters in the formal text.
Repetition We identify repetition tokens using regular expressions (a token that appears more than 3 times in a row is considered a repetition.)

J XFORMAL: Data Quality
We ask a native speaker to judge the quality of a sample of 200 rewrites for each language via choosing one of the following three options: "excellent", "acceptable", and "poor". The details of this analysis are included in A.H (Task 1).Results indicate that on average less than 10% of the rewrites were identified as of poor quality across languages while more than 40% as of excellent quality. The small number of rewrites identified as "poor" consists mostly of edits where humans add context not appearing in the original sentence. We choose not to exclude any of the rewrites from the dataset as we provide multiple reformulations of each informal instance.

K OpenSubtitles data
OpeSubtiles Preprocessing BR-PT   Table 13 presents questions and answers used in QC2 of our annotation protocol. Turkers have to score 80 and above to participate in the task. To compute an average score for each test, we assume that an answer is incorrect if it deviates more than 1 point from the gold-standard scores given in the second column of Table 13.

M Trade-off plots
Trade-off plots Figure 4 presents trade-off plots between multi-BLEU vs. formality score and fluency, across different reruns as proposed by Tikhonov and Yamshchikov (2018). First, we observe that models exhibit small variations in terms of formality and fluency scores across different reruns, and larger variations across BLEU for most cases. Notably, the single seed BACKTRANSLATE systems are the most consistent across reruns for all metrics and languages. Furthermore, in almost all cases, models trained on 2M data, perform better that the naive COPY baseline, across metrics. However, single seed models fail to consistently outperform the RULE-BASED baselines in almost all cases, with the exception of BACKTRANSLATE for BR-PT and FR which report an improvement of about 1 BLEU score.

N Compute time & Infrastracture
All experiments for benchmarking both NMTbased and unsupervised approaches use Amazon EC2 P3 instances: https://aws.amazon.com/
2 Il s'agit bien du groupe America et le titre "A horse with no name".