Parallel Data Augmentation for Formality Style Transfer

The main barrier to progress in the task of Formality Style Transfer is the inadequacy of training data. In this paper, we study how to augment parallel data and propose novel and simple data augmentation methods for this task to obtain useful sentence pairs with easily accessible models and systems. Experiments demonstrate that our augmented parallel data largely helps improve formality style transfer when it is used to pre-train the model, leading to the state-of-the-art results in the GYAFC benchmark dataset.


Introduction
Formality style transfer (FST) is defined as the task of automatically transforming a piece of text in one particular formality style into another (Rao and Tetreault, 2018).For example, given an informal sentence, FST aims to preserve the styleindependent content and output a formal sentence.
Previous work tends to leverage neural networks (Xu et al., 2019;Niu et al., 2018;Wang et al., 2019) such as seq2seq models to address this challenge due to their powerful capability and large improvement over the traditional rule-based approaches (Rao and Tetreault, 2018).However, the performance of the neural network approaches is still limited by the inadequacy of training data: the public parallel corpus for FST training -GYAFC (Rao and Tetreault, 2018) -contains only approximately 100K sentence pairs, which can hardly satiate the neural models with millions of parameters.
To tackle the data sparsity problem for FST, we propose to augment parallel data with three specific data augmentation methods to help improve the model's generalization ability and reduce the overfitting risk.Besides applying the widely used back I dunno, even if she like you, and then she 'll prob.

Reference (formal)
I don't know.She probably will if she likes you.
Target I don't know ... Good luck.
M-Task Source I think she like cat too.
Target I think she likes cat too.translation (BT) method (Sennrich et al., 2016a) in Machine Translation (MT) to FST, our data augmentation methods include formality discrimination (F-Dis) and multi-task transfer (M-Task).They are both novel and effective in generating parallel data that introduces additional formality transfer knowledge that cannot be derived from the original training data.Specifically, F-Dis identifies useful pairs from the paraphrased pairs generated by cross-lingual MT; while M-task leverages the training data of Grammatical Error Correction (GEC) task to improve formality, as shown in Figure 1.
Experimental results show that our proposed data augmentation methods can harvest large amounts of augmented parallel data for FST.The augmented parallel data proves helpful and significantly helps improve formality style transfer when it is used to pre-train the model, allowing the model to achieve the state-of-the-art results in the GYAFC benchmark dataset.We study three data augmentation methods for formality style transfer: back translation, formality discrimination, and multi-task transfer.We focus on informal→formal style transfer since it is more practical in real application scenarios.

Back translation
The original idea of back translation (BT) (Sennrich et al., 2016a) is to train a target-to-source seq2seq (Sutskever et al., 2014;Cho et al., 2014) model and use the model to generate source language sentences from target monolingual sentences, establishing synthetic parallel sentences.We generalize it as our basic data augmentation method and use the original parallel data to train a seq2seq model in the formal-to-informal direction.Then, we can feed formal sentences to this model that is supposed to be capable of generating their informal counterparts.The formal input and the informal output sentences can be paired to establish augmented parallel data.

Formality discrimination
According to the observation that an informal sentence tends to become a formal sentence after a round-trip translation by MT models that are mainly trained with formal text like news, we propose a novel method called formality discrimination to generate formal rewrites of informal source sentences by means of cross-lingual MT models.
A typical example is shown in Figure 2.
To this end, we collect a number of potentially informal English sentences (e.g., from online forums).Formally, we denote the collected sentences as S = {s i } |S| i=1 where s i represents the i-th sentence.We first translate 2 them into a pivot language (e.g., French) and then translate them back into English, as Figure 2 shows.In this way, we obtain a rewritten sentence s i for each sentence s i ∈ S.
To verify whether s i improves the formality compared to s i , we introduce a formality discriminator which in our case is a Convolutional Neural Network (CNN) to quantify the formality level of a sentence.We trained the formality discriminator with the sentences and their formality labels in the FST corpus (e.g., GYAFC).The pairs (s i , s i ) where s i largely improves the formality of s i will The pair (connected by the red dashed arrow) that obtains significant formality improvement will be kept as augmented data.
be selected as the augmented data.The resulting data set T aug is such a set of pairs: where P + (x) is the probability of sentence x being formal, predicted by the discriminator, and σ is the threshold3 for augmented data selection.In this way, we can obtain much helpful parallel data with valuable rewriting knowledge that is not covered by the original parallel data.

Multi-task transfer
In addition to back translation and formality discrimination that use artificially generated sentence pairs for data augmentation, we introduce multitask transfer that uses annotated sentence pairs from other seq2seq tasks.We observe that informal texts are usually ungrammatical while formal texts are almost grammatically correct.Therefore, a desirable FST model should possess the ability to detect and rewrite ungrammatical texts, which has been verified by the previous empirical study (Ge et al., 2019) showing that using a state-of-theart grammatical error correction (GEC) model to post-process the outputs of an FST model can improve the result.Inspired by this observation, we propose to transfer the knowledge from GEC to FST by leveraging the GEC training data as the augmented parallel data to help improve formality.An example is illustrated in Figure 1 in which the annotated data for GEC provides knowledge to help the model rewrite the ungrammatical informal sentence.

Pre-training with Augmented Data
In general, massive augmented parallel data can help a seq2seq model to learn contextualized representations, sentence generation and source-target alignments better.When the augmented parallel data is available, previous studies (Sennrich et al., 2016a;Edunov et al., 2018;Karakanta et al., 2018;Wang et al., 2018) for seq2seq tasks are inclined to train a seq2seq model with original training data and augmented data simultaneously.However, augmented data is usually noisier and less valuable than original training data.In simultaneous training, the massive augmented data tends to overwhelm the original data and introduce unnecessary and even erroneous editing knowledge, which is undesirable for our task.
To better exploit the augmented data, we propose to first pre-train the model with augmented parallel data and then fine-tune the model with the original training data.In our pre-training & finetuning (PT&FT) approach, the augmented data is not treated equally to the original data; instead it only serves as prior knowledge that can be updated and even overwritten during the fine-tuning phase.In this way, the model can better learn from the original data without being overwhelmed or distracted by the augmented data.Moreover, separating the augmented and original data into different training phases makes the model become more tolerant to noise in augmented data, which reduces the quality requirement for the augmented data and enables the model to use noisier augmented data and even training data from other tasks.

Experiments
In this section, we present the experimental settings and related experimental results.We focus on informal→formal style transfer since it is more practical in real application scenarios.

Experimental Settings
We use GYAFC benchmark dataset (Rao and Tetreault, 2018) for training and evaluation.GYAFC's training split contains a total of 110K annotated informal-formal parallel sentences, which are annotated via crowd-sourcing of two domains: Entertainment & Music (E&M) and Family & Relationships (F&R).In its test split, there are 1,146 and 1,332 informal sentences in E&M and F&R domain respectively and each informal sentence has 4 referential formal rewrites.We use all the three data augmentation methods we introduced and obtain a total of 4.9M augmented pairs.Among them, 1.6M are generated by back-translating (BT) formal sentences identified (as formal) by the formality discriminator in E&M and F&R domain on Yahoo Answers L6 corpus4 , 1.5M are derived by formality discrimination (F-Dis) by using French, German and Chinese as pivot languages, and 1.8M are from multi-task transfer (M-task) from the public GEC data (Lang-8 (Mizumoto et al., 2011;Tajiri et al., 2012) and NUCLE (Dahlmeier et al., 2013)).The informal sentences used in F-Dis strategy are also from Yahoo Answers L6 corpus.We use the Transformer (base) (Vaswani et al., 2017) as the seq2seq model with a shared vocabulary of 20K BPE (Sennrich et al., 2016b) tokens.We adopt the Adam optimizer to pre-train the model with the augmented parallel data and then fine-tune it with the original parallel data.In pre-training, the dropout rate is set to 0.1 and the learning rate is set to 0.0005 with 8000 warmup steps and scheduled to an inverse square root decay after warmup; while during fine-tuning, the learning rate is set to 0.00025.We pre-train the model for 80k steps and fine-tune the model for a total of 15k steps.The CNN we use as the formality discriminator has filter sizes of 3, 4, 5 with 100 feature maps.The dropout rate is set to 0.5.It achieves an accuracy of 93.09% over the GYAFC test set.the augmented data in the pre-training phase and treats it as the prior knowledge supplementary to the original training data, reducing the negative effects of the augmented data and improving the results.

Effect of Proposed Approach
Table 2 compares the results of different data augmentation methods with PT&FT.Pre-training with augmented data generated by BT enhances the generalization ability of the model, thus we observe an improvement over the baseline.However, it does not introduce any new informal-to-formal transfer knowledge, leading to the least improvement among the three methods.In contrast, both F-Dis and M-Task introduce abundant transfer knowledge for FST.The augmented data of F-Dis includes various informal→formal rewrite knowledge derived from the MT models, allowing the model to better handle the test instances whose patterns are never seen in the original training data; while M-Task introduces GEC knowledge that helps improve formality in terms of grammar.
We then combine all these beneficial augmented data for pre-training.As expected, the combination strategy achieves further improvement as shown in Table 2 since the it enables the model to take advantage of all the data augmentation methods.

Comparison with State-of-the-Art Results
We compare our approach to the following previous approaches in the GYAFC benchmark: • Rule, PBMT, NMT, PBMT-NMT: Rule-based, phrase-based MT, NMT, PBMT-NMT hybrid model (Rao and Tetreault, 2018).
• GPT-CAT, GPT-Ensemble: fine-tuned encoder-decoder models (Wang et al., 2019) initialized by GPT (Radford et   2019).Specifically, GPT-CAT concatenates the original input sentence and the input sentence preprocessed by rules as input, while GPT-Ensemble is the ensemble of two GPT-based encoder-decoder models: one takes the original input sentence as input, the other takes the preprocssed sentence as input.
Following Niu et al. (2018), we train 4 independent models with different initializations for ensemble decoding.According to Table 3, our single model performs comparably to the state-ofthe-art GPT-based encoder-decoder models (more than 200M parameters) with only 54M parameters.Our ensemble model further advances the state-ofthe-art result only with a comparable model size to the GPT-based single model (i.e., GPT-CAT).
We also conduct human evaluation.Following Rao and Tetreault (2018), we assess the model output on three criteria: formality, fluency and meaning preservation.We compare our baseline model trained with original data, our best performing model and the previous state-of-the-art models (NMT-MTL and GPT-CAT).We randomly sample 300 items and each item includes an input and four outputs that shuffled to anonymize model identities.Two annotators are asked to rate the outputs on a discrete scale of 0 to 2. More details can be found in the appendix.The results are shown in Table 4 which demonstrates that our model is consistently well rated in human evaluation.

Analysis of Pivot Languages in Feature Discrimination
We also conduct an exploratory study of the pivot languages used in formality discrimination.Among the three pivot languages (i.e.French, German and Chinese) in our experiments, it is interest- ing to observe a significant difference in the sizes of the obtained parallel data given the same source sentences and filter threshold, as shown in Table 5.
Using Chinese as the pivot language results in the most data, probably due to the fact that Chinese and English belong to different language systems.The formality of original informal English sentences may be lost during translation, which turns out to facilitate the MT system to translate Chinese back into formal English.In contrast, French and German have much in common with English, especially for French in terms of the lexicon (Baugh and Cable, 1993).The translated sentences are likely to maintain informal sense, which hinders the MT system from generating formal English translations.
We compare the performance with augmented data generated by three pivot languages separately in Table 6.Manual inspection reveals that a few pairs have the issue of meaning inconsistency in all the three sets, which mainly arises from the translation difficulties caused by omissions and poor grammaticality in informal sentences and the segmentation ambiguity in some pivot languages like Chinese.Among the three languages, the Chinesebased augmented data introduces more noise due to the additional segmentation ambiguity problem but brings fair improvement because of its largest size.In contrast, the German-based augmented data has relatively high quality and a moderate size, leading to the best result in our experiments.
For text style transfer, however, due to the lack of parallel data, many studies focus on unsupervised approaches (Luo et al., 2019;Wu et al., 2019;Zhang et al., 2018a) and there is little related work concerning data augmentation.As a result, most recent work (Jhamtani et al., 2017;Xu et al., 2012) that models text style transfer as MT suffers from a lack of parallel data for training, which seriously limits the performance of powerful models.To solve this pain point, we propose novel data augmentation methods and study the best way to utilize the augmented data, which not only achieves a success in formality style transfer, but also would be inspiring for other text style transfer tasks.

Conclusion
In this paper, we propose novel data augmentation methods for formality style transfer.Our proposed data augmentation methods can effectively generate diverse augmented data with various formality style transfer knowledge.The augmented data can significantly help improve the performance when it is used for pre-training the model and leads to the state-of-the-art results in the formality style transfer benchmark dataset.

A Details of Human Evaluation
We describe the grading standard of the three criteria we present in the main paper for FST: formality, fluency and meaning preservation.The outputs are rated on a discrete scale of 0 to 2. We hire two annotators who major in Linguistics and have received Bachelor degree.Formality Given the informal source sentence and an output, the annotators are asked to rate the formality of a sentence according to the formality improvement level, regardless of fluency and meaning.
If the output shows significant formality improvement over the input, it will be rated 2 points.If the output is just slightly more formal than the input, it will be rated 1 point.If the output shows no improvement in the formality or even decreases the formality, it will be rated 0 point.
Fluency Given the outputs, the annotators are asked to evaluate the fluency of each sentence in isolation.A sentence is considered to be fluent if it makes sense and is grammatically correct.The sentences satisfying the requirements will be rated 2 points.The sentences with minor errors will be rated 1 point.If the errors lead to confusing meaning, we give it 0 point.Meaning preservation Given the output sentence and the corresponding source sentence, the annotators are asked to estimate how much information is preserved of the output compared to the input sentences.If the output sentence and the input exactly convey the same idea, the corresponding system of the output gets 2 points.If they are mostly equivalent but different in some trivial details, the corresponding system gets 1 point.If the output omits some important details that affect the sentence's meaning, the system will get no credit.
For inter-annotator agreement, we calculate the Pearson correlation coefficient of two annotators over the three criteria.The Pearson correlation over the formality criteria is 0.62.For fluency and meaning preservation, the correlation scores are 0.69 and 0.61, respectively.

Figure 1 :
Figure 1: An example that Formality Style Transfer (FST) benefits from data augmented via formality discrimination (F-Dis) and multi-task transfer (M-Task).The mapping knowledge indicated by the color (blue→pink) in FST test instance occur in the pairs augmented by F-Dis and M-Task.F-Dis identifies useful sentence pairs from paraphrased sentence pairs generated by cross-lingual MT, while M-Task utilizes training data from GEC to help formality improvement.

Figure 2 :
Figure2: Formality discrimination for FST.The numbers following the sentences are formality scores predicted by a formality discriminator.The pair (connected by the red dashed arrow) that obtains significant formality improvement will be kept as augmented data.

Table 1 :
The comparison of simultaneous training (ST) and Pre-train & Fine-tuning (PT&FT).Down-sampling and up-sampling are for balancing the size of the augmented data and the original data.Specifically, downsampling samples augmented data, while up-sampling increases the frequency of the original data.

Table 3 :
The comparison of our approach to the stateof-the-art results.* denotes the ensemble results.

Table 4 :
Results of human evaluation of FST.Scores marked with */ † are significantly different from the scores of Original data / NMT-MTL (p < 0.05 in significance test).

Table 5 :
The sizes of augmented datasets generated by F-Dis based on different pivot languages.