Harnessing Pre-Trained Neural Networks with Rules for Formality Style Transfer

Formality text style transfer plays an important role in various NLP applications, such as non-native speaker assistants and child education. Early studies normalize informal sentences with rules, before statistical and neural models become a prevailing method in the field. While a rule-based system is still a common preprocessing step for formality style transfer in the neural era, it could introduce noise if we use the rules in a naive way such as data preprocessing. To mitigate this problem, we study how to harness rules into a state-of-the-art neural network that is typically pretrained on massive corpora. We propose three fine-tuning methods in this paper and achieve a new state-of-the-art on benchmark datasets


Introduction
Text formality research is essential for a wide range of NLP applications, such as non-native speaker assistants and child education. Due to the progress of deep learning techniques, researchers make a step from formality understanding to formality-aware text generation. Recently, Rao and Tetreault (2018) published a dataset, the Grammarly's Yahoo Answers Formality Corpus (GYAFC), serving as a benchmark testbed for formality style transfer, which aims to generate a formal sentence given an informal one, while keeping its semantic meaning.
Since the GYAFC dataset is small, existing studies have realized the importance of rules as a preprocessing step of informal text, typically handling capitalization (e.g., "ARE YOU KIDDING ME?"), character repetition (e.g., "noooo"), slang words (e.g., "wanna"), etc. While rule-based preprocessing could largely simplify the formality * Corresponding author.
Our code and output are available at: https://github.com/jimth001/ formality_emnlp19.git  style transfer task, we observe that it also introduces noise, with an example shown in Table 1. Given a sentence with all capital letters, a common rule-based method would lower all characters except the first one. Some entities, such as R & B, are changed to lower case incorrectly, especially if not recognized as a proper noun. Another intuition of ours is that, due to the small size of the parallel corpus, it would be beneficial to leverage a large neural network, which is pretrained on a massive corpus and has learned general knowledge of language. Then, we could finetune it in the formality style transfer task.
To this end, we study in this paper how to effectively incorporate pretrained networksparticularly, the powerful GPT-2 model (Radford et al., 2019)-with simple rules for formality style transfer. We analyze three ways of harnessing rules in GPT-2: 1) We feed the concatenation of the original informal sentence and the preprocessed one to the encoder; 2) We ensemble two models at the inference stage: one takes the original informal text as input, while the other takes the rule-preprocessed text as input; and 3) We employ two encoders to encode original informal text and the rule-preprocessed text seperately, and then develop a hierarchical attention mechanism in both word-and sentence-levels to aggregate information. Our work differs from previous work, which only feeds preprocessed inputs to the encoder. Rather, we are able to preserve more information of the original sentence, and the rule-based system is harnessed in a learnable way. Experimental results show that our method outperforms direct fine-tuning of GPT-2 by 2 BLEU scores, and previous published results by 1.8-2.8 scores in different domains of the GYAFC dataset.

Related Work
In the past few years, style-transfer generation has attracted increasing attention in NLP research. Early work transfers between modern English and the Shakespeare style with a phrase-based machine translation system (Xu et al., 2012). Recently, style transfer is more recognized as a controllable text generation problem (Hu et al., 2017), where the style may be designated as sentiment (Fu et al., 2018), tense (Hu et al., 2017), or even general syntax (Bao et al., 2019;. In the above approaches, the training sentences are labeled with style information, but no parallel data are given. Xu et al. (2019a) take one step further and capture the most salient style by detecting global variance in a purely unsupervised manner (i.e., style labels are unknown).
Formality style transfer is mostly driven by the GYAFC parallel corpus. Since a parallel corpus, albeit small, is available, formality style transfer usually takes a seq2seq-like approach (Rao and Tetreault, 2018;Niu et al., 2018a;Xu et al., 2019b). In particular, this paper focuses on harnessing pre-trained neural networks with rulebased systems.

Approach
We implement our encoder and decoder with GPT blocks, and initialize them with the pretrained GPT-2 parameters (Radford et al., 2019). The architecture of a decoder GPT block performs attention to the context words and previous words with the same multi-head attention layer, illustrated in Figure 1, which is slightly different with the classic Transformer (Vaswani et al., 2017). Formally, the output of the attention layer is 1 where Q, K, and V are defined the same as the scaled dot-product attention in Transformer, and d k is a scaling factor.
[; ] is a concatenation operation; it enables to consider context words and previous decoding results in the same layer. Such architecture enables to adapt GPT-2, a Transformerbased pretrained language model, to a sequenceto-sequence model without re-initializing the parameters.
In the following, we describe several methods combining the GPT-based encoder-decoder model with preprocessing rules and (limited) parallel data.
Fine-Tuning with Prepocessed Text as Input. Given an informal sentence x i as input, the most straightforward method, perhaps, is to first convert x i to x i by rules, and then fine-tune the pre-trained GPT model with parallel data {(x i , y i )} M i=0 (M is the number of samples). In this way, informal sentences can be normalized with rules before using a neural network. This simplifies the task and is standard in previous studies of formality style transfer (Rao and Tetreault, 2018).
However, the preprocessed sentence serves as a Markov blanket, i.e., the system is unaware of the original sentence, provided that the preprocessed one is given. This is in fact not desired, since the rule-based system could make mistakes and introduce noise (Table 1).
Fine-Tuning with Concatenation. To alleviate the above issue, we feed the encoder with both the original sentence x i and the preprocessed one x i . We concatenate the words of x i and x i with a special token EOS in between, forming a long se- after that, the concatenated sequence and the corresponding formal reference serve as a parallel text pair to fine-tune the GPT model. In this way, our model can make use of a rule-based system but also recognize its errors during the fine-tuning stage.
Decoder Ensemble. We investigate how the model performs if we train two GPTs with {x i , y i } M i=0 and {x i , y i } M i=0 separately, but combine them by model ensemble in the decoding phase. We denote the generative probability of the jth word by h(x i , y i,<j ) and h (x i , y i,<j ). We apply "average voting" and the resulting predictive probability is . Hierarchical Attention. In our final variant, we use two encoders to encode x i and x i separately, but compute a hierarchical attention to aggregate information, given by (2) where α and β are sentence attention weights for each decoding step, computed by Here, h l is the hidden state of the lth step at the decoder, W is a learnable parameter, z 1 and z 2 represent the last hidden state of the two encoders, respectively. We propose this variant in hopes of combining the information of x i and x i in the training stage and in a learnable way, compared with the decoder ensemble.

Setup
We evaluate our methods on the benchmark dataset, the Grammarly's Yahoo Answers Formality Corpus (GYAFC, Rao and Tetreault, 2018  We implement our model with Tensorflow 1.12.0 and take the pretrained GPT-2 model (117M) released by OpenAI 2 to initialize our encoder and decoder. We use the Adam algorithm (Kingma and Ba, 2015) to train our model with a batch size 128. We set the learning rate to 0.001 and stop training if validation loss increases in two successive epochs.

Competing Methods
We compare our model with the following stateof-the-art methods in previous studies.
Rule-Based: We follow Rao and Tetreault (2018) and create a set of rules to convert informal texts to formal ones. Due to the lack of industrial engineering, our rule-based system achieves a performance slightly lower than (but similar to) Rao and Tetreault (2018).
NMT-Baseline: An RNN-based Seq2Seq model with the attention mechanism (Bahdanau et al., 2015) is trained to predict formal texts, given rule-preproccessed informal text.
PBMT-Combined: Similar to NMT, this baseline trains a traditional phrase-based machine translation (PBMT) system, also taking the preprocessed text as input. Then, self-training (Ueff-ing, 2006) is applied with an unlabeled in-domain dataset for further improvement.
NMT-Combined: This method uses backtranslation (Sennrich et al., 2016) with the PBMT-Combined system to synthesize a pseudo-parallel corpora. Then a Seq2Seq model is trained on the combination of the pseudo-parallel and parallel corpora.
Note that the above baselines are reported by Rao and Tetreault (2018).
Transformer-Combined: This setting in Xu et al. (2019b) is the same as NMT-Combined, except that it employs Transformer (Vaswani et al., 2017) as the encoder and decoder.
JTHTA: Xu et al. (2019b) propose a bidirectional framework that can transfer formality from formal to informal or from informal to formal with one single encoder-decoder component. They jointly optimize the model against various losses and call it Joint Training with Hybrid Textual Annotation (JTHTA).
Bi-directional-FT: Niu et al. (2018b) merge the training data of two domains and leverage data borrowed from machine translation to train their models with a multi-task learning schema, and also apply model ensembles. For fairness, we also combine the two domains when comparing with Niu et al. (2018b).
Additionally, we also evaluate our model variants. We first apply the GPT based on the original parallel corpus without using the rule-based system, denoted as GPT-Orig. Then, we feed the rule-preprocessed text as input, denoted as GPT-Rule. Other variants in Section 3 are denoted as GPT-CAT, GPT-Ensemble, and GPT-HA, respectively.

Evaluation Metrics
To evaluate different models, we apply multiple automatic metrics, mostly following Rao and Tetreault (2018).
Formality: Rao and Tetreault (2018) train a feature-based model to evaluate the formality of sentences, requiring an extra labeled corpus for training, which is unfortunately not publicly available. As a replacement, we train an LSTM-based classifier using the training data of GYAFC. It achieves 93% accuracy in the development and test sets, and thus is an acceptable approximation.
Meaning Preservation: We evaluate whether the meaning of the source sentence is preserved with a model trained on the Semantic Textual Similarity (STS) dataset. We adopt the BERT-Base 3 model (Devlin et al., 2019) and use STS for finetuning.
Overall: We evaluate the overall quality of formality-transferred sentences with BLEU (Papineni et al., 2002) and PINC (Chen and Dolan, 2011). BLEU evaluates the n-gram overlap, and PINC is an auxiliary metric indicating the dissimilarity between an output sentence and an input. A PINC score of 0 indicates that the input and output sentences are the same. According to Rao and Tetreault (2018), BLEU correlates with human annotation best.

Results
We show results of the E&M and F&R domains in Tables 3 and 4, respectively. We see that, by using the GPT-2 pretrained model alone (GPT-Orig), we achieve close results to previous stateof-the-art models. It outperforms NMT-Combine and JTHTA, even without fine-tuning on pseudoparallel data. Our method also significantly outperforms the Transformer-Combined model (without pretraining). The results suggest that the small GYAFC corpus does not suffice to fully train the Transformer model. The GPT-2 model, pretrained with massive unlabeled corpora, is able to capture the generic knowledge of language and can be adapted to formality style transfer.
We then evaluate our different methods of incorporating the rule-based system into the pretrained GPT-2 model. We see that GPT-CAT yields the best results, which is probably because the concatenation enables two input sentences interact with each other through a single self-attention mechanism, while other methods encode each input sentences (original and rule-preprocessed) separately.
When combining both domains as in Niu et al. (2018b), we also have better performance than the previous work. This further shows the robustness of our model.
Regarding formality, our model achieves a reasonably high accuracy, although combining domains is slightly worse (since cross-domain training may bring noise that hurts the output formality). The rule-based model itself shows the best performance on content preserving, but it does not   change the input much (a low PINC score). In summary, our models significantly outperform previous work in formality style transfer and achieve a state-of-the-art performance on the two domains of GYAFC, which credits to both the pretrained model and our fine-tuning methods in consideration of a rule-based system.

Conclusion
In this work, we study how to incorporate a pretrained neural network with a rule-based system for formality style transfer. We find that building a pretrained GPT-2 upon the concatenation of the original informal text and the rule-preprocessed text achieves the highest performance on benchmark datasets.