On Learning Text Style Transfer with Direct Rewards

In most cases, the lack of parallel corpora makes it impossible to directly train supervised models for the text style transfer task. In this paper, we explore training algorithms that instead optimize reward functions that explicitly consider different aspects of the style-transferred outputs. In particular, we leverage semantic similarity metrics originally used for fine-tuning neural machine translation models to explicitly assess the preservation of content between system outputs and input texts. We also investigate the potential weaknesses of the existing automatic metrics and propose efficient strategies of using these metrics for training. The experimental results show that our model provides significant gains in both automatic and human evaluation over strong baselines, indicating the effectiveness of our proposed methods and training strategies.


Introduction
Text style transfer aims to convert an input text into another generated text with a different style but the same basic semantics as the input. One major challenge in this setting is that many style transfer tasks lack parallel corpora, since the absence of human references makes it impossible to train the text style transfer models using maximum likelihood estimation (MLE), which aims to maximize the predicted likelihood of the references. As a result, some of the earliest work (Shen et al., 2017;Hu et al., 2017;Fu et al., 2018) on unsupervised text style transfer proposed training algorithms that are still based on MLE by formulating the style transfer models as auto-encoders optimized with reconstruction loss. Specifically, during training the model is tasked to generate a style-agnostic encoding and reconstruct the input text based on this encoding with style-specific embeddings or decoders. During inference, the model aims to transfer the source text style using the target style information. While these methods have seen empirical success, they face the inherent difficulty of coming up with a style-agnostic but content-preserving encodingthis is a non-trivial task and failure at this first step will diminish style transfer accuracy and content preservation of the final output.
Another line of work (Xu et al., 2018;Pang and Gimpel, 2019; proposes training algorithms based on rewards related to the automatic evaluation metrics, which can assess the model performance more directly during training. This approach is conceptually similar to training algorithms that optimize models using rewards related to the corresponding evaluation metrics for other NLP tasks, such as machine translation (Shen et al., 2016;Wieting et al., 2019a) or text summarization (Paulus et al., 2018;. As for unsupervised style transfer, the widely used automatic metrics mainly attend to three desiderata: (1) style transfer accuracy -the generated sentence must be in the target style, commonly measured by the accuracy of a style classifier applied to the transferred text, (2) fluency -the generated text must be grammatically correct and natural, commonly measured by the perplexity of a language model and (3) content preservation -the semantics need to be preserved between the source and target, commonly measured by the BLEU score between the system outputs and source texts. Since these automatic metrics only require the system outputs and source texts, they can be used as rewards for training. Moreover, the two lines of approaches can be used together, and previous work (Yang et al., 2018;John et al., 2019;Madaan et al., 2020) proposed methods which use the auto-encoders as the backbone augmented with task-specific rewards. In particular, the style transfer accuracy reward is used by most of the recent work.
However, reward-based training algorithms still have their limitations, and in this paper we aim to identify and address the bottlenecks of these methods. Specifically, we focus on two problems: (1) the difficulty of designing an efficient reward for content preservation, (2) the lack of robustness of the existing automatic evaluation metrics.
Content preservation is more difficult to measure compared to style transfer accuracy and fluency because it needs to consider the overlap in the semantics between the source text and system outputs. While using BLEU score between the source text and system output would be a direct solution (Xu et al., 2018), this approach has an inherent limitation in that n-gram based metrics such as BLEU are sensitive to lexical differences and will penalize modifications that are necessary for transferring text style. In fact, previous work has proposed various different proxy rewards for content preservation. One of the most popular methods is the cycle-consistency loss Dai et al., 2019;Pang and Gimpel, 2019), which introduces a round-trip generation process, where the model generates an output in the target style, and the ability of a reconstruction model to re-generate the original text is used as a proxy for content preservation. While this method is more tolerant to lexical differences, the correlation between the reconstruction loss and content preservation can be weak. Therefore, we aim to design a reward for content preservation which can directly assess the semantic similarity between the system outputs and input texts. Specifically, we note that models of semantic similarity are widely studied (Wieting et al., 2016;Sharma et al., 2017;Pagliardini et al., 2018;, and we can leverage these methods to directly calculate the similarity between the system outputs and input texts. This renders our method applicable for even unsupervised settings where no human references are available.
Another key challenge for reward-based training algorithms is that the existing automatic evaluation metrics are not well-correlated with human evaluation . It poses general risks to the work in this field with respect to model training and evaluation since these metrics are widely used. An important observation we made from our experiments is that style transfer models can exploit the weaknesses of the automatic metrics. They do this by making minimal changes to the input texts which are enough to trick the classifier used for style transfer accuracy while achieving high content preservation and fluency scores due to the high lexical similarity with the input texts. Upon identifying this risk, we re-visit and propose several strategies that serve as auxiliary regularization on the style transfer models, effectively mitigating the problem discussed above.
We empirically show that our proposed reward functions can provide significant gains in both automatic and human evaluation over strong baselines from the literature. In addition, the problems we identify with existing automatic evaluation metrics suggest that the automatic metrics need to be used with caution either for model training or evaluation in order to make it truthfully reflect human evaluation.

Overview
Data for unsupervised text style transfer can be defined as where x (i) denotes the text and s (i) denotes the corresponding style label. The objective of the task is to generate (via a generator g) the output with the target style conditioned on s while preserving most of the semantics of the source x. In other words, x = g(x, s) should have style s and the semantics of x. We define the style as a binary attribute such that s ∈ {0, 1}, however, it can be easily extended to a multi-class setting.

Generator
For our generator, we fine-tune a large-scale language model GPT-2 (Radford et al., 2019). GPT-2 is pre-trained on large corpora and can be finetuned to generate fluent and coherent outputs for a variety of language generation tasks (Wolf et al., 2019). Since GPT-2 is a unidirectional language model, we reformulate the conditional generation task as a sequence completion task. Namely, as input to the generator, we concatenate the original sentence with a special token which indicates the target style. The sequence following the style token is our output.

Reward Functions
We use four reward functions to control the quality of the system outputs. The quality of the outputs is assessed in three ways: style transfer accuracy, content preservation, and fluency. We attend to each of these factors with their respective rewards. Here we denote the input text x having style s by x s , and denote the output byx s , i.e.,x s = g(x s , 1 − s).

Rewards for Style Transfer Accuracy
We use a style classifier to provide the supervision signal to the generator with respect to the style transfer accuracy. The min-max game between the generator g and the classifier f cls is: (1) The style transfer accuracy reward for the generator is the log-likelihood of the output being labeled as the target style: Following prior work, we use the CNN-based classifier (Kim, 2014) f cls , which takes both the sentence and the style label as input and its objective is to predict the likelihood of the sentence being coherent to the given style.

Rewards for Content Preservation
To ensure that the system outputs still preserve the basic semantics of the source sentences, we use the pretrained SIM model introduced in Wieting et al. (2019b,a) to measure the semantic similarity between the source sentences and system outputs.
The SIM score for a sentence pair is the cosine similarity of its sentence representations. These representations are constructed by averaging sub-word embeddings. Compared to the cycle-consistency loss Dai et al., 2019;Pang and Gimpel, 2019), our method is more direct since it doesn't require a second-pass generation. It also has advantages over n-gram based metrics like BLEU (Papineni et al., 2002) since it is more robust to lexical changes and can provide smoother rewards.
In Wieting et al. (2019a), SIM is augmented with a length penalty to help control the length of the generated text. We use their entire model, SIMILE, as the content preservation reward, and α is an exponential term to control the weight of the length penalty, which is set to 0.25. We also use the cycle-consistency loss L cyc to bootstrap the training: (5) Here, p g is the likelihood assigned by the generator g. This introduces two generation passes, i.e.,x s = g(x, 1 − s) andx s = g(x s , s) while SIM reward only requires one generation pass, as illustrated in Fig. 1.

Rewards for Fluency
Style transfer accuracy rewards and content preservation rewards do not have a significant effect on the fluency of the outputs. Therefore, we again use the pre-trained GPT-2 model, but as a reward this time. To encourage the outputs to be as fluent as the source sentences, we define the fluency reward as the difference of the perplexity between the system outputs and source sentences: Here, ppl denotes the length-normalized perplexity assigned by the language model fine-tuned on the training set.
As will be further discussed in Section 3.3, we found that using the rewards mentioned above can still result in unnatural outputs. Therefore, we additionally use a LSTM-based (Hochreiter and Schmidhuber, 1997) discriminator f adv to provide a naturalness reward, whose job is to discriminate the system outputs and the real sentences, i.e., an adversarial discriminator. It constructs a min-max game with the generator: The naturalness reward is the log-likelihood of the outputs being classified as real sentences: (8)

Learning
The final corresponding loss term is: Here, N is the number of samples in the dataset. To train the model, we use the weighted average of the losses defined in the previous section: where λ denotes the weight of the corresponding term. The setting of λ is chosen to make the training stable and have balanced style transfer accuracy and content preservation performance on the development set. L rec is the reconstruction loss, i.e., We follow a two-stage training procedure. We first use the cycle-consistency loss L cyc to bootstrap the training and then fine-tune the model with the rewards we introduced above to improve the output quality.
In the bootstrap stage, the objective function is We select the checkpoint with the highest mean of the style transfer accuracy and BLEU on the development set as the starting point for the second training stage.
In the second stage, the generator is optimized with Eq. 10. The classifier f cls for L cls is pretrained and the language model for L lang is finetuned on the training set. During training, the discriminator f adv for L adv is trained against the generator. f cls is fixed when trained on some datasets, while it is trained against the generator on others. We select the checkpoint that has the style transfer accuracy and BLEU score similar to that from the first stage and the lowest perplexity on the development set.
Lastly, since gradients can not be propagated through the discrete samples, we use two approaches to circumvent this problem. For the content preservation reward (Eq. 3) and fluency reward (Eq. 6), we use the REINFORCE (Williams, 1992) algorithm to optimize the model, We approximate the expectation by greedy decoding and the log-likelihood is normalized by sequence length, i.e., 1 L L i=1 log p g (w i ), wherew i denotes the i-th token ofx s and L is sequence length. For the style transfer accuracy reward (Eq. 2) and naturalness reward (Eq. 8), we use a different approach to generate a continuous approximation of the discrete tokens, which allows gradients to be back-propagated to the generator. Namely, taking the style classifier f cls as an example, we use the distribution p i of each token produced by the generator as the input of the classifier. This distribution is then multiplied by the classifier's word embedding matrix W embed to obtain a weighted average of word embeddings: Then, the classifier takes the sequence ofŵ i as its input. We chose this method because it provides a token-level supervision signal to the generator, while the REINFORCE algorithm provides sentence-level signals.

Datasets
We evaluate our approach on three datasets for sentiment transfer with positive and negative reviews: Yelp review dataset, Amazon review dataset provided by , 2 and the IMDb movie review dataset provided by Dai et al. (2019). 3 We also evaluate our methods on a formality style transfer dataset, Grammarly's Yahoo Answers Formality Corpus (GYAFC), 4 introduced in Rao and Tetreault (2018). Although it is a parallel corpus, we treat it as an unaligned corpus in our experiments. In order to compare to previous work,

Experimental Details
Following previous work, we measure the style transfer accuracy using a FastText 5 (Joulin et al., 2017) style classifier trained on the respective training set of each dataset. To measure content preservation, we use SIM and BLEU as metrics where self-SIM and self-BLEU are computed between the source sentences and system outputs, while ref-SIM and ref-BLEU are computed between the system outputs and human references when available. To measure the fluency we use a pre-trained GPT-2 model to compute the perplexity. 6 Our generator, GPT-2, has 1.5 billion parameters, and we train on a GTX 1080 Ti GPU for about 12 hours. The weights of the loss terms in Eq. 10 and Eq. 12 are detailed in Table 2. While during our experiments we found that there are other possible configurations which give higher scores with respect to the automatic evaluation metrics, as will be discussed in Section 3.3, we also found that 5 https://fasttext.cc/ 6 Note that we didn't fine-tune it on the training set  better performance in automatic evaluation doesn't always entail better performance in human evaluation. Therefore, we also manually checked the quality of the transferred texts on development set when we chose the value of the hyperparameters. We compare our model with several state-ofthe-art methods: DeleteAndRetrieve (D&R)

Adversarial Examples
Yelp and Amazon are arguably the most frequently used datasets for the sentiment transfer task. In our experiments, we found that the automatic evaluation metrics can be tricked on these datasets. Table 3 shows the performance of the models which generate adversarial examples. Upon identifying these risks, we propose several design options that can effectively mitigate these problems.
Yelp Dataset For the Yelp dataset, when trained without the adversarial discriminator f adv and the fluency reward, our model (DIRR-YELP-ADV) is able to discover a trivial solution which receives high automatic evaluation scores: injecting a word that carries strong sentiment at the beginning of the output, and making minimum changes (if any) to the source sentences, as illustrated in Table 8. This obviously does not meet the objective of content-preserving sentiment transfer and is easily detectable for humans. In fact, after we manually removed the first word from each of the output sentences, the transfer accuracy dropped from 95.2 to 58.4. To address this problem, we introduced an  auxiliary discriminator f adv as we discussed above to penalize the trivial outputs since they can be easily captured by the discriminator. On the other hand, the output perplexity is not sensitive enough to this local feature so using the fluency reward alone is not sufficient. Our final model has much more stable performance when the first word of its output sentences is removed, experiencing only a small drop of the style transfer accuracy from 94.2 to 88.2.
Amazon Dataset For the Amazon dataset, we found that the style classifier f cls needs to be updated during the training to prevent the model exploiting the data imbalance problem of the dataset. Namely, in the Amazon dataset some categories of products appear mostly in negative or positive reviews. In Table 4, we show the word frequency of game and phone in both negative and positive reviews. In the original dataset, game mostly appears in negative reviews while phone mostly appears in positive reviews. Therefore, without any prior knowledge, it is very likely that these words will be used as informative features by the sentiment classifier, which makes its predictions unreliable. 7 When our second-stage model is trained with the fixed style classifier, it (DIRR-AMAZON-ADV) learns to exploit this dataset bias by changing the nouns in the original sentences to game or phone, which achieves better transfer accuracy. We list some examples in Table 5. DIRR-AMAZON-ADV generated 291 game in 500 positive reviews, which obviously changes the semantics of the source sentences. In order to show that this phenomenon is independent to the classifier architec-7 Notice that the style classifier only achieves 43 accuracy on the human references.

Model Text
Source don t waste your time or money on these jeans . Adv don t need your time or money on these phones .
Source i made beef bolognese in the oven and it turned out wonderfully .
Adv i made beef bolognese in the game and it turned out wonderfully .
Source this one does the job i need it for ! Adv this game does the job i need it for !  Sudhakar et al., 2019;Madaan et al., 2020) and other methods (Yang et al., 2018; also use a fixed classifier or use words with unbalanced frequencies in different styles as important features, which means that their methods may face the same risk. While  has pointed out this data imbalance problem of the Amazon dataset, we further demonstrate that a strong generator can even use this discrepancy to trick the automatic metrics.
We are able to mitigate this problem by updating the style classifier during the training, and in Table 4, DIRR is more robust to the data imbalance problem compared to other methods.

Automatic Evaluation
The automatic evaluation results are shown in Table 6. We report the performance of the previous methods based on the outputs they provided for fair comparison and omit those whose results are not available.
We have the following observations of the results. First, compared to our base model (DIRR-CYCLE), the model trained with our proposed rewards has higher fluency, while remains the same level of content preservation. It indicates that SIM score is as effective as cycle-consistency loss for content preservation and our fluency reward can effectively improve the output fluency. Secondly, there exists a trade-off among the style transfer accuracy, content preservation and language fluency. While our model does not outperform the previous meth-  Table 6: Automatic Evaluation. Acc is the accuracy of the sentiment classifier. PPL is the perplexity assigned by the GPT-2 language model. r-BLEU is the BLEU score between the human references and system outputs. s-BLEU is the BLEU score between the source sentences and system outputs. Copy is an oracle which copies the source sentences as outputs. Human denotes the human references.
ods on all of the metrics, it is able to find a better balance of the different metrics.

Human Evaluation
We conducted human evaluation on Yelp, Amazon and GYAFC datasets evaluating the style transfer accuracy, content preservation, and fluency separately. The first two aspects are rated with range 1 -3 while the fluency is rated with range 0 -1. We randomly select 100 candidates and compare the outputs of different systems. We use Amazon Turk 8 for human evaluation. Each candidate is rated by three annotators and we report the average scores here. We did not evaluate the style  Table 7: Human Evaluation. Style denotes style transfer accuracy, Flu. denotes fluency, Con. denotes content preservation. Mean denotes the average of the metrics where the fluency scores are scaled up to be consistent with other scores. *: significantly better than other systems (p < 0.01) according to the mean score.
transfer accuracy for the GYAMC dataset since it is difficult for human annotators to accurately capture the difference between formal and informal sentences. The results of our human evaluations are shown in Table 7. We additionally report the sample-wise mean score of the metrics where the fluency scores are scaled up to be consistent with other scores. Our model achieves better overall performance when considering all three evaluation metrics on each dataset. Interestingly, we found that the automatic metrics for both the style transfer accuracy and content preservation do not accurately reflect performance as measured by human evaluation. For example, on the Amazon dataset, although Tag&Gen (Madaan et al., 2020) achieves significantly higher style transfer accuracy based on the automatic metric, our model achieves better performance based on the human evaluation. This phenomenon suggests that the importance of our findings discussed in Section 3.3, that strong neural models can potentially exploit the weaknesses of the automatic metrics.

Analysis
We next show an ablation study, demonstrating the effectiveness of the content preservation and fluency rewards in DIRR, and how SIM can be used to replace the cycle-consistency loss. We also compare using BLEU versus using SIM as a content-preservation reward, finding that using BLEU results in reduced performance, unstable training, and artifacts in the outputs, which makes the results less natural than the results of the model trained with SIM score.
To illustrate that training with SIM can replace   Table 9: Ablation and Comparative Study on Yelp Dataset. Acc is the accuracy of the sentiment classifier. PPL is the perplexity assigned by the GPT-2 language model. self-BLEU (s-BLEU) and self-SIM (s-SIM) are computed between the source sentences and outputs.
the cycle-consistency loss for content preservation, we fine-tuned DIRR-CYCLE on SIM to produce a new model, DIRR w/o FLU. The difference between DIRR and DIRR w/o FLU is that the former is additionally trained with our fluency rewards. The results are shown in Table 9, and show two main trends. First, we see that DIRR w/o FLU has better fluency and content preservation performance than DIRR-CYCLE, which shows that the cycle-consistency loss can be replaced by SIM score for content preservation. Second, DIRR has better fluency than DIRR w/o FLU, showing the effectiveness of our fluency rewards. We next investigate the effectiveness of using SIM as a reward instead of BLEU. To do this, we train a model, DIRR-BLEU, which uses BLEU as the content reward and report the results in Table 9. The results show that using BLEU has larger content preservation as measured by BLEU, but has similar performance when measured by SIM. However, performance on the style transfer accuracy and fluency decreases. We hypothesize that this is because using SIM as a reward gives the model more freedom, allowing the model to have more balanced performance since there is less pressure to copy n-grams. We also observe more adversarial examples in the outputs of DIRR-BLEU. As discussed in Section 3.3, these adversarial examples are generated by injecting a word carrying strong sentiment at the beginning of the output. The model trained with BLEU is more likely to generate these outputs as it will try to avoid breaking up the ngrams in the source sentences, allowing for a higher BLEU reward. Examples of this behavior is shown in Table 8. Notice that the DIRR-BLEU samples start with the word great, which is enough to often fool the classifier, but are unnatural.

Related Work
A main line of work (Shen et al., 2017;Hu et al., 2017;Fu et al., 2018;Xu et al., 2018;John et al., 2019) for text style transfer aims to model the conditional distribution of the data with the encoderdecoder architecture. Due to the lack of parallel corpora, inductive biases are designed to make the generation conditioned on both source sentences and specific styles such that the model can rewrite the source texts with the target style while still preserve the content information of the source texts.
Efforts are also made to design training objectives to improve performance. For example, Backtranslation Prabhumoye et al., 2018), denoising auto-encoding (Lample et al., 2019) and the cycle-consistency loss Dai et al., 2019;Pang and Gimpel, 2019) have been shown effective for improving the model performance.  proposes a retrievebased pipeline, which contains three stages, namely, delete, retrieve and generate. Sudhakar et al. (2019) extends this pipeline by using GPT (Radford et al., 2018) as the generator. Compared to these methods, we propose a more direct and effective approach to encourage semantic-preserving transfer by directly measuring the semantic similarity of the source texts and system outputs.
Recently, other works have been proposed for unsupervised text style transfer (Jin et al., 2019;Lai et al., 2019;Li et al., 2020). He et al. (2020) proposes a probabilistic view which models the non-parallel data from two domains as a partially observed parallel corpus. Madaan et al. (2020) proposes a tag-and-generate pipeline, which firstly identifies style attribute markers from the source texts, then replaces them with a special token, and generates the outputs based on the tagged sentences. Zhou et al. (2020) focuses on exploring the word-level style relevance which is assigned by a pre-trained style classifier. They propose a reward for content preservation which is based on the weighted combination of the word embeddings of the source texts and system outputs. Compared to this reward, our proposed content reward is specifically designed for semantic similarity and pre-trained on large corpora, which makes it more robust across different datasets.

Conclusion
In this paper, we propose a direct approach of improving content preservation for text style transfer by leveraging a semantic similarity metric as the content reward. Using a large pre-trained language model (GPT-2) with our proposed rewards that target the different aspects of the output quality, our approach achieves strong performance on both automatic and human evaluation. Recently, several semantic similarity metrics (Zhao et al., 2019;Sellam et al., 2020; based on pre-trained language models have shown promising results. Introducing these metrics in our proposed method as the content preservation reward may bring further improvements.
Moreover, we identify several problems in the commonly used automatic evaluation metrics and datasets, and propose several practical strategies to mitigate these problems, which makes these met-rics more effective rewards for model training. Considering the weaknesses of the automatic metrics presented in this work, we believe that more rigorous discussion and investigation on the criteria of "successful transferring" is essential for this field of work. Since existing works mostly relied on model-based metrics to determine the success of style transfer models, methods such as adversarial training could be introduced to make the modelbased metrics more robust and faithful indicators of the success of style-transferring, which would be beneficial for both model training and evaluation.