Style Transformer: Unpaired Text Style Transfer without Disentangled Latent Representation

Disentangling the content and style in the latent space is prevalent in unpaired text style transfer. However, two major issues exist in most of the current neural models. 1) It is difficult to completely strip the style information from the semantics for a sentence. 2) The recurrent neural network (RNN) based encoder and decoder, mediated by the latent representation, cannot well deal with the issue of the long-term dependency, resulting in poor preservation of non-stylistic semantic content. In this paper, we propose the Style Transformer, which makes no assumption about the latent representation of source sentence and equips the power of attention mechanism in Transformer to achieve better style transfer and better content preservation.


Introduction
Text style transfer is the task of changing the stylistic properties (e.g., sentiment) of the text while retaining the style-independent content within the context. Since the definition of the text style is vague, it is difficult to construct paired sentences with the same content and differing styles. Therefore, the studies of text style transfer focus on the unpaired transfer.
Recently, neural networks have become the dominant methods in text style transfer. Most of the previous methods (Hu et al., 2017;Shen et al., 2017;Fu et al., 2018;Carlson et al., 2017;Zhang et al., 2018b,a;Prabhumoye et al., 2018;Jin et al., 2019;Melnyk et al., 2017;dos Santos et al., 2018) formulate the style transfer problem into the "encoder-decoder" framework. The encoder maps the text into a style-independent latent representation (vector representation), and the decoder generates a new text with the same content but a different style from the disentangled latent representation plus a style variable.
These methods focus on how to disentangle the content and style in the latent space. The latent representation needs better preserve the meaning of the text while reducing its stylistic properties. Due to lacking paired sentence, an adversarial loss (Goodfellow et al., 2014) is used in the latent space to discourage encoding style information in the latent representation. Although the disentangled latent representation brings better interpretability, in this paper, we address the following concerns for these models.
1) It is difficult to judge the quality of disentanglement. As reported in (Elazar and Goldberg, 2018;Lample et al., 2019), the style information can be still recovered from the latent representation even the model has trained adversarially. Therefore, it is not easy to disentangle the stylistic property from the semantics of a sentence.
2) Disentanglement is also unnecessary. Lample et al. (2019) reported that a good decoder can generate the text with the desired style from an entangled latent representation by "overwriting" the original style.
3) Due to the limited capacity of vector representation, the latent representation is hard to capture the rich semantic information, especially for the long text. The recent progress of neural machine translation also proves that it is hard to recover the target sentence from the latent representation without referring to the original sentence. 4) To disentangle the content and style information in the latent space, all of the existing approaches have to assume the input sentence is encoded by a fix-sized latent vector. As a result, these approaches can not directly apply the attention mechanism to enhance the ability to preserve the information in the input sentence. 5) Most of these models adopt recurrent neural networks (RNNs) as encoder and decoder, which has a weak ability to capture the long-range dependencies between words in a sentence. Besides, without referring the original text, RNN-based decoder is also hard to preserve the content. The generation quality for long text is also uncontrollable.
In this paper, we address the above concerns of disentangled models for style transfer. Different from them, we propose Style Transformer, which takes Transformer (Vaswani et al., 2017) as the basic block. Transformer is a fully-connected selfattention neural architecture, which has achieved many exciting results on natural language processing (NLP) tasks, such as machine translation (Vaswani et al., 2017), language modeling (Dai et al., 2019), text classification (Devlin et al., 2018). Different from RNNs, Transformer uses stacked self-attention and point-wise, fully connected layers for both the encoder and decoder. Moreover, Transformer decoder fetches the information from the encoder part via attention mechanism, compared to a fixed size vector used by RNNs. With the strong ability of Transformer, our model can transfer the style of a sentence while better preserving its meaning. The difference between our model and the previous model is shown in Figure 1.
Our contributions are summarized as follows: • We introduce a novel training algorithm which makes no assumptions about the disentangled latent representations of the input sentences, and thus the model can employ attention mechanisms to improve its performance further. • To the best of our knowledge, this is the first work that applies the Transformer architecture to style transfer task. • Experimental results show that our proposed approach generally outperforms the other approaches on two style transfer datasets. Specifically, to the content preservation, Style Transformer achieves the best performance with a significant improvement.

Related Work
Recently, many text style transfer approaches have been proposed. Among these approaches, there is a line of works aims to infer a latent representation for the input sentence, and manipulate the style of the generated sentence based on this learned latent representation. Shen et al. (2017) propose a cross-aligned auto-encoder with adversarial training to learn a shared latent content distribution and a separated latent style distribution. Hu et al. (2017) propose a new neural generative model which combines variational auto-encoders and holistic attribute discriminators for the effective imposition of semantic structures. Following their work, many methods (Fu et al., 2018;John et al., 2018;Zhang et al., 2018a,b) has been proposed based on standard encoder-decoder architecture.
Although, learning a latent representation will make the model more interpretable and easy to manipulate, the model which is assumed a fixed size latent representation cannot utilize the information from the source sentence anymore.
On the other hand, there are also some approaches without manipulating latent representation are proposed recently. Xu et al. (2018) propose a cycled reinforcement learning method for unpaired sentiment-to-sentiment translation task.  propose a three-stage method. Their model first extracts content words by deleting phrases a strong attribute value, then retrieves new phrases associated with the target attribute, and finally uses a neural model to combine these into a final output. Lample et al. (2019) reduce text style transfer to unsupervised machine translation problem (Lample et al., 2018). They employ Denoising Auto-encoders (Vincent et al., 2008) and back-translation (Sennrich et al., 2016) to build a translation style between different styles.
However, both lines of the previous models make few attempts to utilize the attention mechanism to refer the long-term history or the source sentence, except Lample et al. (2019). In many NLP tasks, especially for text generation, attention mechanism has been proved to be an essential technique to enable the model to capture the longterm dependency (Bahdanau et al., 2014;Luong et al., 2015;Vaswani et al., 2017).
In this paper, we follow the second line of work and propose a novel method which makes no assumption about the latent representation of source sentence and takes the proven self-attention network, Transformer, as a basic module to train a style transfer system.

Style Transformer
To make our discussion more clearly, in this section, we will first give a brief introduction to the style transfer task, and then start to discuss our proposed model based on our problem definition.

Problem Formalization
In this paper, we define the style transfer problem as follows: Considering a bunch of datasets , and each dataset D i is composed of many natural language sentences. For all of the sentences in a single dataset D i , they share some specific characteristic (e.g. they are all the positive reviews for a specific product), and we refer this shared characteristic as the style of these sentences. In other words, a style is defined by the distribution of a dataset. Suppose we have K different datasets D i , then we can define K different styles, and we denote each style by the symbol s (i) . The goal of style transfer is that: given a arbitrary natural language sentence x and a desired style s ∈ {s (i) } K i=1 , rewrite this sentence to a new one x which has the style s and preserve the information in original sentence x as much as possible.

Model Overview
To tackle the style transfer problem we defined above, our goal is to learn a mapping function f θ (x, s) where x is a natural language sentence and s is a style control variable. The output of this function is the transferred sentence x for the input sentence x.
A big challenge in the text style transfer is that we have no access to the parallel corpora. Thus we can't directly obtain supervision to train our transfer model. In section 3.4, we employ two discriminator-based approaches to create supervision from non-parallel corpora.
Finally, we will combine the Style Transformer network and discriminator network via an overall learning algorithm in section 3.5 to train our style transfer system.

Style Transformer Network
Generally, Transformer follows the standard encoder-decoder architecture. Explicitly, for a input sentence x = (x 1 , x 2 , ..., x n ), the Transformer encoder Enc(x; θ E ) maps inputs to a sequence of continuous representations z = (z 1 , z 2 , ..., z n ). And the Transformer decoder Dec(z; θ D ) estimates the conditional probability for the output sentence y = (y 1 , y 2 , ..., y n ) by auto-regressively factorized its as: (1) At each time step t, the probability of the next token is computed by a softmax classifier: where o t is logit vector outputted by decoder network.
To enable style control in the standard Transformer framework, we add a extra style embedding as input to the Transformer encoder Enc(x, s; θ E ). Therefore the network can compute the probability of the output condition both on the input sentence x and the style control variable s. Formally, this can be expressed as: and we denote the predicted output sentence of this network by f θ (x, s).

Discriminator Network
Suppose we use x and s to denote the sentence and its style from the dataset D. Because of the absence of the parallel corpora, we can't directly obtain the supervision for the case f θ (x, s) where s = s. Therefore, we introduce a discriminator network to learn this supervision from the nonparallel copora.
The intuition behind the training of discriminator is based on the assumption below: As we mentioned above, we only have the supervision for the case f θ (x, s). In this case, because of the input sentence x and chosen style s are both come from the same dataset D, one of the optimum solutions, in this case, is to reproduce the input sentence. Thus, we can train our network to reconstruct the input in this case. In the case of f θ (x, s) where s = s, we construct supervision from two ways. 1) For the content preservation, we train the network to reconstruct original input sentence x when we feed transferred sentence y = f θ (x, s) to the Style Transformer network with the original style label s. 2) For the style controlling, we train a discriminator network to assist the Style Transformer network to better control the style of the generated sentence.
In short, the discriminator network is another Transformer encoder, which learns to distinguish the style of different sentences. And the Style Transformer network receives style supervision from this discriminator. To achieve this goal, we experiment with two different discriminator architectures.

Conditional Discriminator
In a setting similar to Conditional GANs (Mirza and Osindero, 2014), discriminator makes decision condition on a input style. Explicitly, a sentence x and a proposal style s are feed into discriminator d φ (x, s), and the discriminator is asked to answer whether the input sentence has the corresponding style. In discriminator training stage, the real sentence from datasets x, and the reconstructed sentence y = f θ (x, s) are labeled as positive, and the transferred sentences y = f θ (x, s) where s = s, are labeled as negative. In Style Transformer network training stage, the network f θ is trained to maximize the probability of positive when feed f θ (x, s) and s to the discriminator.
Multi-class Discriminator Different from the previous one, in this case, only one sentence is feed into discriminator d φ (x), and the discriminator aims to answer the style of this sentence. More concretely, the discriminator is a classifier with K + 1 classes. The first K classes represent K different styles, and the last class is stand for the generated data from f θ (x, s) , which is also often referred as fake sample. In discriminator training stage, we label the real sentences x and reconstructed sentences y = f θ (x, s) to the label of the corresponding style. And for the transferred sentence y = f θ (x, s) where s = s, is labeled as the class 0. In Style Transformer network learning stage, we train the network f θ (x, s) to maximize Figure 2: The training process for Style Transformer network. The input sentence x and input style s( s) is feed into Transformer network f θ . If the input style s is the same as the style of sentence x, generated sentence y will be trained to reconstruct x. Otherwise, the generated sentence y will be feed into Transformer f θ and discriminator d φ to reconstruct input sentence x and input style s respectively.
the probability of the class which is stand for style s.

Learning Algorithm
In this section, we will discuss how to train these two networks. And the training algorithm of our model can be divided into two parts: the discriminator learning and Style Transformer network learning. The brief illustration is shown in Figure 2.

Discriminator Learning
Loosely speaking, in the discriminator training stage, we train our discriminator to distinguish between the real sentence x and reconstructed sentence y = f θ (x, s) from the transferred sentence y = f θ (x, s). The loss function for the discriminator is simply the cross-entropy loss of the classification problem. For the conditional discriminator: And for the multi-class discriminator: According to the difference of discriminator architecture, there is a different protocol for how to label these sentences, and the details can be found in Algorithm 1. Self Reconstruction For the case s = s , or equivalently, the case f θ (x, s). As we discussed before, the input sentence x and the input style s comes from the same dataset , we can simply train our Style Transformer to reconstruct the input sentence by minimizing negative log-likelihood: For the case s = s, we can't obtain direct supervision from our training set. So, we introduce two different training loss to create supervision indirectly.
Cycle Reconstruction To encourage generated sentence preserving the information in the input sentence x, we feed the generated sentence y = f θ (x, s) to the Style Transformer with the style of x and training our network to reconstruct original input sentence by minimizing negative loglikelihood:

s). (7)
Style Controlling If we only train our Style Transformer to reconstruct the input sentence x from transferred sentence y = f θ (x, s), the network can only learn to copy the input to the output. To handle this degeneration problem, we further add a style controlling loss for the generated sentence. Namely, the network generated sentence y is feed into discriminator to maximize the probability of style s.
For the conditional discriminator, the Style Transformer aims to minimize the negative loglikelihood of class 1 when feed to the discriminator with the style label s: And in the case of the multi-class discriminator, the Style Transformer is trained to minimize the the negative log-likelihood of the corresponding class of style s: Combining the loss function we discussed above, the training procedure of the Style Transformer is summarized in Algorithm 2. Compute L style (θ) for y by Eq. (8) or (9) ; 10 end

Summarization and Discussion
Finally, we can construct our final training algorithm based on discriminator learning and Style Transformer learning steps. Similar to the training process of GANs (Goodfellow et al., 2014), in each training iteration, we first perform n d steps discriminator learning to get a better discriminator, and then train our Style Transformer n f steps to improve its performance. The training process is summarized in Algorithm 3.
Before finishing this section, we finally discuss a problem which we will be faced with in the training process. Because of the discrete nature of the natural language, for the generated sentence y = f θ (x, s), we can't directly propagate gradients from the discriminator through the discrete samples. To handle this problem, one can use REIN-FORCE (Williams, 1992) or the Gumbel-Softmax trick (Kusner and Hernández-Lobato, 2016) to estimates gradients from the discriminator. However, these two approaches are faced with high Perform gradient decent to update d φ . variance problem, which will make the model hard to converge. In our experiment, we also observed that the Gumbel-Softmax trick would slow down the model converging, and didn't bring much performance improvement to the model. For the reasons above, empirically, we view the softmax distribution generated by f θ as a "soft" generated sentence and feed this distribution to the downstream network to keep the continuity of the whole training process. When this approximation is used, we also switch our decoder network from greedy decoding to continuous decoding. Which is to say, at every time step, instead of feed the token that has maximum probability in previous prediction step to the network, we feed the whole softmax distribution (Eq. (2)) to the network. And the decoder uses this distribution to compute a weighted average embedding from embedding matrix for the input.

Datasets
We evaluated and compared our approach with several state-of-the-art systems on two review datasets, Yelp Review Dataset (Yelp) and IMDb Movie Review Dataset (IMDb). The statistics of the two datasets are shown in Table 1. Yelp Review Dataset (Yelp) The Yelp dataset is provided by the Yelp Dataset Challenge, consisting of restaurants and business reviews with sentiment labels (negative or positive). Following previous work, we use the possessed dataset provided by . Additionally, it also provides human reference sentences for the test set. IMDb Movie Review Dataset (IMDb) The IMDb dataset 2 consists of movie reviews written by online users. To get a high quality dataset, we use the highly polar movie reviews provided by Maas et al. (2011). Based on this dataset, we construct a highly polar sentence-level style transfer dataset by the following steps: 1) fine tune a BERT (Devlin et al., 2018) classifier on original training set, which achieves 95% accuracy on test set; 2) split each review in the original dataset into several sentences; 3) filter out sentences with confidence threshold below 0.9 by our fine-tuned BERT classifier; 4) remove sentences with uncommon words. Finally, this dataset contains 366K, 4k, 2k sentences for training, validation, and testing, respectively.

Evaluation
A goal transferred sentence should be a fluent, content-complete one with target style. To evaluate the performance of the different model, following previous works, we compared three different dimensions of generated samples: 1) Style control, 2) Content preservation and 3) Fluency.

Automatic Evaluation
Style Control We measure style control automatically by evaluating the target sentiment accuracy of transferred sentences. For an accurate evaluation of style control, we trained two sentiment classifiers on the training set of Yelp and IMDb using fastText (Joulin et al., 2017).

Content Preservation
To measure content preservation, we calculate the BLEU score (Papineni et al., 2002) between the transferred sentence and its source input using NLTK. A higher BLEU score indicates the transferred sentence can achieve better content preservation by retaining more words from the source sentence. If a human reference is available, we will calculate the BLEU RetrieveOnly  92.9 0.4 0.7 10 N/A N/A N/A TemplateBased  84.2 13.7 44.1 67 N/A N/A N/A DeleteOnly  85.5 9.7 28.6 79 N/A N/A N/A DeleteAndRetrieve    (Fu et al., 2018) 49.9 9.2 37.9 127 N/A N/A N/A CycleRL (Xu et al., 2018) 88  score between the transferred sentence and corresponding reference as well. Two BLEU score metrics are referred to as self -BLEU and ref -BLEU respectively. Fluency Fluency is measured by the perplexity of the transferred sentence, and we trained a 5-gram language model on the training set of two datasets using KenLM (Heafield, 2011).

Human Evaluation
Due to the lack of parallel data in style transfer area, automatic metrics are insufficient to evaluate the quality of the transferred sentence. Therefore we also conduct human evaluation experiments on two datasets. We randomly select 100 source sentences (50 for each sentiment) from each test set for human evaluation. For each review, one source input and three anonymous transferred samples are shown to a reviewer. And the reviewer is asked to choose the best sentence for style control, content preservation, and fluency respectively.
• Which sentence has the most opposite sentiment toward the source sentence?
• Which sentence retains most content from the source sentence?
• Which sentence is the most fluent one?
To avoid interference from similar or same generated sentences, "no preference." is also an option answer to these questions.

Training Details
In all of the experiment, for the encoder, decoder, and discriminator, we all use 4-layer Transformer with four attention heads in each layer. The hidden size, embedding size, and positional encoding size in Transformer are all 256 dimensions. Another embedding matrix with 256 hidden units is used to represent different style, which is feed into encoder as an extra token of the input sentence. And the positional encoding isn't used for the style token. For the discriminator, similar to Radford et al. (2018) and Devlin et al. (2018), we further add a <cls> token to the input, and the output vector of the corresponding position is feed into a softmax classifier which represents the output of discriminator.
In the experiment, we also found that preforming random word dropout for the input sentence when computing the self reconstruction loss (Eq. (6)) can help model more easily to converge to a reasonable performance. On the other hand, by adding a temperature parameter to the softmax layer (Eq. (2)) and using a sophisticated temperature decay schedule can also help the model to get a better result in some case.

Experimental Results
Results using automatic metrics are presented in Table 2. Comparing to previous approaches, our models achieve competitive performance overall and get better content preservation at all of two datasets. Our conditional model can achieve a better style controlling compared to the multi-class model. Both our models are able to generate sentences with relatively low perplexity. For those previous models performing the best on a single metric, an obvious drawback can always be found on another metric.
For the human evaluation, we choose two  of the most well-performed models according to the automatic evaluation results as competitors: DeleteAndRetrieve (DAR)  and Controlled Generation (CtrlGen) (Hu et al., 2017). And the generated outputs from multi-class discriminator model is used as our final model. We have performed over 400 human evaluation reviews. Results are presented in Table 3. The human evaluation results are mainly conformed with our automatic evaluation results. And it also shows that our models are better in content preservation, compared to two competitor model. Finally, to better understand the characteristic of different models, we sampled several output sentences from the Yelp dataset, which are shown in Table 4.

Ablation Study
To study the impact of different components on overall performance, we further did an ablation study for our model on Yelp dataset, and results are reported in Table 5.
For better understanding the role of different loss functions, we disable each loss function by turns and retrain our model with the same setting for the rest of hyperparameters. After we disable self-reconstruction loss (Eq. (6)), our model failed to learn a meaningful output and only learned to generate a single word for any combination of input sentence and style. However, when we don't use cycle reconstruction loss (Eq. (7)), it's also possible to train the model successfully, and both of two models converge to reasonable performance. And comparing to the full model, there is a small improvement in style accuracy, but a significant drop in BLEU score. As our expected, the cycle reconstruction loss is able to encourage the model to preserve the information from the input sentence. At last, when the discriminator loss (Eq. (8) and (9)) is not used, the model quickly degenerates to a model which is only copying the input sentence to output without any style modification. This behaviour also conforms with our intuition. If the model is only asked to minimize the self-reconstruction loss and cycle reconstruction loss, directly copying input is one of the optimum solutions which is the easiest to achieve. In summary, each of these loss plays an important role in the Style Transformer training stage: 1) the self-reconstruction loss guides the model to generate readable natural language sentence. 2) the cycle reconstruction loss encourages the model to preserve the information in the source sentence.
3) the discriminator provides style supervision to help the model control the style of generated sentences.
Another group of study is focused on the different type of samples used in the discriminator training step. In Algorithm 1, we used a mixture of real sentence x and generated sentence y as the positive training samples for the discriminator. By contrast, in the ablation study, we trained our model with only one of them. As the result shows, the generated sentence is the key component in discriminator training. When we remove the real sentence from the training data of discriminator, our model can also achieve a competitive result as the full model with only a small performance drop. However, if we only use the real sentence the model will lose a significant part of the ability to control the style of the generated sentence, and thus yields a bad performance in style accuracy. However, the model can still perform a style control far better than the input copy model discussed in the previous part. For the reasons above, we used a mixture of real sample and generated sample in our final version.

Conclusions and Future Work
In this paper, we proposed the Style Transformer with a novel training algorithm for text style transfer task. Experimental results on two text style transfer datasets have shown that our model achieved a competitive or better performance compared to previous state-of-the-art approaches. Especially, because our proposed approach doesn't assume a disentangled latent representation for manipulating the sentence style, our model can get better content preservation on both of two datasets.
In the future, we are planning to adapt our Style Transformer to the multiple-attribute setting like Lample et al. (2019). On the other hand, the back-   Lample et al. (2019) can also be adapted to the training process of Style Transformer. How to combine the backtranslation with our training algorithm is also a good research direction that is worth to explore.