Cycle-Consistent Adversarial Autoencoders for Unsupervised Text Style Transfer

Unsupervised text style transfer is full of challenges due to the lack of parallel data and difficulties in content preservation. In this paper, we propose a novel neural approach to unsupervised text style transfer which we refer to as Cycle-consistent Adversarial autoEncoders (CAE) trained from non-parallel data. CAE consists of three essential components: (1) LSTM autoencoders that encode a text in one style into its latent representation and decode an encoded representation into its original text or a transferred representation into a style-transferred text, (2) adversarial style transfer networks that use an adversarially trained generator to transform a latent representation in one style into a representation in another style, and (3) a cycle-consistent constraint that enhances the capacity of the adversarial style transfer networks in content preservation. The entire CAE with these three components can be trained end-to-end. Extensive experiments and in-depth analyses on two widely-used public datasets consistently validate the effectiveness of proposed CAE in both style transfer and content preservation against several strong baselines in terms of four automatic evaluation metrics and human evaluation.


Introduction
Unsupervised text style transfer is to rewrite a text in one style into a text in another style while the content of the text remains the same as much as possible without using any parallel data. Style transfer can be utilized in many tasks such as personalization in dialogue systems (Oraby et al., 2018;Colombo et al., 2019), sentiment and word decipherment (Shen et al., 2017), offensive language translation (Nogueira dos Santos et al., 2018), and data augmentation (Perez and Wang, 2017;Mikołajczyk and Grochowski, 2018;Zhu et al., 2020), etc.
However, there are a variety of challenges to text style transfer in practice. First, we do not have large-scale style-to-style parallel data to train a text style transfer model in a supervised way. Second, even with non-parallel corpora, the inherent discrete structure of text sequences aggravates the difficulty of learning desirable continuous representations for style transfer Bowman et al., 2016;Hjelm et al., 2018). Third, it is difficult to preserve the content of a text when its style is transferred. To obtain good content preservation for text style transfer, various disentanglement approaches (Shen et al., 2017;Hu et al., 2017;Fu et al., 2018;Sudhakar et al., 2019) are proposed to separate the content and style of a text in the latent space. However, content-style disentanglement is not easily achievable as content and style typically interact with each other in texts in subtle ways (Lample et al., 2019).
In order to solve the issues above, we propose a cycle-consistent adversarial autoencoders (CAE) for unsupervised text style transfer. In CAE, we learn the representation of a text where we embed both content and style in the same space. Such space is constructed for each style from non-parallel data. We then transfer the learned representation from one style space to another space. To guarantee that the content is preserved during the style transfer procedure, the transferred representation is transferred back to the original space to minimize the distance between its original representation and the reversely transferred representation.  Without loss of generality, we discuss CAE for text transfer between two styles. Multiple styles can be factorized into two styles (Shen et al., 2020a). Specifically, CAE is composed of three essential components: LSTM autoencoders, adversarial style transfer networks and a cycle-consistent constraint. The LSTM autoencoder contains an encoder enc to encode a sentence x s i from style s into a hidden representation z s i in the corresponding style space and a decoder dec to generate sentences from vectors z s i learned by the LSTM encoder (orz j transferred from the other style space). The adversarial style transfer networks learn a generator T to generate a representationz s 1 →s 2 i in style space s 2 from z s 1 i in style space s 1 , or the other way around from style space s 2 to s 1 . It also uses a discriminator to ensure that the transferred representations belong to the corresponding style space.
The top of Figure 2 displays the original sentences and style-transferred sentences generated by the LSTM decoder from transferred representations produced by the generator of the adversarial style transfer network. The cycle-consistent constraint transfers back representations to their original space and attempts to minimize their distances, as demonstrated in the bottom of Figure 2.
In summary, our contributions are threefold as follows.
• We propose a novel end-to-end framework with three components to learn text style transfer without using parallel data.
• To the best of our knowledge, our work is the first to use the cycle-consistent constraint in the latent representational space for unsupervised text style transfer.
• The proposed CAE are validated on two widely-used datasets: Yelp restaurant review sentiment transfer dataset and Yahoo QA topic transfer dataset. Extensive experiments and analyses demonstrate that CAE obtains better performance than several state-of-the-art baselines in both style transfer and content preservation.

Related work
A number of text style transfer approaches have been proposed in recent years following the pioneering study of style transfer in images (Gatys et al., 2015). These approaches can be roughly categorized into two strands: methods that disentangle representations of style and content and the others that do not.
In the first line of text style transfer, Hu et al. (2017) combine a variational autoencoder (VAE) with style discriminators to enforce that styles can be reliably inferred back from generated sentences. Shen et al. (2017) uses discriminators to align hidden states of the transferred samples from one style with the creativecommons.org/licenses/by/4.0/. true samples in the other to obtain the shared latent content distribution. Fu et al. (2018) use an adversarial network to separate content representations from style representations. Prabhumoye et al. (2018) fix the machine translation model and the encoder of the back-translation model to obtain content representations, then generate texts with classifier-guided style-specific generators.  extract content words by deleting style indicator words, then combine the content words with retrieved style words to construct the final output. Xu et al. (2018) use reinforcement learning to jointly train a neutralization module which removes style words based on a classifier and an emotionalization module. ARAE  and DAAE (Shen et al., 2020b) train GAN-regularized latent representations to obtain styleindependent content representations, then decodes the content representations conditioned on style. He et al. (2020) presents a new probabilistic graphical model for unsupervised text style transfer.
In the second line of works that avoid disentangled representations of style and content, Lample et al. (2019) use back-translation technique on denoising autoencoder model with latent representation pooling to control the content preservation. Their experiments and analyses show that the contentstyle disentanglement is neither necessary nor always satisfied with practical requirements, even with the domain adversarial training that explicitly aims at learning disentangled representations. Style Transformer (Dai et al., 2019) uses Transformer as a basic module to train a style transfer system. DualRL ) employs a dual reinforcement learning framework with two sequence-to-sequence models in two directions, using style classifier and back-transfer reconstruction probability as rewards.
We follow the second line and propose a novel method that makes no assumption on the latent representation disentanglement. But differently, we perform style transfer in the latent representational spaces of the source and target style. And inspired by CycleGAN (Zhu et al., 2017;Zhu et al., 2018) which uses a cycle loss on image style transfer to enforce the back-translation of a transferred image to be equivalent to the original image, we also impose a cycle-consistent constraint on our style transfer network. However, training style transfer networks with such a cycle constraint on discrete texts is quite different from those on images and non-trivial. In order to enable cycle training on texts, we project texts onto adversarially regularized latent space collectively learned by the LSTM autoencoders and adversarial transfer networks. Different from latent cross project with Euclidean distance for semi-supervised style transfer (Shang et al., 2019), we construct latent CycleGAN to generate high quality sentences for unsupervised style transfer.

CAE: Cycle-consistent Adversarial Autoencoders
Suppose we have two non-parallel text datasets X 1 = {x 1 i } n i=1 and X 2 = {x 2 j } m j=1 with different styles s 1 and s 2 . The CAE employs LSTM autoencoder models to encode discrete text sequences

LSTM autoencoders
We use an LSTM (Hochreiter and Schmidhuber, 1997) autoencoder to learn the latent representation of a text for each style. The encoder employs an LSTM recurrent neural network to map the input sequence to a latent representation with a fixed size, and the decoder utilizes the other LSTM network to generate an output sequence from a hidden representation (Sutskever et al., 2014). Given the i-th input text sequence in style s 1 , the LSTM autoencoder for style s 1 can be formulated as: is the learned latent representation from the encoder enc 1 ,x 1 i,<t are the tokens generated beforex 1 i,t and we start the decoder by a start-of-sentence symbol "<bos>" which isx 1 i,<1 . p(x 1 i,t |z 1 i ,x 1 i,<t ; enc 1 , dec 1 ) is the softmax output from decoder dec 1 . For style s 2 , similarly, we construct the other LSTM autoencoder with encoder enc 2 and decoder dec 2 to learn latent representation z 2 j . The LSTM autoencoder tries to reconstruct the input sequence x k i with the outputx k i from the networks enc k , dec k , where k = 1, 2 for different styles. The training objective function for the two LSTM autoencoders can be computed as: The two LSTM autoencoders transform discrete sequences into latent continuous representations, which facilitate the style transfer models to perform style transfer and cycle training in the continuous space.

Adversarial style transfer networks
Once we obtain the representations of text sequences in different styles via LSTM autoencoders, we learn two transformation functions T 1→2 and T 2→1 to map a representation in one style to the representation in the other style in the learned latent spaces. The style transfer in this way is formulated as: wherez 1→2 is the generated latent representation in style s 2 from its original representation z 1 in style s 1 by the transformation T 1→2 , andz 2→1 is the generated latent representation in style s 1 by T 2→1 . We use generative adversarial networks (Goodfellow et al., 2014) to learn the two transformation functions. Let's consider the learning of the transformation T 1→2 . We regard the function T 1→2 as the generator that generates a representation in style s 2 from a representation in style s 1 . We then build a discriminator D 2 to distinguish representations in style s 2 from others. The generator tries to generate a representation that is able to fool the discriminator. The adversarial learning of the generator T 1→2 and the discriminator D 2 is formulated as: Similarly, we can derive the generative adversarial loss L G (T 2→1 , D 1 ) for style transformation function T 2→1 and discriminator D 1 .

Cycle-consistent constraint
Theoretically, the adversarial style transfer networks described above are capable of learning many different transformation functions that can generate outputs in the distribution identical to the target style space (Zhu et al., 2017). This is because that the learning of the transformation functions lacks of sufficient constraints and the two functions are learned in a relatively separate way according to equations (4) and (5).
In order to learn desirable transformation functions, we use a cycle-consistent constraint to tighten the learning of the two transformation functions T 1→2 and T 2→1 , which is inspired by CycleGAN (Zhu et al., 2017). The cycle-consistent constraint expects that a transferred representation generated by a transformation function can be translated back to its original representation by the other transformation function.
Given a latent representation z 1 in style s 1 , the reconstructed latent representation through the two style transformation functions T 1→2 , T 2→1 can be obtained as: Similarly, we can obtain the reconstructed latent representationz (2→1)→2 for latent representation z 2 in style s 2 .
To constrain the transformation functions T 2→1 and T 1→2 , the latent representational cycle-consistent reconstruction loss can be formulated as: where · 1 is L 1 norm. This latent representational cycle-consistent reconstruction imposes the constraint on the adversarial style transfer networks to palliate mode-dropping in the latent style transfer, and to improve the content preservation in the generated sentences.

Training and inference
As CAE has three components in its network architecture, the end-to-end training objective of CAE is composed of three sub-objectives and is formulated as: where λ 1 , λ 2 and λ 3 control the relative importance of the three sub-objectives. We aim to solve: enc 1 , dec 1 , enc 2 , dec 2 , T 2→1 , T 1→2 , D 1 , D 2 = arg min For the inference, let's consider the transfer of a text x 1 i in style s 1 into a text in style s 2 . We first obtain latent representation z 1 i = enc 1 (x 1 i ) using encoder enc 1 . We then perform style transfer and obtain the transferred latent representationz 1→2 i in style s 2 based on equation (3). Finally, we employ the decoder dec 2 to generate a transferred sequencex 1→2 wherex 1→2 i = (x 1→2 i,1 , · · · ,x 1→2 i,L ) with length L , p(·|z, ·; dec 2 ) is the same as equation (1) calculated by the softmax from dec 2 with previous tokens. The inference of the entire sequencex 1→2 i in style s 2 from sequence x 1 i in style s 1 is formulated as: Similarly, we can conduct style transfer from a sequence x 2 j in style s 2 to generate a sequencex 2→1 j in style s 1 .

Experiments
To compare our work with previous approaches to text style transfer from non-parallel data, we conducted experiments on two text transfer tasks: sentiment transfer on the Yelp restaurant review corpus and topic transfer on the "Yahoo! Answers Comprehensive Questions and Answers version 1.0" dataset. We also carried out ablation experiments to study the impact of different components of CAE on overall performance of style transfer.

Datasets
For the Yelp dataset, we followed the same experimental setup and used the same dataset as Crossaligned auto-encoder (Shen et al., 2017) and ARAE  for sentiment transfer on the Yelp restaurant reviews. The sentiment of a review is labeled as positive if the rating is above three; otherwise, it is labeled as negative. We used 70% of the data for training, 10% for validation and the rest for testing. For the Yahoo QA dataset, we chose two topics for style transfer: "Entertainment & Music" and "Politics & Government", and extracted questions from these two topics to construct the final dataset. The partition ratios of this dataset for training and testing are 80% and 20%, respectively. To reduce the vocabulary size, we pruned the vocabulary to keep the most frequent words and replaced other words with "<unk>". Table 1 shows the statistics of the two datasets.

Baselines
We compared CAE with the following baselines: (1) LSTM autoencoder (AE): using only LSTM autoencoders in CAE for the two styles without the style transfer networks and cycle-consistent constraint.
(2) Cross-aligned autoencoder (Cross-aligned AE) (Shen et al., 2017): aligning the hidden states of autoencoders adversarially to learn a shared latent content distribution. (3) ARAE : adversarially training GAN-regularized prior with a classifier to obtain style-independent content representations, then conducting style transfer through decoders conditioned on style. (4) Template-based method : replacing the style words of source sentence with the other style words retrieved from target sentences. (5) Cycled reinforcement learning approach (Cycled RL) (Xu et al., 2018): using reinforcement learning to jointly train a neutralization module which removes style words based on a classifier and an emotionalization module.

Hyper-parameter settings
The used encoders enc 1 , enc 2 and decoders dec 1 , dec 2 were LSTM networks with one hidden layer of size h n = 128 on the Yelp review dataset and of size h n = 300 on Yahoo QA dataset. The word embedding size was the same as the number of hidden neurons h n . The latent variables z 1 , z 2 ,z 2→1 andz 1→2 were L 2 -normalized to 1. The transformation functions T 1→2 , T 2→1 were parameterized by two-layer fully-connected neural networks (h n -h n -h n neurons). The discriminators D 1 , D 2 were twolayer fully-connected neural networks (h n -h n -1 neurons) with hyperbolic tangent activation function in the first layer and sigmoid activation function for the second layer. The λ 1 , λ 2 , λ 3 were set as 0.1, 1.0, 1.0 respectively based on the performance on the validation set.

Evaluation metrics
We used four automatic metrics to quantitatively evaluate the proposed CAE: Transfer, BLEU, PPL and RPPL, which have been widely used in previous literature . Transfer is the style  transfer success rate and implemented as a classifier which is trained by the fastText library (Joulin et al., 2017). BLEU is used to evaluate the content preservation between the source sequence and transferred sequence (Papineni et al., 2002). To evaluate the fluency of the transferred sequence, we utilized the perplexity of the generated text denoted by PPL. We also used the reverse perplexity (RPPL) to assess the representativeness of generated texts with respect to the underlying data distribution and to detect the mode collapse for generative models . RPPL scores were calculated by training an RNN language model on generated samples to evaluate the perplexity on real-world hold-out data . We used the code from Zhao et al.  (word embedding size of 300 with dropout 0.2, and one-layer LSTM of size 300 with dropout 0.2) to build the language models and calculate PPL and RPPL. These four evaluation metrics together form a comprehensive evaluation and comparison between different approaches. We also conducted human evaluation. We randomly chose 200 instances from each style for the human evaluation. Four human annotators can proficiently understand English texts and have sufficient background knowledge about this evaluation task. The annotation is blind to them in random order. They grade all sentences with scores from one to five for style transfer, content preservation and fluency. Following Wu et al. (2019) and , we regard a style transfer with scores larger than or equal to four on all three measures (style transfer, content preservation and fluency) as a successful transfer. We calculate the percentage of successful transfers and refer to this percentage as Suc.

Yelp restaurant reviews sentiment transfer
The results are shown in Table 2 (left), from which we clearly observe that CAE obtains better performance than the five baseline approaches. Specifically, CAE yields improvements of 5.1, 6.1 and 7.0 points over the best baselines for sentiment transfer in terms of transfer success rate (Transfer), fluency (PPL) and mode reservation (RPPL), respectively.
The template-based method achieves the highest BLEU score since all content words are guaranteed to be kept by templates with only style words replaced by retrieved words. However, it obtains the worst perplexity, indicating that it is very difficult for the template-based method to generate fluent sentences. By contrast, our CAE achieves the lowest perplexity due to the strong LSTM decoders. The cycleconsistent constraint enables CAE to yield the best RPPL as it palliates mode dropping in style transfer. Additionally, the adversarial style transfer network constrained by the cycle consistency loss facilitates CAE to perform well on both style transfer and content preservation.

Yahoo questions topic transfer
We further evaluated CAE against five basedlines on the Yahoo QA topic transfer task. Results in Table 2 (right) show that CAE obtains better Transfer (+6.1%), PPL (−9), BLEU and RPPL than the best baseline approaches of Cycled RL, Cross-aligned AE and ARAE, demonstrating the advantages of CAE on style transfer, fluency and content preservation.   The template-based method again achieves the highest BLEU score but it fails to generate meaningful and fluent sentences (very high PPL and RPPL). The reason for this is because simply replacing the style words with unreasonable words may generate senmantically incorrect sentences. For example, the template-based method transfers a source sentence of "is harrison ford married ?" into "is a state in the married ?", which is meaningless. The template-based method also achieves the highest transfer which is different with the results on the Yelp dataset. The reason is that it is easier to differentiate style words from content words in the Yahoo topic dataset than the Yelp dataset which makes it more accurate for the template-based method to substitute style words. It can be observed that the template-based method achieves high BLEU scores at the cost of fluency and semantic correctness. Taking all the four metrics into consideration, we believe that our approach performs better than all the baselines. Table 3 shows the results of human evaluation. The CAE achieves the highest style transfer, content preservation and fluency score on both datasets. It also obtains the highest comprehensive successful rate of style transfer in terms of Suc.

Ablation study
We further conducted ablation experiments on the Yelp dataset to study the effect of the cycle-consistent constraint and discriminators in CAE. Table 4 shows the results. When we disable the cycle-consistent constraint, we can train the model successfully. However, it leads to significant drop in BLEU and higher PPL and RPPL with marginal improvement in Transfer compared to the full CAE, which again confirms that the cycle-consistent constraint is helpful for content mode preservation. When we disable the discriminators, the model cannot preserve content, obtaining terribly low BLEU and high RPPL, indicating complete mode collapse. Without the discriminators, the CAE has no constraint and guidance to learn the transformation functions, which is prone to mode collapse. Serious mode collapse results in poor content preservation. We notice that most transferred sentences in the "negative" style contain words "not" or "disappointed", while most generated sentences in the "positive" style have the word "good". The discriminators are important for preventing the model from collapse and hence further preserving the content.

Style-transferred sentences
We display some examples to look into the differences between the CAE and previous approach ARAE in Table 5 . In the first example in Table 5, we can clearly see that CAE correctly detects the sentiment words "great" and "wonderful" and successfully transfers the positive sentiment to the negative sentiment by changing the two words into negative words "horrible" and "awful". It is worth noting that CAE  preserves the substance of the source sentence during the successful style (sentiment) transfer. In contrast, ARAE fails to keep the background and the major content of original sentences when it struggles to change the style of them not only in the first example but also in other examples. Examples from Yahoo dataset again demonstrate the advantages of CAE in both style transfer and content preservation over ARAE. It can be obliviously found that CAE is able to learn the patterns of the original questions and to change the topic from "Entertainment & Music" to "Politics & Government" or vice versa in the frame of the learned patterns.

Comparison with the nearest neighbour sequences from training data
We compared the transferred sequences with the nearest-neighbor sequences from training data based on Jaccard distance (word-level intersection over union for two sequences). The results are listed in Table 6. The transferred sentences are very different from the retrieved nearest sentences in both syntax and semantics. Additionally, they are also very fluent. This suggests that CAE is capable of learning the style knowledge from training instances and generalizing the learned knowledge to generate style-transferred sentences from given source sentences.

Transferred sentences
Nearest neighbour in training data it has a horrible atmosphere , with awful service . horrible atmosphere , horrible service .
definitely a waste of time for sushi in las vegas ! best sushi in las vegas ! the steak was really dry with my sauce on the salsa . the philly was dry with no sauce .

Conclusion
We have presented a novel approach, CAE, to unsupervised text style transfer from non-parallel text. We learn latent representations for sequences in different styles with LSTM autoencoders. The learned representations are transferred from their original style to another style via adversarial transfer networks. The transfer networks are equipped with a cycle-consistent constraint to guarantee content preservation during style transfer. Experiments and analyses on the Yelp and Yahoo datasets sufficiently demonstrate the powerful style transfer ability of CAE with good fluency and content preservation against previous methods.