Multiple Text Style Transfer by using Word-level Conditional Generative Adversarial Network with Two-Phase Training

The objective of non-parallel text style transfer, or controllable text generation, is to alter specific attributes (e.g. sentiment, mood, tense, politeness, etc) of a given text while preserving its remaining attributes and content. Generative adversarial network (GAN) is a popular model to ensure the transferred sentences are realistic and have the desired target styles. However, training GAN often suffers from mode collapse problem, which causes that the transferred text is little related to the original text. In this paper, we propose a new GAN model with a word-level conditional architecture and a two-phase training procedure. By using a style-related condition architecture before generating a word, our model is able to maintain style-unrelated words while changing the others. By separating the training procedure into reconstruction and transfer phases, our model is able to learn a proper text generation process, which further improves the content preservation. We test our model on polarity sentiment transfer and multiple-attribute transfer tasks. The empirical results show that our model achieves comparable evaluation scores in both transfer accuracy and fluency but significantly outperforms other state-of-the-art models in content compatibility on three real-world datasets.


Introduction
Text style transfer is a challenging problem in natural language generation, whose objective is to alter specific attributes (e.g. sentiment, mood, tense, voice, politeness, etc (Hu et al., 2017;Shen et al., 2017;Sennrich et al., 2016;Logeswaran et al., 2018;Prabhumoye et al., 2018)) of a given text while preserving its remaining attributes and contents. This task has potential applications such as paraphrasing, summarizing articles, author obfuscation (Reddy and Knight, 2016), poems/lyrics rewriting, and scenario-adaptive machine translation (Michel and Neubig, 2018).
One major challenge for text style transfer is that parallel data across different text attributes is difficult to collect and label. Without parallel data, supervised deep-learning models are not applicable and the transfer rules among styles unclear. Unlike image style transfer, another main challenge for text style transfer is the difficulty to identify and disentangle neural feature representations for texts.
In previous models (Shen et al., 2017;Hu et al., 2017) the core idea for non-parallel text style transfer, is to training an auto-encoder with additional adversarial loss (Goodfellow et al., 2014), (or a VAE with a classifier), for the discriminator, (or classifier), to guide the decoder generate sentences to have a specific target style.
Despite previous success in attributeconditioned text generation, several research questions remain, regarding to previous models, including: 1) previous modeling limitations to transformations between only a few attributes. 2) the issue of trade-off among text fluency, content preservation, and the accurate transfer with the desired attributes and 3) the unstability of adversarial training.
To address these issues, we make contribution by adopting an encoder-decoder framework and propose a novel conditional adversarial training, including several improvements as follows: (1) a word-level attribute condition architecture in both the decoder and discriminators to capture relations between styles and words; (2) we employ a seq-to-seq attention mechanism and (3) a two-phase training procedure (reconstruction/transfer phases) for better content preservation. It is trained with standard adversarial learning approach. We test our conditional adversarial training on two tasks: (a) polarity style transfer, and (b) multiple-attribute style transfer.

Related Works
Text style transfer without parallel data is an active research topic. Mueller et al. (2017) designed a variational auto-encoder (VAE) framework; Hu et al. (2017) used VAE with controllable attributes; Shen et al. (2017) proposed to adversarially train a Cross-Aligned Auto-Encoder (CAAE) to align two different styles. To improve performances, several works including, (Fu et al., 2017;Yang et al., 2018;dos Santos et al., 2018;Logeswaran et al., 2018) were proposed. Fu et al. (2017) suggested a multi-head decoder to generate sentences with different styles; Yang et al. (2018) utilized language models as discriminators to stabilize training; dos Santos et al. (2018) used a classifier to aid style transfer; Logeswaran et al. (2018) also made use of a conditional discriminator for multiple style transfer.
On the other hand, a few works including, , Xu et al. (2018) adopt an eraseand-replace approach and design their methods to erase the style-related words first and then fill in words of different style attributes. Nonparallel text style transfer is also relevant to unsupervised machine translation. Prabhumoye et al. (2018), Subramanian et al. (2018), Logeswaran et al. (2018) and dos Santos et al. (2018) apply back-translation technique from unsupervised machine translation for style transfer task. Our work follows the framework of CAAE, and we propose several adjustments to improve the performance.

Model Architecture
As shown in Figure 1, our model contains an encoder-decoder (E, G) and two discriminators D cnn , D rnn . We describe each model architecture in details.

Encoder-Decoder
Following prior works (Hu et al., 2017;Shen et al., 2017;Yang et al., 2018;Logeswaran et al., 2018), we use a seq-to-seq de-noising encoderdecoder (E, G) with attention mechanism (Bahdanau et al., 2014). For each input sentence x and attributeỹ, the encoder E encodes x to a latent code z = E(x), and the decoder G decodes transferred sentencex = G(z,ỹ), which can be further back-translated to reconstruct the original sentence. Based on this framework, we design word-level condition model for better results.
Word-level Condition The most unique component between our encoder-decoder and previous works is the condition architecture of attributes. Most of style transfer works (Hu et al., 2017;Shen et al., 2017;Logeswaran et al., 2018) treat attributes y as part of the initial vector fed into the RNN cell in the decoder, and we argue that the conditioning structure is important to the model performance. In our decoder G, we embed y to a vector and concatenate the vector with the output h t of GRU cell at each time step t. More formally, at each time step t, the hidden state h t and output probabilities o t are generated as follows: where GRU denotes a Gated Recurrent Unit (Chung et al., 2015) in decoder G, and c t is the content vector from the attention mechanism; W p , b p denote the projection matrix and bias to map a hidden state to an output vocabulary distribution; σ is the softmax function.

Discriminator
Our discriminators take output probability distributions o as inputs along with attribute labels y. Formally, for each discriminator D ∈ {D cnn , D rnn }, the function of D can be expressed as: f cond is a multi-layer perceptron. f trans is either a bi-directional GRU or a CNN to perform a global feature transform in D cnn , D rnn respectively. Finally, f disc is a fully-connected layer with the sigmoid function to output decisions. For simplicity, we substitute x for o as the input to discriminators in the following description.

Loss functions
We train our model with reconstruction loss L rec , back-translation loss L bt , discrimination loss L disc , and adversarial loss L adv . L rec and L bt force the decoder G to reconstruct the input sentence x. L rec , L bt are expressed as: L rec = − log p(x|z, y), L bt = − log p(x|z, y). L disc and L adv force the decoder G to output the transferred sentencex with correct attributesỹ. For D ∈ {D cnn , D rnn }, L disc , L adv are listed as:  The total loss L tot for our encoder-decoder (E, G) is a weighted sum of L rec , L bt , and L adv .

Training
We split our training procedure into two phases: reconstruction and transfer phase. We first train our encoder-decoder (E, G) with loss L rec in reconstruction phase, and then in transfer phase, we use total loss L tot and L disc to train our encoder-decoder (E, G) and the discriminators D cnn , D rnn , respectively. Our experiments show this approach improves content preservation significantly. A diagram about this two-phase training is provided in Appendix A.1.

Datasets
YelpSent Dataset The preprocessed Yelp dataset (Shen et al., 2017) consists of sentences with length limit of 15 words, labeled with either positive or negative sentiment as attributes.
AmaProd and AmaSent Dataset Amazon product dataset (He and McAuley, 2016): consists of product reviews associated with ratings of the products. We select 4 product types: (books/ movies/ electronics/ CDs), following the approach in (Kim et al., 2017) to select relevant sentences. 1 YelpTense Dataset We use similar approach to label sentences with sentiment and past/present tense from Yelp Dataset Challenge 2 . If a sentence contains at least one verb in past tense, we label it as 'past tense'; otherwise as 'present tense'.
All the data statistics are presented in Table 5 in Appendix A.2.

Evaluation Metrics
We follow previous works and apply three automatic evaluation metrics and one human evaluation for three indicative aspects: attribute compatibility, content preservation, and fluency.
Attribute Compatibility To measure how well our model transfers sentence attributes, we pre-train a CNN classifier (Kim, 2014) for each attribute category (i.e. product types, sentiments, tenses) on our training data, and use the classifiers to measure the accuracy of transferred sentences associated with the desired attribute. We report the accuracy of each attribute category separately.

Content Preservation
Measuring content preservation is still an open research problem. Following the previous works, we compute the BLEU score (Papineni et al., 2002) used in machine translation. In our experiments, we use self BLEU to evaluate on transferred sentences as a measurement of content preservation.
Fluency To test fluency of the generated sentences, we train a bi-directional LSTM language model on our training data for each dataset. We regard the perplexity of generated sentences as a measure for fluency.
Human Evaluation We also evaluate the transferred sentences with human assessments. In the evaluation, 8 people are asked to rate sentences based on criteria associated with three dif: attribute compatibility, content preservation, and fluency. Each aspect is rated on a 5-point Likert scale. 20 sentences on YelpSent with corresponding transferred sentences are randomly selected as testing examples. More details about our human evaluation are provided in Appendix A.3.

Comparison with State-Of-The-Arts
We compare with four different State-of-the-Art models: CAAE, DAR, MultiAttr, ContPrev. CAAE (Shen et al., 2017) consists of an autoencoder with discriminator networks to guide text generation. DAR  uses a deleteand-retrieve approach. 3 MultiAttr (Subramanian et al., 2018) performs multiple-attributes style transfer using back-translation (Lample et al., 2018(Lample et al., , 2017Artetxe et al., 2017) and latent representation pooling. 4 ContPrev (Logeswaran et al., 2018) is an auto-encoder model with a conditional discriminator for multiple attributes transfer. Among these models, in our experiments, CAAE and DAR are compared only on polarity sentiment style transfer tasks. More details about our model and training settings are provided in Appendix A.4.

Quantitative Result
Automatic Evaluation Results Table 1 and Table 2 show the automatic evaluation results. Our model achieves higher BLEU scores and comparable transfer accuracy. We also notice that our model generates sentences with higher perplexity, while other models produce sentences with perplexity lower than the real data.
Human Evaluation Results Table 3 exhibits the human evaluation results on YelpSent. The results show that our sentences are evaluated higher on content preservation, and share comparable scores with other models on attribute compatibility.
Ablation test We conduct ablation experiments on our model on YelpSent dataset. As the results shown at Table 4, removing the word-level condition architecture decreases transfer accuracy and BLEU scores. The two-phase training procedure can also ensure a much higher BLEU scores. Using both CNN and RNN discriminator sightly improves the performance on all metrics.
Evaluation Curve We also plot the curves of transfer accuracy, BLEU score and perplexity, respectively, on the validation set, as the number of training epochs increases, across different models in Appendix A.5. According to the curves, our model achieves higher BLEU scores and relatively stable performance on all three metrics.

Qualitative Results
Our model exhibits a tendency to follow the original sentence surface structure. With the help of word-level conditional architecture, the decoder learns to make word adjustments. Sample sentences are shown in Tables 7 and 8 in Appendix A.6. For frequent occurring sentence structures across different attribute domains, our  Sent., Tense and Prod. represent the transfer accuracy measured by the pretrained sentiment-, tense-and product-attribute classifiers, respectively.

Conclusion
In this paper, we conduct non-parallel style transfer among multiple attributes. We propose a seqto-seq model with word-level condition and twophase training. The empirical results demonstrate that our model outperforms our competitors in the polarity sentiment transfer task on YelpSent. In multiple attribute transfer tasks, our model also achieves comparable results with the state-of-theart MultiAttr on YelpTense and AmaProd. We also analyze our model with ablation tests. Although our model achieves better content preservation, the general quality of our transferred sentences can be further improved. Also, designing proper evaluation metrics is still an open problem for text style transfer. We leave these two questions as the future works.