Disentangled Representation Learning for Non-Parallel Text Style Transfer

This paper tackles the problem of disentangling the latent variables of style and content in language models. We propose a simple yet effective approach, which incorporates auxiliary multi-task and adversarial objectives, for label prediction and bag-of-words prediction, respectively. We show, both qualitatively and quantitatively, that the style and content are indeed disentangled in the latent space. This disentangled latent representation learning method is applied to style transfer on non-parallel corpora. We achieve substantially better results in terms of transfer accuracy, content preservation and language fluency, in comparison to previous state-of-the-art approaches.


Introduction
The neural network has been a successful learning machine during the past decade due to its highly expressive modeling capability, which is a consequence of multiple layers of non-linear transformations of input features. Such transformations, however, make intermediate features "latent," in the sense that they do not have explicit meaning and are not interpretable. Therefore, neural networks are usually treated as black-box machinery.
Disentangling the latent space of neural networks has become an increasingly important research topic. In the image domain, for example, Chen et al. (2016) use adversarial and information maximization objectives to produce interpretable latent representations that can be tweaked to adjust writing style for handwritten digits, as well as lighting and orientation for face models. Mathieu et al. (2016) utilize a convolutional autoencoder to achieve the same objective. However, this problem is not well explored in natural language processing.
In this paper, we address the problem of disentangling the latent space of neural networks for text generation. Our model is built on an autoencoder that encodes a sentence to the latent space (vector representation) by learning to reconstruct the sentence itself. We would like the latent space to be disentangled with respect to different features, namely, style and content in our task.
Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1 Our code is publicly available at https://github.com/ vineetjohn/linguistic-style-transfer To accomplish this, we propose a simple approach that combines multi-task and adversarial objectives. We artificially divide the latent representation into two parts: the style space and content space. In this work, we consider the sentiment of a sentence as the style. We design auxiliary losses, enforcing the separation of style and content latent spaces. In particular, the multi-task loss operates on a latent space to ensure that the space does contain the information we wish to encode. The adversarial loss, on the contrary, minimizes the predictability of information that should not be contained in that space. In previous work, researchers typically work with the style, or specifically, sentiment space (Hu et al. 2017;Shen et al. 2017;Fu et al. 2018), but simply ignore the content space, as it is hard to formalize what "content" actually refers to.
In our paper, we propose to approximate the content information by bag-of-words (BoW) features, where we focus on style-neutral, non-stopwords. Along with traditional styleoriented auxiliary losses, our BoW multi-task loss and BoW adversarial loss make the style and content spaces much more disentangled from each other.
The learned disentangled latent space can be directly used for text style-transfer (Hu et al. 2017;Shen et al. 2017), which aims to transform a given sentence to a new sentence with the same content but a different style. Since it is difficult to obtain training sentence pairs with the same content and differing styles (i.e. parallel corpora), we follow the setting where we train our model on a non-parallel but stylelabeled corpora. We call this non-parallel text style transfer. To accomplish this, we train an autoencoder with disentangled latent spaces. For style-transfer inference, we simply use the autoencoder to encode the content vector of a sentence, but ignore its encoded style vector. We then infer from the training data, an empirical embedding of the style that we would like to transfer. The encoded content vector and the empirically-inferred style vector are concatenated and fed to the decoder. This grafting technique enables us to obtain a new sentence similar in content to the input sentence, but with a different style.
We conducted experiments on two customer review datasets. Qualitative and quantitative results show that both the style and content spaces are indeed disentangled well. In the style-transfer evaluation, we achieve substantially better style-transfer strength, content preservation, and language fluency scores, compared with previous results. Ablation tests also show that the auxiliary losses can be combined well, each playing its own role in disentangling the latent space.

Related Work
Disentangling neural networks' latent space has been explored in the image processing domain in the recent years, and researchers have successfully disentangled rotation features, color features, etc. of images (Chen et al. 2016;Luan et al. 2017). Some image characteristics (e.g., artistic style) can be captured well by certain statistics (Gatys, Ecker, and Bethge 2016). In other work, researchers adopt data augmentation techniques to learn a disentangled latent space (Kulkarni et al. 2015;Champandard 2016).
In natural language processing, the definition of "style" itself is vague, and as a convenient starting point, NLP researchers often treat sentiment as a salient style of text. Hu et al. (2017) manage to control the sentiment by using discriminators to reconstruct sentiment and content from generated sentences. However, there is no evidence that the latent space would be disentangled by this reconstruction. Shen et al. (2017) use a pair of adversarial discriminators to align the recurrent hidden decoder states of original and style-transferred sentences, for a given style. Fu et al. (2018) propose two approaches: training style-specific embeddings, and training separate style-specific decoders for style-transfer. They apply an adversarial loss on the encoded space to discourage encoding style in the latent space of an autoencoding model. All the above approaches only deal with the style information and simply ignore the content part. Zhao et al. (2018) extend the multi-decoder approach and use a Wasserstein-distance penalty to align content representations of sentences with different styles. However, the Wasserstein penalty is applied to empirical samples from the data distribution, and is more indirect than our BoW-based auxiliary losses. Recently, Rao and Tetreault (2018) treat the formality of writing as a style, and create a parallel corpus for style transfer with sequence-to-sequence models. This is beyond the scope of our paper, as we focus on non-parallel text style transfer.
Our paper differs from previous work in that both our style space and content space are encoded from the input, and we design several auxiliary losses to ensure that each space encodes and only encodes the desired information. Such disentanglement of latent space has its own research interest in the deep learning community. The disentangled representation can be directly applied to non-parallel text style-transfer tasks, as in the aforementioned studies.

Approach
In this section, we describe our approach in detail, shown in Figure 1. Our model is built upon an autoencoder with a sequence-to-sequence neural network (Sutskever, Vinyals, and Le 2014), and we design multi-task and adversarial losses for both style and content spaces. Finally, we present

Autoencoder
An autoencoder encodes an input to a latent vector space, from which it reconstructs the input itself. The latent vector space is usually of much smaller dimensionality than input data, and the autoencoder learns salient and compact representations of data during the reconstruction process. Let x = (x 1 , x 2 , · · · x n ) be an input sequence with n tokens. The encoder recurrent neural network (RNN) with gated recurrent units (GRU) (Cho et al. 2014) encodes x and obtains a hidden vector representation h, which is linearly transformed from the encoder RNN's final hidden state.
Then a decoder RNN generates a sentence, which ideally should be x itself. Suppose at a time step t, the decoder RNN predicts the word x t with probability p(x t |h, x 1 · · · x t−1 ). Then the autoencoder is trained with a sequence-aggregated cross-entropy loss, given by where θ E and θ D are the parameters of the encoder and decoder, respectively. 2 Both the encoder and decoder are deterministic functions in the original autoencoder model (Rumelhart, Hinton, and Williams 1985), and thus we call it a deterministic autoencoder (DAE).
Variational Autoencoder. In addition to DAE, we also implement a variational autoencoder (VAE) (Kingma and Welling 2014), which imposes a probabilistic distribution on the latent vector. The Kullback-Leibler (KL) divergence (Kullback and Leibler 1951) penalty is added to the loss function to regularize the latent space. The decoder reconstructs data based on the sampled latent vector from its posterior distribution.
Formally, the autoencoding loss in the VAE is where λ kl is the hyperparameter balancing the reconstruction loss and the KL term. p(h) is the prior, set to the standard normal distribution N (0, I). q E (h|x) is the posterior taking the form N (µ, diag σ), where µ and σ are predicted by the encoder network. The motivation for using VAE as opposed to DAE is that the reconstruction is based on the samples of the posterior, which populates encoded vectors to the neighborhood and thus smooths the latent space. Bowman et al. (2016) show that VAE enables more fluent sentence generation from a latent space than DAE. The autoencoding losses in Equations (1,2) serve as our primary training objective. Besides, the autoencoder is also used for text generation in the style-transfer application. We also design several auxiliary losses to disentangle the latent space. In particular, we hope that h can be separated into two spaces s and c, representing style and content, respectively, i.e., h = [s; c], where [·; ·] denotes concatenation. This is accomplished by the auxiliary losses described in the rest of this section.

Style-Oriented Losses
We first design auxiliary losses that ensure the style information is contained in the style space s. This involves a multitask loss that ensures s is discriminative for the style, as well as an adversarial loss that ensures c is not discriminative for the style.
Multi-Task Loss for Style. Although the corpus we use is non-parallel, we assume that each sentence is labeled with its style. In particular, we treat the sentiment as the style of interest, following previous work (Hu et al. 2017;Shen et al. 2017;Fu et al. 2018;Zhao et al. 2018), and each sentence is labeled with a binary sentiment tag (positive or negative).
We build a classifier on the style space that predicts the style label. Formally, a two-way softmax layer (equivalent to logistic regression) is applied to the style vector s, given by are parameters for multitask learning of style, and y s is the output of softmax layer.
The classifier is trained with a simple cross-entropy loss against the ground truth distribution t s (·), given by where θ E are the encoder's parameters.
We train the style classifier at the same time as the autoencoding loss. Thus, this could be viewed as multi-task learning, incentivizing the entire model to not only decode the sentence, but also predict its sentiment from the style vector s. We denote it by "mul(s)." The idea of multi-task losses is not new and has been used in previous work for sequence-to-sequence learning (Luong et al. 2015), sentence representation learning (Jernite, Bowman, and Sontag 2017) and sentiment analysis (Balikas, Moura, and Amini 2017), among others.
Adversarial Loss for Style. The above multi-task loss only ensures that the style space contains style information. However, the content space might also contain style information, which is undesirable for disentanglement.
We thus apply an adversarial loss to discourage the content space containing style information. The idea is to first introduce a classifier, called an adversary, that deliberately discriminates the true style label using the content vector c. Then the encoder is trained to learn a content vector space, from which its adversary cannot predict style information.
Concretely, the adversarial discriminator and its training objective have a similar form as Equations 3 and 4, but with different input and parameters, given by are the parameters of the adversary.
It should be emphasized that, for the adversary, the gradients are not propagated back to the autoencoder, i.e., the variables in c are treated as shallow features. Therefore, we view J dis(s) as a function of θ dis(s) only, whereas J mul(s) is a function of both θ E and θ mul(s) .
Having trained an adversary, we would like the autoencoder to be tuned in such an ad hoc fashion, that c is not discriminative for style. In existing literature, there could be different approaches, for example, maximizing the adversary's loss (Shen et al. 2017;Zhao et al. 2018) or penalizing the entropy of the adversary's prediction (Fu et al. 2018). In our work, we adopt the latter, as it can be easily extended to multi-category classification, used for the content-oriented losses of our approach. Formally, the adversarial objective for the style is to maximize where H(p) = − i∈labels p i log p i and y s is the predicted distribution over the style labels. Here, J adv(s) is maximized with respect to the encoder, and attains maximum value when y s is a uniform distribution. It is viewed as a function of θ E , and we fix θ dis(s) . While adversarial loss has been explored in previous style-transfer papers (Shen et al. 2017;Fu et al. 2018), it has not been combined with the multi-task loss. As we shall show in our experiments, combining these two losses is promisingly effective, achieving better style transfer performance than a variety of previous state-of-the-art methods.

Content-Oriented Losses
The above style-oriented losses only regularize style information, but they do not impose any constraint on how the content information should be encoded. This also happens in most previous work (Hu et al. 2017;Shen et al. 2017;Fu et al. 2018). Although the style space is usually much smaller than the content space, it is unrealistic to expect that the content would not flow into the style space because of its limited capacity. Therefore, we need to design contentoriented auxiliary losses to regularize the content information. Inspired by the above combination of multi-task and adversarial losses, we apply the same idea to the content space. However, it is usually hard to define what "content" actually refers to.
To this end, we propose to approximate the content information by bag-of-words (BoW) features. The BoW features of an input sentence is a vector, each element indicating the probability of a word's occurrence in the sentence. For a sentence x with N words, the word w * 's BoW probability is given by t where t c (·) denotes the target distribution of content, and I{·} is an indicator function. Here, we only consider content words, excluding stopwords and style-specific words, since we focus on "content" information. In particular, we exclude sentiment words from a curated lexicon (Hu and Liu 2004) for sentiment style transfer. The effect of using different vocabularies for BoW is analyzed in Supplemental Material A.
Multi-Task Loss for Content. Similar to the styleoriented loss, the multi-task loss for content, denoted as "mul(c)", ensures that the content space c contains content information, i.e., BoW features.
We introduce a softmax classifier over the BoW vocabulary are the classifier's parameters, and y c is the predicted BoW distribution.
The training objective is a cross-entropy loss against the ground truth distribution t c (·), given by where the optimization is performed with both encoder parameters θ E and the multi-task classifier θ mul(c) . Notice that although the target distribution is not one-hot as for BoW prediction, the cross-entropy loss (Equation 9) has the same form.
It is also interesting that, at first glance, the multi-task loss for content appears to be redundant, given the autoencoding loss, when in fact, it is not. The multi-task loss only considers content words, which do not include stopwords and sentiment words, and is only applied to the content space c. This ensures that the content information is captured in the content space. The autoencoding loss only requires that the model reconstructs the sentence based on the content and style space, and does not ensure their separation.
Adversarial Loss for Content. To ensure that the style space does not contain content information, we design our final auxiliary loss, the adversarial loss for content, denoted as "adv(c)." We build an adversary, a softmax classifier on the style space to predict BoW features, approximating content information, given by where θ dis(c) = [W dis(c) ; b dis(c) ] are the classifier's parameters for BoW prediction.
The adversarial loss for the model is to maximize the entropy of the discriminator Again, J dis(c) is trained with respect to the discriminator's parameters θ dis(c) , whereas J adv(c) is trained with respect to θ E , similar to the adversarial loss for style.

Training Process
The overall loss J ovr for the autoencoder comprises several terms: the reconstruction objective, the multi-task objectives for style and content, and the adversarial objectives for style and content: where λ's are the hyperparameters that balance the autoencoding loss and these auxiliary losses.
To put it all together, the model training involves an alternation of optimizing discriminator losses J dis(s) and J dis(c) , and the model's own loss J ovr , shown in Algorithm 1.

Generating Style-Transferred Sentences
A direct application of our disentangled latent space is styletransfer for natural language generation. For example, we can generate a sentence with generally the same meaning (content) but a different style (e.g., sentiment). Let x * be an input sentence with s * and c * being the encoded, disentangled style and content vectors, respectively. If we would like to transfer its content to a different style, we compute an empirical estimate of the target style's vectorŝ usingŝ = i∈target style s i # target style samples The inferred target styleŝ is concatenated with the encoded content c * for decoding style-transferred sentences, as shown in Figure 1b.

Experiments Datasets
We conducted experiments on two datasets, Yelp and Amazon reviews. Both of these datasets comprise sentences accompanied by binary sentiment labels (positive, negative). They are used to train latent space disentanglement as well as to evaluate sentiment transfer.
Yelp Service Reviews. We used a Yelp review dataset, following previous work (Shen et al. 2017;Zhao et al. 2018). 3 It contains 444101, 63483 and 126670 labeled reviews for train, validation, and test, respectively. The maximum review length is 15 words, and the vocabulary size is approximately 9200.
Amazon Product Reviews. We further evaluate our model with an Amazon review dataset, following another previous paper (Fu et al. 2018). 4 It contains 559142, 2000 and 2000 labeled reviews for train, validation, and test, respectively. The maximum review length is 20 words, and the vocabulary size is approximately 58000.

Experiment Settings
We used the Adam optimizer (Kingma and Ba 2014) for the autoencoder and the RMSProp optimizer (Tieleman and Hinton 2012) for the discriminators, following adversarial training stability tricks (Arjovsky, Chintala, and Bottou 2017). Each optimizer has an initial learning rate of 10 −3 . Our model is trained for 20 epochs, by which time it has mostly converged. The word embedding layer was initialized by word2vec (Mikolov et al. 2013) trained on respective training sets. Both the autoencoder and the discriminators are trained once per mini-batch with λ mul(s) = 10, λ mul(c) = 3, λ adv(s) = 1, and λ adv(c) = 0.03. These hyperparameters were tuned by performing a log-scale grid search within two orders of magnitude around the default value 1, and choosing those that yielded the best validation results. The recurrent unit size is 256, the style vector size is 8, and the content vector size is 128. We append the latent vector h to the hidden state at every time step of the decoder.
For the VAE model, we enforce the KL-divergence penalty on both the style and content posterior distributions, using λ kl(s) and λ kl(c) , respectively. We set λ kl(s) = 0.03 and λ kl(c) = 0.03 and use the sigmoid KL-weight annealing schedule following Bahuleyan et al. (2018). They were tuned in the same manner as the other hyperparameters of the model.

Experiment I: Disentangling Latent Space
First, we analyze how the style (sentiment) and content of the latent space are disentangled. We train classifiers on the different latent spaces, and report their inference-time classification accuracies in Table 1.
We see that the 128-dimensional content vector c is not particularly discriminative for style. It achieves accuracies slightly better than majority guess. However, the 8dimensional style vector s, despite its low dimensionality, achieves substantially higher style classification accuracy. When combining content and style vectors, we observe no further improvement. These results verify the effectiveness of our disentangling approach, as the style space contains style information, whereas the content space does not.  We show t-SNE plots of both the deterministic autoencoder (DAE) and the variational autoencoder (VAE) models in Figure 2. As seen, sentences with different styles are noticeably separated in a clean manner in the style space (LHS), but are indistinguishable in the content space (RHS). It is also evident that the latent space learned by the variational autoencoder is considerably smoother and continuous compared with the one learned by the deterministic autoencoder.
We show t-SNE plots for ablation tests with different combinations of auxiliary losses in Supplemental Material B.

Experiment II: Non-Parallel Text Style Transfer
We also conducted sentiment transfer experiments with our disentangled latent space.
Metrics. We evaluate competing models based on (1) style transfer strength, (2) content preservation and (3) quality of generated language. The evaluation of generated sentences is a difficult task in contemporary literature, so we adopt a few automatic metrics and use human judgment as well.
• Style-Transfer Accuracy. We follow most previous work (Hu et al. 2017;Shen et al. 2017;Fu et al. 2018) and train a separate convolutional neural network (CNN) to predict the sentiment of a sentence (Kim 2014), which is then used to approximate the style transfer accuracy. In other words, we report the CNN classifier's accuracy on the style-transferred sentences, considering the target style to be the ground truth.
While the style classifier itself may not be perfect, it  Table 2: Performance of non-parallel text style transfer. The style-embedding approach achieves poor transfer accuracy, and should not be considered as an effective style-transfer model. Despite this, our model outperforms other previous methods in terms of all aspects (transfer strength, content preservation, and language fluency). Numbers with the † symbol are quoted from respective papers. Others are based on our replication using the published code in previous work. Our replicated experiments achieve 0.809 and 0.835 transfer accuracy on the Yelp dataset, close to the results in Shen et al. (2017) and Zhao et al. (2018), respectively, showing that our replication is fair.
achieves a reasonable sentiment accuracy on the validation sets (97% for Yelp; 82% for Amazon). Thus, it provides a quantitative way of evaluating the strength of style-transfer.
• Cosine Similarity. We followed Fu et al. (2018) and computed a sentence embedding by concatenating the min, max, and mean of its word embeddings (sentiment words removed). Then, we computed the cosine similarity between the source and generated sentence embeddings, which is intended to be an indicator of content preservation.
• Word Overlap. We find that the cosine similarity measure, although correlated to human judgment, is not a sensitive measure, and we propose a simple yet effective measure that counts the unigram word overlap rate of the original sentence x and the style-transferred sentence y, computed by count(wx∩wy) count(wx∪wy) .
• Language Fluency. We use a trigram Kneser-Ney (KL) smoothed language model (Kneser and Ney 1995) as a quantitative and automated metric to evaluate the fluency of a sentence. It estimates the empirical distribution of trigrams in a corpus, and computes the log-likelihood of a test sentence. We train the language model on the respective dataset, and report the Kneser-Ney language model's log-likelihood. A larger (closer to zero) number indicates a more fluent sentence.
• Manual Evaluation. Despite the above automatic metrics, we also conduct human evaluations to further confirm the performance of our model. This was done on the Yelp dataset only, due to the amount of manual effort involved. We asked 6 human evaluators to rate each sentence on a 1-5 Likert scale (Stent, Marge, and Singhai 2005) in terms of transfer strength, content similarity, and language quality. This evaluation was conducted in a strictly blind fashion: samples obtained from all evaluated models are randomly shuffled, so that the evaluator would be unaware of which model generated a particular sentence. The inter-rater agreement-as measured by Krippendorff's alpha (Krippendorf 2004) for our Likert scale ratings-is 0.74, 0.68, and 0.72 for transfer strength, content preservation, and language quality, respectively. According to Krippendorf (2004), this is an acceptable inter-rater agreement.
Results and Analysis. We compare our approach with previous state-of-the-art work in Table 2  ods, we quoted results from existing papers whenever possible, and replicated the experiments to report other metrics with publicly available code (Shen et al. 2017;Fu et al. 2018;Zhao et al. 2018). 5 As discussed in Table 2, our replication involves reasonable efforts and is fair for comparison. We observe that the style embedding model (Fu et al. 2018) performs poorly on the style-transfer objective, 6 resulting in inflated cosine similarity and word overlap scores. We also examined the number of times each model generates exact copies of the source sentences during style transfer. We notice that the style-embedding model simply reconstructs the exact source sentence 24% of the time, whereas all other models do this less than 6% of the time. Therefore, we do not think that the style embedding approach is an effective model for text style transfer.
The other two competing methods (Shen et al. 2017;Zhao et al. 2018) achieve reasonable transfer accuracy and cosine similarity. However, our model outperforms them by 10% transfer accuracy as well as content preserving scores (measured by cosine similarity and the word overlap rate). This shows our model is able to generate high-quality style transferred sentences, which in turn indicates that the latent space is well disentangled into style and content subspaces. Regarding language fluency, we see that VAE is better than DAE in both experiments. This is expected as VAE regularizes the latent space by imposing a probabilistic dis-  Original (Positive) DAE Transferred (Negative) VAE Transferred (Negative) the food is excellent and the service is exceptional the food was a bit bad but the staff was exceptional the food was bland and i am not thrilled with this the waitresses are friendly and helpful the guys are rude and helpful the waitresses are rude and are lazy the restaurant itself is romantic and quiet the restaurant itself is awkward and quite crowded the restaurant itself was dirty great deal horrible deal no deal both times i have eaten the lunch buffet and it was outstanding their burgers were decent but the eggs were not the consistency both times i have eaten here the food was mediocre at best Original (Negative) DAE Transferred (Positive) VAE Transferred (Positive) the desserts were very bland the desserts were very good the desserts were very good it was a bed of lettuce and spinach with some italian meats and cheeses it was a beautiful setting and just had a large variety of german flavors it was a huge assortment of flavors and italian food the people behind the counter were not friendly whatsoever the best selection behind the register and service presentation the people behind the counter is friendly caring the interior is old and generally falling apart the decor is old and now perfectly the interior is old and noble they are clueless they are stoked they are genuinely professionals tribution. We also see that our method achieves considerably more fluent sentences than competing methods, showing that our multi-task and adversarial losses are more "natural" than other methods, for example, aligning RNN hidden states (Shen et al. 2017). Table 3 presents the results of human evaluation. Again, we see that the style embedding model (Fu et al. 2018) is ineffective as it has a very low transfer strength, and that our method outperforms other baselines in all aspects. The results are consistent with the automatic metrics in both experiments (Table 2). This implies that the automatic metrics we used are reasonable; it also shows consistent evidence of the effectiveness of our approach.
We conducted ablation tests on the Yelp dataset, and show results in Table 4. With J VAE only, we cannot achieve reasonable style transfer accuracy by substituting an empirically estimated style vector of the target style. This is because the style and content spaces would not be disentangled spontaneously with the autoencoding loss alone.
With either J mul(s) or J adv(s) , the model achieves reasonable transfer accuracy and cosine similarity. Combining them together improves the transfer accuracy to 90%, outperforming previous methods by a margin of 10% (Table 2). This shows that the multi-task loss and the adversarial loss work in different ways. Our insight of combining the two auxiliary losses is a simple yet effective way of disentangling latent space.
However, J mul(s) and J adv(s) only regularize the style information, leading to gradual drop of content preserv-ing scores. Then, we have another insight of introducing content-oriented auxiliary losses, J mul(c) and J adv(c) , based on BoW features, which regularize the content information in the same way as the style information. By incorporating all these auxiliary losses, we achieve high transfer accuracy, high content preservation, as well as high language fluency. Table 5 provides several examples of our style-transfer model. Results show that we can successfully transfer the sentiment while preserving the content of a sentence. We see that, with the empirically estimated style vector, we can reliably control the sentiment of generated sentences.

Conclusion
In this paper, we propose a simple yet effective approach for disentangling the latent space of neural networks. We combine multi-task and adversarial objectives to separate content and style information from each other, and propose to approximate content information with bag-of-words features of style-neutral, non-stopword vocabulary.
Both qualitative and quantitative experiments show that the latent space is indeed separated into style and content parts. This disentangled space can be directly applied to text style-transfer tasks. It achieves substantially better style-transfer strength, content-preservation scores, as well as language fluency, compared with previous state-of-the-art work.

A. Bag-of-Words (BoW) Vocabulary Ablation Tests
The tests in Table 6 demonstrate the effect of the choice of vocabulary used for the auxiliary content losses.  It is evident that using a BoW vocabulary that excludes sentiment words and stopwords performs better on every single quantitative metric. Figure 3 shows the t-SNE plots of the style and content embeddings, without any auxiliary losses. Figures 4, 5, 6 and 7 show the effect of adding each of the auxiliary losses independently.