Style Transfer Through Back-Translation

Style transfer is the task of rephrasing the text to contain specific stylistic properties without changing the intent or affect within the context. This paper introduces a new method for automatic style transfer. We first learn a latent representation of the input sentence which is grounded in a language translation model in order to better preserve the meaning of the sentence while reducing stylistic properties. Then adversarial generation techniques are used to make the output match the desired style. We evaluate this technique on three different style transformations: sentiment, gender and political slant. Compared to two state-of-the-art style transfer modeling techniques we show improvements both in automatic evaluation of style transfer and in manual evaluation of meaning preservation and fluency.


Introduction
Intelligent, situation-aware applications must produce naturalistic outputs, lexicalizing the same meaning differently, depending upon the environment. This is particularly relevant for language generation tasks such as machine translation (Sutskever et al., 2014;Bahdanau et al., 2015), caption generation (Karpathy and Fei-Fei, 2015;Xu et al., 2015), and natural language generation (Wen et al., 2017;Kiddon et al., 2016). In conversational agents (Ritter et al., 2011;Sordoni et al., 2015;Vinyals and Le, 2015;Li et al., 2016), for example, modulating the politeness style, to sound natural depending upon a situation: at a party with friends "Shut up! the video is starting!", or in a professional setting "Please be quiet, the video will begin shortly.".
These goals have motivated a considerable amount of recent research efforts focused at "controlled" language generation-aiming at separating the semantic content of what is said from the stylistic dimensions of how it is said. These include approaches relying on heuristic substitutions, deletions, and insertions to modulate demographic properties of a writer (Reddy and Knight, 2016), integrating stylistic and demographic speaker traits in statistical machine translation (Rabinovich et al., 2016;Niu et al., 2017), and deep generative models controlling for a particular stylistic aspect, e.g., politeness (Sennrich et al., 2016), sentiment, or tense (Hu et al., 2017;Shen et al., 2017). The latter approaches to style transfer, while more powerful and flexible than heuristic methods, have yet to show that in addition to transferring style they effectively preserve meaning of input sentences. This paper introduces a novel approach to transferring style of a sentence while better preserving its meaning. We hypothesize-relying on the study of Rabinovich et al. (2016) who showed that author characteristics are significantly obfuscated by both manual and automatic machine translation-that grounding in back-translation is a plausible approach to rephrase a sentence while reducing its stylistic properties. We thus first use back-translation to rephrase the sentence and reduce the effect of the original style; then, we generate from the latent representation, using separate style-specific generators controlling for style ( §2).
We focus on transferring author attributes: (1) gender and (2) political slant, and (3) on sentiment modification. The second task is novel: given a sentence by an author with a particular political leaning, rephrase the sentence to preserve its meaning but to confound classifiers of political slant ( §3). The task of sentiment modification enables us to compare our approach with state-of- Figure 1: Style transfer pipeline: to rephrase a sentence and reduce its stylistic characteristics, the sentence is back-translated. Then, separate style-specific generators are used for style transfer.
Style transfer is evaluated using style classifiers trained on held-out data. Our back-translation style transfer model outperforms the state-of-theart baselines (Shen et al., 2017;Hu et al., 2017) on the tasks of political slant and sentiment modification; 12% absolute improvement was attained for political slant transfer, and up to 7% absolute improvement in modification of sentiment ( §5). Meaning preservation was evaluated manually, using A/B testing ( §4). Our approach performs better than the baseline on the task of transferring gender and political slant. Finally, we evaluate the fluency of the generated sentences using human evaluation and our model outperforms the baseline in all experiments for fluency.
The main contribution of this work is a new approach to style transfer that outperforms stateof-the-art baselines in both the quality of inputoutput correspondence (meaning preservation and fluency), and the accuracy of style transfer. The secondary contribution is a new task that we propose to evaluate style transfer: transferring political slant.

Methodology
Given two datasets 2 } which represent two different styles s 1 and s 2 , respectively, our task is to generate sentences of the desired style while preserving the meaning of the input sentence. Specifically, we generate samples of dataset X 1 such that they belong to style s 2 and samples of X 2 such that they belong to style s 1 . We denote the output of dataset X 1 transfered to style s 2 asX 1 = {x (1) 2 , . . . ,x (n) 2 } and the output of dataset X 2 transferred to style s 1 asX 2 = {x (1) 1 , . . . ,x (n) 1 }. Hu et al. (2017) and Shen et al. (2017) introduced state-of-the-art style transfer models that use variational auto-encoders (Kingma and Welling, 2014, VAEs) and cross-aligned autoencoders, respectively, to model a latent content variable z. The latent content variable z is a code which is not observed. The generative model conditions on this code during the generation process. Our aim is to design a latent code z which (1) represents the meaning of the input sentence grounded in back-translation and (2) weakens the style attributes of author's traits. To model the former, we use neural machine translation. Prior work has shown that the process of translating a sentence from a source language to a target language retains the meaning of the sentence but does not preserve the stylistic features related to the author's traits (Rabinovich et al., 2016). We hypothesize that a latent code z obtained through backtranslation will normalize the sentence and devoid it from style attributes specific to author's traits. Figure 1 shows the overview of the proposed method. In our framework, we first train a machine translation model from source language e to a target language f . We also train a backtranslation model from f to e. Let us assume our styles s 1 and s 2 correspond to DEMOCRATIC and REPUBLICAN style, respectively. In Figure 1, the input sentence i thank you, rep. visclosky. is labeled as DEMOCRATIC. We translate the sentence using the e → f machine translation model and generate the parallel sentence in the target language f : je vous remercie, rep. visclosky. Using the fixed encoder of the f → e machine translation model, we encode this sentence in language f . The hidden representation created by this encoder of the back-translation model is used as z. We condition our generative models on this z. We then train two separate decoders for each style s 1 and s 2 to generate samples in these respective styles in source language e. Hence the sentence could be translated to the REPUBLICAN style using the decoder for s 2 . For example, the sentence i'm praying for you sir. is the REPUBLICAN ver- sion of the input sentence and i thank you, senator visclosky. is the more DEMOCRATIC version of it.
Note that in this setting, the machine translation and the encoder of the back-translation model remain fixed. They are not dependent on the data we use across different tasks. This facilitates reusability and spares the need of learning separate models to generate z for a new style data.

Meaning-Grounded Representation
In this section we describe how we learn the latent content variable z using back-translation. The e → f machine translation and f → e backtranslation models are trained using a sequence-tosequence framework (Sutskever et al., 2014;Bahdanau et al., 2015) with style-agnostic corpus. The style-specific sentence i thank you, rep. visclosky. in source language e is translated to the target language f to get je vous remercie, rep. visclosky. The individual tokens of this sentence are then encoded using the encoder of the f → e backtranslation model. The learned hidden representation is z.
Formally, let θ E represent the parameters of the encoder of f → e translation system. Then z is given by: where, x f is the sentence x in language f . Specifically, x f is the output of e → f translation system when x e is given as input. Since z is derived from a non-style specific process, this Encoder is not style specific. Figure 2 shows the architecture of the generative model for generating different styles. Using the encoder embedding z, we train multiple decoders for each style. The sentence generated by a decoder is passed through the classifier. The loss of the classifier for the generated sentence is used as feedback to guide the decoder for the generation process. The target attribute of the classifier is determined by the decoder from which the output is generated. For example, in the case of DEMOCRATIC decoder, the target attribute is DEMOCRATIC and for the REPUBLICAN decoder the target is REPUBLICAN.

Style Classifiers
We train a convolutional neural network (CNN) classifier to accurately predict the given style. We also use it to evaluate the error in the generated samples for the desired style. We train the classifier in a supervised manner. The classifier accepts either discrete or continuous tokens as inputs. This is done such that the generator output can be used as input to the classifier. We need labeled examples to train the classifier such that each instance in the dataset X should have a label in the set s = {s 1 , s 2 }. Let θ C denote the parameters of the classifier. The objective to train the classifier is given by: To improve the accuracy of the classifier, we augment classifier's inputs with style-specific lexicons. We concatenate binary style indicators to each input word embedding in the classifier. The indicators are set to 1 if the input word is present in a style-specific lexicon; otherwise they are set to 0. Style lexicons are extracted using the log-odds ratio informative Dirichlet prior (Monroe et al., 2008), a method that identifies words that are statistically overrepresented in each of the categories.

Generator Learning
We use a bidirectional LSTM to build our decoders which generate the sequence of tokensx = {x 1 , · · · x T }. The sequencex is conditioned on the latent code z (in our case, on the machine translation model). In this work we use a corpus translated to French by the machine translation system as the input to the encoder of the backtranslation model. The same encoder is used to encode sentences of both styles. The representation created by this encoder is given by Eq 1. Samples are generated as follows: where,x <t are the tokens generated beforex t . Tokens are discrete and non-differentiable. This makes it difficult to use a classifier, as the generation process samples discrete tokens from the multinomial distribution parametrized using softmax function at each time step t. This nondifferentiability, in turn, breaks down gradient propagation from the discriminators to the generator. Instead, following Hu et al. (2017) we use a continuous approximation based on softmax, along with the temperature parameter which anneals the softmax to the discrete case as training proceeds. To create a continuous representation of the output of the generative model which will be given as an input to the classifier, we use: where, o t is the output of the generator and τ is the temperature which decreases as the training proceeds. Let θ G denote the parameters of the generators. Then the reconstruction loss is calculated using the cross entropy function, given by: Here, the back-translation encoder E creates the latent code z by: The generative loss L gen is then given by: min θgen L gen = L recon + λ c L class (7) where L recon is given by Eq. (5), L class is given by Eq (2) and λ c is a balancing parameter.
We also use global attention of (Luong et al., 2015) to aid our generators. At each time step t of the generation process, we infer a variable length alignment vector a t : where h t is the current target state andh s are all source states. While generating sentences, we use the attention vector to replace unknown characters (UNK) using the copy mechanism in (See et al., 2017).

Style Transfer Tasks
Much work in computational social science has shown that people's personal and demographic characteristics-either publicly observable (e.g., age, gender) or private (e.g., religion, political affiliation)-are revealed in their linguistic choices (Nguyen et al., 2016). There are practical scenarios, however, when these attributes need to be modulated or obfuscated. For example, some users may wish to preserve their anonymity online, for personal security concerns (Jardine, 2016), or to reduce stereotype threat (Spencer et al., 1999). Modulating authors' attributes while preserving meaning of sentences can also help generate demographically-balanced training data for a variety of downstream applications. Moreover, prior work has shown that the quality of language identification and POS tagging degrades significantly on African American Vernacular English (Blodgett et al., 2016;Jørgensen et al., 2015); YouTube's automatic captions have higher error rates for women and speakers from Scotland (Rudinger et al., 2017). Synthesizing balanced training data-using style transfer techniques-is a plausible way to alleviate bias present in existing NLP technologies.
We thus focus on two tasks that have practical and social-good applications, and also accurate style classifiers. To position our method with respect to prior work, we employ a third task of sentiment transfer, which was used in two stateof-the-art approaches to style transfer (Hu et al., 2017;Shen et al., 2017). We describe the three tasks and associated dataset statistics below. The methodology that we advocate is general and can be applied to other styles, for transferring various social categories, types of bias, and in multi-class settings.
Gender. In sociolinguistics, gender is known to be one of the most important social categories driving language choice (Eckert and McConnell-Ginet, 2003;Lakoff and Bucholtz, 2004;Coates, 2015). Reddy and Knight (2016) proposed a heuristic-based method to obfuscate gender of a writer. This method uses statistical association measures to identify gender-salient words and substitute them with synonyms typically of the opposite gender. This simple approach produces highly fluent, meaning-preserving sentences, but does not allow for more general rephrasing of sentence beyond single-word substitutions. In our work, we adopt this task of transferring the author's gender and adapt it to our experimental settings.
We used Reddy and Knight's (2016) dataset of reviews from Yelp annotated for two genders corresponding to markers of sex. 1 We split the reviews to sentences, preserving the original gender labels. To keep only sentences that are strongly indicative of a gender, we then filtered out genderneutral sentences (e.g., thank you) and sentences whose likelihood to be written by authors of one gender is lower than 0.7. 2 Political slant. Our second dataset is comprised of top-level comments on Facebook posts from all 412 current members of the United States Senate and House who have public Facebook pages (Voigt et al., 2018). 3 Only top-level comments that directly respond to the post are included. Every comment to a Congressperson is labeled with the Congressperson's party affiliation: democratic or republican. Topic and sentiment in these comments reveal commenter's political slant. For example, defund them all, especially when it comes to the illegal immigrants . and thank u james, praying for all the work u do . are republican, whereas on behalf of the hard-working nh public school teachers-thank you ! and we need more strong voices like yours fighting for gun control . represent examples of democratic sentences. Our task is to preserve intent of the commenter (e.g., to thank their representative), but to modify their observable political affiliation, as in the example in Figure 1. We preprocessed and filtered the comments similarly to the gender-annotated corpus above.
Sentiment. To compare our work with the stateof-the-art approaches of style transfer for nonparallel corpus we perform sentiment transfer, replicating the models and experimental setups of Hu et al. (2017) and Shen et al. (2017). Given a positive Yelp review, a style transfer model will generate a similar review but with an opposite sentiment. We used Shen et al.'s (2017) corpus of reviews from Yelp. They have followed the standard practice of labeling the reviews with rating of higher than three as positive and less than three as negative. They have also split the reviews to sentences and assumed that the sentence has the same sentiment as the review.
Dataset statistics. We summarize below corpora statistics for the three tasks: transferring gender, political slant, and sentiment. The dataset for sentiment modification task was used as described in (Shen et al., 2017). We split Yelp and Facebook corpora into four disjoint parts each: (1) a training corpus for training a style classifier (class); (2) a training corpus (train) used for training the stylespecific generative model described in §2.2; (3) development and (4) test sets. We have removed from training corpora class and train all sentences that overlap with development and test corpora. Corpora sizes are shown in Table 1. Table 2 shows the approximate vocabulary sizes used for each dataset. The vocabulary is the same for both the styles in each experiment.

Experimental Setup
In what follows, we describe our experimental settings, including baselines used, hyperparameter settings, datasets, and evaluation setups.
Baseline. We compare our model against the "cross-aligned" auto-encoder (Shen et al., 2017), which uses style-specific decoders to align the style of generated sentences to the actual distribution of the style. We used the off-the-shelf sentiment model released by Shen et al. (2017) for the sentiment experiments. We also separately train this model for the gender and political slant using hyper-parameters detailed below. 4 Translation data. We trained an English-French neural machine translation system and a French-English back-translation system. We used data from Workshop in Statistical Machine Translation 2015 (WMT15) (Bojar et al., 2015) to train our translation models. We used the French-English data from the Europarl v7 corpus, the news commentary v10 corpus and the common crawl corpus from WMT15. Data were tokenized using the Moses tokenizer (Koehn et al., 2007). Approximately 5.4M English-French parallel sentences were used for training. A vocabulary size of 100K was used to train the translation systems.
Hyperparameter settings. In all the experiments, the generator and the encoders are a twolayer LSTM with an input size of 300 and the hidden dimension of 500. The encoders are bidirec-4 In addition, we compared our model with the current state-of-the-art approach introduced by Hu et al. (2017); Shen et al. (2017) use this method as baseline, obtaining comparable results. We reproduced the results reported in (Hu et al., 2017) using their tasks and data. However, the same model trained on our political slant datasets (described in §3), obtained an almost random accuracy of 50.98% in style transfer. We thus omit these results. tional. The generator samples a sentence of maximum length 50. All the generators use global attention vectors of size 500. The CNN classifier is trained with 100 filters of size 5, with maxpooling. The input to CNN is of size 302: the 300-dimensional word embedding plus two bits for membership of the word in our style lexicons, as described in §2.2.1. Balancing parameter λ c is set to 15. For sentiment task, we have used settings provided in (Shen et al., 2017).

Results
We evaluate our approach along three dimensions.
(1) Style transfer accuracy, measuring the proportion of our models' outputs that generate sentences of the desired style. The style transfer accuracy is performed using classifiers trained on held-out train data that were not used in training the style transfer models.
(2) Preservation of meaning. (3) Fluency, measuring the readability and the naturalness of the generated sentences. We conducted human evaluations for the latter two.
In what follows, we first present the quality of our neural machine translation systems, then we present the evaluation setups, and then present the results of our experiments.
Translation quality. The BLEU scores achieved for English-French MT system is 32.52 and for French-English MT system is 31.11; these are strong translation systems. We deliberately chose a European language close to English for which massive amounts of parallel data are available and translation quality is high, to concentrate on the style generation, rather than improving a translation system. 5

Style Transfer Accuracy
We measure the accuracy of style transfer for the generated sentences using a pre-trained style classifier ( §2.2.1). The classifier is trained on data that is not used for training our style transfer generative models (as described in §3). The classifier has an accuracy of 82% for the gender-annotated corpus, 92% accuracy for the political slant dataset and 93.23% accuracy for the sentiment dataset.
We transfer the style of test sentences and then test the classification accuracy of the generated sentences for the opposite label. For example, if we want to transfer the style of male Yelp reviews to female, then we use the fixed common encoder of the back-translation model to encode the test male sentences and then we use the female generative model to generate the female-styled reviews. We then test these generated sentences for the female label using the gender classifier.  In Table 4, we detail the accuracy of each classifier on generated style-transfered sentences. 6 We denote the Shen et al.'s (2017) Cross-aligned Auto-Encoder model as CAE and our model as Back-translation for Style Transfer (BST).
On two out of three tasks our model substantially outperforms the baseline, by up to 12% in political slant transfer, and by up to 7% in sentiment modification.

Preservation of Meaning
Although we attempted to use automatics measures to evaluate how well meaning is preserved in our transformations; measures such as BLEU (Papineni et al., 2002) and Meteor (Denkowski and Lavie, 2011), or even cosine similarity between distributed representations of sentences do not capture this distance well.
Meaning preservation in style transfer is not trivial to define as literal meaning is likely to change when style transfer occurs. For example "My girlfriend loved the desserts" vs "My partner liked the desserts". Thus we must relax the condition of literal meaning to intent or affect of the utterance within the context of the discourse. Thus if the intent is to criticize a restaurant's service in a review, changing "salad" to "chicken" could still have the same effect but if the intent is to order food that substitution would not be acceptable. Ideally we wish to evaluate transfer within some  downstream task and ensure that the task has the same outcome even after style transfer. This is a hard evaluation and hence we resort to a simpler evaluation of the "meaning" of the sentence. We set up a manual pairwise comparison following Bennett (2005). The test presents the original sentence and then, in random order, its corresponding sentences produced by the baseline and our models. For the gender style transfer we asked "Which transferred sentence maintains the same sentiment of the source sentence in the same semantic context (i.e. you can ignore if food items are changed)". For the task of changing the political slant, we asked "Which transferred sentence maintains the same semantic intent of the source sentence while changing the political position". For the task of sentiment transfer we have followed the annotation instruction in (Shen et al., 2017) and asked "Which transferred sentence is semantically equivalent to the source sentence with an opposite sentiment" We then count the preferences of the eleven participants, measuring the relative acceptance of the generated sentences. 7 A third option "=" was given to participants to mark no preference for either of the generated sentence. The "no preference" option includes choices both are equally bad and both are equally good. We conducted three tests one for each type of experiment -gender, political slant and sentiment. We also divided our annotation set into short (#tokens ≤ 15) and long (15 < #tokens ≤ 30) sentences for the gender and the political slant experiment. In each set we had 20 random samples for each type of style transfer. In total we had 100 sentences to be annotated. Note that we did not ask about appropriateness of the style transfer in this test, or fluency of outputs, only about meaning preservation.
The results of human evaluation are presented in Table 5.
Although a no-preference option was chosen often-showing that state-ofthe-art systems are still not on par with hu-7 None of the human judges are authors of this paper man expectations-the BST models outperform the baselines in the gender and the political slant transfer tasks.
Crucially, the BST models significantly outperform the CAE models when transferring style in longer and harder sentences. Annotators preferred the CAE model only for 12.5% of the long sentences, compared to 47.27% preference for the BST model.

Fluency
Finally, we evaluate the fluency of the generated sentences. Fluency was rated from 1 (unreadable) to 4 (perfect) as is described in (Shen et al., 2017). We randomly selected 60 sentences each generated by the baseline and the BST model.
The results shown in Table 6   BST outperforms the baseline overall. It is interesting to note that BST generates significantly more fluent longer sentences than the baseline model. Since the average length of sentences was higher for the gender experiment, BST notably outperformed the baseline in this task, relatively to the sentiment task where the sentences are shorter. Examples of the original and style-transfered sentences generated by the baseline and our model are shown in the Supplementary Material.

Discussion
The loss function of the generators given in Eq. 5 includes two competing terms, one to improve meaning preservation and the other to improve the style transfer accuracy. In the task of sentiment modification, the BST model preserved meaning worse than the baseline, on the expense of being better at style transfer. We note, however, that the sentiment modification task is not particularly well-suited for evaluating style transfer: it is particularly hard (if not impossible) to disentangle the sentiment of a sentence from its proposi-tional content, and to modify sentiment while preserving meaning or intent. On the other hand, the style-transfer accuracy for gender is lower for BST model but the preservation of meaning is much better for the BST model, compared to CAE model and to "No preference" option. This means that the BST model does better job at closely representing the input sentence while taking a mild hit in the style transfer accuracy.

Related Work
Style transfer with non-parallel text corpus has become an active research area due to the recent advances in text generation tasks. Hu et al. (2017) use variational auto-encoders with a discriminator to generate sentences with controllable attributes. The method learns a disentangled latent representation and generates a sentence from it using a code. This paper mainly focuses on sentiment and tense for style transfer attributes. It evaluates the transfer strength of the generated sentences but does not evaluate the extent of preservation of meaning in the generated sentences. In our work, we show a qualitative evaluation of meaning preservation. Shen et al. (2017) first present a theoretical analysis of style transfer in text using non-parallel corpus. The paper then proposes a novel crossalignment auto-encoders with discriminators architecture to generate sentences. It mainly focuses on sentiment and word decipherment for style transfer experiments. Fu et al. (2018) explore two models for style transfer. The first approach uses multiple decoders for each type of style. In the second approach, style embeddings are used to augment the encoded representations, so that only one decoder needs to be learned to generate outputs in different styles. Style transfer is evaluated on scientific paper titles and newspaper tiles, and sentiment in reviews. This method is different from ours in that we use machine translation to create a strong latent state from which multiple decoders can be trained for each style. We also propose a different human evaluation scheme. Li et al. (2018) first extract words or phrases associated with the original style of the sentence, delete them from the original sentence and then replace them with new phrases associated with the target style. They then use a neural model to fluently combine these into a final output. Junbo et al. (2017) learn a representation which is styleagnostic, using adversarial training of the autoencoder.
Our work is also closely-related to a problem of paraphrase generation (Madnani and Dorr, 2010;Dong et al., 2017), including methods relying on (phrase-based) back-translation (Ganitkevitch et al., 2011;Ganitkevitch and Callison-Burch, 2014). More recently,  and Wieting et al. (2017) showed how neural backtranslation can be used to generate paraphrases. An additional related line of research is machine translation with non-parallel data. Lample et al. (2018) and Artetxe et al. (2018) have proposed sophisticated methods for unsupervised machine translation. These methods could in principle be used for style transfer as well.

Conclusion
We propose a novel approach to the task of style transfer with non-parallel text. 8 We learn a latent content representation using machine translation techniques; this aids grounding the meaning of the sentences, as well as weakening the style attributes. We apply this technique to three different style transfer tasks. In transfer of political slant and sentiment we outperform an off-the-shelf state-of-the-art baseline using a cross-aligned autoencoder. The political slant task is a novel task that we introduce. Our model also outperforms the baseline in all the experiments of fluency, and in the experiments for meaning preservation in generated sentences of gender and political slant. Yet, we acknowledge that the generated sentences do not always adequately preserve meaning.
This technique is suitable not just for style transfer, but for enforcing style, and removing style too. In future work we intend to apply this technique to debiasing sentences and anonymization of author traits such as gender and age.
In the future work, we will also explore whether an enhanced back-translation by pivoting through several languages will learn better grounded latent meaning representations. In particular, it would be interesting to back-translate through multiple target languages with a single source language (Johnson et al., 2016).

A Supplementary Material
In Tables 7, 8, and 9 we present the style transfer accuracy results broken-down to style categories. We denote the Cross-aligned Auto-Encoder model as CAE and our model as Back-translation for Style Transfer (BST).

Model Style transfer
Acc CAE male → female 64.75 BST male → female 54.59 CAE female → male 56.05 BST female → male 59.49   Table 9: Sentiment modification accuracy.
In Table 10, we detail the accuracy of the gender classifier on generated style-transfered sentences by an auto-encoder; Table 11 shows the accuracy of transfer of political slant. The experiments are setup as described in Section 5.1. We denote the Auto-Encoder as (AE) and our model as Backtranslation for Style Transfer (BST).   each of 20 random samples for each type of style transfer. Note that we did not ask about appropriateness of the style transfer in this test, or fluency of outputs, only about meaning preservation. We show the results of human evaluation in Table 12 Style  Examples of the original and style-transfered sentences generated by the baseline and our model are shown in Table 13 Input Sentence CAE BST male → female my wife ordered country fried i got ta get the chicken breast . my husband ordered the chicken steak and eggs. salad and the fries. great place to visit and maybe we could n't go back and i would great place to go back and try a find that one rare item you be able to get me to get me. lot of which you ' ve never had just have never seen or can to try or could not have been not find anywhere else.
able to get some of the best. the place is small but cosy and very clean.