Domain Adaptive Text Style Transfer

Text style transfer without parallel data has achieved some practical success. However, in the scenario where less data is available, these methods may yield poor performance. In this paper, we examine domain adaptation for text style transfer to leverage massively available data from other domains. These data may demonstrate domain shift, which impedes the benefits of utilizing such data for training. To address this challenge, we propose simple yet effective domain adaptive text style transfer models, enabling domain-adaptive information exchange. The proposed models presumably learn from the source domain to: (i) distinguish stylized information and generic content information; (ii) maximally preserve content information; and (iii) adaptively transfer the styles in a domain-aware manner. We evaluate the proposed models on two style transfer tasks (sentiment and formality) over multiple target domains where only limited non-parallel data is available. Extensive experiments demonstrate the effectiveness of the proposed model compared to the baselines.


Introduction
Text style transfer, which aims to edit an input sentence with the desired style while preserving style-irrelevant content, has received increasing attention in recent years. It has been applied successfully to stylized image captioning , personalized conversational response generation (Zhang et al., 2018a), formalized writing (Rao and Tetreault, 2018), offensive to nonoffensive language transfer (dos Santos et al., 2018), and other stylized text generation tasks (Akama et al., 2017;Zhang et al., 2019).
Text style transfer has been explored as a sequence-to-sequence learning task using parallel datasets (Jhamtani et al., 2017). However, parallel datasets are often not available, and hand-annotating sentences in different styles is expensive. The recent surge of deep generative models (Kingma and Welling, 2013;Goodfellow et al., 2014) has spurred progress in text style transfer without parallel data by learning disentanglement (Hu et al., 2017;Fu et al., 2018;Prabhumoye et al., 2018). These methods typically require massive amounts of data (Subramanian et al., 2018), and may perform poorly in limited data scenarios.
A natural solution to the data-scarcity issue is to resort to massive data from other domains. However, directly leveraging abundant data from other domains is problematic due to the discrepancies in data distribution on different domains. Different domains generally manifest themselves in domain-specific lexica. For example, sentiment adjectives such as "delicious", "tasty", and "disgusting" in restaurant reviews might be out of place in movie reviews, where the sentiment words such as "imaginative", "hilarious", and "dramatic" are more typical. Domain shift (Gretton et al., 2009) is thus apt to result in feature misalignment.
In this work, we take up the problem of domain adaptation in scenarios where the target domain data is scarce and misaligned with the distribution in the source domain. Our goal is to achieve successful style transfer into the target domain, with the help of the source domain, while the transferred sentences carry relevant characteristics in the target domain.
We present two first-of-their-kind domain adaptive text style transfer models that facilitate domain-adaptive information exchange between the source and target domains. These models effectively learn generic content information and distinguish domain-specific information. Generic content information, primarily captured by modeling a large corpus from the source domain, fa-cilitates better content preservation on the target domain. Meanwhile, domain-specific information, implicitly imposed by domain vectors and domain-specific style classifiers, underpins the transferred sentences by generating target-specific lexical terms.
Our contributions in this paper are threefold: (i) We explore a challenging domain adaptation problem for text style transfer by leveraging massively-available data from other domains. (ii) We introduce simple text style transfer models that preserve content and meanwhile translate text adaptively into target-domain-specific terms. (iii) We demonstrate through extensive experiments the robustness of these methods for style transfer tasks (sentiment and formality) on multiple target domains where only limited non-parallel data is available. Our implementation is available at https://github.com/ cookielee77/DAST.

Related Work
Text Style Transfer. Text style transfer using neural networks has been widely studied in the past few years. A common paradigm is to first disentangle latent space as content and style features, and then generate stylistic sentences by tweaking the style-relevant features and passing through a decoder. Hu et al. (2017);Fu et al. (2018); ; ; Gong et al. (2019); Lin et al. (2017) explored this direction by assuming the disentanglement can be achieved in an auto-encoding procedure with a suitable style regularization, implemented by either adversarial discriminators or style classifiers. ; Xu et al. (2018); Zhang et al. (2018c) achieved disentanglement by filtering the stylistic words of input sentences. Recently, Prabhumoye et al. (2018) has proposed to use back-translation for text style transfer with a de-noising auto-encoding objective (Logeswaran et al., 2018;Subramanian et al., 2018). Our work differs in that we leverage domain adaptation to deal with limited target domain data, whereas previous methods require massive target domain style-labelled samples.
Domain Adaptation. Domain adaptation has been studied in various natural language processing tasks, such as sentiment classification (Qu et al., 2019), dialogue system (Wen et al., 2016), abstractive summarization (Hua and Wang, 2017;Zhang et al., 2018b), machine translation (Koehn and Schroeder, 2007;Axelrod et al., 2011;Sennrich et al., 2016b;Michel and Neubig, 2018), etc. However, little or no work explores domain adaptation on text style transfer. To the best of our knowledge, we are the first to explore the adaptation of text style transfer models to a new domain with limited non-parallel data available. The task requires both style transfer and domain-specific generation on the target domain. To differentiate different domains, Sennrich et al. (2016a); Chu et al. (2017) appended domain tokens to the input sentences. Our model uses learnable domain vectors combining domain-specific style classifiers, which force the model to learn distinct stylized information in each domain.

Preliminary
We first describe a standard text style transfer approach, which only considers data in the target domain. We limit our discussion to the scenario where only non-parallel data is available, since large amounts of parallel data is typically not feasible.
Given a set of style-labelled sentences in the target domain, the goal is to transfer sentence x i with style l i to a sentence x i with another style l i , where l i = l i . l i , l i belong to a set of style labels l T in the target domain: l i , l i ∈ l T . Typically, an encoder encodes the input x i to a semantic representation c i , while a decoder controls or modifies the stylistic property and decodes the sentence x i based on c i and the pre-specific style l i . Specifically, we denote an encoder-decoder model as (E, D). The semantic representation c i of sentence x i is extracted by the encoder E, i.e., c i = E(x i ). The decoder D aims to learn a conditional distribution of x i given the semantic representation c i and style l i : where x t i is the t th token of x i , and x <t i is the prefix of x i up to the t th token.
Directly estimating Eqn. (1) is impractical during training due to a lack of parallel data (x i , x i ). Alternatively, the original sentence x i should have high probability under the conditional distribution p D (x i |c i , l i ). Thus, an auto-encoding reconstruc-tion loss could be formulated as: Note that we assume that the decoder D recovers x i 's original stylistic property as accurate as possible when given the style label l i . To achieve text style transfer, the decoder manipulates the style of generated sentences by replacing l i with a desired style l i . Specifically, the generated sentence However, by directly optimizing Eqn.
(2), the encoderdecoder model tends to ignoring the style labels and collapses to a reconstruction model, which might simply copy the input sentence, hence fails to transfer the style. To force the model to learn meaningful style properties, Hu et al. (2017 apply a style classifier for the style regularization. The style classifier ensures the encoder-decoder model to transfer x i with its correct style label l i : where C T is the style classifier pretrained on the target domain. The overall training objective for text style transfer within the target domain T is written as:

Domain Adaptive Text Style Transfer
In this section, we present Domain Adaptive Style Transfer (DAST) models to perform style transfer on a target domain by borrowing the strength from a source domain, while maintaining the transfer to be domain-specific.

Problem Definition
Suppose we have two sets of style-labelled sen- in the source domain S and the target domain T , respectively. x i denotes the i th source sentence. l i denotes the corresponding style label, which belongs to a source style label set: l i ∈ l S (e.g., positive/negative). l i can be available or unknown. Likewise, pair (x i , l i ) represents the sentence and style label in the target domain, where l i ∈ l T .
We consider domain adaptation in two settings: (i) the source style l S is unknown, e.g., we may have a large corpus, such as Yahoo! Answers, but the underlying style for each sample is not available; (ii) the source styles are available, and are the same as the target styles, i.e., l T = l S , e.g., both IMDB movie reviews and Yelp restaurant reviews have the same style classes (negative and positive sentiments).
In both scenarios, we assume that the target domain T only has limited non-parallel data. With the help of source domain data S, the goal is to The transferred sentence x i should simultaneously hold: (i) the main content with x i , (ii) a different style l i from l i , and (iii) domain-specific characteristics of the target data distribution T .

DAST with unknown-stylized source data
In this section, we investigate the case that the source style l S is unknown. We first examine a drawback of limited target data to motivate our method. With limited target data, Eqn. (4) may yield an undesirable transferred text, where the generated text tends to use the most discriminative words that the target style prefers while ignoring the content. This is because the classifier C T typically requires less data to train than a sequence autoencoder (E, D). The classifier objective L T style thus dominates Eqn. (4), rendering the generator to bias the sentences with most representative stylized (e.g., positive or negative) words rather than preserving the contents (see Table 5 for examples).
We consider alleviating this issue by leveraging massive source domain data to enhance the content-preserving ability, though the underlying styles in the source domain are unknown. By jointly training an auto-encoder on both the source and target domain data, the learned generic content information enables the model to yield better content preservation on the target domain.
To utilize the source data, we consider that l S only contains a special unknown-style label l u , separated from the target style l T . We assume the semantic representation of the source data c i is encoded by the encoder, i.e., c i = E(x i ). The decoder takes c i with style l u to generate the sentences on the source domain. The auto-encoding reconstruction objective of the source domain is: where the encoder-decoder model (E, D) is shared in both domains. Therefore, the corresponding objective can be written as:  This can be perceived as combining the source domain data with the target domain data to train a better encoder-decoder framework, while targetspecific style information on the target domain is learned through L T style . Note that L T ae and L S ae are conditional on domain-specific styles labels: l T and l u , which implicitly encourages the model to learn domainspecific features. The decoder could thus generate target sentences adaptively with l T , while achieving favorable content preservation with the generic content information modeled by L S ae . We refer this model, which is illustrated in Figure 1(left), as Domain Adaptive Style Transfer with generic Content preservation (DAST-C).

DAST with stylized source data
We further explore the scenario where l S = l T . In this case, besides the generic content information, there is much style information from the source domain that could be leveraged, e.g., generic stylized expressions like "fantastic" and "terrible" for sentiment transfer can be applied to both restaurant and movie reviews. We thus consider to borrow the full strength of the source data, by sharing learned knowledge on both the generic content and style information.
A straightforward way to achieve this is to train Eqn. (4) on both domains. However, simply mixing the two domains together will lead to undesirable style transfers, where the transfer is not domain-specific. For example, when adapting the IMDB movie reviews to the Yelp restaurant reviews, directly sharing the style transfer model without specifying the domain will inevitably result in generations like "The pizza is dramatic!".
To alleviate this problem, we introduce additional domain vectors, encouraging the model to perform style transfer in a domain-aware manner. The proposed DAST model is illustrated in Fig-ure 1(right). Consider two domain vectors: d S for the source domain and d T for the target domain, respectively. We rewrite the auto-encoding loss as: where the encoder-decoder model (E, D) is shared across domains. The domain vectors, d S , d T , learned from the model, implicitly guide the decoder to generate sentences with domainspecific characteristics. Note that l i and l i are shared, i.e., l T = l S . This enables the model to learn generic style information from both domains. On the other hand, explicitly learning precise stylized information within each domain is crucial to generate domain-specific styles. Thus, two domain-specific style classifiers ensure the model to learn the corresponding styles by conditioning on (d S , l i ) in the source domain or (d T , l i ) in the target domain: where x i , x i are the transferred sentences with prespecific styles l i , l i in the source and target domains, respectively. The domain-specific style classifiers, C T and C S , are trained separately on each domain. The signals from classifiers encourage the model to learn domain-specific styles combining with the domain vectors and style labels. The overall training objective of the proposed DAST model is: The domain-specific style classifiers enforce the model to learn domain-specific style information conditioning on (d S , l i ) or (d T , l i ), which in turn controls the model to generate sentences with domain-specific words. The model can thus distinguish domain-specific features, and adaptively transfer the styles in a domain-aware manner.

Experiments
We evaluate our proposed models on two tasks: sentiment transfer (positive-to-negative and negative-to-positive), and formality transfer (informal-to-formal). In both tasks, we make comparisons with previous approaches over multiple target domains. All experiments are conducted on one Nvidia GTX 1080Ti GPU.

Dataset
A statistics for the source and target corpora used in the experiments is summarized in   (Diao et al., 2014) by following the filtering and preprocessing pipelines from . This results in 344k training samples with sentiment labels. For the target domain, both the Yelp restaurant review dataset and the Amazon product review dataset are from . For the test sets, we evaluate our methods by using 1k human-transferred sentences, annotated by , on both Yelp and Amazon datasets. In addition to the two standard sentiment datasets, we manually collected a Yahoo sentimental question dataset -7k question samples with sentiments from Yahoo! Answers dataset (Zhang et al., 2015). We split the 7k sentimental questions into 4k/2k/1k for train/dev/test sets, respectively. Note that the Yahoo sentiment dataset only consists of questions, which have different domain characteristics with the IMDB dataset. In all the sentiment experiments, we consider both transfer directions (positive-to-negative and negative-to-positive).
Formality Transfer. We use Grammarly's Yahoo Answers Formality Corpus (GYAFC) (Rao and Tetreault, 2018) as the source dataset. The publicly released version of GYAFC only covers two topics (Entertainment & Music and Family & Relationships), where each topic contains 50k paired informal and formal sentences written by humans. For the target domain, we use Enron email conversation dataset 1 , which covers several different fields like Business, Politics, Daily Life, etc. We manually labeled 7k non-parallel sentences written in either the formal or informal style. We split the Enron dataset into 6k, 500, 500 samples for training, validation and testing, respectively. Both the validation and test set consist of mere informal sentences, where the corresponding formal references are annotated by us from a crowd-sourcing platform for evaluation. We only assess the informal-to-formal transfer direction in the formality transfer experiment.

Evaluation
Automatic Metrics. We evaluate the effectiveness of our DAST models based on three automatic metrics: (i) Content Preservation. We assess the content preservation according to n-gram statistics, by measuring the BLEU scores (Papineni et al., 2002) between generated sentences and human references on the target domain, refered as human BLEU (hBLEU). When no human reference is available (e.g., Yahoo), we compute the BLEU scores with respect to the input sentences.
(ii) Style Control. We generate samples from the model and measure the style accuracy with a style classifier that is pre-trained on the target domain.
We refer the style accuracy as S-acc. (iii) Domain Control. To validate whether the generated sentences hold the characteristics of the target domain, we adopt a pre-trained domain classifier to measure the percentage of generated sentences that belong to the target domain. We refer the domain accuracy as D-acc. All the pre-trained classifiers are implemented by TextCNN (Kim, 2014;. The test accuracy of all these classifiers used for evaluation are reported in Appendix A.1. Following Xu et al. (2018), we also evaluate all methods using a single unified metric called G-score, which calculates the geometric mean of style accuracy and hBLEU.

Yelp Amazon
Model (100% target data) D-acc S-acc hBLEU G-score D-acc S-acc hBLEU G-score  Human Evaluation. To assess the quality of transferred sentences, we conduct human evaluations based on the facets of content preservation, style control and fluency, following Mir et al. (2019). Previous works (Subramanian et al., 2018;Gong et al., 2019) ask workers to evaluate the quality via a numerical score, however, we found that this empirically leads to high-variance results. Instead, we pair transferred sentences from two different models, and ask workers to choose the sentence they prefer when compared to the input on each evaluation aspect. We provide a "No Preference" option to choose when the workers think the qualities of the two sentences are indistinguishable. Details of the human evaluation instruction are included in Appendix A.3. For each testing, we randomly sample 100 sentences from the corresponding test set and collect three human responses for each pair on every evaluation aspect, resulting in 2700 responses in total.

Experimental Setup
The encoder E and the decoder D are implemented by one-layer GRU (Cho et al., 2014)  accordingly, the initialization of the decoder is a concatenation of content and style representations. TextCNN (Kim, 2014) is employed for the domain-specific style classifiers pre-trained on corresponding domains. After pre-training, the parameters of the classifiers are fixed. We use the hard-sampling trick (Logeswaran et al., 2018) to back-propagate the loss through discrete tokens from the classifier to the encoder-decoder model. During training, we assign each mini-batch the same amount of source and target data to balance the training. We make an extensive comparison with five state-of-the-art text style transfer models: CrossAlign , Delete&Retrieve , CycleRL (Xu et al., 2018), SMAE (Zhang et al., 2018c) and ControlGen . We also experiment a simple and effective domain adaptation baseline -Finetune, which is trained with Eqn. (4) on the source domain and then fine-tuned on the target domain.

Results
Model Comparisons. To evaluate the effectiveness of leveraging massive data from other domains, we compare our proposed DAST models with previously proposed models trained on the target domain (Table 2). We observe that by leveraging massive data from the IMDB dataset, our models achieve better performance against all   baselines on the sentiment transfer tasks in both the Yelp and Amazon domains.
Notably, when the target domain has limited data (1%), all baselines trained only on the target domain fail completely on content preservation. Finetune preserves better content but experiences the catastrophic forgetting problem (Goodfellow et al., 2013) to the source domain information. As a result, the overall style transfer performance is still nonoptimal. By contrast, with the help of the source domain, DAST obtains considerable content preservation performance improvement when compared with other baselines. Our model also attains favorable performance in terms of style transferring accuracy (S-acc), resulting in a good overall G-score. In general, we observe that DAST-C is able to better preserve content information, while DAST further improves both content preservation and style control abilities. Additionally, both DAST-C and DAST can adapt to the target domain, as evidenced by the high domain accuracy (D-acc). The human evaluation results (Table 3) show a strong preference of DAST over DAST-C as well as ControlGen in terms of style control, content preservation and fluency.
Finally, we evaluate our models on Yahoo sentiment transfer task. As can be seen in Table 4, both DAST and DAST-C achieve successful style transfer even if the target data is formed as questions which have a large discrepancy with the source IMDB domain. The samples of Yelp and Yahoo sentiment transfer are shown in Table 5. We also investigate the effect of different source domain data, included in Appendix A.2.
Limiting the Target Domain Data. We further test the limit of our model by using as few target domain data as possible. Figure 2 shows the quantitative results with different percentages of target domain training data. When the target domain data is insufficient, especially less than 10%, the content preservation ability of the baseline (trained with target data only) has degenerated rapidly despite a relatively high style transfer accuracy. This is less than desirable because by retrieving sentences with the target style a transferred sentence Yelp (positive-to-negative) Yelp (negative-to-positive) Input the service was great , food delicious , and the value impeccable . and the pizza was cold , greasy , and generally quite awful . ControlGen the service was horrible , service , the service and very frustrated . and the food was delicious, delicious , and freaking tasty , delicious . Finetune the service was poor , food ... , and the experience were . and the pizza was professional , friendly , and always have great . DAST-C the service was horrible , food horrible , and the slow sparse . and the pizza was fresh, greasy , and generally quite cool . DAST the service was horrible , food bland , and the value lousy . and the pizza was tasty , juicy , and definitely quite amazing . Human service was poor and the food expensive and weak tasting . the pizza was warm , not greasy , and generally tasted great . Yahoo (positive-to-negative) Yahoo (negative-to-positive) Input who is more romantic ? man or woman ? why do stupid questions constantly receive intelligent answers ? ControlGen which is more stupid ? and or why ? men do fantastic questions constantly receive intelligent bound ! Finetune the is more expensive ? man or woman ? why do great questions read more entertaining answers ? DAST-C who is more ugly ? man or woman ? why do important questions constantly receive intelligent answers ? DAST who is more crazy ? man or woman ? why do nice questions constantly receive intelligent answers ? Enron (informal-to-formal) Enron (informal-to-formal) Input ya 'll need to come visit us in austin . are n't you suppose to be teaching some kids or something ? ControlGen could we need to look on saturday in enpower . are you not supposed to be disloyal some kids or something ? Finetune you will need to go in bed with him . are you not to be able to be some man or something ? DAST-C you will need to visit town . are not you supposed to be teaching some kids or something ? DAST yes , you will need to visit us in austin . are you not supposed to be teaching some children or something ? Human all of you should come visit us in austin . are you not supposed to be instructing children ?  can easily exhibit the correct style while retaining barely any content similar to the input. Finetune improves content preservation but still suffers the same problem with less target data. Note that DAST-C is not comparable to Finetune as the former does not use the style information in the source domain.
Both DAST models bring substantial improvements to content preservation, and can still successfully manipulate the styles, resulting in consistently higher G-scores. This is presumably because our models adapt the content information as well as the style information from the source domain to consistently sustain the style transfer on the target domain. By learning both generic and domain-specific stylized information, DAST outperforms DAST-C in terms of content preservation and style control. Even with 0.1% target domain data (400 samples), DAST was able to attain a reasonable degree of text style transfer, whereas the model trained on the target data generated entirely nonsensical sentences. Meanwhile, DAST succeeded in transferring the sentences in a domain-  aware manner, achieving consistently high domain accuracy.
Ablation Study. To investigate the effect of individual components and training setup on the overall performance, we conduct an ablation study in Table 6. The domain vectors enable the model to transfer sentences in a domain-aware manner, and thus give the largest boost on domain accuracy. Without domain-specific style classifiers, the model mixes the style information on both domains, resulting in worse style control and content preservation. Additionally, simply increasing the number of training examples (i.e., the row "w/o both") improves content preserving, while introducing a data distribution discrepancy between the training (Yelp+IMDB) and test data (Yelp), as evidenced by the lower S-acc and D-acc scores.
In terms of the training setup, the source domain IMDB mostly helps content preservation, while accurate style information is mainly learned from the target domain Yelp. Finetune gives higher Sacc and D-acc and lower hBLEU due to catastrophic forgetting. Our proposed DAST successfully exploits the source domain data, and thus yields balanced results on style and domain control, while maintaining content preservation.
Non-parallel Style Transfer with Parallel Source Data. Finally, to verify the versatility of our proposed models in different scenarios, we investigate another domain adaptation setting, where the source domain data (GYAFC) is parallel but the target domain data (Enron) is non-parallel. Since parallel data is available in the source domain, we are able to simply add a sequenceto-sequence loss L S s2s on source domain data in Eqn. (6) and Eqn. (9) to help the target domain without parallel data. The training objectives can be written as: L T ae +L T style +L S ae +L S s2s and L S,T ae + L S,T style + L S s2s , respectively. Results are summarized in Table 7. DAST outperforms other methods on both style control and content preservation while keeping the transferred sentences with target-specific characteristics (D-acc). A strong human preference for DAST can be observed in Table 3 when compared to the baselines. Qualitative samples are provided in Table 5.

Conclusion
We present two simple yet effective domain adaptive text style transfer models that leverage massively available data from other domains to facilitate the transfer task in the target domain. The proposed models achieve better content preservation with the generic information learned from the source domain and simultaneously distinguish the domain-specific information, which enables the models to transfer text in a domain-adaptive manner. Extensive experiments demonstrate the robustness and applicability on various scenarios where the target data is limited.