Contextual Text Style Transfer

We introduce a new task, Contextual Text Style Transfer - translating a sentence into a desired style with its surrounding context taken into account. This brings two key challenges to existing style transfer approaches: (I) how to preserve the semantic meaning of target sentence and its consistency with surrounding context during transfer; (ii) how to train a robust model with limited labeled data accompanied by context. To realize high-quality style transfer with natural context preservation, we propose a Context-Aware Style Transfer (CAST) model, which uses two separate encoders for each input sentence and its surrounding context. A classifier is further trained to ensure contextual consistency of the generated sentence. To compensate for the lack of parallel data, additional self-reconstruction and back-translation losses are introduced to leverage non-parallel data in a semi-supervised fashion. Two new benchmarks, Enron-Context and Reddit-Context, are introduced for formality and offensiveness style transfer. Experimental results on these datasets demonstrate the effectiveness of the proposed CAST model over state-of-the-art methods across style accuracy, content preservation and contextual consistency metrics.


Introduction
Text style transfer has been applied to many applications (e.g., sentiment manipulation, formalized writing) with remarkable success. Early work relies on parallel corpora with a sequence-to-sequence learning framework (Bahdanau et al., 2015;Jhamtani et al., 2017). However, collecting parallel annotations is highly time-consuming and expensive. There has also been studies on developing text style transfer models with non-parallel data (Hu et al.,1 Code and datasets will be released at https:// github.com/ych133/CAST. 2017; Prabhumoye et al., 2018;Subramanian et al., 2018), assuming that disentangling style information from semantic content can be achieved in an auto-encoding fashion with the introduction of additional regularizers (e.g., adversarial discriminators , language models ).
Despite promising results, these techniques still have a long way to go for practical use. Most existing models focus on sentence-level rewriting. However, in real-world applications, sentences typically reside in a surrounding paragraph context. In formalized writing, the rewritten span is expected to align well with the surrounding context to keep a coherent semantic flow. For example, to automatically replace a gender-biased sentence in a job description document, a style transfer model taking the sentence out of context may not be able to understand the proper meaning of the statement and the intended message. Taking a single sentence as the sole input of a style transfer model may fail in preserving topical coherency between the generated sentence and its surrounding context, leading to low semantic and logical consistency on the paragraph level (see Example C in Table 4). Similar observations can be found in other style transfer tasks, such as offensive to non-offensive and political to neutral translations.
Motivated by this, we propose and investigate a new task -Contextual Text Style Transfer. Given a paragraph, the system aims to translate sentences into a desired style, while keeping the edited section topically coherent with its surrounding context. To achieve this goal, we propose a novel Context-Aware Style Transfer (CAST) model, by jointly considering style translation and context alignment. To leverage parallel training data, CAST employs two separate encoders to encode the source sentence and its surrounding context, respectively. With the encoded sentence and context embed-dings, a decoder is trained to translate the joint features into a new sentence in a specific style. A pre-trained style classifier is applied for style regularization, and a coherence classifier learns to regularize the generated target sentence to be consistent with the context. To overcome data sparsity issue, we further introduce a set of unsupervised training objectives (e.g., self-reconstruction loss, back-translation loss) to leverage non-parallel data in a hybrid approach (Shang et al., 2019). The final CAST model is jointly trained with both parallel and non-parallel data via end-to-end training.
As this is a newly proposed task, we introduce two new datasets, Enron-Context and Reddit-Context, collected via crowdsourcing. The former contains 14,734 formal vs. informal paired samples from Enron (Klimt and Yang, 2004) (an email dataset), and the latter contains 23,158 offensive vs. non-offensive paired samples from Reddit (Serban et al., 2017). Each sample contains an original sentence and a human-rewritten one in target style, accompanied by its paragraph context. In experiments, we also leverage 60k formal/informal sentences from GYAFC (Rao and Tetreault, 2018) and 100k offensive/non-offensive sentences from Reddit (dos Santos et al., 2018) as additional nonparallel data for model training.
The main contributions of this work are summarized as follows: (i) We propose a new task -Contextual Text Style Transfer, which aims to translate a sentence into a desired style while preserving its style-agnostic semantics and topical consistency with the surrounding context. (ii) We introduce two new datasets for this task, Enron-Context and Reddit-Context, which provide strong benchmarks for evaluating contextual style transfer models. (iii) We present a new model -Context-Aware Style Transfer (CAST), which jointly optimizes the generation quality of target sentence and its topical coherency with adjacent context. Extensive experiments on the new datasets demonstrate that the proposed CAST model significantly outperforms state-of-the-art style transfer models.

Text Style Transfer
Text style transfer aims to modify an input sentence into a desired style while preserving its styleindependent semantics. Previous work has explored this as a sequence-to-sequence learning task using parallel corpora with paired source/target sen-tences in different styles. For example, Jhamtani et al. (2017) pre-trained word embeddings by leveraging external dictionaries mapping Shakespearean words to modern English words and additional text. However, available parallel data in different styles are very limited. Therefore, there is a recent surge of interest in considering a more realistic setting, where only non-parallel stylized corpora are available. A typical approach is: (i) disentangling latent space as content and style features; then (ii) generating stylistic sentences by tweaking stylerelevant features and passing them through a decoder, together with the original content-relevant features (Xu et al., 2018).
Many of these approaches borrowed the idea of adversarial discriminator/classifier from the Generative Adversarial Network (GAN) framework (Goodfellow et al., 2014). For example, ; Fu et al. (2018); Lample et al. (2018) used adversarial classifiers to force the decoder to transfer the encoded source sentence into a different style/language. Alternatively,  achieved disentanglement by filtering stylistic words of input sentences. Another direction for text style transfer without parallel data is using back-translation (Prabhumoye et al., 2018) with a de-noising auto-encoding objective (Logeswaran et al., 2018;Subramanian et al., 2018).
Regarding the tasks, sentiment transfer is one of the most widely studied problems. Transferring from informality to formality (Rao and Tetreault, 2018; is another direction of text style transfer, aiming to change the style of a given sentence to more formal text. dos Santos et al. (2018) presented an approach to transferring offensive text to non-offensive based on social network data. In Prabhumoye et al. (2018), the authors proposed the political slant transfer task. However, all these previous studies did not directly consider context-aware text style transfer, which is the main focus of this work.

Context-aware Text Generation
Our work is related to context-aware text generation (Mikolov and Zweig, 2012;Tang et al., 2016), which can be applied to many NLP tasks (Mangrulkar et al., 2018). For example, previous work has investigated language modeling with context information (Wang and Cho, 2015;Wang et al., 2017;Li et al., 2020), treating the preceding sentences as context. There are also studies on response gen- Recon. Trans.
Original style

Context
Recon. eration for conversational systems (Sordoni et al., 2015b;Wen et al., 2015), where dialogue history is treated as a context. Zang and Wan (2017) introduced a neural model to generate long reviews from aspect-sentiment scores given the topics. Vinyals and Le (2015) proposed a model to predict the next sentence given the previous sentences in a dialogue session. Sordoni et al. (2015a) presented a hierarchical recurrent encoder-decoder model to encode dialogue context. Our work is the first to explore context information in the text style transfer task.

Context-Aware Style Transfer
In this section, we first describe the problem definition and provide an overview of the model architecture in Section 3.1. Section 3.2 presents the proposed Context-Aware Style Transfer (CAST) model with supervised training objectives, and Section 3.3 further introduces how to augment the CAST model with non-parallel data in a hybrid approach.

Overview
Problem Definition The problem of contextual text style transfer is defined as fol- includes: (i) the i-th instance containing the original sentence x i with a style l i , (ii) its corresponding rewritten sentence y i in another stylel i , and (iii) the paragraph context c i . x i and y i are expected to encode the same semantic content, but in different language styles (i.e., l i =l i ). The goal is to transform x i in style l i to y i in stylel i , while keeping y i semantically coherent with its context c i . In practice, labelled parallel data may be difficult to garner. Ideally, ad- can be leveraged to enhance model training.
Model Architecture The architecture of the proposed CAST model is illustrated in Figure 1. The hybrid model training process consists of two paths, one for parallel data and the other for non-parallel data. In the parallel path, a Seq2Seq loss and a contextual coherence loss are included, for the joint training of two encoders (Sentence Encoder and Context Encoder) and the Sentence Decoder. The non-parallel path is designed to further enhance the Sentence Encoder and Decoder with three additional losses: (i) a self-reconstruction loss; (ii) a back-translation loss; and (iii) a style classification loss. The final training objective, uniting both parallel and non-parallel paths, is formulated as: where λ 1 , λ 2 , λ 3 and λ 4 are hyper-parameters to balance different objectives. Each of these loss terms will be explained in the following subsections.

Supervised Training Objectives
In this subsection, we discuss the training objective associated with parallel data, consisting of: (i) a contextual Seq2Seq loss; and (ii) a contextual coherence loss.
Contextual Seq2Seq Loss When parallel data is available, a Seq2Seq model can be directly learned for text style transfer. We denote the Seq2Seq model as (E, D), where the semantic representation of sentence x i is extracted by the encoder E, and the decoder D aims to learn a conditional distribution of y i given the encoded feature E(x i ) and stylel i : However, in such a sentence-to-sentence style transfer setting, the context in the paragraph is ignored, which if well utilized, could help improve generation quality such as paragraph-level topical coherence. Thus, to take advantage of the paragraph context c i , we use two separate encoders E s and E c to encode the sentence and the context independently. The outputs of the two encoders are combined via a linear layer, to obtain a context-aware sentence representation, which is then fed to the decoder to generate the target sentence. The model is trained to minimize the following loss: (3) Compared with Eqn.
(2), the use of E c (c i ) makes the text style transfer process context-dependent. The generated sentence can be denoted asỹ Contextual Coherence Loss To enforce contextual coherence (i.e., to ensure the generated sentence y i aligns with the surrounding context c i ), we train a coherence classifier that judges whether c i is the context of y i , by adopting a language model with an objective similar to next sentence prediction (Devlin et al., 2019).
Specifically, assume that y i is the t-th sentence of a paragraph p i (i.e., Based on this, we obtain a paragraph representation u i via a language model encoder. Then, we apply a linear layer to the representation, followed by a tanh function and a softmax layer to predict a binary label s i , which indicates whether c i is the right context for y i : where LM represents the language model encoder, and s i = 1 indicates that c i is the context of y i . f (.) is a softmax function with temperature τ , where the logits are the predicted network output with a dimension of vocabulary size. Note that sinceỹ i are discrete tokens that are nondifferentiable, we use the continuous feature f (ỹ i ) to generatesỹ i as the input of the language model. We construct paired data {y i , c i , s i } N i=1 for training the classifier, where the negative samples are created by replacing a sentence in a paragraph with another random sentence. After pre-training, the coherence classifier is used to obtain the contextual coherence loss: Intuitively, minimizing L P cohere encouragesỹ i to blend better to its context c i . Note that the coherence classifier is pre-trained, and remains fixed during the training of the CAST model. The above coherence loss can be used to update the parameters of E s , E c and D during model training.

Unsupervised Training Objectives
For the contextual style transfer task, there are not many parallel datasets available with style-labeled paragraph pairs. To overcome the data sparsity issue, we propose a hybrid approach to leverage additional non-parallel data U = {(x i , l i )} N i=1 , which are abundant and less expensive to collect. In order to fully exploit U to enhance the training of the Sentence Encoder and Decoder (E s , D), we introduce three additional training losses, detailed below.

Reconstruction Loss
The reconstruction loss aims to encourage E s and D to reconstruct the input sentence itself, if the desired style is the same as the input style. The corresponding objective is similar to Eqn. (2): Compared to Eqn.
(2), here we encourage the decoder D to recover x i 's original style properties as accurate as possible, given the style label l i . The self-reconstructed sentence is denoted aŝ Back-Translation Loss The back-translation loss requires the model to reconstruct the input sentence after a transformation loop. Specifically, the input sentence x i is first transferred into the target style, i.e.,x i = D(E s (x i ),l i ). Then the generated target sentence is transferred back into its original style, i.e.,x i = D(E s (x i ), l i ). The back-translation loss is defined as: where the source and target styles are denoted as l i andl i , respectively.

Style Classification Loss
To further boost the model, we use U to train a classifier that predicts the style of a given sentence, and regularize the training of (E s , D) with the pre-trained style classifier. The objective is defined as: where p C (·) denotes the style classifier. After the classifier is trained, we keep its parameters fixed, and apply it to update the parameters of (E s , D). The resulting style classification loss utilizing the pre-trained style classifier is defined as:

New Benchmarks
Existing text style transfer datasets, either parallel or non-parallel, do not contain contextual information, thus unsuitable for the contextual transfer task. To provide benchmarks for evaluation, we introduce two new datasets: Enron-Context and Reddit-Context, derived from two existing datasets -Enron (Klimt and Yang, 2004) and Reddit Politics (Serban et al., 2017).

1) Enron-Context
To build a formality transfer dataset with paragraph contexts, we randomly sampled emails from the Enron corpus (Klimt and Yang, 2004). After pre-processing and filtering with NLTK (Bird et al., 2009), we asked Amazon Mechanical Turk (AMT) annotators to identify informal sentences within each email, and rewrite them in a more formal style. Then, we asked a different group of annotators to verify if each rewritten sentence is more formal than the original sentence.
2) Reddit-Context Another typical style transfer task is offensive vs. non-offensive, for which we collected another dataset from the Reddit Politics corpus (Serban et al., 2017). First, we identify offensive sentences in the original dataset with sentence-level classification. After filtering out extremely long/short sentences, we randomly selected a subset of sentences (10% of the whole dataset) and asked AMT annotators to rewrite each offensive sentence into two non-offensive alternatives. After manually removing wrong or duplicate annotations, we obtained a total of 14,734 rewritten sentences for Enron-Context, and 23,158 for Reddit-Context. We also limited the vocabulary size by replacing words with a frequency less than 20/70 in Enron/Reddit datasets with a special unknown token. Table 1 provides the statistics on the two datasets. More details on AMT data collection are provided in Appendix.

Experiments
In this section, we compare our model with state-ofthe-art baselines on the two new benchmarks, and provide both quantitative analysis and human evaluation to validate the effectiveness of the proposed CAST model.

Datasets and Baselines
In addition to the two new parallel datasets, we also leverage non-parallel datasets for CAST model training. For formality transfer, one choice is Grammarlys Yahoo Answers Formality Corpus (GYAFC) (Rao and Tetreault, 2018), crawled and annotated from two domains in Yahoo Answers. This corpus contains paired informalformal sentences without context. We randomly selected a subset of sentences (28,375/29,774 formal/informal) from the GYAFC dataset as our training dataset. For offensiveness transfer, we utilize the Reddit dataset. Following dos Santos et al. (2018), we used a pre-trained classifier to extract 53,028/53,714 offensive/non-offensive sentences from Reddit posts as our training dataset. Table 2 provides the statistics of parallel and non-parallel datasets used for the two style transfer tasks. For the non-parallel datasets, we split them into two: one for CAST model training ('Train'), and the other for the style classifier pre-training.   Similarly, for the parallel datasets, the training sets are divided into two as well, for the training of CAST ('Train/Dev/Test') and the coherence classifier, respectively. We compare CAST model with several baselines: (i) Seq2Seq: a Transformer-based Seq2Seq model (Eqn. (2)), taking sentences as the only input, trained on parallel data only; (ii) Contextual Seq2Seq: a Transformer-based contextual Seq2Seq model (Eqn. (3)), taking both context and sentence as input, trained on parallel data only; (iii) Hybrid Seq2Seq (Xu et al., 2019): a Seq2Seq model leveraging both parallel and non-parallel data; (iv) ControlGen (Hu et al., 2017: a state-ofthe-art text transfer model using non-parallel data; (v) MulAttGen (Subramanian et al., 2018): another state-of-the-art style transfer model that allows flexible control over multiple attributes.

Evaluation Metrics
The contextual style transfer task requires a model to generate sentences that: (i) preserve the original semantic content and structure in the source sentence; (ii) conform to the pre-specified style; and (iii) align with the surrounding context in the paragraph. Thus, we consider the following automatic metrics for evaluation: Content Preservation. We assess the degree of content preservation during transfer, by measuring BLEU scores (Papineni et al., 2002) between generated sentences and human references. Following Rao and Tetreault (2018), we also use GLEU as an additional metric for the formality transfer task, which was originally introduced for the grammatical error correction task (Napoles et al., 2015).
For offensiveness transfer, we include perplexity (PPL) as used in dos Santos et al. (2018), which is computed by a word-level LSTM language model pre-trained on non-offensive sentences.
Style Accuracy. Similar to prior work, we measure style accuracy using the prediction accuracy of the pre-trained style classifier over generated sentences (Acc.).
Context Coherence. We use the prediction accuracy of the pre-trained coherence classifier to measure how a generated sentence matches its surrounding context (Coherence).
The evaluation classifiers are trained separately from those used to train CAST, following (dos Santos et al., 2018). For formality transfer, the style classifier and coherence classifier reach 91.35% and 86.78% accuracy, respectively, on pre-trained dataset. For offensiveness transfer, the accuracy is 93.47% and 84.96%.

Implementation Details
The context encoder, sentence encoder and sentence decoder are all implemented as a one-layer Transformer with 4 heads. The hidden dimension of one head is 256, and the hidden dimension of the feed-forward sub-layer is 1024. The context encoder is set to take maximum of 50 words from the surrounding context of the target sentence. For the style classifier, we use a standard CNN-based sentence classifier (Kim, 2014).
Since the non-parallel corpus U contains more samples than the parallel one P, we down-sample U to assign each mini-batch the same number of parallel and non-parallel samples to balance training, alleviating the 'catastrophic forgetting prob-   lem' described in Howard and Ruder (2018). We train the model using Adam optimizer with a minibatch size 64 and a learning rate 0.0005. The validation set is used to select the best hyper-parameters. Hard-sampling (Logeswaran et al., 2018) is used to back-propagate loss through discrete tokens from the pre-trained classifier to the model. For the ControlGen (Hu et al., 2017) baseline, we use the code provided by the authors, and use their default hyper-parameter setting. For Hybrid Seq2Seq (Xu et al., 2019) and MulAttGen (Subramanian et al., 2018), we re-implement their models following the original papers.

Experimental Results
Formality Transfer Results on the formality transfer task are summarized in Table 3. The CAST model achieves better performance than all the baselines. Particularly, CAST is able to boost GLEU and Coherence scores with a large margin. Hybrid Seq2Seq also achieves good performance by utilizing non-parallel data. By incorporating context information, Contextual Seq2Seq also im-proves over the vanilla Seq2Seq model. As expected, ControlGen does not perform well, since only non-parallel data is used for training. Offensiveness Transfer Results are summarized in Table 3. CAST achieves the best performance over all the metrics except for PPL. In terms of Coherence, Contextual Seq2Seq and CAST, that leverage context information achieve better performance than Seq2Seq baseline. Contextual Seq2Seq also improves BLEU, which is different from the observation in the formality transfer task. On PPL, CAST produces slightly worse performance than ControlGen and MulAttGen. We hypothesize that this is because our model tends to use the same non-offensive word to replace an offensive word, producing some untypical sentences, as discussed in dos Santos et al. (2018). Qualitative Analysis Table 4 presents some generation examples from different models. We observe that CAST is better at replacing informal words with formal ones (Example B and C), and generates more context-aware sentences (Example A and C), possibly due to the use of coherence and   Table 6: Results of pairwise human evaluation between CAST and three baselines on two style transfer tasks. Win/lose/tie indicate the percentage of results generated by CAST being better/worse/equal to the reference model. style classifiers. We also observe that the exploitation of context information can help the model preserve semantic content in the original sentence (Example B).
Ablation Study To investigate the effectiveness of each component of CAST model, we conduct detailed ablation studies and summarize the results in Table 5. Experiments show that the context encoder and the coherence classifier play an important role in the proposed model. The context encoder is able to improve content preservation and style transfer accuracy, demonstrating the effectiveness of using context. The coherence classifier can help improve the coherence score but not much for style accuracy. By using these two components, our model can strike a proper balance between translating to the correct style and maintaining contextual consistency. When both of them are removed (the 4th row), performance on all the metrics drops significantly. We also observe that without using nonparallel data, the model performs poorly, showing the benefit of using a hybrid approach and more data for this task.
Human Evaluation Considering the subjective nature of this task, we conduct human evaluation to judge model outputs regarding content preservation, style control and context consistency. Given an original sentence along with its corresponding context and a pair of generated sentences from two different models, AMT workers were asked to select the best one based on these three aspects. The AMT interface also allows a neutral option, if the worker considers both sentences as equally good in certain aspect. We randomly sampled 200 sentences from the test set, and collected three human responses for each pair. Table 6 reports the pairwise comparison results on both tasks. Based on human judgment, the quality of transferred sentences by CAST is significantly higher than the other methods across all three metrics. This is consistent with the experimental results on automatic metrics discussed earlier.

Conclusion
In this paper, we present a new task -Contextual Text Style Transfer. Two new benchmark datasets are introduced for this task, which contain annotated sentence pairs accompanied by paragraph context. We also propose a new CAST model, which can effectively enforce content preservation and context coherence, by exploiting abundant nonparallel data in a hybrid approach. Quantitative and human evaluations demonstrate that CAST model significantly outperforms baseline methods that do not consider context information. We believe our model takes a first step towards modeling context information for text style transfer, and will explore more advanced solutions e.g., using a better encoder/decoder like GPT-2 (Radford et al., 2019) and BERT (Devlin et al., 2019), adversarial learning (Zhu et al., 2020) or knowledge distillation (Chen et al., 2019).