Reformulating Unsupervised Style Transfer as Paraphrase Generation

Modern NLP defines the task of style transfer as modifying the style of a given sentence without appreciably changing its semantics, which implies that the outputs of style transfer systems should be paraphrases of their inputs. However, many existing systems purportedly designed for style transfer inherently warp the input's meaning through attribute transfer, which changes semantic properties such as sentiment. In this paper, we reformulate unsupervised style transfer as a paraphrase generation problem, and present a simple methodology based on fine-tuning pretrained language models on automatically generated paraphrase data. Despite its simplicity, our method significantly outperforms state-of-the-art style transfer systems on both human and automatic evaluations. We also survey 23 style transfer papers and discover that existing automatic metrics can be easily gamed and propose fixed variants. Finally, we pivot to a more real-world style transfer setting by collecting a large dataset of 15M sentences in 11 diverse styles, which we use for an in-depth analysis of our system.


Introduction
The task of style transfer on text data involves changing the style of a given sentence while preserving its semantics. 1 Recent work in this area conflates style transfer with the related task of attribute transfer (Subramanian et al., 2019;He et al., 2020), in which modifications to attributespecific content words (e.g., those that carry sentiment) warp both stylistic and semantic properties of a sentence (Preotiuc-Pietro et al., 2016). Attribute transfer has been criticized for its limited real-world applications: Pang (2019)  Ooh yall will leave me unhappy lol Step 1: diverse paraphrasing Step 2: inverse paraphrasing (Shakespeare, Twitter) Training time Test time Figure 1: During training, STRAP applies a diverse paraphraser to an input sentence and passes the result through a style-specific inverse paraphraser to reconstruct the input. At test time, we perform style transfer by swapping out different inverse paraphrase models (Shakespeare → Twitter shown here). All generated sentences shown here are actual outputs from STRAP. mantic preservation is critical for author obfuscation (Shetty et al., 2018), data augmentation (Xie et al., 2019;Kaushik et al., 2020), text simplification (Xu et al., 2015), writing assistance (Heidorn, 2000). Moreover, semantic preservation (via paraphrases) has several applications like better translation evaluation (Sellam et al., 2020;Freitag et al., 2020) and adversarial defenses (Iyyer et al., 2018). We propose to improve semantic preservation in style transfer by modeling the task as a controlled paraphrase generation problem. Our unsupervised method (Style Transfer via Paraphrasing, or STRAP) requires no parallel data between different styles and proceeds in three simple stages: Our approach requires none of the finicky 2 modeling paradigms popular in style transfer researchno reinforcement learning (Luo et al., 2019), variational inference (He et al., 2020), or autoregressive sampling during training (Subramanian et al., 2019). Instead, we implement the first two stages of our pipeline by simply fine-tuning a pretrained GPT-2 language model (Radford et al., 2019).
Despite its simplicity, STRAP significantly outperforms the state of the art on formality transfer and Shakespeare author imitation datasets by 2-3x on automatic evaluations and 4-5x on human evaluations. We further show that only 3 out of 23 prior style transfer papers properly evaluate their models: in fact, a naïve baseline that randomly chooses to either copy its input or retrieve a random sentence written in the target style outperforms prior work on poorly-designed metrics.
Finally, we take a step towards real-world style transfer by collecting a large dataset CDS (Corpus of Diverse Styles) of 15M English sentences spanning 11 diverse styles, including the works of James Joyce, romantic poetry, tweets, and conversational speech. CDS is orders of magnitude larger and more complex than prior benchmarks, which generally focus on transferring between just two styles. We analyze STRAP's abilities on CDS, and will release it as a benchmark for future research. In summary, our contributions are: (1) a simple approach to perform lexically and syntactically diverse paraphrasing with pretrained language models; (2) a simple unsupervised style transfer method that models semantic preservation with our paraphraser and significantly outperforms prior work; (3) a critique of existing style transfer evaluation based on a naïve baseline that performs on par with prior work on poorly designed metrics; (4) a new benchmark dataset that contains 15M sentences from 11 diverse styles.

Style Transfer via Paraphrasing
We loosely define style as common patterns of lexical choice and syntactic constructions that are distinct from the content of a sentence, following prior work (Hovy, 1987;DiMarco and Hirst, 1993;Green and DiMarco, 1993;Kabbara and Cheung, 2016). While we acknowledge this distinction is not universally accepted, 3 this treatment is critical to unlock several real-world applications of style transfer (as argued in Section 1). Unfortunately, many modern style transfer systems do not respect this definition: a human evaluation (Table 2) shows that fewer than 25% of style-transferred sentences from two state-of-the-art systems (Subramanian et al., 2019;He et al., 2020) on formality transfer were rated as paraphrases of their inputs. Motivated by this result, we reformulate style transfer as a controlled paraphrase generation task. We call our method STRAP, or Style Transfer via Paraphrasing. STRAP operates within an unsupervised setting: we have raw text from distinct target styles, but no access to parallel sentences paraphrased into different styles. To get around this lack of data, we create pseudo-parallel sentence pairs using a paraphrase model (Section 2.1) trained to maximize output diversity (Section 2.4). Intuitively, this paraphrasing step normalizes the input sentence by stripping away information that is predictive of its original style ( Figure 2). The normalization effect allows us to train an inverse paraphrase model specific to the original style, which attempts to generate the original sentence given its normalized version (Section 2.2). Through this process, the model learns to identify and produce salient features of the original style without unduly warping the input semantics.

Creating pseudo-parallel training data
The first stage of our approach involves normalizing input sentences by feeding them through a diverse paraphrase model. Consider a corpus of sentences from multiple styles, where the set of all sentences from style i is denoted by X i . We first generate a paraphrase z for every sentence x ∈ X i using a pretrained paraphrase model f para , This process results in a dataset Z i of normalized sentences and allows us to form a pseudo-parallel corpus (X i , Z i ) between each original sentence and its paraphrased version. Figure 2 shows that this paraphrasing process has a powerful style normalization effect for our instantiation of f para .

Style transfer via inverse paraphrasing
We use this pseudo-parallel corpus to train a stylespecific model that attempts to reconstruct the original sentence x given its paraphrase z. Since f para removes style identifiers from its input, the intuition behind this inverse paraphrase model is that it learns to insert stylistic features through the reconstruction process. Formally, the inverse paraphrase model f i inv for style i learns to reconstruct 4 the original corpus X i using the standard language modeling objective with cross-entropy loss L CE , During inference, given an arbitrary sentence s (in any particular style), we convert it to a sentences j in target style j using a two-step process of style normalization with f para followed by stylization with the inverse paraphraser f j inv , as in

Paraphraser implementation with GPT-2
We fine-tune the large-scale pretrained GPT2-large language model (Radford et al., 2019) to implement both the paraphraser f para and inverse paraphrasers f i inv for each style. 5 Starting from a pretrained LM improves both output fluency and generalization to small style-specific datasets (Section 5). We use the encoder-free seq2seq modeling approach described in Wolf et al. (2018), where input and output sequences are concatenated together with a separator token. We use Hugging Face's Transformers library (Wolf et al., 2019) to implement our models; see Appendix A.2 for more details about the architecture & hyperparameters.

Promoting diversity by filtering data
The final piece to our approach is how we choose training data for the paraphrase model f para . We discover that maximizing lexical and syntactic diversity of the output paraphrases is crucial for effective style normalization (Section 5, 6). We promote output diversity by training f para on an aggressivelyfiltered subset of PARANMT-50M (Wieting and Gimpel, 2018), a large corpus of backtranslated text. Specifically, we apply three filters: (1) removing sentence pairs with more than 50% trigram or unigram overlap to maximize lexical diversity and discourage copying; (2) removing pairs with lower than 50% reordering of shared words, measured by Kendall's tau (Kendall, 1938), to promote syntactic diversity; and (3) removing pairs with low semantic similarity, measured by the SIM model from Wieting et al. (2019). 6 After applying these filters, our training data size shrinks from 50M to 75K sentence pairs, which are used to fine-tune GPT-2; see Appendix A.1 for more details about the filtering process and its effect on corpus size.

Evaluating style transfer
Providing a meaningful comparison of our approach to existing style transfer systems is difficult because of (1) poorly-defined automatic and human methods for measuring style transfer quality (Pang, 2019;Mir et al., 2019;Tikhonov et al., 2019), and (2) misleading (or absent) methods of aggregating three individual metrics (transfer accuracy, semantic similarity and fluency) into a single number. In this section, we describe the flaws in existing metrics and their aggregation (the latter illustrated through a naïve baseline), and we propose a new evaluation methodology to fix these issues.

Current state of style transfer evaluation
We conduct a survey of 23 previously-published style transfer papers (more details in Appendix A.9), which reveals three common properties on which style transfer systems are evaluated. Here, we discuss how prior work implements evaluations for each of these properties and propose improved implementations to address some of their downsides.
Transfer accuracy (ACC): Given an output sentences j and a target style j, a common way of measuring transfer success is to train a classifier to identify the style of a transferred sentence and report its accuracy ACC on generated sentences (i.e., whethers j has a predicted style of j). 14 of 23 surveyed papers implement this style classifier with a 1-layer CNN (Kim, 2014 Semantic similarity (SIM): A style transfer system can achieve high ACC scores without maintaining the semantics of the input sentence, which motivates also measuring how much a transferred sentence deviates in meaning from the input. 15 / 23 surveyed papers use n-gram metrics like BLEU (Papineni et al., 2002) against reference sentences, often along with self-BLEU with the input, to evaluate semantic similarity. Using BLEU in this way has many problems, including (1) unreliable correlations between n-gram overlap and human evaluations of semantic similarity (Callison-Burch et al., 2006), (2) discouraging output diversity (Wieting et al., 2019), and (3) not upweighting important semantic words over other words (Wieting et al., 2019;Wang et al., 2020). These issues motivate us to measure semantic similarity using the subword embedding-based SIM model of Wieting et al. (2019), which performs well on semantic textual similarity (STS) benchmarks in SemEval workshops (Agirre et al., 2016). 8 Fluency (FL): A system that produces ungrammatical outputs can still achieve high scores on both ACC and SIM, motivating a separate measure for fluency. Only 10 out of 23 surveyed papers did a fluency evaluation; 9 of which used language model perplexity, which is a poor measure because (1) it is unbounded and (2) unnatural sentences with common words tend to have low perplexity (Mir et al., 2019;Pang, 2019). To tackle this we replace perplexity with the accuracy of a RoBERTa-large classifier trained on the CoLA corpus (Warstadt et al., 2019), which contains sentences paired with grammatical acceptability judgments. In Table 1, we show that our classifier marks most reference sentences as fluent, confirming its validity. 9 Human evaluation: As automatic evaluations are insufficient for evaluating text generation (Liu et al., 2016;Novikova et al., 2017), 17 out of 23 surveyed style transfer papers also conduct human evaluation. In our work, we evaluate SIM and FL using human evaluations. 10 As we treat style transfer as a paraphrase generation task, we borrow the three-point scale used previously to evaluate paraphrases (Kok and Brockett, 2010;Iyyer et al., 2018), which jointly captures SIM and FL. Given the original sentence and the transferred sentence, annotators on Amazon Mechanical Turk can choose one of three options: 0 for no paraphrase relationship; 1 for an ungrammatical paraphrase; and 2 for a grammatical paraphrase. A total of 150 sentence pairs were annotated per model, with three annotators per pair. More details on our setup, payment & agreement are provided in Appendix A.10.

Aggregation of Metrics
So far, we have focused on individual implementations of ACC, SIM, and FL. After computing these metrics, it is useful to aggregate them into a single number to compare the overall style transfer quality across systems (Pang, 2019). However, only 5 out of the 23 papers aggregate these metrics, either at the corpus level (Xu et al., 2018;Pang and Gimpel, 2019) or sentence level (Li et al., 2018). Even worse, the corpus-level aggregation scheme can be easily gamed. Here, we describe a naïve system that outperforms state-of-the-art style transfer systems when evaluated using corpus-level aggregation, and we present a new sentence-level aggregation metric that fixes the issue.
The issue with corpus-level aggregation: Aggregating ACC, SIM, and FL is inherently difficult because they are inversely correlated with each other (Pang, 2019). Prior work has combined these three scores into a single number using geometric averaging (Xu et al., 2018) or learned weights (Pang and Gimpel, 2019). However, the aggregation is computed after averaging each metric independently across the test set (corpus-level aggregation), which is problematic since systems might generate sentences that optimize only a subset of metrics. For example, a Shakespeare style transfer system could output Wherefore art thou Romeo? regardless of its input and score high on ACC and FL, while a model that always copies its input would score well on SIM and FL (Pang, 2019).
A Naïve Style Transfer System: To concretely illustrate the problem, we design a naïve baseline that exactly copies its input with probability p and chooses a random sentence from the target style corpus for the remaining inputs, where p is tuned on the validation set. 11 When evaluated using geometric mean corpus-level aggregation (GM column of Table 1) this system outperforms state of the art methods (UNMT, DSLM) on the Formality dataset despite not doing any style transfer at all! Proposed Metric: A good style transfer system should jointly optimize all metrics. The strong performance of the naïve baseline with corpus-level aggregation indicates that metrics should be combined at the sentence level before averaging them across the test set (sentence aggregation). Unfortunately, only 3 out of 23 surveyed papers measure absolute performance after sentence-level aggregation, and all of them use the setup of Li et al. (2018), which is specific to human evaluation with Likert scales. We propose a more general alternative, where x is a sentence from a test corpus X. We treat ACC and FL at a sentence level as a binary judgement, ensuring incorrectly classified or disfluent sentences are automatically assigned a score of 0. As a sanity check, our naïve system performs extremely poorly on this new metric (Table 1), as input copying will almost always yield an ACC of zero, while random retrieval results in low SIM. 11 p = 0.4 / 0.5 for Formality / Shakespeare datasets.

Experiments & Results
We evaluate our method (STRAP) on two existing style transfer datasets, using the evaluation methodology proposed in Section 3. Our system significantly outperforms state of the art methods and the naïve baseline discussed in Section 3.2.

Datasets
We focus exclusively on semantics-preserving style transfer tasks, which means that we do not evaluate on attribute transfer datasets such as sentiment, gender, and political transfer. Specifically, we use two standard benchmark datasets for Shakespeare author imitation and formality transfer to compare STRAP against prior work. While both datasets contain parallel data, we only use it to automatically evaluate our model outputs; for training, we follow prior work by using the non-parallel trainvalidation-test splits from He et al. (2020). The Shakespeare author imitation dataset (Xu et al., 2012) contains 37k training sentences from two styles -William Shakespeare's original plays, and their modernized versions. Shakespeare's plays are written in Early Modern English, which has a significantly different lexical (e.g., thou instead of you) and syntactic distribution compared to modern English. Our second dataset is Formality transfer (Rao and Tetreault, 2018), which contains 105k sentences, also from two styles. Sentences are written either in formal or informal modern English. Unlike formal sentences, informal sentences tend to have more misspellings, short forms (u instead of you), and non-standard usage of punctuation.

Comparisons against prior work
We compare STRAP on the Shakespeare / Formality datasets against the following baselines: • COPY: a lower bound that simply copies its input, which has been previously used in prior work (Subramanian et al., 2019;Pang, 2019) • NAÏVE: our method from Section 3.2 that randomly either copies its input or retrieves a sentence from the target style

Ablation studies
In this section, we perform several ablations on STRAP to understand which of its components contribute most to its improvements over baselines.
Overall, these ablations validate the importance of both paraphrasing and pretraining for style transfer.
Paraphrase diversity improves ACC: How critical is diversity in the paraphrase generation step? While our implementation of f para is trained on data that is heavily-filtered to promote diversity, we also build a non-diverse paraphrase model by removing this diversity filtering of PARANMT-50M but keeping all other experimental settings identical. In Table 3, the -Div. PP rows show a drop in ACC across both datasets as well as higher SIM, which in both cases results in a lower J(·) score. A qualitative inspection reveals that the decreased ACC and increased SIM are both due to a greater degree of input copying, which motivates the importance of diversity.

Paraphrasing during inference improves ACC:
The diverse paraphraser f para is obviously crucial to train our model, as it creates pseudo-parallel data for training f i inv , but is it necessary during inference? We try directly feeding in the original sentence (without the initial paraphrasing step) to the inverse paraphrase model f i inv during inference, shown in the -Inf. PP row of Table 3. While SIM and FL are largely unaffected, there is a large drop in ACC, bringing down the overall score (45.5 to 20.7 in Formality, 34.7 to 23.3 in Shakespeare). This supports our hypothesis that the paraphrasing step is useful for normalizing the input.
LM pretraining is crucial for SIM and FL: As we mainly observe improvements on FL and SIM compared to prior work, a natural question is how well does STRAP perform without large-scale LM pretraining? We run an ablation study by replacing the GPT-2 implementations of f para and f i inv with LSTM seq2seq models, which are trained with global attention (Luong et al., 2015) using OpenNMT (Klein et al., 2017) with mostly default hyperparameters. 14 As seen in the -GPT2 row of Table 3, this model performs competitively with the UNMT / DLSM models on J(ACC,SIM,FL), which obtain 20.0 / 18.6 on Formality (Table 1), respectively. However, it is significantly worse than STRAP, with large drops in SIM and FL. 15 This result shows the merit of both our algorithm and the boost that LM pretraining provides. 16 Nucleus sampling trades off ACC for SIM: While our best performing system uses a greedy decoding strategy, we experiment with nucleus sampling (Holtzman et al., 2020) by varying the nucleus p value in both Table 1 and Table 2. As expected, higher p improves diversity and trades off increased ACC for lowered SIM. We find that p = 0.6 is similar to greedy decoding on J(·) metrics, but higher p values degrade performance.

Towards Real-World Style Transfer
All of our experiments and ablations thus far have been on the Shakespeare and Formality datasets, which contain just two styles each. To explore the ability of our system to perform style transfer between many diverse styles, we create the Corpus of Diverse Styles (CDS), a new non-parallel style transfer benchmark dataset with 11 diverse styles (15M tokens), and use it to evaluate STRAP.

Corpus of Diverse Styles:
To create CDS, we obtain data (Table 5) from existing academic research datasets (Godfrey et al., 1992;Blodgett et al., 2016) and public APIs or online collections like Project Gutenberg (Hart, 1992). We choose styles that are easy for human readers to identify at a sentence level (e.g., Tweets or Biblical text), and the left side of Figure 2 confirms that machines also cluster CDS into eleven distinct styles. While prior benchmarks involve a transfer between two styles, CDS has 110 potential transfer directions.    Table 6: A controlled comparison between models on 2 styles from CDS using automatic evaluation. ACC is calculated using our 11-way CDS classifier and SIM is with input. STRAP greatly outperforms prior work.
ing; 17 qualitatively, the diverse model exhibits more lexical swaps and syntactic diversity. 18 Style Transfer on CDS: We measure STRAP's performance on CDS using Section 3's evaluation methodology. We sample 1K sentences from each style and use STRAP to transfer these sentences to each of the 10 other styles. Despite having to deal with many more styles than before, our system achieves 48.4% transfer accuracy (on a 11-way RoBERTa-large classifier), a paraphrase similarity score of 63.5, and 71.1% fluent generations, yielding a J(ACC,SIM,FL) score of 20.7. A break- Shak. → Bible Have you importuned him by any means? → did you ever try to import him? → hast thou ever tried to import him?
Misunderstanding the word "importune"the model believes it refers to import rather than harass / bother.
1990. → Tweet. The machine itself is made of little straws of carbon. → the machine is made of straw. → Machine made of straw.
Dropping of important semantic words during diverse paraphrasing ("carbon") significantly warps the meaning of sentences Swit. → Shak. well they offer classes out at uh Ray Hubbard → they're offering a course at Ray Hubbard's. → They do offer a course at the house of the Dukedom.
Hallucination of tokens irrelevant to the input ("house of the dukedom") to better reflect style distribution.
Subtle modifications in semantics since the models fail to understand their inputs. down of style-specific performance is provided in Appendix A.
8. An error analysis shows that the classifier misclassifies some generations as styles sharing properties with the target style ( Figure 3).

Controlled comparisons:
To ground our CDS results in prior work, we compare STRAP with baselines from Section 4.2. We sample equal number of training sentences from two challenging styles in CDS (Shakespeare, English Tweets) and train all three models (UNMT, DLSM, STRAP) on this subset of CDS. 19 As seen in Table 6, STRAP greatly outperforms prior work, especially in SIM and FL. Qualitative inspection shows that baseline models often output arbitrary style-specific features, completely ignoring input semantics (explaining poor SIM but high ACC).
Qualitative Examples: Table 4 contains several outputs from STRAP; see Appendix A.11 for more examples. We also add more qualitative analysis of the common failures of our system in  19 We could not find an easy way to perform 11-way style transfer in the baseline models without significantly modifying their codebase / model due to the complex probabilistic formulation beyond 2 styles and separate modeling for each of the 110 directions. 2018), but this method can also warp semantics as seen in Subramanian et al. (2019); as such, we only use it to build our paraphraser's training data after heavy filtering. Our work relates to recent efforts that use Transformers in style transfer (Sudhakar et al., 2019;Dai et al., 2019). Closely related to our work is Gröndahl and Asokan (2019), who over-generate paraphrases using a complex handcrafted pipeline and filter them using proximity to a target style corpus. Instead, we automatically learn style-specific paraphrasers and do not need over-generation at inference. Relatedly, Preotiuc-Pietro et al. (2016) present qualitative style transfer results with statistical MT paraphrasers. Other, less closely related work on control & diversity in text generation is discussed in Appendix A.12.

Conclusion
In this work we model style transfer as a controlled paraphrase generation task and present a simple unsupervised style transfer method using diverse paraphrasing. We critique current style transfer evaluation using a survey of 23 papers and propose fixes to common shortcomings. Finally, we collect a new dataset containing 15M sentences from 11 diverse styles. Possible future work includes (1) exploring other applications of diverse paraphrasing, such as data augmentation; (2) performing style transfer at a paragraph level; (3) performing style transfer for styles unseen during training, using few exemplars provided during inference. We train our paraphrase model in a seq2seq fashion using the PARANMT-50M corpus (Wieting and Gimpel, 2018), which was constructed by backtranslating (Sennrich et al., 2016) the Czech side of the CzEng parallel corpus (Bojar et al., 2016). This corpus is large and noisy and we aggressively filter it to encourage content preservation and diversity maximization. We use the following filtering, Content Filtering: We remove all sentence pairs which score lower than 0.5 on a strong paraphrase similarity model from Wieting et al. (2019). 20 We filter sentence pairs by length, allowing a maximum length difference of 5 words between paired sentences. Finally, we remove very short and long sentences by only keeping sentence pairs with an average token length between 7 and 25. Lexical Diversity Filtering: We only preserve backtranslated pairs with sufficient unigram distribution difference. We filter all pairs where more than 50% of words in the backtranslated sentence can be found in the source sentence. This is computed using the SQuAD evaluation scripts (Rajpurkar et al., 2016). Additionally, we remove sentences with more than 70% trigram overlap. Syntactic Diversity Filtering: We discard all paraphrases which have a similar word ordering. We compare the relative ordering of the words shared between the input and backtranslated sentence by measuring the Kendall tau distance (Kendall, 1938) or the "bubble-sort" distance. We keep all backtranslated pairs which are at least 50% shuffled. 21 LangID Filtering: Finally, we discard all sentences where both the input and backtranslated sentence are classified as non-English using langdetect. 22 Effect of each filter: We adopt a pipelined approach to filtering. The PARANMT-50M corpus size after each stage of filtering is shown in Table 8. 20 We use the SIM model from Wieting et al. (2019), which achieves a strong performance on the SemEval semantic text similarity (STS) benchmarks (Agirre et al., 2016) 21 An identical ordering of words is 0% shuffled whereas a reverse ordering is 100% shuffled. 22 This is using the Python port of Nakatani (2010), https: //github.com/Mimino666/langdetect.

A.2 Generative Model Details
This section provides details of our seq2seq model used for both paraphrase model and style-specific inverse paraphrase model. Recent work (Radford et al., 2019) has shown that GPT2, a massive transformer trained on a large corpus of unlabeled text using the language modeling objective, is very effective in performing more human-like text generation. We leverage the publicly available GPT2-large checkpoints by finetuning it on our custom datasets with a small learning rate. However, GPT2 is an unconditional language model having only a decoder network, and traditional seq2seq setups use separate encoder and decoder neural network (Sutskever et al., 2014) with attention (Bahdanau et al., 2014). To avoid training an encoder network from scratch, we use the encoder-free seq2seq modeling approach described in Wolf et al. (2018). where both input and output sequences are fed to the decoder network separated with a special token, and use separate segment embeddings. Our model is implemented using the transformers library 23 (Wolf et al., 2019). We use encoder-free seq2seq modeling (Wolf et al., 2018) which feeds the input into the decoder neural network, separating it with segment embeddings. We fine-tune GPT2-large to perform encoder-free seq2seq modeling.
Architecture: Let x = (x 1 , ..., x n ) represent the tokens in the input sequence and let y = (y bos , y 1 , ..., y m , y eos ) represent the tokens of the output sequence, where y bos and y eos corresponds to special beginning and end of sentence tokens. We feed the sequence (x 1 , ..., x n , y bos , y 1 , ..., y m ) as input to GPT2 and train it on the next-word prediction objective for the tokens y 1 , ..., y m , y eos using the cross-entropy loss. During inference, the sequence (x 1 , ..., x n , y bos ) is fed as input and the tokens are generated in an autoregressive manner (Vaswani et al., 2017) until y eos is generated.
Every token in x and y is passed through a shared input embedding layer to obtain a vector representation of every token. To encode positional and segment information, learnable positional and segment embeddings are added to the input embedding consistent with the GPT2 architecture. Segment embeddings are used to denote whether a token belongs to sequence x or y.
Other seq2seq alternatives: Note that our unsupervised style transfer algorithm is agnostic to the specific choice of seq2seq modeling. We wanted to perform transfer learning from massive left-to-right language models like GPT2, and found the encoder-free seq2seq approach simple and effective. Future work includes finetuning more recent models like T5 (Raffel et al., 2019) or BART (Lewis et al., 2019). These models use the standard seq2seq setup of separate encoder / decoder networks and pretrain them jointly using denoising autoencoding objectives based on language modeling.
Hyperparameter Details: We finetune GPT2large using NVIDIA TESLA M40 GPUs for 2 epochs using early stopping based on validation set perplexity. The models are finetuned using a small learning rate of 5e-5 and converge to a good solution fairly quickly as noticed by recent work (Li et al., 2020;Kaplan et al., 2020). Specifically, each experiment completed within a day of training on a single GPU, and many experiments with small datasets took a lot less time. We use a minibatch size of 10 sentence pairs and truncate sequences which are longer than 50 subwords in the input or output space. We use the Adam optimizer (Kingma and Ba, 2015) with the weight decay fix and using a linear learning rate decay schedule, as implemented in the transformers library. Finally, we leftpad the input sequence to get a total input length of 50 subwords and right-pad output sequence to get a total output length of 50 subwords. This special batching is necessary to use minibatches during inference time. Special symbols are used to pad the sequences and they are not considered in the cross-entropy loss. Our model has 774M trainable parameters, identical to the original GPT2-large.

A.3 Classifier Model Details
We fine-tune RoBERTa-large to build our classifier, using the official implementation in fairseq. We use a learning rate of 1e-5 for all experiments with a minibatch size of 32. All models were trained on a single NVIDIA RTX 2080ti GPU, with gradient accumulation to allow larger batch sizes. We train models for 10 epochs and use early stopping on the validation split accuracy. We use the Adam optimizer (Kingma and Ba, 2015) with modifications suggested in the RoBERTa paper (Liu et al., 2019). Consistent with the suggested hyperparameters, we use a learning rate warm-up for the first 6% of the updates and then decay the learning rate.

A.4 OpenNMT Model Details
We train sequence-to-sequence models with attention based on LSTMs using OpenNMT (Klein et al., 2017) using their PyTorch port. 24 We mostly used the default hyperparameter settings of OpenNMT-py. The only hyperparameter we modified was the learning rate schedule, since our datasets were small and overfit quickly. For the paraphrase model, we started decay after 11000 steps and halved the learning rate every 1000 steps. For Shakespeare, we started the decay after 3000 steps and halved the learning rate every 500 steps. For Formality, we started the decay after 6000 steps and halved the learning rate every 1000 steps. These modifications only slightly improved validation perplexity (by 3-4 points in each case).
We used early stopping on validation perplexity and checkpoint the model every 500 optimization steps. The other hyperparameters are the default OpenNMT-py settings -SGD optimization using learning rate 1.0, LSTM seq2seq model with global attention (Luong et al., 2015), 500 hidden units and embedding dimensions and 2 layers each in the encoder and decoder.

A.5 More Comparisons with Prior Work
Please refer to Table 12 for an equivalent of Table 1 using BLEU scores.
We present more comparisons with prior work in Table 13. We use the generated outputs for the Formality test set available in the public repository of Luo et al. (2019) (including outputs from the algorithms described in Prabhumoye et al., 2018 andLi et al., 2018) and run them on our evaluation pipeline. We compare the results with our formality transfer model used in Table 1  and Table 2. We note significant performance improvements, especially in the fluency of the generated text. Note that there is a domain shift for our model, since we trained our model using the splits of He et al. (2020)  The only datasets used in Dai et al. (2019) were sentiment transfer benchmarks, which modify semantic properties of the sentence. We attempted to train the models in Dai et al. (2019) using their codebase on the Shakespeare dataset, but faced three major issues 1) missing number of epochs / iterations. The early stopping criteria is not implemented or specified, and metrics were being computed on the test set every 25 training iterations, which is invalid practice for choosing the optimal checkpoint; 2) specificity of the codebase to the Yelp sentiment transfer dataset in terms of maximum sequence length and evaluation, making it non-trivial to use for any other dataset; 3) despite our best efforts we could not get the model to converge to a good minima which would produce fluent text (besides word-by-word copying) when trained on the Shakespeare dataset.
Similarly, the datasets used in Sudhakar et al.
(2019) modify semantic properties (sentiment, political slant etc.). On running their codebase on the Shakespeare dataset using the default hyperparameters, we achieved a poor performance of 53.1% ACC, 55.2 SIM and 56.5% FL, aggregating to a J(A,S,F) score of 18.4. Similarly on the Formality dataset, performance was poor with 41.7% ACC, 67.8 SIM and 67.7% FL, aggregating to J(A,S,F) score of 18.1. A qualitatively inspection showed very little abstraction and nearly word-by-word copying from the input (due to the delete & generate nature of the approach), which explains the higher SIM score but lower ACC score (just like COPY baseline in Table 1). Fluency was low despite GPT pretraining, perhaps due to the token deletion step in the algorithm.

A.6 Details of our Dataset, CDS
We provide details of our sources, the sizes of individual style corpora and examples from our new benchmark dataset CDS in Table 14. We individually preprocessed each corpus to remove very short and long sentences, boilerplate text (common in Project Gutenberg articles) and section headings. We have added some representative examples from each style in Table 14. More representative examples (along with our entire dataset) will be provided in the project page http://style.cs.umass.edu.
Style Similarity: In Figure 4 we plot the cosine similarity between styles using the averaged [CLS] vector of the trained RoBERTa-large classifier (inference over validation set). The offdiagonal elements show intuitive domain similarities, such as (Lyrics, Poetry); (AAE, Tweets); (Joyce, Shakespeare) or among classes from the Corpus of Historical American English.

A.7 Diverse Paraphrasing on CDS
We compare the quality and diversity of the paraphrases generated by our diverse and non-diverse paraphrasers on our dataset CDS in Table 16. Note that this is the pseudo parallel training data for the inverse paraphrase model (described in Section 2.1 and Section 2.4) and not the actual style transferred sentences. Overall, the diverse paraphraser achieves high diversity, with 51% unigram change and 27% word shuffling, 25 compared to 28% unigram and 6% shuffling for non-diverse paraphraser, while maintaining good semantic similarity (SIM= 72.5 vs 83.9 for non-diverse) even in complex stylistic settings.

Original Sentences Cosine Similarity
Note that Fluency scores on this dataset could be misleading since even the original sentences from some styles are often classified as disfluent (Orig. FL). Qualitatively, this seems to happen for styles with rich lexical and syntactic diversity (like Romantic Poetry, James Joyce). These styles tend to be out-of-distribution for the fluency classifier trained on the CoLA dataset (Warstadt et al., 2019).

A.9 A Survey of Evaluation Methods
We present a detailed breakdown of evaluation metrics used in prior work in Table 10 and the implementations of the metrics in Table 11. Notably, only 3 out of 23 prior works use an absolute sentence-level aggregation evaluation. Other works either perform "overall A/B" testing, flawed corpuslevel aggregation or don't perform any aggregation at all. Note that while "overall A/B" testing cannot be gamed like corpus-aggregation, it has a few issues -(1) it is a relative evaluation and does not provided an absolute performance score for future reference; (2) "A/B" testing requires human evalu-ation, which is expensive and noisy; (3) evaluating overall performance will require human annotators to be familiar with the styles and style transfer task setup; (4) Kahneman (2011) has shown that asking humans to give a single number for "overall score" is biased when compared to an aggregation of independent scores on different metrics. Luckily, the sentence-level aggregation in Li et al. (2018) does the latter and is the closest equivalent to our proposed J(·) metric.

A.10 Details on Human Evaluation
We conduct experiments of Amazon Mechanical Turk, annotating the paraphrase similarity of 150 sentences with 3 annotators each. We report the label chosen by two or more annotators, and collect additional annotations in the case of total disagreement. We pay workers 5 cents per sentence pair ($10-15 / hr). We only hire workers from USA, UK and Australia with a 95% or higher approval rating and at least 1000 approved HITs. Sentences where the input was exactly copied (after lower-casing and removing punctuation) are automatically assigned the option 2 paraphrase and grammatical. Even though these sentences are clearly not style transferred, we expect them to be penalized in J(ACC,SIM,FL) by poor ACC. We found that every experiment had a Fleiss kappa (Fleiss, 1971) of at least 0.13 and up to 0.45 (slight to moderate agreement according to (Landis and Koch, 1977)). A qualitative inspection showed that crowdworkers found it easier to judge sentence pairs in the Formality dataset than Shakespeare, presumably due to greater familiarity with modern English. We also note that crowdworkers had higher agreement for sentences which were clearly not paraphrases (like the UNMT / DLSM generations on the Formality dataset). Table 2: To calculate SIM, we count the percentage of sentences which humans assigned a label 1 (ungrammatical paraphrase) or 2 (grammatical paraphrase). This is used as a binary value to calculate J(ACC, SIM). To calculate J(ACC, SIM, FL), we count sentences which are correctly classified as well as humans assigned a label of 2 (grammatical paraphrase). We cannot calculate FL alone using the popular 3-way evaluation, since the fluent sentences which are not paraphrases are not recorded.

A.11 More Example Generations
More examples are provided in Table 9. All of our style transferred outputs on CDS will be available in the project page of this work, http://style.cs.umass.edu.

A.12 More Related Work
Our inverse paraphrase model is a style-controlled text generator which automatically learns lexical and syntactic properties prevalent in the style's corpus. Explicit syntactically-controlled text generation has been studied previously using labels such as constituency parse templates (Iyyer et al., 2018;Akoury et al., 2019) or learned discrete latent templates (Wiseman et al., 2018). Syntax can also be controlled using an exemplar sentence (Chen et al., 2019;Guu et al., 2018;Peng et al., 2019). While style transfer requires the underlying content to be provided as input, another direction explores attribute-controlled unconditional text generation (Dathathri et al., 2020;Keskar et al., 2019;Zeng et al., 2020;Ziegler et al., 2019).
Diversity in text generation is often encouraged during inference time via heuristic modifications to beam search (Li et al., 2016;Vijayakumar et al., 2018), nucleus sampling (Holtzman et al., 2020) or submodular optimization (Kumar et al., 2019); in contrast, we simply filter our training data to increase diversity. Other algorithms learn to condition generation on latent variables during training (Bowman et al., 2016), which are sampled from at inference time to encourage diversity (Jain et al., 2017;Gupta et al., 2018;Park et al., 2019). Relatedly, Goyal and Durrett (2020) promote syntactic diversity of paraphrases by conditioning over possible syntactic rearrangements of the input. Oh shit ima be a senior so uh i got to the senior level of the business Table 9: More example outputs from our model STRAP trained on our dataset CDS. Our project page will provide all 110k style transferred outputs generated by STRAP on CDS.   Hence, we have retrained our RoBERTa-large classifiers to reflect the new distribution. *Note: While our system significantly outperforms prior work, we re-use the formality system used in Table 1 and Table 2 for these results, which was trained on Entertainment & Music (consistent with He et al. (2020)). There could be a training dataset mismatch between our model and the models from Luo et al. (2019), since the Formality dataset has two domains. This is not clarified in Luo et al. (2019) to the best of our knowledge.    Table 16: A detailed style-wise breakup of the diverse paraphrase quality in CDS (the training data for the inverse paraphrase model, described in Section 2.1 and Section 2.4). The ideal paraphraser should score lower on "Lexical" and "Syntactic" overlap and high on "Similiarity". Overall, our method achieves high diversity (51% unigram change and 27% word shuffling, compared to 28% unigram and 6% shuffling for non-diverse), while maintaining good semantic similarity (SIM= 72.5 vs 83.9 for non-diverse) even in complex stylistic settings. We measure lexical overlap in terms of unigram F1 overlap using the evaluation scripts from Rajpurkar et al. (2016). Syntactic overlap is measured using Kendall's τ B (Kendall, 1938) of shared vocabulary. A τ B = 1.0 indicates no shuffling whereas a value of τ B = −1.0 indicates 100% shuffling (complete reversal). Finally, the SIM model from Wieting et al. (2019) is used for measuring similarity.