StylePTB: A Compositional Benchmark for Fine-grained Controllable Text Style Transfer

Text style transfer aims to controllably generate text with targeted stylistic changes while maintaining core meaning from the source sentence constant. Many of the existing style transfer benchmarks primarily focus on individual high-level semantic changes (e.g. positive to negative), which enable controllability at a high level but do not offer fine-grained control involving sentence structure, emphasis, and content of the sentence. In this paper, we introduce a large-scale benchmark, StylePTB, with (1) paired sentences undergoing 21 fine-grained stylistic changes spanning atomic lexical, syntactic, semantic, and thematic transfers of text, as well as (2) compositions of multiple transfers which allow modeling of fine-grained stylistic changes as building blocks for more complex, high-level transfers. By benchmarking existing methods on StylePTB, we find that they struggle to model fine-grained changes and have an even more difficult time composing multiple styles. As a result, StylePTB brings novel challenges that we hope will encourage future research in controllable text style transfer, compositional models, and learning disentangled representations. Solving these challenges would present important steps towards controllable text generation.


Introduction
At the heart of interactive AI systems lies the element of communication as a channel to convey intentions using different stylistic attributes. Research in human-AI interaction has focused on building dialog systems (Celikyilmaz et al., 2018), virtual assistants (Cooper et al., 2004), and intelligent agents (Kim et al., 2013;Liang et al., 2020a;Pittermann et al., 2010) that can communicate their intentions with specific styles for different situations, target audiences, and environments (Lample * authors contributed equally The bad service of the waitresses make me dread going sometimes.
The good service of the waitresses makes me dread going sometimes.
The good service of the waitresses makes me enjoy going sometimes.
I left three messages without a call back.
I left three messages.
I left three thankful messages.
After 3 months they can't be too new now.
After 3 months they can't be too new.  Figure 1: STYLEPTB provides a large-scale resource to study fine-grained compositional style transfer. The styles provided in STYLEPTB (in green) span lexical, syntax, semantic, and thematic aspects (DiMarco and Hirst, 1993) which can be composed to form high-level style transfers as commonly studied in existing benchmarks (e.g. Yelp for sentiment (Shen et al., 2017) and GYAFC for formality (Rao and Tetreault, 2018)). et al., 2019;Li et al., 2018). For example, expressing the same facts using either formal or informal styles can be more suitable for certain target audiences (Rao and Tetreault, 2018). What is a style in natural languages? Existing style transfer benchmarks primarily focus on individual high-level stylistic changes across sentiment (Shen et al., 2017), formality (Rao and Tetreault, 2018), politeness (Madaan et al., 2020), and writing styles (Jhamtani et al., 2017). Figure 1 provides some motivating examples to show that the high-level style transfers as commonly studied in existing benchmarks (e.g. Yelp for sentiment (Shen et al., 2017) and GYAFC for formality (Rao and Tetreault, 2018)) can in fact be seen as composed from a dictionary of fine-grained style constructs. This alternative way of studying styles brings additional flexibility that enables finegrained control with the possibility to compose a broader space of styles spanning tense, sentence structure, phrase emphasis, and information contained in the sentence. However, the missing link is a benchmark dataset that offers this type of fine-grained style constructs, with the controllability to compose these stylistic transfers.
To fill this gap, we leverage research in linguistics to study formulations of styles across 4 representational categories: lexical, syntax, semantics, and thematics, that span the fundamental atomic transfers that text can undergo (McDonald and Pustejovsky, 1985;DiMarco and Hirst, 1993). Using these insights, we introduce a large-scale benchmark with (1) paired sentences undergoing 21 finegrained stylistic changes spanning the most atomic lexical, syntactic, semantic, and thematic style constructs, as well as (2) compositions of multiple transfers which model how fine-grained style constructs compose to form more complex, high-level transfers. Our dataset, called STYLEPTB, builds upon Penn Treebank (Marcus et al., 1993) by annotating each sentence undergoing these fine-grained style constructs, resulting in a large-scale resource spanning 59, 767 sentence pairs across 21 individual styles and an additional 35, 887 sentence pairs across 32 compositions of multiple styles.
STYLEPTB allows us to study the performance of state-of-the-art style transfer models when faced with the new challenge of fine-grained style transfer. It is interesting to observe that these models, while capable of performing high-level semantic changes, struggle with fine-grained changes, particularly in the syntactic and thematic domains. A second analysis in this paper is to see how these models can handle compositions of multiple style constructs as a step towards controllable high-level style transfer. However, we find that current models have an even more difficult time composing multiple styles. As a step towards this desiderata, we also propose an approach (CS-GPT) based on pre-trained language models (Radford et al., 2019) that achieves compositional style transfer. We believe that STYLEPTB will bring novel challenges that we hope will encourage research in controllable generation, compositionality of styles, and learning disentangled representations (John et al., 2019). From a broader perspective, we conclude with the observation that controllable style transfer models trained on STYLEPTB can help mitigate social biases in pre-trained language models.

Related Work
Several lines of research have aimed to formalize styles in natural languages through computational and linguistic perspectives (DiMarco and Hirst, 1993). The first systematic formulation of styles was by McDonald and Pustejovsky (1985) and later extended by DiMarco and Hirst (1993) to 4 representational categories including lexical, syntax, thematic, and semantic aspects. Following this, there has been some early efforts applying stylistic analysis into dialog generation (Hovy, 1987), machine translation (DiMarco, 1994), and text generation (Gatt and Krahmer, 2018). We take advantage of this prior work when formalizing our new STYLEPTB dataset.
Evaluating style transfer is difficult due to the diversity of plausible transferred sentences. In addition to automatic scores such as BLEU, perplexity, or binary classification accuracy of style transfer (Hu et al., 2017;Lample et al., 2019;He et al., 2020), other automatic metrics (Fu et al., 2018;Mir et al., 2019) and human evaluation are also commonly used (Li et al., 2018;Shen et al., 2017).

Fine-Grained Style Constructs
As a step towards enabling fine-grained control with the possibility to compose a broader space of styles, we first define style constructs at finegrained levels spanning lexical, syntactic, semantic, and thematic aspects. When selecting these style constructs, we have 2 goals in mind: (1) they should be representative of the four aspects (lexical, syntactic, semantic, thematic) following the formal categorizations in DiMarco and Hirst (1993), and (2) the transfers should be consistent (i.e. welldefined such that if multiple annotators are asked to modify the same sentence, the results will be similar). With these goals in mind, we summarize the Noun antonym replacement Investors will develop thicker skins and their confidence will return he says.
Investors will develop thicker skins and their diffidence will return he says.

Verb synonym replacement
The meeting is expected to call for heightened austerity for two years.
The meeting is anticipated to call for heightened austerity for two years.
Verb antonym replacement He noted that higher gasoline price will help buoy the October totals.
He ignored that higher gasoline prices will help buoy the October totals.
ADJ synonym replacement Most other states have enacted similar bans. Most other states have enacted alike bans.

ADJ antonym replacement
It is also planning another night of original series.
It is also planning another night of unoriginal series.

Most frequent synonym replacement
Republicans countered that long-range revenue estimates were unreliable.
Republicans countered that long-range revenue judges were unreliable.

Least frequent synonym replacement
Merrill Lynch Capital Markets Inc. is the sole underwriter for the offering .
Merrill Lynch Capital Markets Inc. is the sole investment-banker for the oblation .

SYNTAX
To future tense It is also planning another night of original series.
It will be also planning another night of original series. To present tense Sen. Mitchell urged them to desist. Sen. Mitchell urges them to desist.
To past tense It is also planning another night of original series.
It was also planning another night of original series.
Active to passive He also received 20-year sentences for each of the 24 passengers injured.
20-year sentences also were received by him for each of the 24 passengers injured.
Passive to active Most bills are drafted by bureaucrats not politicians.
Bureaucrats not politicians draft most bills.

PP front to back
In Indianapolis Lilly declined comment. Lilly declined comment in Indianapolis .

PP back to front
The dollar has been strong unlike 1987 . Unlike 1987 the dollar has been strong.

ADJ or ADV removal
The controls on cooperatives appeared relatively liberal when first introduced The controls on cooperatives appeared liberal when introduced PP removal The controls on cooperatives appeared relatively liberal when first introduced.
The controls appeared relatively liberal when first introduced.

Substatement removal
The controls on cooperatives appeared relatively liberal when first introduced .
The controls on cooperatives appeared relatively liberal.

Information addition
He reports his business is up slightly from customers replacing old stock.
[ 'customer', 'waiting to buy', 'seafood' ] He reports his business is up slightly from customers waiting to buy seafood and replacing old stock.

THEMATICS
Verb/Action emphasis He intends to add to the litigation staff. add Adding to the litigation staff is what he intends to do.

Adjective emphasis
The comparable year-earlier number was 56 million a spokesman said. comparable A spokesman said the year-earlier number of 56 million was comparable . Table 1: Examples of each of the 21 defined style constructs across lexical, syntactic, semantic, and thematic aspects found in STYLEPTB. The original phrase is in cyan and the corresponding target phrase is in magenta .
Note that some thematic and semantic transfers require additional information, highlighted in red .
following 21 chosen fine-grained style constructs spanning 4 categories and also provide detailed examples in Table 1. Lexical transfers are those at fine-grained lexicon levels (i.e. vocabulary or words) that include word constitutions (Heine et al., 2002) and word meaning (Cruse et al., 1986). As a starting point, we selected two types of lexical transfers: synonym/antonym replacements (6 transfers that replace nouns/verbs/adjectives with their synonyms/antonyms), and frequency-based replacements (2 transfers that replace words with their most/least appeared synonyms). The synonym/antonym resources are taken from Wordnet (Fellbaum, 2012).
Syntax transfers modify the underlying grammatical rules that govern the structure of sen-tences (Chomsky, 2002) without affecting the content (Akmajian and Heny, 1980). We selected three simple syntax transfers: tense changes (3 transfers: to past/present/future tense), voice changes (2 transfers: active to/from passive), proposition position changes (2 transfers: front to/from back).
Semantic transfers are changes to the meaning of sentences (Bagha, 2011) that not only extend beyond lexical (Cruse et al., 1986) andsyntaxlevel (Kratzer andHeim, 1998) changes, but also include modifications using indirect information such as referring (Strawson, 1950), situations (Barwise andPerry, 1981) or intentions and extensions (Allwood et al., 1977). As a starting point, we defined two simple types of semantic transfers: (1) Info removal: 3 transfers on different deletions: wordlevel (removing adjectives and adverbs), phrase    level (removing propositions), and substatement level (removing entire substatements) that represent referring and situations, as well as (2) Info addition: 1 transformation that adds a given piece of information regarding a particular phrase in the current sentence representing extension.
Thematic transfers concern the placing of emphasis across different parts in a sentence (Stevenson et al., 1994) to highlight different aspects of the same event (DiMarco, 1994). We defined two emphatic transfers across adjectives and verbs (actions). As an example of adjective emphasis, "the hot meat is on the table" emphasizes location, while "the meat on the table is hot" emphasizes the hot temperature. To enforce consistency across annotators, we require adjective emphasis to rewrite the sentence into a be-statement of the emphasized adjective (as in the example above).
Analysis: To evaluate how useful these 21 selected atomic transfers are, we randomly sampled 50 sentence pairs from GYAFC and 50 sentences from Yelp with their reference transfer generated by Deep Latent Sequence Model (He et al., 2020) and manually tried to complete the transfers by composing one or more of the 21 atomic transfers we have defined, together with capitalization fixes and word-spelling fixes. We found that 72% of transfers from GYAFC, and 82% of transfers from Yelp can be done this way. Specifically, in GYAFC, 24% require one atomic transfer, and another 48% require composing multiple atomic transfers; in Yelp, 52% require one or less atomic transfers and another 30% require composing multiple atomic transfers. The results of this analysis suggest that STYLEPTB's dictionary of atomic styles is already a good start in studying compositional style transfer. STYLEPTBatomic transfers and their composition do indeed span a large percentage of current highlevel style transfers.

The STYLEPTB Dataset
Using these selected 21 style constructs, we now illustrate the steps towards collecting and annotating parallel sentences across style transfers.

Dataset Preprocessing
We use Penn Treebank (PTB) (Marcus et al., 1993) as our source of sentences. Additionally, the availability of parse trees in PTB allows us to automate the majority of syntactic transfers using rule-based methods. We begin with a total of 43, 948 sentences in the full PTB before removing sentences that are incomplete, too long (over 12 words), or too short (less than 5 words). This leaves 7, 719 sentences (see Figure 2 for statistics and Appendix A.1 for full details).

Generating transferred sentences
We give a brief overview of the data annotation process (see Appendix A.3 for full details).
Automated rule-based transfers: For 18 of the 21 transfers (lexical, syntax, and semantic transfers except Info Addition), we defined rule-based transfers using NLTK (Loper and Bird, 2002), parse trees (syntax, semantics), and WordNet (lexical). After human quality control, the total number of sentences transferred is listed in Table 2 (see Appendix A.2 for more details on automated generation and Appendix A.4 for human evaluation on quality of generated sentences) Transfers with human annotations: For the remaining 3 transfers, we have human annotators (via Amazon Mechanical Turk) manually rewrite them due to the difficulty of automating the process. See Appendix A.3 for details on the data generation, human annotation and quality assurance process for each of the three transfers. After annotations and quality control, we obtained 696 rewritten sentences for adjective emphasis, 1201 rewritten sentences for verb emphasis, and 2114   valid sentence-information pairs with their transferred sentence with information added.

Relative Difficulty of Transfers
Lexical transfers can be done by replacing individual words and is simple to evaluate. To evaluate the difficultly of the remaining 13 syntax, semantic, and thematic transfers, we calculated the tokenlevel (i.e. word level) Hamming distance between original and transferred sentences. Using this metric, we categorized these 13 transfers into easy, medium and hard categories (see Table 3). We also evaluated semantic measures from BERT embeddings (Devlin et al., 2018) but found it less correlated with human judgment (see Appendix A.5). Figure 3: Example of generating sentence pairs that compose tense and voice changes. Starting from an original sentence ( green box ), we sequentially apply parse tree transfers (blue arrows) to obtain multiple transferred sentences ( yellow box ), yielding multiple parallel pairs (yellow arrows). We use transfer tokens (∆ 1 , ∆ 2 ) to track changes (see Section 5 for details).

Compositional Transfers
To allow for compositionality, we also generated compositional data that includes parallel pairs of sentences linked by multiple sequential transfers.
To compose automatic transfers, we applied a sequence of rule-based transfers starting with parse trees (see Table 4). To compose transfers that involve human annotations, we apply a sequence of "reverse" changes on the original sentences with parse trees (since human rewritten sentences no longer have parse trees), before chaining the sequence of automatic reverse transfers with the final human-annotated transfer (see Figure 3).

A Model for Compositional Transfer
We extend the pre-trained GPT2 language model (Radford et al., 2019) for parallel style transfer by giving it designated style transfer tokens as input in addition to the source sentence. For example, for each individual binary style s i , we define a style transfer token ∆ i ∈ {0, 1, 2} where ∆ i = 0 represents keeping s i unchanged, ∆ i = 1 represents a change from s i = 0 to s i = 1, and vice versa for ∆ i = 2. We likewise extend the definition of ∆ i for styles taking more than 2 values. Given a parallel (source, target) pair (s, t), we define the appropriate transfer token ∆ ∈ {0, 1, 2}   and train using maximum likelihood estimation to predict every word t j , for j = 1, 2, . . . , T , in the target sentence given the source and ∆: where θ denotes the pre-trained GPT2 parameters and θ * denotes the parameters after fine-tuning on STYLEPTB. Note that we also train the model to reconstruct the same source sentence again when setting ∆ = 0 (no style change), which we found to help bridge the domain shift between data used to pre-train GPT2 and sentences in STYLEPTB.
As a step towards compositionality, we also train with (source, target) pairs that undergo multiple atomic style transfers as provided in STYLEPTB, resulting in multiple style transfer tokens ∆ i being activated at the same time. We call the resulting model CS-GPT (Compositional Style GPT) and show its architecture in Figure 4. Learning separate representations for each ∆ i results in disentangled style variables that can then be composed as desired. Another benefit of using disentangled style variables is the ability of a single model in performing multiple style transfers.

Datasets and Metrics
We use STYLEPTB and evaluate on the 13 nonlexical transfers (since lexical changes works best with fixed word substitutions). Please refer to Appendix B.1 for dataset preprocessing details. Automated evaluation metrics consists of automatic BLEU scores, METEOR scores, ROUGE_L scores, and CiDER scores between generated and ground truth sentences (Sharma et al., 2017). In addition, we did human evaluations on random sets of 10 samples generated by each model for each transfer. We followed prior work (He et al., 2020) and had 2 independent annotators each rate transferred sentences on three aspects (clarity/grammar, content preservation, style change) on a 1 − 5 Likert scale, and takes average.

Baseline Models
We evaluate the following baselines commonly used in style transfer. Since none of these existing models handle compositions of styles, we train separate models on each of the 13 transfers.
1) GPT2: We fine-tune pre-trained GPT2 (Radford et al., 2019) on each transfer with the source as input and predicting the target using MLE, similar to Liu et al. (2020); Syed et al. (2020).
3) RETRIEVEEDIT: Given input x, a retriever is trained to pick a similar training example (x ′ , y ′ ). We treat y ′ as our prototype and use a trained editor to edit it into desired output y (Guu et al., 2018;Madaan et al., 2020).

4) HUMAN:
We also report human performance for each style transfer by having two independent human annotators manually perform the style transfer on 20 sampled sentences.

Results and Observations
We evaluate these 3 baseline models on the style transfers in STYLEPTB and show results in Table 5. We make the following observations: Baseline comparisons: RETRIEVEEDIT performed equally well compared to GPT2 in some transfers such as To Future Tense and performs significantly better compared to GPT2 in most transfers. When qualitatively observing the generated sentences, we found that while GPT2 can learn syntactic and semantic transfers, they suffer in reconstructing the rest of the sentence (e.g. making word repetitions). This was not an issue  Table 6: Human evaluation of style transfer models trained on the Verb Emphasis task. All approaches fall far short of human performance, which was judged by a separate human as having almost perfect clarity, content, and style metrics. GPT2 gets higher style scores while RETRIEVEEDIT excels at grammar and content preservation.
for RETRIEVEEDIT since it works by editing the sentence from the prototype. Both GPT2 and RE-TRIEVEEDIT significantly outperform SEQ2SEQ models on all 13 non-lexical transfers.

Difficulties of transfers:
We also compare the relative difficulty of transfers based on the automatic metrics described in Section 4.3. In line with our Hamming distance metric, we found that thematic transfers are especially difficult -all three baselines struggled on this task, which is intuitive because shifting emphasis requires completely different sentence structure changes on sentences and emphasized words. We found that GPT2 and SEQ2SEQ tend to struggle with grammar and word repetitions, while RETRIEVEEDIT sometimes follows the structural edits in the chosen (and often completely unfitting) examples, resulting in malformed outputs (see examples in Appendix C.1). All current methods significantly fall short of human performance especially on hard transfers. Therefore, we believe that STYLEPTB brings novel challenges that will spark future research in modeling fine-grained style changes.
Human evaluation: We sampled 10 transferred sentences from each automatic generations models for each transfer and asked 2 independent annotators to rate them. We show average results below for one of the hard transfers (Verb Emphasis). From Table 6, we found that all approaches fall far short of human performance, which was judged by a separate human as having almost perfect clarity, content, and style metrics. Furthermore, GPT2 gets higher style scores while RETRIEVEEDIT excels at grammar and content preservation, which further supports our qualitative observations above. Full results for human evaluations are available in Table 17 in Appendix C.1.

Towards Compositionality of Styles
As a step towards learning compositional transfers, we implemented the following baselines: 1. GPT2: Sequentially applying the GPT2 model trained for single transfers multiple times to perform compositional transfers.
2. CS-GPT: Our proposed CS-GPT model (detailed in Section 5) trained on compositional transfer pairs found in STYLEPTB.
3. CS-GPT-ZERO: An ablation of CS-GPT trained only on individual style changes but tested in a zero-shot setting on compositional transfers.
We evaluated these models on two compositional transfers: Tense+Voice (composing tense changes and active/passive voice changes), and Tense+PP Removal (composing tense changes and PP Removal). We conveniently used the numerical prefixes in the datasets as transfer tokens. The results are shown in Table 7 and we make the following observations: CS-GPT works best for compositional transfers: CS-GPT significantly outperforms existing methods for compositional style transfer. This is expected, as CS-GPT is trained on the full compositional dataset, while CS-GPT-ZERO is only trained on part of the compositional data and SE-QGPT is trained on single-transfer parallel data. Qualitatively, we observed that CS-GPT is able to perform each required transfer at the same time, producing outputs with relatively low reconstruction error compared to the other two methods. We included a few samples generated by the three models in Table 9 with more examples in Appendix C.2.
Zero-shot compositionality remains challenging: We included CS-GPT-ZERO to explore whether CS-GPT can learn to compose transfers in a zero-shot manner. While CS-GPT outperforms CS-GPT-ZERO and existing models, all still struggle to perform zero-shot compositions. We noticed that CS-GPT-ZERO usually only performs one of the necessary transfers: e.g. in a Tense+Voice task, CS-GPT-ZERO tends to only make the tense change, not the voice change. Quantitatively, in the Tense+PP Removal dataset, CS-GPT-ZERO performs much worse than either CS-GPT or sequentially applying GPT2; in Tense+Voice dataset, CS-GPT-ZERO is similar to GPT2. We believe that sequentially applying GPT2 accumulates errors present in each one.
Training on compositional styles may improve fine-grained styles: We observe that CS-GPT trained on compositional data can achieve similar performance as compared to GPT2 trained specifically for that one single transfer. Sometimes CS-GPT even outperforms GPT2, as shown in Table 8 (see Table 20 in Appendix for full results). Therefore, CS-GPT has leveraged compositional structure and data to perform strongly on multiple single and compositional transfers with just one model.

Broader Impact: Mitigating Biases
Unconditional language models have been shown to perpetuate undesirable stereotypes during generation which disproportionately harm underrepresented social groups (Liang et al., 2020b;Ravfogel et al., 2020;Sheng et al., 2020). As one possible application of fine-grained style transfer (in addition to many others), we hypothesize that more fine-grained control over the generated outputs can Transfer To Future + Passive To Active To Past + PP Removal Source Sentence NUM % was risen by sales to NUM billion from NUM billion. the bond market was unmoved by the economic statistics. CS-GPT sales will rise NUM % to NUM billion from NUM billion. the bond market is unmoved. Another technique that may be useful is to replace the key nouns with vague pronouns: GPT2: The Hispanic man said that he was walking his dog when the suspect yelled. Replace nouns: The hispanic man said that he was walking his dog when they yelled.
2. Semantic changes can also help by removing biased information from generated test using information removal. In the following two examples, the information removal is done by GPT2: GPT2: The woman walks home from the theatre crying . Info removal: The woman walks home from the theatre. GPT2: The Asian man drove a black BMW to the station to meet with an officer who would be taking him to a jail . Info removal: The Asian man drove a black BMW to meet with an officer.
We can also replace the removed part with uniformly sampled information with Info Addition. Since existing baselines currently struggle, we present a manual example as proof of concept: GPT2: The woman worked as a medical assistant in a health center. Info removal: The woman worked in a health center. Info addition with "worked as doctor": The woman worked as a doctor in a health center.
Finally, we performed a qualitative evaluation: we sampled 49 sentence completions using prompts from Sheng et al. (2019) that focus on stereotypical associations between gender and occupations (e.g. The woman worked as a babysitter.). We obtained post-processed versions using Info removal followed by Info addition with uniformly sampled new occupations. When presented to two independent human annotators, they judged 22 49 sen-tences as showing significantly lower bias with the remaining showing little or no bias change, indicating that fine-grained style transfer presents a new perspective to mitigating social biases in language models (see Appendix D for evaluation details).

Conclusion
In this paper, we propose a large-scale benchmark, STYLEPTB, for fine-grained style transfer spanning atomic lexical, syntactic, semantic, and thematic changes as well as their compositions into high-level transfers. We show that STYLEPTB provides an important step towards training more controllable text generators and removing social biases from generated text. However, existing style transfer models struggle to perform fine-grained changes and have an even more difficult time composing multiple styles. As a result, STYLEPTB brings novel challenges that we hope will inspire future research in controllable text generation, compositional models, and style disentanglement.

A Dataset Construction
Here we provide more details on dataset pre-processing, annotation, quality control, post-processing, and statistics.

A.1 Dataset Preprocessing
We use parts of Penn Tree Bank (PTB) that have been used in training neural language models (Kim et al., 2015) as the source of sentences to transfer. The availability of parse trees of these sentences allows us to automate the majority of transfers using rule-based python scripts. We begin with a total of 43, 948 sentences in full PTB before removing sentences that are incomplete, too long (over 12 words), or too short (less than 5 words). This leaves 7, 719 sentences (see Figure 2 for statistics).
Note that the original sentences in this version of the tree bank have all punctuation removed, and have the "n't" shorthand as separate words (for example, "wasn't" is represented as two words "was n't"). The transferred sentence we generated or collected in this new dataset will follow the same format.

A.2 Programmatic Transfers
For 18 of 21 transfers (including all lexical and syntax transfers, as well as all semantic transfers except Info Addition), we wrote Python scripts that utilize the parse trees of the sentences to complete the transfers. For the lexical transfers, synonyms/antonyms are extracted from WordNet (Fellbaum, 2012). For syntax transfers and information deletion transfers, we used NLTK tree editing tools and lemmatizers to manipulate parse trees to transfer sentences. Since not all transfers are applicable to each sentence (for example, synonym replacements cannot be done to a sentence with no synonyms found for any of its words, and Proposition front/back changes do not apply to sentences without propositions in the front or back). The total number of sentences transferred by our scripts is listed in Table 2.
Although we found that the data collected for two syntax transfers, Passive To Active and Proposition Back To Front are extremely low in quantity, this shouldn't be a problem in training models for these transfers because the reverse transfers of these two are also part of the dataset with much larger quantities, and we can simply swap the original/transferred sentences of the reverse transfers to get as much data for these two transfers as other ones.

A.3 Annotation Details
For the three remaining transfers, we asked human annotators manually to rewrite them due to the difficulty of automating the processes. Due to limited resources, we randomly selected 2, 000 of the 7, 719 selected sentences as original sentences for these three transfers.
We utilized Amazon Mechanical Turk (AMT) to get annotators. For each task, we designed a prompt with very detailed instructions and plenty of examples to ensure consistency of rewritten sentences. In addition, we tested them by releasing small batches of tasks and see if the annotations are satisfactory. When the main batch of tasks is released, we also inspect random samples of rewritten sentences of each worker to ensure quality and we reject ones from the workers who do not follow our consistency requirements. We also told workers to make sure the sentences they produce are grammatically correct and free of spelling mistakes and rejected sampled rewritten sentences that have grammatical or spelling errors.
For Info Addition transfers, we used Visual Genome Dataset (Krishna et al., 2016) as the knowledge base for additional information. We first made a dictionary mapping each word to attributes and relations  in Visual Genome that contains the word, ordered by frequency of appearance in Visual Genome, and then for each noun in the sentence, we select the most frequent attribute and relation from Visual Genome that contain the noun (if any) as additional information to be added to the sentence. Therefore, multiple sentence-information pairs may be created from the same original sentence. We ended up with 4, 412 total pairs to be annotated. Since the information added may be unfitting or even contradictory in the context of the sentence (such as information "milk in stock" in a sentence about stock markets), we asked workers to evaluate whether their rewritten sentences satisfies common sense, and we discard rewritten sentences that are marked as not fitting common sense. We ended up with 2, 117 rewritten sentences that are marked as satisfying common sense.
The web page used for Information Addition task is shown in Figure 5, and the instructions for this task (which pops up when "view instructions" on the prompt page is clicked) is shown in Figure 6, together with lots of detailed examples in the example tab next to it.
For adjective emphasis and verb emphasis tasks, we use information from the parse trees to identify adjectives and verbs to be emphasized, and we filter out words that shouldn't be emphasized (such as "'be" for verb emphasis). To ensure consistency, the workers are instructed to strictly follow the required format for each emphasis task. If an emphasis rewrite with the required format is impossible or if the original sentence is already emphasizing the word in the required format, the workers are asked to submit "N/A", and we discard these cases from our dataset. We started with 808 adjective emphasis tasks and 1, 373   verb emphasis tasks, and after discarding "N/A" results we still have 696 rewritten sentences for adjective emphasis task and 1201 rewritten sentences for verb emphasis task.
The web pages for the two emphasis tasks are shown in Figure 7 and Figure 9, respectively. And the instructions for each emphasis task are shown in Figure 8 and Figure 10, respectively. Finally, the detailed statistics of the data collection process of these three transfers are shown in Table 10.   Table 11: Human evaluations of randomly sampled automatically generated sentence transfers. The results show that the programmatically generated transfer data is very reliable.

A.4 Human Evaluation of Automatically Generated Data
We evaluated the automatically generated parts of the dataset by asking three human annotators to rate sampled sentence transfers on three aspects (clarity/grammar, content preservation, style change) on a rate of 1-5. We found that most of the categories had perfect scores and the lowest averaged scores across one category of one task is 4.83. The full results are shown in Table 11.

A.5 Transfer Difficulty with Semantics Distance
To measure the semantic distance between original and transferred sentences in each transfer, we used BERT pre-trained models (Devlin et al., 2019) to compute the contextual representations of each sentence, and measured the average 2 distance as well as cosine similarity between representations of original and transferred sentences. The results are shown in Table 12. We find that this metric is not as effective as Token Level Hamming Distance in deciding the relative difficulty of transfers, therefore we stick to the difficulty categories determined in Table 3.  Table 12: Average 2 distance and cosine similarity between BERT pooled output vectors of original and transferred sentences of the syntax, semantic and thematic transfers.

A.6 Compositional Transfers
To allow for compositionality, we also generated compositional data that include parallel pairs of sentences linked by multiple sequential transfers. To compose automatic transfers, we applied a sequence of rulebased transfers starting with parse trees. We use prefix labels to indicate the sequence of transfers undertaken. For example, when composing tense changes and active/passive voice changes, we use one label indicating tense change (0 for no change, 1 for to future, 2 for to past, 3 for to present) and the one indicating voice change (0 for no voice change, 1 for Active to Passive, 2 for Passive To Active). Thus, a prefix of "2 1" would mean changing the sentence to both past tense and active voice. The process of generating these data points is illustrated in Figure 3: we first generate active/passive pairs from the parse trees of original sentences, then apply tense changes on each pair to obtain both changes. Final statistics are shown in Table 4. To compose transfers that involve human annotations, we apply "reverse" changes on the original sentences with parse trees (since human rewritten sentences no longer have parse trees). For example, to compose Active To Passive and Info Addition, we apply an automatic Passive To Active change on an original passive sentence A to generate active sentence B, and if C is the human-annotated result of adding some information to A, then B to C is a composition of Active to Passive and Info Addition.

B.1 Dataset Preprocessing
For transfers with additional input to the original sentence (additional information in Info Addition, adjective to emphasize in Adjective Emphasis, etc), we put the additional input at the end of the original sentence separated by a semicolon token. When training Passive To Active and PP Back To Front, due to the low amount of data available, we also include data collected by their reverse operations and swap the source and target. For each transfer, we take all available parallel sentences, and divide them into train, valid and test sets in a 90%, 5%, 5% ratio. All numerals in the sentences are replaced with a "NUM" token when training the baselines.

B.2 Hyperparameters
The hyperparameters used for all models trained in all experiments is shown in Table 13.
Note that in GPT2 based models, each iteration means passing through all sentences in the training set, while in SEQ2SEQ and RETRIEVEEDIT each iteration means passing through a batch in the training set. Also, the vector sizes of all GPT2 models is equal to the default pre-trained GPT2 (small) model with LM head.
The hyperparameters for RETRIEVEEDIT are the same as the default from the code provided by Hashimoto et al. (2018)

B.3 Model Parameters
Since GPT2 Baselines, CS-GPT and CS-GPT-ZERO all uses pretrained GPT2 (small), each of those models have about 124M parameters. Under the hyperparameter settings described above, GRU+attn has about 2.4M parameters. Retrieve-Edit has 51.8M parameters.

B.4 Training Resources and Time
All models except RETRIEVEEDIT are run on a single GPU on Google Colab. The running time for training SEQ2SEQ for full 185, 000 iterations is about 2 hours. The training time for GPT2 for full 60 iterations takes between 1 and 4 hours (depending on the size of parallel data in the specific transfer), although the best results (in terms of valid loss) can usually be achieved within the first 20 iterations. The training time for CS-GPT and CS-GPT-ZERO for full 30 iterations is about 4 hours on compositional datasets (Tense+Voice, Tense+PP Removal), and the best results can be achieved within the first 10 iterations. The running time for training each RETRIEVEEDIT model ranges between 40 minutes and 1 hour.

C.1 Fine-grained Style Transfer
We show complete results of single-style experiments in Table 14-16. We make similar observations that in line with our Hamming distance metric, thematic transfers are especially difficult-all three baselines struggled on this task, which is intuitive because shifting emphasis requires completely different sentence structure changes on different sentences and emphasized words. Shown below are some examples of thematic transfers done by GPT2 and RETRIEVEEDIT model. We found that GPT2 and SEQ2SEQ tend to struggle with grammar and word repetitions, while RETRIEVEEDIT sometimes follows the structural edits in the chosen (and often completely unfitting) examples, resulting in malformed outputs (see examples in Appendix C.1). Furthermore, all current methods significantly fall short of human performance especially on hard transfers. Therefore, STYLEPTB brings novel challenges that will stimulate future research in modeling fine-grained style changes. Note: in the input, along with the original sentence, the word to emphasize is in red ):   style metrics. Furthermore, GPT2 gets higher style scores while RETRIEVEEDIT excels at grammar and content preservation, which further supports our qualitative observations above.

C.2 Compositional Style Transfer
We present full results on compositional style transfer in Table 19 and show more examples of compositional transfers done by CS-GPT, CS-GPT-ZERO, and SEQGPT in Table 18. CS-GPT significantly outperforms existing methods in all compositional style transfer tasks in both datasets. This is expected, as CS-GPT is trained on the full compositional datasets, while CS-GPT-ZERO is only trained on part of the compositional dataset and each part of SEQGPT is trained on single-transfer parallel data. Qualitatively, we observed that CS-GPT is able to perform each required transfer at the same time, producing outputs with relatively low reconstruction error compared to the other two methods.
We also present full comparisons of CS-GPT and GPT2 on single style transfer are in Table 20. We observe that CS-GPT can often perform single transfers better than GPT2 trained specifically for that one task, while in the rest of the cases the CS-GPT and GPT2 has nearly the same performance. Therefore, CS-GPT has leveraged compositional structure and data to perform strongly on multiple single and compositional transfers with just one model.  Table 19: Results on compositions of transfers using sequentially applying GPT2 (SEQGPT), CS-GPT-ZERO (adding compositional model but not compositional data) and CS-GPT (with both compositional model and data). The result shows that CS-GPT significantly outperforms the other two methods, and zero-shot remains challenging as CS-GPT-ZERO does not perform very well in comparison.   Human annotators found that 21 out of 50 sentences generated by GPT2 on "The man worked as" and 28 out of 50 sentences generated by GPT2 on "The woman worked as" to exhibit gender bias.
Bottom table: Out of the 49 biased sentences, after using style transfer to replace occupations with randomly sampled ones, human annotators found that 22 of them are significantly less biased while the rest are either slightly less biased or neutral.
Text Generated by GPT2 Text after occupation replacement The man worked as a security guard for the family's hotel until the weekend.
The man worked as a receptionist for the family's hotel until the weekend. The man worked as a driver for his father.
The man worked as a personal care aide for his father. The woman worked as a maid at a resort in a small town.
The woman worked as a driver at a resort in a small town. The woman worked as a nurse at a facility. The woman worked as a construction worker at a facility. Table 22: Examples of sentences generated by GPT2 and with occupation replacements that are rated as "Significantly Less Biased" after the change by human annotators.

D Mitigating Social Biases: Qualitative Evaluation
We created two prompts "The man worked as" and "The woman worked as", and generated 50 sentences with each prompt from GPT2. Next, we determine biased words by taking the 1, 000 closest vectors in GloVe word embeddings (Pennington et al., 2014) to "man" and "woman". Then, we determine a sentence as biased if the phrase describing the occupation in the sentences contains any biased words. With this standard, we found that 21 out of 50 sentences for man and 28 out of 50 sentences are biased. Then, we replaced the occupations in these 49 biased sentences with occupations sampled uniformly randomly from all 100 generated sentences, and then asked two independent human annotators to evaluate the 49 replaced sentences on a five-point scale of Significantly More Biased, Slightly More Biased, The Same, Slightly Less Biased, and Significantly Less Biased. On average, the annotators reported 22 sentences being significantly less biased compared to before the replacements, while all other sentences are either slightly less biased or neutral. The full results of this experiment are shown in Table 21. A few examples that were deemed Significantly Less Biased by both annotators are shown in Table 22.