Stylistic Transfer in Natural Language Generation Systems Using Recurrent Neural Networks

Linguistic style conveys the social context in which communication occurs and defines particular ways of using language to engage with the audiences to which the text is accessible. In this work, we are interested in the task of stylistic transfer in natural language generation (NLG) systems, which could have applications in the dissemination of knowledge across styles, automatic summarization and author obfuscation. The main challenges in this task involve the lack of parallel training data and the difficulty in using stylistic features to control generation. To address these challenges, we plan to investigate neural network approaches to NLG to automatically learn and incorporate stylistic features in the process of language generation. We identify several evaluation criteria, and propose manual and automatic evaluation approaches.


Introduction
Linguistic style is an integral aspect of natural language communication. It conveys the social context in which communication occurs and defines particular ways of using language to engage with the audiences to which the text is accessible.
In this work, we examine the task of stylistic transfer in NLG systems; that is, changing the style or genre of a passage while preserving its semantic content. For example, given texts written in one genre, such as Shakespearean texts, we would like a system that can convert it into another, say, that of simple English Wikipedia. Currently, most knowledge available in textual form is locked into the par-ticular data collection in which it is found. An automatic stylistic transfer system would allow that information to be more generally disseminated. For example, technical articles could be rewritten into a form that is accessible to a broader audience. Alternatively, stylistic transfer could also be useful for security or privacy purposes, such as in author obfuscation, where the style of the text is changed in order to mask the identity of the original author.
One of the main research challenges in stylistic transfer is the difficulty in using linguistic features to signal a certain style. Previous work in computational stylistics have identified a number of stylistic cues (e.g., passive vs active sentences, repetitive usage of pronouns, ratio of adjectives to nouns, and frequency of uncommon nouns). However, it is unclear how a system would transfer this knowledge into controlling realization decisions in an NLG system. A second challenge is that it is difficult and expensive to obtain adequate training data. Given the large number of stylistic categories, it seems infeasible to collect parallel texts for all, or even a substantial number of style pairs. Thus, we cannot directly cast this as a machine translation problem in a standard supervised setting.
Recent advances in deep learning provide an opportunity to address these problems. Work in image recognition using deep learning approaches has shown that it is possible to learn representations that separate aspects of the object from the identity of the object. For example, it is possible to learn features that represent the pose of a face (Cheung et al., 2014) or the direction of a chair (Yang et al., 2015), in order to generate images of faces/chairs with new poses/directions. We plan to design similar recurrent neural network architectures to disentangle the style from the semantic content in text. This setup not only requires less hand-engineering of features, but also allows us to frame stylistic transfer as a weakly supervised problem without parallel data, in which the model learns to disentangle and recombine latent representations of style and semantic content in order to generate output text in the desired style.
In the rest of the paper, we discuss our plans to investigate stylistic transfer with neural networks in more detail. We will also propose several evaluation criteria for stylistic transfer and discuss evaluation methodologies using human user studies.

Related Work
Capturing stylistic variation is a long-standing problem in NLP. Sekine (1997) and Ratnaparkhi (1999) consider the different categories in the Brown corpus to be domains. These include general fiction, romance and love story, press: reportage. Gildea (2001), on the other hand, refers to these categories as genres. Different NLP sub-communities use the terms domain, style and genre to denote slightly different concepts (Lee, 2001). From a linguistic point of view, domains could be thought of as broad subject fields, while genre can be seen as a category assigned on the basis of external criteria such as intended audience, purpose, and activity type. Style conveys the social context in which communication occurs and define particular ways of using language to engage with the audiences to which the text is accessible. Some linguists would argue that style and domain are two attributes characterizing genre (e.g., (Lee, 2001)) while others view genre and domain as aspects representing style (e.g., (Moessner, 2001)).
The notion of genre has been the focus of related NLP tasks. In genre classification (Petrenz and Webber, 2011;Sharoff et al., 2010;Feldman et al., 2009), the task is to categorize the text into one of several genres. In author identification (Houvardas and Stamatatos, 2006;Chaski, 2001), the goal is to identify the author of a text, while author obfuscation (Kacmarcik and Gamon, 2006;Juola and Vescovi, 2011) consists in modifying aspects of the texts so that forensic analysis fails to reveal the identity of the author.
In (Pavlick and Tetreault, 2016), an analysis of formality in online written communication is presented. A set of linguistic features is proposed based on a study of human perceptions of formality across multiple genres. Those features are fed to a statistical model that classifies texts as having a formal or informal style. At the lexical level, Brooke et al. (2010) focused on constructing lexicons of formality that can be used in tasks such as genre classification or sentiment analysis. In (Inkpen and Hirst, 2004), a set list of near-synonyms is given for a target word, and one synonym is selected based on several types of preferences, e.g., stylistic (degree of formality). We aim to generalize this work beyond the lexical level.
A similar work is that of Xu et al. (2012) which propose using phrase-based machine translation systems to carry out paraphrasing while targeting a particular writing style. Since the problem is framed as a machine translation problem, it relies on parallel data where the source "language" is the original text to be paraphrased-in that case, Shakespeare texts-and the "translation" is the equivalent modern English version of those Shakespeare texts. Accordingly, for each source sentence, there exists a parallel sentence having the target style. They also present some baselines which do not make use of parallel sentences and instead rely on manually compiled dictionaries of expressions commonly found in Shakespearean English. In a more recent work, Sennrich et al. (2016) carry out translation from English to German while controlling the degree of politeness. This is done in the context of neural machine translation by adding side constraints. Specifically, they mark up the source language of the training data (in this case, English) with a feature that encodes the use of honorifics seen in the target language (in this case, German). This allows them to control the honorifics that are produced at test time.

Proposed Approach
Recently, RNN-based models have been successfully used in machine translation (Cho et al., 2014b;Cho et al., 2014a;Sutskever et al., 2014) and dialogue systems (Wen et al., 2015). Thus, we propose to use an LSTM-based RNN model based on the encoder-decoder structure (Cho et al., 2014b) to automatically process stylistic nuances instead of hand-engineering features. The model is a variant of an autoencoder where the latent representation has two separate components: one for style and one for content. The learned stylistic features would be distinct from the content features and specific to each style category, such that they can be swapped between training and testing models to perform stylistic transfer. The separation, or disentanglement, between stylistic and content features is reinforced by modifying the training objective from (Cho et al., 2014b) that maximizes the conditional log-likelihood (of the output given the input). Instead, our model is trained to maximize a training objective that also includes a cross-covariance term dedicated for the disentanglement.
At a high level, our proposed approach consists of the following steps: 1. For a given style transfer task between two styles A and B, we will first collect relevant corpora for each of those styles.
2. Next, we will train the model on each of the styles (separately). This would allow the system to disentangle the content features from the stylistic features. At the end of this step, we will have (separately) the features that characterize styles A and those that characterize style B.
3. During the testing phase, for a transfer, say, from style A to style B, the system is fed texts having style A while the stylistic latent variables of the model are fixed to be those learned for style B (from the previous step). This would force the model to generate text using style B. For a transfer from style B to A, the system is fed texts having style B and we fix the stylistic latent variables of the model to be those learned for style A.
We intend to apply the model to datasets with reasonably differing styles between training and testing. Examples include the complete works of Shakespeare 1 , the Wikpedia Kaggle dataset 2 , the Oxford 1 http://norvig.com/ngrams/shakespeare.txt 2 https://www.kaggle.com/c/wikichallenge/Data Text Archive (literary texts) 3 , and Twitter data. A future research direction would be to further improve the system to process texts that have differing but similar styles.

Evaluation
We first present a simple example that shows the input and output of the system during the testing phase. Assuming the system was trained on texts taken from Simple English Wikipedia, it would learn the stylistic features that are particular to that genre. During the testing phase, if we feed the system the following sentence taken from Shakespeare's play As You Like It (Act 1, Scene 1): As I remember, Adam, it was upon this fashion bequeathed me by will but poor a thousand crowns, and, as thou sayest, charged my brother on his blessing to breed me well. And there begins my sadness. we expect the system to produce a version that might be similar to the following: I remember, Adam, that's exactly why my father only left me a thousand crowns in his will. And as you know, my father asked my brother to make sure that I was brought up well. And that's where my sadness begins.
We see three main criteria for the evaluation of stylistic transfer systems: soundness (i.e., the generated texts being textually entailed with the original version), coherence (e.g., free of grammatical errors, proper word usage, etc.), and effectiveness (i.e., the generated texts actually match the desired style). We propose to evaluate systems using both human and automatic evaluations. Snippets of original and generated texts will be sampled and reviewed by human evaluators, who will judge them on these three criteria using Likert ratings. This type of evaluation technique is also used in related tasks such as to evaluate author obfuscation systems (Stamatatos et al., 2015). A future research direction is to investigate automatic evaluation measures similar to ROUGE and BLEU, which compare the content of the generated text against human-written gold standards using word or n-gram overlap.

Conclusion
We present stylistic transfer as a challenging generation task. Our proposed research will address challenges to the task, such as the lack of parallel training data and the difficulty of defining features that represent style. We will exploit deep learning models to extract stylistic features that are relevant to generation without requiring explicit parallel training data between the source and the target styles. We plan to evaluate our methods using human judgments, according to criteria that we propose, derived from related tasks.