Decomposing Textual Information For Style Transfer

This paper focuses on latent representations that could effectively decompose different aspects of textual information. Using a framework of style transfer for texts, we propose several empirical methods to assess information decomposition quality. We validate these methods with several state-of-the-art textual style transfer methods. Higher quality of information decomposition corresponds to higher performance in terms of bilingual evaluation understudy (BLEU) between output and human-written reformulations.


Introduction
The arrival of deep learning seems transformative for many areas of information processing and is especially interesting for generative models (Hu et al., 2017b). However, natural language generation is still a challenging task due to a number of factors that include the absence of local information continuity and non-smooth disentangled representations (Bowman et al., 2015), and discrete nature of textual information (Hu et al., 2017a). If information needed for different natural language processing (NLP) tasks could be encapsulated in independent components of the obtained latent representations, one could have worked with different aspects of text independently. This could also naturally simplify learning transfer for NLP models and potentially make them more interpretable.
Despite the fact that content and style are deeply fused in natural language, style transfer for texts is often addressed in the context of disentangled latent representations (Hu et al., 2017a;Shen et al., 2017;Fu et al., 2018;Romanov et al., 2018;Tian et al., 2018). A majority of these works use an encoder-decoder architecture with one or * Equal contribution multiple style discriminators to improve latent representations. An encoder takes a given sentence as an input and generates a style-independent content representation. The decoder then uses this content representation and a target style representation to generate a new sentence in the needed style. This approach seems intuitive and appealing but has certain difficulties. For example, Subramanian et al. (2018) question the quality and usability of the disentangled representations for texts with an elegant experiment. The authors train a state of the art architecture that relies on disentangled representations and show that an external artificial neural network can predict the style of the input using a semantic component of an obtained latent representation (that supposedly did not incorporate stylistic information).
In this work, we demonstrate that the decomposition of latent representations is, indeed, attainable with encoder-decoder based methods but depends on the used architecture. Moreover, architectures with higher quality of information decomposition perform better in terms of the style transfer task.
The contribution of this paper is threefold: (1) we propose several ways to quantify the quality of the obtained latent semantic representations; (2) we show that the quality of such representation can significantly differ depending on the used architecture; (3) finally we demonstrate that architectures with higher quality of information decomposition perform better in terms of BLEU (Papineni et al., 2002) between output of a model and a human written reformulations.

Related Work
It is hard to define style transfer rigorously (Xu, 2017). Therefore recent contributions in the field are mostly motivated by several empirical results and rather address specific narrow aspects of style that could be empirically measured. Stylistic attributes of text include author-specific attributes (see (Xu et al., 2012) or (Jhamtani et al., 2017) on 'shakespearization'), politeness (Sennrich et al., 2016), the 'style of the time' (Hughes et al., 2012), gender or political slant (Prabhumoye et al., 2018, and formality of speech (Rao and Tetreault, 2018). All these attributes are defined with varying degrees of rigor. Meanwhile, the general notion of literally style is only addressed in a very broad context. For example, Hughes et al. (2012) shows that the style of a text can be characterized quantitively and not only with an expert opinion; Potash et al. (2015) demonstrate that stylized texts could be generated if a system is trained on a dataset of stylistically similar texts; and literary styles of the authors could be learned end-to-end (Tikhonov and Yamshchikov, 2018a,b;. In this particular submission we focus on a very narrow framework of sentiment transfer. There is certain controversy whether sentiment of a text could be regarded as its stylistic attribute, see (Tikhonov and Yamshchikov, 2018c). However, there seems to be certain agreement in the field that sentiment could be regarded as a viable attribute to be changed by the style transfer system. Addressing the problem of sentiment transfer Kabbara and Cheung (2016); ; Xu et al. (2018) estimate the quality of the style transfer with a pre-trained binary sentiment classifier. Fu et al. (2018) and Ficler and Goldberg (2017) generalize this ad-hoc approach and in principle enable the information decomposition approach. They define a style as a set of arbitrary quantitatively measurable categorical or continuous parameters that could be automatically estimated with an external classifier. In this submission we stay within this empirical paradigm of literary style.
Generally speaking, a solution that works for one aspect of a style could not be applied for a different aspect of it. For example, a retrieve-edit approach by (Guu et al., 2018) works for sentiment transfer. A delete-retrieve model shows good results for sentiment transfer in . However, these retrieval approaches could hardly be used for the style of the time or formality or any other case when the system is expected to paraphrase a given sentence to achieve the target style. To address this challenge Hu et al. (2017a) propose a more general approach to the controlled text generation combining variational autoencoder (VAE) with an extended wake-sleep mechanism in which the sleep procedure updates both the generator and external discriminator that assesses generated samples and feedbacks learning signals to the generator. Labels for style were concatenated with the text representation of the encoder and used with "hard-coded" information about the sentiment of the output as the input of the decoder. This approach is promising and is used in many recent contributions. Shen et al. (2017) use an adversarial loss to decompose information about the form of a sentence and apply a GAN to align hidden representations of sentences from two corpora. Fu et al. (2018) use an adversarial network to make sure that the output of the encoder does not include stylistic information. Hu et al. (2017a) also use an adversarial component to ensure there is no stylistic information within the representation. A dedicated component that controls semantic component of the latent representation is proposed by  who demonstrate that decomposition of style and content could be improved with an auxiliary multi-task for label prediction and adversarial objective for a bag-of-words prediction. Romanov et al. (2018) also introduce a dedicated component to control semantic aspects of latent representations and an adversarial-motivational training that includes a special motivational loss to encourage a better decomposition.
The framework of information decomposition within latent representations is challenged by an alternative family of neural machine translation approaches. These are works on style transfer with (Carlson et al., 2018) and without parallel corpora  in line with (Lample et al., 2017) and (Artetxe et al., 2017). In particular, Subramanian et al. (2018) state that learning a latent representation, which is independent of the attributes specifying its style is rarely attainable. They experiment with the model developed in (Fu et al., 2018) where by design the discriminator, which was trained adversarially and jointly with the model, gets worse at predicting the sentiment of the input when the coefficient of the adversarial loss increases. Authors show that a classifier that is separately trained on the resulting encoder representations easily recovers the sentiment of a latent representation produced by the encoder.
In this paper, we show that contrary to (Subramanian et al., 2018) decomposition of the stylistic and semantic information is attainable with autoencoder-type models and could be quantified. However, the quality of such decomposition severely depends on the particular architecture. We propose three different measures for information decomposition quality and using four different architectures show that models with better information decomposition outperform the state-of-theart models in terms of BLEU between output and human-written reformulations.

Style transfer
In this work we experiment with extensions of a model, described in (Hu et al., 2017a), using Texar  framework. To generate plausible sentences with specific semantic and stylistic features every sentence is conditioned on a representation vector z which is concatenated with a particular code c that specifies desired attribute, see Figure 1. Under notation introduced in (Hu et al., 2017a) the base autoencoder (AE) includes a conditional probabilistic encoder E defined with parameters θ E to infer the latent representation z given input x Generator G defined with parameters θ G is a GRU-RNN for generating and outputx defined as a sequence of tokensx =x 1 , ...,x T conditioned on the latent representation z and a stylistic component c that are concatenated and give rise to a generative distributionx ∼ G(z, c) = p G (x|z, c).
These encoder and generator form an AE with the following loss (1) This standard reconstruction loss that drives the generator to produce realistic sentences is combined with two additional losses. The first discriminator provides extra learning signals which enforce the generator to produce coherent attributes that match the structured code in c. Since it is impossible to propagate gradients from the discriminator through the discrete samplex, we use a deterministic continuous approximation a "soft" generated sentence, denoted asG =G τ (z, c) with "temperature" τ set to τ → 0 as training proceeds. The resulting soft generated sentence is fed into the discriminator to measure the fitness to the target attribute, leading to the following loss (2) Finally, under the assumption that each structured attribute of generated sentences is controlled through the corresponding code in c and is independent from z one would like to control that other not explicitly modelled attributes do not entangle with c. This is addressed by the dedicated loss (3) The training objective for the baseline, shown in Figure 1, is therefore a sum of the losses from Equations (1) -(3) defined as (4) where λ c and λ z are balancing parameters. Figure 1: The generative model, where style is a structured code targeting sentence attributes to control. Blue dashed arrows denote the proposed independence constraint of latent representation and controlled attribute, see (Hu et al., 2017a) for the details.
Let us propose two further extensions of this baseline architecture. To improve reproducibility of the research the code of the studied models is open 1 . Both extensions aim to improve the quality of information decomposition within the latent representation. In the first one, shown in Figure 2, a special dedicated discriminator is added to the model to control that the latent representation does not contain stylistic information. The loss of this discriminator is defined as Here a discriminator denoted as D z is trying to predict code c using representation z. Combining the loss defined by Equation (4) with the adversarial component defined in Equation (5) the following learning objective is formed where L baseline is a sum defined in Equation (4), λ Dz is a balancing parameter. The second extension of the baseline architecture does not use an adversarial component D z that is trying to eradicate information on c from component z. Instead, the system, shown in Figure 3 feeds the "soft" generated sentenceG into encoder E and checks how close is the representation E(G) to the original representation z = E(x) in terms of the cosine distance. We further refer to it as shifted autoencoder or SAE. Ideally, both E(G(E(x), c)) and E(G(E(x),c)), wherec denotes an inverse style code, should be both equal to E(x) 2 . The loss of the shifted autoencoder is where λ cos and λ cos − are two balancing parameters, with two additional terms in the loss, namely, cosine distances between the softened output processed by the encoder and the encoded original input, defined as L cos (x, c) = cos E(G(E(x), c)), E(x) , L cos − (x, c) = cos E(G(E(x),c)), E(x) . (8) 2 This notation is valid under the assumption that every stylistic attribute is a binary feature Figure 3: The generative model with a dedicated loss added to control that semantic representation of the output, when processed by the encoder, is close to the semantic representation of the input.
We also study a combination of both approaches described above, shown on  Tikhonov et al. (2019) carry out a series of experiments for these architectures. In this contribution, we work with the same data set of human-labeled positive and negative reviews but focus solely on the quality of information decomposition.

Information decomposition for texts
As we have mentioned earlier, several recent contributions rely on the idea that decomposing different aspects of textual information into various components of a latent representation might be helpful for a task of style transfer. To our knowledge, this is a supposition that is rarely addressed rigorously. The majority of the arguments in favor of information decomposition based architectures is of an intu-itive and qualitative rather than quantitative nature. Moreover, there are specific arguments against this idea.
In particular, Subramanian et al. (2018) show that information decomposition does not necessarily occur in autoencoder-based systems using a method developed in (Fu et al., 2018). Subramanian et al. (2018) demonstrate that as training proceeds, the internal discriminator, which was trained adversarially and jointly with the model, gets worse at predicting the sentiment of the input. However, an external classifier that is separately trained on the resulting latent representations easily recovers the sentiment. This is a strong argument in favor of the idea that actual disentanglement does not happen. Instead of decomposing the semantic and stylistic aspects of information, the encoder merely 'tricks' internal classifier and 'hides' stylistic information in the semantic component ending up in some local optimum.

Empirical measure of information decomposition quality
Yelp! 3 reviews dataset that was lately enhanced with human written reformulations by (Tian et al., 2018) is one of the most frequently used baselines for textual style transfer at the moment. It consists of restaurant reviews split into two categories, namely, positive and negative. There is a human written reformulation of every review in which the sentiment is changed that is commonly used as a ground truth for the task performance estimation. We applied an empirical method to estimate the quality of information decomposition to the architectures described in Section 3 as well as architectures developed by (Tian et al., 2018). An external classifier was trained from scratch to predict a style of a message using component z of a latent representation produced by an encoder. If information decomposition does not happen, one would expect that accuracy of an external classifier would be close to 1. This would mean that despite intuitive expectations, information about the style of a message is present in z. If decomposition were effective, the accuracy of an external classifier would be close to 0.5; in (Tikhonov et al., 2019) it is shown that style transfer methods show varying results in terms of accuracy and BLEU for different retrains, so in this paper the accuracy of an external classifier and BLEU between the system's output and 3 https://www.yelp.com/dataset human-written reformulations was measured after four independent retrains. On Figure 5, one can see the results of these experiments. Figure 5: BLEU between system's output and humanwritten reformulations seems to be higher if accuracy of an external classifier is closer to one half. Systems that decompose information better tend to show higher BLEU.
The fact that the external classifier always predicts style with the probability that is above one half could be partially attributed to the fact that full information decomposition of sentiment and semantics is hardly attainable. For example, such adjectives as "delicious" or "yummy" incorporate positive sentiment with the semantics of taste, whereas "polite" or "friendly" in Yelp! reviews are combining positive sentiment with the semantics of service. This internal entanglement of sentiment and semantics is discussed in detail in (Tikhonov and Yamshchikov, 2018c). It is essential to mention that the very fact that semantics and stylistics are entangled on the level of words does not deny a theoretical possibility to build a latent representation where they are fully disentangled. Anyway, Figure  5 demonstrates that the quality of the disentanglement is much better for SAE-type architectures. Since the shifted autoencoder controls the cosine distance between soft output and input, the encoder has to disentangle the semantic component, rather than "hide" the sentiment information from the discriminator.
On Figure 6 one can see how state of the art approaches compare to each other in terms of BLEU between output and human-written reformulations. All systems were retrained five times from scratch to report error margins of the methods since the results are noisy. BLEU between output and humanwritten reformulations is higher for lower values of external classifier accuracy. Systems that perform better in terms of information decomposition outperform system with lower quality of information decomposition. Moreover, the system that does not rely on an idea of disentangled latent representations at all shows weaker results than systems with high information disentanglement. It is important to note that there is a variety of methods to assess the quality of style transfer such as PINC (Paraphrase In N-gram Changes) score (Carlson et al., 2018), POS distance (Tian et al., 2018), language fluency , etc. The methodology of style transfer quality assessment is addressed in detail in (Tikhonov et al., 2019), but BLEU between output and input is a very natural all-purpose metric for the task of such type that is common in the style transfer literature. Figure 6: Overview of the BLEU between output and human-written reformulations of Yelp! reviews. Architecture with additional discriminator, shifted autoencoder (SAE) with additional cosine losses, and a combination of these two architectures measured after five re-runs outperform the baseline by (Hu et al., 2017a) as well as other state of the art models. Results of (Romanov et al., 2018) are not displayed due to the absence of self-reported BLEU scores Tables 1 -2 allow to compare random examples for different architectures. Generally, baseline and discriminator perform poorly once the syntax of a review is irregular or if there are some omissions in the text. SAE-based architectures tend to preserve the semantic component better. They also add sentimentally charged words at random not as often as the baseline and the discriminator-based architecture.

Preservation of semantic component
Another way to quantify the quality of latent representations is to calculate cosine distance and KLdivergence between semantic components of latent representations for the inputs and corresponding outputs. If we believe that the latent representation captures the semantics of the input that should be preserved in the output, the ideal behavior of the system is to produce equal latent representation for both the input and the output phrase. Indeed, on Figure 7 one can see that SAE manages to learn a space of latent representations in which semantic components of inputs and outputs are always equal to each other. Architecture with additional stylistic discriminator shows lower cosine distances and lower KL-divergences then the baseline yet. This results are in line with the measurements discussed above in Section 4.1. Figure 7: Comparison of cosine distances and KLdivergences between semantic components of latent representation for inputs and outputs. After 12 epochs of training SAE makes semantic component z for every output equal to the semantic component for a corresponding input. Discriminator corresponds to lower values of KL-divergence and cosine distance then baseline (Hu et al., 2017a) To get an intuition on how the resulting latent space differs for different architectures, one can look at the t-SNE visualizations (Maaten and Hinton, 2008) for the resulting latent representations of the data that different systems produce. In Figure 8, one can see that the baseline latent representations easily allow recovering the sentiment.
In contrast with the baseline, the architecture with additional discriminator obtains better disentanglement. Figure 9 shows that in this case one has a harder time recovering the sentiment of the sentence based on its latent representation. SAE does not only show a higher level of disentanglement but also produces equal semantic components for the input and the corresponding output. Judging by Figure 10 this makes SAE representations denser in certain areas of the semantic space and sparser in the others. input Human baseline the carne asada burrito is awesome! the carne asada burrito is awful! the worst asada burrito is gross! the rooms are not that nice and the rooms were spacious and the rooms are excellent that nice and the food is not that good either.
food was very well cooked the food is not that good either. it was so delicious; everything tasted bad, it was so rude; i've never had anything like it! nothing i liked i've never had anything like it! so, that was my one and only i will be ordering the so, that was my one and best time ordering the benedict there.
benedict  we did perfected want to incredible. ridiculous a place to keep in mind.
would a place to keep in mind. wont a place to keep in mind. firstly, their project are firstly, their draw are sheila, their round are generally higher than other places. generally higher than other places. generally higher than other places. horrific the trap -tea at the gut.
dumb the afternoon -tea at the rabbit. wtf the afternoon -tea at the slim. Aligning results shown on Figures 5 -10 one can clearly see several crucial things: (1) architectures based on the idea of disentangled latent representations show varying performance in terms of BLEU between output and human written reformulations; (2) architectures with higher quality of information decomposition in terms of correlation or KL-divergence between representations for input and output, show higher performance; (3) architectures that produce equal semantic components for a given input and corresponding output show the highest performance; (4) these results are aligned with empirical estimation of decomposition quality with external classifiers; it shows that architectures that are more successfully disentangling semantics of the input from its stylistics tend to perform better.

Conclusion
This paper addresses the questions of information decomposition for the task of textual style transfer. We propose three new architectures that use latent representations to decompose stylistic and semantics information of input. Two different methods to assess the quality of such decomposition are proposed. It is shown that architectures that produce an equal semantic component of latent representations for input and corresponding output  (Hu et al., 2017a). Red dots represent positive reviews. Blue dots represent negative reviews. One can clearly see that stylistic information can be recovered from the representation. Figure 9: t-SNE visualisation of the obtained latent representations for the architecture with an additional discriminator. Red dots represent positive reviews. Blue dots represent negative reviews. One can see that it is harder to recover stylistic information from the representation. outperform state of the art architectures in terms of BLEU between output and human written reformulations. An empirical method to assess the quality of information decomposition is proposed. There is a correspondence between higher BLEU between output and human written reformulations and better quality of information decomposition. Figure 10: t-SNE visualisation of the obtained latent representations for the shifted autoencoder. Red dots represent positive reviews. Blue dots represent negative reviews. One can see that it is harder to recover stylistic information from the representation and the structure of the differs significantly from the latent representation space obtained by the baseline.