A Hierarchical Reinforced Sequence Operation Method for Unsupervised Text Style Transfer

Unsupervised text style transfer aims to alter text styles while preserving the content, without aligned data for supervision. Existing seq2seq methods face three challenges: 1) the transfer is weakly interpretable, 2) generated outputs struggle in content preservation, and 3) the trade-off between content and style is intractable. To address these challenges, we propose a hierarchical reinforced sequence operation method, named Point-Then-Operate (PTO), which consists of a high-level agent that proposes operation positions and a low-level agent that alters the sentence. We provide comprehensive training objectives to control the fluency, style, and content of the outputs and a mask-based inference algorithm that allows for multi-step revision based on the single-step trained agents. Experimental results on two text style transfer datasets show that our method significantly outperforms recent methods and effectively addresses the aforementioned challenges.


Introduction
Text style transfer aims to convert a sentence of one style into another while preserving the style-independent content (Shen et al., 2017;Fu et al., 2018). In most cases, aligned sentences are not available, which requires learning from nonaligned data. Previous work mainly learns disentangled content and style representations using seq2seq (Sutskever et al., 2014) models and decomposes the transfer into neutralization and stylization steps. Although impressive results have been achieved, three challenges remain: 1) the interpretability of the transfer procedure is still weak in seq2seq models, 2) generated sentences are usually highly stylized with poor content preserva-I will be going back and enjoying this great place ! I will be going back and enjoying this horrible place ! I will be going back and avoid this horrible place ! I will not be going back and avoid this horrible place !  Figure 1: Our proposed Point-Then-Operate (PTO) applied to a real test sample. A high-level agent (red squares) iteratively proposes operation positions, and a low-level agent (arrows) alters the sentence based on the high-level proposals. Compared with seq2seq methods, PTO is more interpretable and better preserves style-independent contents. tion, and 3) the trade-off between content preservation and style polarity is intractable.
To address these challenges, we propose a sequence operation-based method within the hierarchical reinforcement learning (HRL) framework, named Point-Then-Operate (PTO). It consists of a hierarchy of a high-level agent that proposes operation positions and a low-level agent that alters the sentence based on high-level proposals. We propose a policy-based training algorithm to model the key aspects in text style transfer, i.e., fluency, style polarity, and content preservation. For fluency, we use a language model reward; for style polarity, we introduce a classification confidence reward and an auxiliary classification task; for content preservation, we adopt a reconstruction reward and a self-supervised reconstruction loss. We introduce a mask-based inference algorithm that applies multi-step sequence operations to the input sentence, allowing for singlestep training which is more stable. Figure 1 shows an example of our method applied to a real test sample from Yelp.
Compared with existing seq2seq methods, our sequence operation method has three merits. 1) Interpretability: our method explicitly models where and how to transfer. 2) Content preservation: sequence operations are targeted at stylized parts; thus, style-independent content can be better preserved. 3) Controllable trade-off : the trade-off between content preservation and style polarity could be tuned in our method. Specifically, we tune it by biasing the number of operation steps. We conduct extensive experiments on two text style transfer datasets, i.e., Yelp and Amazon. We show that our proposed method outperforms recent methods and that it addresses the challenges of existing seq2seq methods. The contributions of this paper are: • We propose a sequence operation method, i.e., Point-Then-Operate, for unsupervised text style transfer. The transfer procedure is modeled as explicit revisions on the input sentences, which improves interpretability, content preservation, and controllable stylecontent trade-off.
• The method is interpreted and trained in the HRL framework with a high-level agent that proposes operation positions and a low-level agent that applies explicit operations. We design comprehensive learning objectives to capture three important aspects of text style transfer and propose a mask-based inference algorithm that allows for multi-step revision based on the single-step trained agents.
• Experiments on Yelp and Amazon show that our method significantly improves BLEU, fluency, and content preservation compared with recent methods and effectively addresses the aforementioned challenges.

Related Work
Text Style Transfer Most work on text style transfer learns disentangled representations of style and content. We categorize them based on how they represent content. Hidden vector approaches represent content as hidden vectors, e.g., Hu et al. (2017) adversarially incorporate a VAE and a style classifier; Shen et al. (2017) propose a cross-aligned AE that adversarially aligns the hidden states of the decoder; Fu et al. (2018) design a multi-decoder model and a style-embedding model for better style representations;  use language models as style discriminators; John et al. (2018) utilize bagof-words prediction for better disentanglement of style and content. Deletion approaches represent content as the input sentence with stylized words deleted, e.g.,  delete stylized ngrams based on corpus-level statistics and stylize it based on similar, retrieved sentences;  jointly train a neutralization module and a stylization module the with reinforcement learning; Zhang et al. (2018a) facilitate the stylization step with a learned sentiment memory. As far as we know, there are two work that avoid disentangled representations. Zhang et al. (2018b) construct a pseudo-aligned dataset with an SMT model and then learn two NMT models jointly and iteratively. A concurrent work, Luo et al. (2019), propose to learn two dual seq2seq models between two styles via reinforcement learning, without disentangling style and content.
Sequence Operation Methods Our work is also closely related to sequence operation methods, which are widely used in SMT (Durrani et al., 2011(Durrani et al., , 2015Pal et al., 2016) and starts to attract attention in NMT (Stahlberg et al., 2018). Compared with methods based on seq2seq models, sequence operation methods are inherently more interpretable (Stahlberg et al., 2018). Notably, our method is revision-based, i.e., it operates directly on the input sentence and does not generate from scratch as in machine translation systems.
Hierarchical Reinforcement Learning In this work, we adopt the Options Framework (Sutton et al., 1999) in HRL, in which a high-level agent learns to determine more abstract options and a low-level agent learns to take less abstract actions given the option. Recent work has shown that HRL is effective in various tasks, e.g., Atari games (Kulkarni et al., 2016), relation classification (Feng et al., 2018), relation extraction (Takanobu et al., 2018), and video captioning .

Formulation
We start by formalizing the problem of our interest. Given two non-aligned sets of sentences 2 } of style s 2 . Unsupervised text style transfer aims to learn two conditional distributions p(x 1→2 |x 1 ) and p(x 2→1 |x 2 ) which alter the style of a sentence and preserve the style-independent content. However, defining content is not trivial. Different from previous text style transfer methods that explicitly model contents with disentangled representations, we implicitly model content with reconstruction, similar to the idea proposed adopted in CycleGAN (Zhu et al., 2017). Given the discreteness nature of natural language texts, we use sequence operations to approximate p(x 1→2 |x 1 ) and p(x 2→1 |x 2 ). In our notations, x 1→2 and x 2→1 are transferred sentences, which are the outputs of a text style transfer system;x 2 andx 1 are operated sentences, which are not necessarily fully transferred.

Our Approach
Our proposed sequence operation-based method, Point-Then-Operate (PTO), decomposes style transfer into two steps: 1) finding where to transfer and 2) determining how to transfer. It could be naturally formulated as an HRL problem, in which a high-level agent (i.e., pointer) proposes operation positions and a low-level agent (i.e., operators) alters the sentence based on high-level proposals.
In this section, we first briefly review the Options Framework in HRL. Then we introduce the proposed pointer module ( §4.2) and operator modules ( §4.3). The training algorithm is in §4.4, in which two extrinsic rewards, an intrinsic reward, and a self-supervised loss are proposed for fluency, style polarity, and content preservation. The inference algorithm is in §4.5, in which a mask mechanism is proposed to iteratively and dynamically apply sequence operations to the input.

Review: The Options Framework in HRL
The Options framework (Sutton et al., 1999) is a well-known formulation in HRL. We denote the state space as S; the option space, O; the action space, A. The high-level agent learns a stochastic policy µ : S × O → [0, 1]. The low-level agent learns a stochastic policy π o : S ×A → [0, 1], conditioned on an option o ∈ O. Additionally, each option o ∈ O has a low-level stochastic termination condition β o : S → [0, 1] which indicates whether the current option should end. In each episode, the high-level agent executes a trajectory (o 1 , · · · , o L ) based on µ; once an option o t is sampled, the low-level agent executes a trajectory (a 1 t , · · · , a lt t ) based on π ot , where l t is dependent on β ot . Intuitively, the flattened trajectory for one episode is (o 1 , a 1 1 , · · · , a l 1 1 , · · · , o L , a 1 L , · · · , a l L L ).

Module Operation
Insert a wordŵ Behind the position Rep φ 3 Replace it with another wordŵ DC Delete the Current word DF Delete the word in Front of the position DB Delete the word Behind the position Skip Do not change anything Table 1: Operator modules. Parameters φ 1 , φ 2 , and φ 3 are meant to generate their correspondingŵ.

High-Level Agent: Pointer
The high-level policy µ aims to propose operation positions; thus, we model it as an attention-based (Bahdanau et al., 2015) pointer network, which assigns normalized probability to each position.
Option Given a sentence x = {x 1 , · · · , x T }, the option space is O = {1, · · · , T }. Note that T changes within an episode, since operations may change the length of a sentence.
State The state is represented by the sentence representation h T and each position representa- Policy We adopt an attention-based policy µ: where a(·, ·) is the scoring function for attention, and i ∈ {1, · · · , T } denotes each position in the intput sentence.

Low-Level Agent: Operators
The low-level policy π alters the sentence around the position i (i.e., option) sampled from µ. We restrict the operations to those listed in Table 1. Note that these operations are complete to generate all natural language sentences in multiple steps.
Action Given the sentence x = {x 1 , · · · , x T } and the operation position i, the action of the lowlevel agent can be decomposed into two step, i.e., 1. Operator selection. Select an operator module from Table 1. 2. Word generation (optional). Generates a word, if necessary as is specified in Table 1.
State Compared with the high-level agent, our low-level agent focuses on features that are more local. We map x to {h 1 , · · · , h T } 2 through a bi-LSTM encoder and take h i as the state representation.
Low-Level Termination Condition Different from the original Options Framework in which a stochastic termination condition β o is learned, we adopt a deterministic termination condition: the low-level agent takes one action in each option and terminates, which makes training easier and more stable. Notably, it does not harm the expressiveness of our method, since multiple options can be executed.
Policy for Operator Selection For training, we adopt a uniform policy for operator selection, i.e., we uniformly sample an operator module from Table 1. In preliminary experiments, we explored a learned policy for operator selection. However, we observed that the learned policy quickly collapses to a nearly deterministic choice of Rep φ 3 . Our explanation is that, in many cases, replacing a stylized word is the optimal choice for style transfer. Thus, the uniform policy assures that all operators are trained on sufficient and diversified data. For inference, we adopt a heuristic policy based on fluency and style polarity, detailed in §4.5.3. Table 1, three operators are parameterized, which are burdened with the task of generating a proper word to complete the action. For each parameterized operator M , the probability of generatingŵ is

Policy for Word Generation As shown in
Notably, for each M we train two sets of parameters for s 1 → s 2 and s 2 → s 1 . For readability, we omit the direction subscripts and assure that they can be inferred from contexts; parameters of the opposite direction are denoted as φ 1 , φ 2 , and φ 3 .

Hierarchical Policy Learning
We introduce comprehensive training objectives to model the key aspects in text style transfer, i.e., fluency, style polarity, and content preservation. For fluency, we use an extrinsic language model reward; for style polarity, we use an extrinsic classification confidence reward and incorporate an auxiliary style classification task; for content Algorithm 1 Point-Then-Operate Training 1: Input: Non-aligned sets of sentences X1,2 2: Initialize θ, φ1,2,3 3: Train language models LM2 on X2 4: Pre-train θ by optimizing L θ cls Eq. 6 5: for each iteration i = 1, 2, · · · , m do 6: Sample x1 from X1 7: Sample i from µ θ (i|x1) Eq. 1 8: Uniformly sample M Update θ based on L θ cls and ∇ θ J(θ) Eq. 6 and 9 12: Get M and i Update φ with ∇ φ J(φ) Eq. 11 20: end if 21: end for preservation, we use a self-supervised reconstruction loss and an intrinsic reconstruction reward. In the following parts, we only illustrate equations related to x 1 →x 2 operations andx 2 → x 1 reconstructions for brevity; the opposite direction can be derived by swapping 1 and 2. The training algorithm is presented in Algorithm 1. A graphical overview is shown in Figure 2.

Modeling Fluency Language Model Reward
To improve the fluency, we adopt a language model reward. Let LM 1 , LM 2 denote the language models for s 1 and s 2 , respectively. Given the generated wordŵ in the operated sentencex 2 , the language model reward is defined as where LM 2 (ŵ|x 2 ) denotes the probability ofŵ given other words inx 2 . In our experiments, the probability is computed by averaging a forward LSTM-LM and a backward LSTM-LM.

Modeling Style Polarity
Classification Confidence Reward We observe that language models are not adequate to capture style polarity; thus, we encourage larger change in the confidence of a style classifier, by adopting a classification confidence reward, i.e., where we reuse the classifier defined in Eq. 5.
Auxiliary Task: Style Classification In HRL, the high-level policy usually suffers from the high variance of gradients since the estimated gradients are dependent on the poorly trained low-level policy. To stabilize the high-level policy learning, we introduce auxiliary supervision to the pointer. Specifically, we extend the pointer to an attentionbased classifier, i.e., for j = 1, 2. Let θ denotes the parameters of the pointer. The auxiliary classification loss for θ is The underlying assumption is that positions with larger attention weights for classification are more likely to be critical to style transfer.

Modeling Content Preservation
Self-Supervised Reconstruction Loss To improve content preservation, we propose a reconstruction loss that guides the operator modules with self-supervision. Suppose the word w at the i th position is deleted or replaced by operator M , we identify the reconstruction operator M and reconstruction position i in Table 2. Then M is updated with MLE, by operating on position i inx 2 with w as gold output. For those with two (M , i ) pairs, we uniformly sample one for training. Formally, the reconstruction loss is defined as Reconstruction Reward One-to-one transfer (e.g., {delicious↔bland, caring↔unconcerned}) is usually preferable to many-to-one transfer (e.g., {delicious→bad, caring→bad}). Thus, we introduce a reconstruction reward for Rep φ 3 to encourage one-to-one transfer, i.e., rec is the reconstruction loss in Eq. 7.

Training with Single-Option Trajectory
Instead of executing multi-option trajectories, we only allow the high-level agent to execute a single option per episode during training, and leave the multi-option scenario to the inference algorithm ( §4.5). We have two motivations for executing single-option trajectories: 1) executing multioption trajectories is less tractable and stable, especially in the case of style transfer which is sensitive to nuances in the sentence; 2) self-supervised reconstruction is ambiguous in a multi-option trajectory, i.e., the gold trajectory for reconstruction is not deterministic.

High-Level Policy Gradients
Since the language model reward is more local and increases the variance of estimated gradients, we only use the classification confidence reward for the highlevel policy. The policy gradient is where gradients are detached from R conf .

Low-Level Policy Gradients
All the extrinsic and intrinsic rewards are used for low-level policy learning. Specifically, the rewards for φ 1,2,3 are For φ = φ 1 , φ 2 , φ 3 , the policy gradient is Overall Objectives The overall objectives for θ are the classification loss in Eq. 6 and the policy gradient in Eq. 9. The overall objectives for φ 1,2,3 are the reconstruction loss in Eq. 7 and the policy gradients in Eq. 11.

Inference
The main problems in applying single-step trained modules to the multi-step scenario are 1) previous steps of operations may influence later steps, and 2) we need to dynamically decide when the trajectory should terminate. We leverage a mask mechanism to address these problems. The basic idea is that given an input sentence, the high-level agent iteratively proposes operation positions for the low-level agent to operate around. In each iteration, the high-level agent sees the whole sentence but with some options (i.e., positions) masked in its policy. The trajectory termination condition is modeled by an additional pre-trained classifier. The algorithm for style transfer from s 1 to s 2 is detailed in Algorithm 2.

Masked Options
To tackle the first problem, we mask the options (i.e., positions) in the high-level policy which appear in the contexts in which any words are inserted, replaced, or skipped (but not for deleted words). Note that we only mask the options in the policy but do not mask the words in the sentence (i.e., both agents still receive the complete sentence), since we cannot bias the state representations ( §4.2 and §4.3) with masked tokens. We set the window size as 1 (i.e., three words are masked in each step). We find the use of window size necessary, since in many cases, e.g., negation and emphasis, the window size of 1 is capable of covering a complete semantic unit.

Termination Condition
A simple solution to the second problem is to terminate the trajectory if the operated sentence is confidently classified as the target style. The problem with this simple solution is that the highly stylized part may result in too early termination. For example, Otherwise a terrible experience and we will go again may be classified as negative with high confidence. Thus, we propose to mask words in the operated sentence for the termination condition. The masking strategy is the same as §4.5.1 and masked words are replaced by unk . To tackle the excessive number of unk , we train an additional classifier as defined in §4.4.2, but trained on sentences with words randomly replaced as unk .

Inference Policy for Operator Selection
As discussed in §4.3, we adopt a heuristic inference policy for operator selection. Specifically, we enumerate each operator and select the operated sentencex 2 which maximizes the criterion: where LM 2 (x 2 ) denotes the probability ofx 2 computed by the language model LM 2 , p(s j |·) is the classifier defined in §4.4.2, and η is a balancing hyper-parameter.

Datasets
We conduct experiments on two commonly used datasets for unsupervised text style transfer, i.e., Yelp and Amazon, following the split of datasets in . Dataset statistics are shown in Table 3. For each dataset,  provided a gold output for each entry in the test set written by crowd-workers on Amazon Mechanical Turk. Since gold outputs are not written for development sets, we tune the hyper-parameters on the development sets based on our intuition of English.
Yelp The Yelp dataset consists of business reviews and their labeled sentiments (from 1 to 5) from Yelp. Those labeled greater than 3 are considered as positive samples and those labeled smaller than 3 are negative samples.
Amazon The Amazon dataset consists of product reviews and labeled sentiments from Amazon (He and McAuley, 2016). Positive and negative samples are defined in the same way as Yelp.
We observe that the Amazon dataset contains many neutral or wrongly labeled sentences, which greatly harms our HRL-based sequence operation method. Thus, on the Amazon dataset, we adopt a cross-domain setting, i.e., we train the modules  on the Yelp training set using the Amazon vocabulary and test the method on Amazon test set. Experimental results show the effectiveness of our method under this cross-domain setting.

Evaluation Metrics
Automatic Evaluation Following previous work (Shen et al., 2017;, we pre-train a style classifier TextCNN (Kim, 2014) on each dataset and measure the style polarity of system outputs based on the classification accuracy. Also, based on the human references provided by , we adopt a caseinsensitive BLEU metric, which is computed using the Moses multi-bleu.perl script.
Human Evaluation Following previous work (Shen et al., 2017;, we also conduct human evaluations. For each input sentence and corresponding output, each participant is asked to score from 1 to 5 for fluency, content preservation, and style polarity. If a transfer gets scores of 4 or 5 on all three aspects, it is considered as a successful transfer. We count the success rate over the test set for each system, which is denoted as Suc in Table 5.

Baselines
We make a comprehensive comparison with stateof-the-art style transfer methods. CrossAligned (Shen et al., 2017) (Zhang et al., 2018b) produces pseudo-aligned data and iteratively learns two NMT models. The outputs of the first six baselines are made public by . The outputs of Back-Translate and UnpairedRL are obtained by running the publicly available codes. We get the outputs of UnsuperMT from the authors of Zhang et al. (2018b). Table 4 shows the results of automatic evaluation. It should be noted that the classification accuracy for human reference is relatively low (74.7% on Yelp and 43.2% on Amazon); thus, we do not consider it as a valid metric for comparison. For BLEU score, our method outperforms recent systems by a large margin, which shows that our outputs have higher overlap with reference sentences provided by humans.

Evaluation Results
To lighten the burden on human participants, we compare our proposed method to only four of the previous methods, selected based on their performance in automatic evaluation. Given the observation discussed in §5.1, we remove the wrongly labeled test samples for human evaluation. Table 5 shows the results of human evaluation. Our proposed method achieves the highest fluency and content preservation on Yelp and performs the best on all human evaluation metrics on Amazon.    p stop is larger, classification accuracy drops and BLEU increases. Based on our observation of human references, we find that humans usually make minimal changes to the input sentence; thus, BLEU computed with human references can be viewed as an indicator of content preservation. From this perspective, Figure 3 shows that if we stop earlier, i.e., when the current style is closer to the source style, more content will be preserved and more weakly stylized words may be kept. Thus, controllable trade-off is achieved by manually setting p stop .

Ablation Studies
We conduct several ablation studies to show the effect of different components in our method: Ablations of Operators To show that incorporating various operators is essential, we evaluate the performance of the following ablations: Inser-tOnly, ReplaceOnly, and DeleteOnly, in which operator choices are restricted to subsets of Table 1.

Ablation of Reconstruction Reward and Reconstruction Loss
To show the effectiveness of our reconstruction-based objectives, we remove the reconstruction reward and the reconstruction loss as an ablation.   Table 6 shows the ablation results. It shows that BLEU drops if operators are restricted to a fixed set, showing the necessity of cooperating operator modules. It also shows that BLEU drops if we remove the reconstruction loss and the reconstruction reward, indicating the generated words overlap less with human references in this ablation case. As discussed in §5.4, we ignore Acc since it is low on human references. Figure 1 is an example of our method applied to a test sample. The transfer starts from more stylized parts and ends at less stylized parts, while keeping neutral parts intact. It also shows that our method learns lexical substitution and negation in an unsupervised way. Table 7 displays some comparisons of different systems. It shows that our proposed method is better at performing local changes to reverse the style of the input sentence while preserving most style-independent parts.

Discussions
We study the system outputs and observe two cases that our method cannot properly handle: Neutral Input The reconstruction nature of our method prefers stylized input to neutral input. We observe that it fails to convert some neutral inputs, e.g., I bought this toy for my daughter about Original (Yelp, negative) staffed primarily by teenagers that do n't understand customer service .

TemplateBased
staffed primarily by teenagers that huge portions and customer service are pretty good . Del-Ret-Gen staffed , the best and sterile by flies , how fantastic customer service . UnpairedRL staffed established each tech feel when great customer service professional . UnsuperMT staffed distance that love customer service .
Point-Then-Operate staffed by great teenagers that do delightfully understand customer service .
Original (Yelp, positive) i will be going back and enjoying this great place ! TemplateBased i will be going back and enjoying this i did not @unk Del-Ret-Gen i will be going back and will not be returning into this UnpairedRL i will be going back and enjoying this great place . UnsuperMT i wo n't be going back and sitting this @num .
Point-Then-Operate i will not be going back and avoid this horrible place ! Original (Amazon, negative) i could barely get through it they taste so nasty .
TemplateBased beautifully through it they taste so nasty . Del-Ret-Gen i have used it through and it is very sharp and it was very nasty . UnpairedRL i could barely get through it they taste so nasty . UnsuperMT i can perfect get through it they taste so delicious .
Point-Then-Operate i could get through it they taste so good .
Original (Amazon, positive) i also prefered the blade weight and thickness of the wustof .
TemplateBased i also prefered the blade weight and thickness of the wustof toe . Del-Ret-Gen i also prefered the blade and was very disappointed in the weight and thickness of the wustof . UnpairedRL i also sampled the comfortable base and follow of the uk . UnsuperMT i also encounter the blade weight and width of the guitar .
Point-Then-Operate i only prefered the weight and thickness of the wustof . Table 7: Sampled system outputs. The dataset and the original style for each input sentence are parenthesized. We mark improperly generated or preserved words in blue, and mark words that show target style and are grammatical in the context in red. Best viewed in color.
@num months ago., which shows that the highlevel policy is not well learned for some neutral sentences.

Adjacent Stylized Words
We introduce a window size of 1 in §4.5.1 to deal with most semantic units. However, we observe in some cases two adjacent stylized words occur, e.g., poor watery food. If the first step is to replace one of them, then the other will be masked in later iterations, leading to incomplete transfer; if the first step is deletion, our method performs well, since we do not mask the context of deletion as stated in §4.5.1. Notably, phrases like completely horrible is not one of these cases, since completely itself is not stylized.
Experiments in this work show the effectiveness of our proposed method for positive-negative text style transfer. Given its sequence operation nature, we see potentials of the method for other types of transfers that require local changes, e.g., politeimpolite and written-spoken, while further empirical verification is needed.

Conclusions
We identify three challenges of existing seq2seq methods for unsupervised text style transfer and propose Point-Then-Operate (PTO), a sequence operation-based method within the hierarchical reinforcement learning (HRL) framework consisting of a hierarchy of agents for pointing and operating respectively. We show that the key aspects of text style transfer, i.e., fluency, style polarity, and content preservation, can be modeled by comprehensive training objectives. To make the HRL training more stable, we provide an efficient mask-based inference algorithm that allows for single-option trajectory during training. Experimental results show the effectiveness of our method to address the challenges of existing methods.