Towards Fine-grained Text Sentiment Transfer

In this paper, we focus on the task of fine-grained text sentiment transfer (FGST). This task aims to revise an input sequence to satisfy a given sentiment intensity, while preserving the original semantic content. Different from the conventional sentiment transfer task that only reverses the sentiment polarity (positive/negative) of text, the FTST task requires more nuanced and fine-grained control of sentiment. To remedy this, we propose a novel Seq2SentiSeq model. Specifically, the numeric sentiment intensity value is incorporated into the decoder via a Gaussian kernel layer to finely control the sentiment intensity of the output. Moreover, to tackle the problem of lacking parallel data, we propose a cycle reinforcement learning algorithm to guide the model training. In this framework, the elaborately designed rewards can balance both sentiment transformation and content preservation, while not requiring any ground truth output. Experimental results show that our approach can outperform existing methods by a large margin in both automatic evaluation and human evaluation.


Introduction
Text sentiment transfer aims to rephrase the input to satisfy a given sentiment label (value) while preserving its original semantic content. It facilitates various NLP applications, such as automatically converting the attitude of review and fighting against offensive language in social media (dos Santos et al., 2018).
Previous work (Shen et al., 2017;Luo et al., 2019) on text sentiment transfer mainly focuses on the coarse-grained level: the reversal of 1 Joint work between WeChat AI and Peking University.

Input Sentence
Tasty food and wonderful service.

Target Sentiment
Output Sentence

0.1
Horrible food and terrible service!

0.5
Food and service need improvement.

0.7
Good food and service.

0.9
Amazing food and perfect service!! Tar Figure 1: An example of the input and output of the fine-grained text sentiment transfer task. The output reviews describe the same content (e.g. food/service) as the input while expressing different sentiment intensity. positive and negative sentiment polarity. They are confined to scenarios where there are two discrete sentiment labels. To achieve more nuanced and precise sentiment control of text generation, we turn to fine-grained text sentiment transfer (FTST) which revises a sequence to satisfy a given sentiment intensity 2 , while keeping the semantic content unchanged. Taking Figure 1 as an example, given the same input and five sentiment intensity values ranging from 0 (most negative) to 1 (most positive), the system generates five different outputs that satisfy the corresponding sentiment intensity in a relative order.
There are two main challenges of FTST task. First, it is tough to achieve fine-grained control of the sentiment intensity when generating sentence. Previous work about coarse-grained text sentiment transfer usually uses a separate decoder for each sentiment label (Xu et al., 2018;Zhang et al., 2018b) or embeds each sentiment label into a separate vector (Fu et al., 2018;. However, these methods are not feasible for fine-grained text sentiment transfer since the target sentiment intensity value is a real value, other than discrete labels. Second, parallel data 3 is unavailable in practice. In other words, we can only access the corpora which are labeled with fine-grained sentiment ratings or intensity values. Therefore, in the FTST task, we can not train a generative model via ground truth outputs. To tackle the two challenges mentioned above, we propose two corresponding solutions. First, in order to control the sentiment intensity of the generated sentence, we propose a novel sentiment intensity controlled sequence-to-sequence (Seq2Seq) model Seq2SentiSeq. It incorporates the sentiment intensity value into the conventional Seq2Seq model via a Gaussian kernel layer. By this means, the model can encourage the generation of words whose sentiment intensity closer to the given intensity value during decoding. Second, due to the lack of parallel data, we can not directly train the proposed model via MLE (maximum likelihood estimation). Therefore, we propose a cycle reinforcement learning algorithm to guide the model training without any parallel data. The designed reward can balance both sentiment transformation and content preservation, while not requiring any ground truth output. Evaluation of the FTST task is also challenging and complex. In order to build a reliable automatic evaluation, we collect human references for FTST task on the Yelp review dataset 4 via crowdsourcing and design a series of automatic metrics.
The main contributions of this work are summarized as follows: • We propose a sentiment intensity controlled generative model Seq2SentiSeq, in which a sentiment intensity value is introduced via a Gaussian kernel layer to achieve fine-grained sentiment control of the generated sentence.
• In order to adapt to non-parallel data, we design a cycle reinforcement learning algorithm CycleRL to guide the model training in an unsupervised way.
• Experiments show that the proposed approach can largely outperform state-of-theart systems in both automatic evaluation and human evaluation.
2 Proposed Model

Task Definition
Given an input sequence x and a target sentiment intensity value v y , the FTST task aims to generate a sequence y which not only expresses the target sentiment intensity v y , but also preserve the original semantic content of the input x. Without loss of generality, we limit the sentiment intensity value v y ranging from 0 (most negative) to 1 (most positive).

Seq2SentiSeq: Sentiment Intensity Controlled Seq2Seq Model
Figure 2 presents a sketch of the proposed Seq2SentiSeq model. The model is based on the encoder-decoder framework, which takes a source text x as the input and outputs a target sentence y with the given sentiment intensity v y . In order to control the sentiment intensity of y, we introduce a Gaussian kernel layer into the decoder.

Encoder
We use a bidirectional RNN as the encoder to capture source content information. Each word in the source sequence x = (x 1 , · · · , x m ) is firstly represented by its semantic representation mapped by semantic embedding E c . The RNN reads the semantic representations from both directions and computes the forward hidden states for each word. We obtain the final hidden representation of the ith word by concatenating the hidden states from

Decoder
Given the hidden representations {h i } m i=1 of the input sequence x and the target sentiment intensity value v y , the decoder aims to generate a sequence y which not only describes the same content as the input sequence x, but also expresses a close sentiment intensity to v y .
In order to achieve the aim of controlling sentiment during decoding, we firstly embedded each word with an additional sentiment representation, besides the original semantic representation. The semantic representation characterizes the semantic content of the word, while the sentiment representation characterizes its sentiment intensity. Formally, the hidden state s t of the decoder at timestep t is computed as follows: The food is where E s (y t−1 ) refers to the sentiment representation of the word y t−1 mapped by the sentiment embedding matrix E s , E c (y t−1 ) is the semantic representation, and the context vector c t is computed by an attention mechanism in the same way as Luong et al. (2015). Considering two goals of the FTST task: sentiment transformation and content preservation, we model the final generation probability into a mixture of semantic probability and sentiment probability, where the former evaluates content preservation and the latter measures sentiment transformation. Similar to the traditional Seq2Seq model (Bahdanau et al., 2014), the semantic probability distribution over the whole vocabulary is computed as follows: where W c is a trainable weight matrix. The sentiment probability measures how close the sentiment intensity of the generated sequence to the target v y . Normally, each word has a specific sentiment intensity. For example, the word "okay" has a positive intensity around 0.6, "good" is around 0.7, and "great" is around 0.8. However, when involving to the previous generated words, the sentiment intensity of current generated word may be totally different. For example, the phrase "not good" has a negative intensity around 0.3, while "extremely good" is around 0.9. That is to say, the sentiment intensity of each word at time-step t should be decided by both the sentiment representation E s and the current decoder state s t . Therefore, we define a sentiment intensity prediction function g(E s , s t ) as follows: where W s is a trainable parameter, and sigmoid is used to scale the predicted intensity value to [0, 1]. Intuitively, in order to achieve fine-grained control of sentiment, words whose sentiment intensities are closer to the target sentiment intensity value v y should be assigned a higher probability. Take Figure 2 as an example, at the 5-th time-step, word "good" should be assigned a higher probability than word "bad", thus the predicted intensity value g("good", s 4 ) is closer to the target sentiment intensity than g("bad", s 4 ). To favor words whose sentiment intensity is near v y , we introduce a Gaussian kernel layer which places a Gaussian distribution centered around v y , inspired by Luong et al. (2015) and Zhang et al. (2018a). Specifically, the sentiment probability is formulated as: Figure 3: Cycle reinforcement learning. Note that the upper encoder-decoder model and the lower encoderdecoder are just one Seq2SentiSeq model.

Training: Cycle Reinforcement Learning
A serious challenge of the FTST task is the lack of parallel data. Since the ground truth output y is unobserved, we can not directly use the maximum likelihood estimation (MLE) for training. To remedy this, we design a cycle reinforcement learning (CycleRL) algorithm. An overview of the training process is summarized in Algorithm 1. Two rewards are designed to encourage changing sentiment but preserving content, without the need of parallel data. The definitions of the two rewards and the corresponding gradients for Seq2SentiSeq model S are introduced as follows.

Reward Design
We design the respective rewards for two goals (sentiment transformation and content preservation) of the FTST task. Then, an overall reward r is calculated to balance these two goals and guide the model training.
Reward for sentiment transformation. A pretrained sentiment scorer is used to evaluate how well the sampled sentenceŷ matches the target sentiment intensity value v y . Specifically, the reward for sentiment transformation is formulated as: where ϕ refers to the pre-trained sentiment scorer which is implemented as LSTM-based linear regression model. Reward for content preservation. Intuitively, if the model performs well in content preservation, it is easy to back-reconstruct the source input x. Therefore, we design the reward for content preservation to be the probability of the model reconstructing x based on the generated textŷ and the source sentiment intensity value v x .
where θ is the parameter of Seq2SentiSeq model.

Algorithm 1
The cycle reinforcement learning algorithm for training Seq2SentiSeq.
Input: A corpora D = {(xi,i )} where each sequence xi is labeled with a fine-grained sentiment label vi 1: Initial the pseudo-parallel data V0 = {(xi,ŷi)} 2: Pre-train Seq2SentiSeq model S θ using V0 3: for each iteration t = 1, 2, ..., T do 4: Sample a sentence x from D 5: for k = 1, 2, ..., K do 6: Sample a intensity value v Generate a target sequence: Compute sentiment reward r (k) s based on Eq. 7 9: Compute content reward r (k) c based on Eq. 8 10: Compute total reward r (k) based on Eq. 9 11: end for 12: Update θ using reward {r (k) } K k=1 based on Eq. 11 13: Update θ using cycle reconstruction loss in Eq. 12 14: end for Overall reward. To encourage the model to improve both sentiment transformation and content preservation, the final reward r guiding the model training is designed to be the harmonic mean of the above two rewards: where β is a harmonic weight that controls the trade-off between two rewards.

Optimization
The goal of RL training is to minimize the negative expected reward, whereŷ (k) is the k-th sampled sequence according to probability distribution p in Eq. 6, r (k) is the reward ofŷ (k) , and θ is the parameter of the proposed model in Figure 2. By means of policy gradient method (Williams, 1992), for each training example, the expected gradient of Eq. 10 can be approximated as: where K is the sample size and b is the greedy search decoding baseline that aims to reduce the variance of gradient estimate which is implemented in the same way as Paulus et al. (2017).
Nevertheless, RL training strives to optimize a specific metric which may not guarantee the fluency of the generated text (Paulus et al., 2017), and usually faces the unstable training problems (Li et al., 2017). The most direct way is to expose the sentences which are from the training corpus to the decoder and trained via MLE (also called teacher-forcing). In order to expose the decoder to the original sentence from the training corpus, we borrow ideas from back-translation (Lample et al., 2018a,b). Specifically, the model first generates a sequenceŷ based on the input text x and the target sentiment intensity value v y , and then reconstructs the source input x based onŷ and the source sentiment intensity value v x . Therefore, the gradient of the cycle reconstruction loss is defined as: where S refers to the Seq2SeniSeq model.
Finally, we alternately update the model parameters θ based on Eq. 11 and Eq. 12.

Experimental Setup
In this section, we introduce the dataset, experiment settings, baselines, and evaluation metrics.

Dataset
We conduct experiments on the Yelp dataset 5 , which consists of a large number of product reviews. Each review is assigned a sentiment rating ranging from 1 to 5. Since the label inconsistency between human is more serious in fine-grained ratings, we average the ratings for the sentences which have a Jaccard Similarity more than 0.9. Then, averaged ratings are normalized between 0 and 1 as the sentiment intensity. Other data preprocessing is the same as Shen et al. (2017). Finally, we obtain a total of 640K sentences. We randomly hold 630K for training, 10K for validation, and 500 for testing. Even though the sentiment intensity distribution of training dataset is not uniform, the proposed framework consists of a uniform data augmentation which generates sentences whose intensity is from interval [0, 1] with a step of 0.05 to guide the model training (Step 6 in Algorithm 1).

Experiment Settings
We tune hyper-parameters on the validation set. The size of vocabulary is set to 10K. Both the semantic and sentiment embeddings are 300dimensional and are learned from scratch. We implement both encoder and decoder as a 1-layer LSTM with a hidden size of 256, and the former is bidirectional. The batch size is 64. We pre-train our model for 10 epochs with the MLE loss using pseudo-parallel sentences conducted by Jaccard Similarity, which is same as Liao et al. (2018). Harmonic weight β in Eq. 9 is 1 and γ in Eq. 6 is 0.5. The standard deviation σ is set to 0.01 for yielding suitable peaked distributions. The sample size K in Eq. 11 is set to 16. The optimizer is Adam (Kingma and Ba, 2014) with 10 −3 initial learning rate for pre-training and 10 −5 for cycleRL training. Dropout (Srivastava et al., 2014) is used to avoid overfitting.

Baselines
We compare our proposed method with the following two series of state-of-the-art systems.
Fine-grained systems aim to modify an input sentence to satisfy a given sentiment intensity. Liao et al. (2018) construct pseudo-parallel corpus to train a model which is a combination of a revised-VAE and a coupling component modeling pseudo-parallel data with three extra losses L extra . What's more, we also consider SC-Seq2Seq (Zhang et al., 2018a) which is a specificity controlled Seq2Seq model proposed in dialogue generation. In order to adapt to this unsupervised task, the proposed CycleRL training algorithm is used to train the SC-Seq2Seq model.
Coarse-grained systems aim to reverse the sentiment polarity (positive/negative) of the input, which can be regarded as a special case where the sentiment intensity is set below average (negative) or above average (positive). We compare our proposed method with the following state-of-the-art systems: CrossAlign (Shen et al., 2017), MultiDecoder (Fu et al., 2018), DeleteRetrieve  and Unpaired (Xu et al., 2018).

Evaluation Metrics
We adopt both automatic and human evaluation.

Automatic Evaluation
Automatic evaluation of FTST is an open and challenging issue, thereby we adopt a combination of multiple evaluation methods.
Content: To evaluate the content preservation performance, we hired crowd-workers on Crowd-Flower 6 to write human references. 7 For each review in the test dataset, crowd-workers are required to write five references with sentiment intensity value from V = [0.1, 0.3, 0.5, 0.7, 0.9]. Therefore, the BLEU (Papineni et al., 2002) score between the human reference and the corresponding generated text of the same sentiment intensity can evaluate the content preservation performance. Fluency: To measure the fluency, we calculate the perplexity (PPL) of each generated sequence via a pre-trained bi-directional LSTM language model (Mousa and Schuller, 2017).
Sentiment: In order to measure how close the sentiment intensity of outputs to the target intensity values, we define three metrics. Given an input sentence x and a list of target intensity values V = [v 1 , v 2 , ..., v N ], the corresponding outputs of the model are [ŷ 1 ,ŷ 2 , ...,ŷ N ]. We then use a pre-trained sentiment regression scorer to predict the sentiment intensity values of outputs asV = [v 1 ,v 2 , ...,v N ]. Following Liao et al. (2018), we use the mean absolute error (MAE = 1 N N i=1 |v i −v i |) between V andV to measure the absolute gap.
Moreover, for fine-grained text sentiment transfer task, we expect that given a higher sentiment intensity value, the model will generate a more positive sentence. That is to say, the relative intensity ranking of all generated sentences of the same input is also important. Inspired by the Mean Reciprocal Rank metric which is widely used in the Information Retrieval area, we design a Mean Relative Reciprocal Rank (MRRR) metric to measure the relative ranking In addition, we also compare our model with the coarse-grained sentiment transfer systems. In order to make the results comparable, we define the generated test samples of all baselines for reproducibility. sentiment intensity larger/smaller than 0.5 as positive/negative results. Then we use a pre-trained binary TextCNN classifier (Kim, 2014) to compute the classification accuracy.

Human Evaluation
We also perform human evaluation to assess the quality of generated sentences more accurately. Each item contains the source input, the sampled target sentiment intensity value, and the output of different systems. Then 500 items are distributed to 3 evaluators, who are required to score the generated sentences from 1 to 5 based on the input and target sentiment intensity value in terms of three criteria: content, sentiment, fluency. Content evaluates the content preservation degree. Sentiment refers to how much the output matches the target sentiment intensity. Fluency is designed to measure whether the generated texts are fluent. For each metric, the average Pearson correlation coefficient of the scores given by three evaluators is greater than 0.71, which ensures the interevaluator agreement.

Evaluation Results
The automatic evaluation and human evaluation results are shown in Table 1. It shows that our approach achieves the best performance in all metrics. More specifically, we have the following observations: (1) The proposed model Seq2SentiSeq obtains 8.6/3.1/0.98 points absolute improvement over the best results on BLEU-1/BLEU-2/Content score. It demonstrates the effectiveness of our approach in improving the content preservation of the input sentences.
(2) Our model can more precisely control the sentiment intensity from human scores on sentiment, and it can also obtain both best results in sentiment mean absolute error (MAE) and relative sentiment rank (MRRR).

Model
Automatic   However, SC-Seq2Seq gets the second best MAE score while Revised-VAE + L extra gets the second best MRRR score. We can infer that the two models excel at different aspects. And MRRR provides a different perspective on the sentiment results.
(3) The proposed model can generate more fluent sentences than all baselines. The main reason for these three phenomenons is that we design two rewards that can directly ensure the content preservation and sentiment transformation in the cycle reinforcement training process. In addition, the cycle reconstruction loss can effectively guarantee the fluency of generated sentences, which has been further verified in the ablation study. What's more, we also simplify our task to the setting of coarse-grained (positive/negative) sentiment transfer task. Table 3 shows the binary sentiment accuracy of the representative systems. We can find that the proposed model achieve the best results over the fine-grained systems, and it is comparable to the best coarse-grained system.

Ablation Study
In this section, we further discuss the impacts of the components of the proposed model. We retrain our model by ablating multiple components of our model: without pre-training, without cycle reconstruction (Eq. 12), without reinforcement learning ( Eq. 11). Table 2 shows the corresponding automatic and human evaluations. The perfor-Input the beer isn't bad, but the food was less than desirable.

Output Seq2SentiSeq
V=0.1 the beer is terrible, and the food was the worst. V=0.3 the beer wasn't bad, and the food wasn't great too. V=0.5 the food is ok, but not worth the drive to the strip. V=0.7 the beer is good, and the food is great. V=0.9 the wine is great, and the food is extremely fantastic.

Output
Revised-VAE + L extra V=0.1 n't no about about no when about that was when about V=0.3 the beer sucks , but the food is not typical time. V=0.5 the beer is cheap, but the food was salty and decor. V=0.7 i just because decent management salty were impersonal. V=0.9 n't that about was that when was about as when was mance declines most when without pre-training. This reveals that reinforcement learning is heavily dependent on pre-training as a warm start because it is hard for RL architecture to train from scratch. Moreover, no pre-training will lead the model to generate frequent words and short sentence which gets low PPL score. What's more, the performance of ablated version without cycle reconstruction also drops significantly, since cycle reconstruction plays an important role of teacherforcing in our paper. Finally, even though the proposed Seq2SentiSeq without reinforcement learning can beat the best baseline in terms of human average score, reinforcement learning still helps to boost the performance of the proposed model by a large margin. Table 4 shows the example outputs on the YELP datasets with five sentiment intensity values. This case demonstrates that our model can both preserve the content ("beer", "food") and change the sentiment to the desired intensity. More importantly, our model can capture the subtle sentiment difference of the words or phrases, e.g., "the worst" → "bad" → "ok" → "good" → "extremely fantastic". However, the Revised-VAE + L extra system does not show this sentiment trend and may collapse when intensity value V is very small (0.1) or very big (0.9). And our model sometimes may also suffer from semantic drift, e.g., "beer" is revised to "wine".

Analysis on Sentiment Representation
We also conduct analysis to understand the sentiment representations of words introduced in our model. We use the 1000 most frequent words from the training dataset. Then, we use a human annotated sentiment lexicon (Hutto and Gilbert, 2014) to classify them into three categories: positive, neutral and negative. After that, we get 112 positive words, 841 neutral words and 47 negative words. Finally, we apply t-SNE (Rauber et al., 2016) to visualize both semantic and sentiment embeddings of the proposed model ( Figure  2) when finished training. As shown in Figure 4, we can see that the distributions of the two embeddings are significantly different. In the semantic embedding space, most of the positive words and negative words lie closely. On the contrary, in the sentiment embedding space, positive words are far from negative words. In conclusion, neighbors on semantic embedding space are semantically related, while neighbors on sentiment embedding space express a similar sentiment intensity.

Related Work
Recently, there is a growing literature on the task of unsupervised sentiment transfer. This task aims to reverse the sentiment polarity of a sentence but keep its content unchanged without parallel data (Fu et al., 2018;Tsvetkov et al., 2018;Xu et al., 2018;Lample et al., 2019). However, there are few researches focus on the fine-grained control of sentiment. Liao et al. (2018) exploits pseudo-parallel data via heuristic rules, thus turns this task to a supervised setting. They then propose a model based on Variational Autoencoder (VAE) to first disentangle the content factor and source sentiment factor, and then combine the content with target sentiment factor. However, the quality of the pseudo-parallel data is not quite satisfactory, which seriously affects the performance of the VAE model. Different from them, we dynamically update the pseudo-parallel data via on-the-fly back-translation (Lample et al., 2018b) during training (Eq. 12).
There are some other tasks of NLP also show interest in controlling the fine-grained attribute of text generation. For example, Zhang et al. (2018a) and Ke et al. (2018) propose to control the specificity and diversity in dialogue generation. We borrow ideas from these works but the motivation and proposed models of our work are a far cry from them. The main differences are: (1) Since sentiment is dependent on local context while specificity is independent of local context, there is a series of design in our model to take the local context (or previous generated words) s t into consideration (e.g., Eq. 1, Eq. 3). (2) Due to the lack of parallel data, we propose a cycle reinforcement learning algorithm to train the proposed model (Section 2.3).

Conclusion
In this paper, we focus on solving the finegrained text sentiment transfer task, which is a natural extension of the binary sentiment transfer task but with more challenges. We propose a Seq2SentiSeq model to achieve the aim of controlling the fine-grained sentiment intensity of the generated sentence. In order to train the proposed model without any parallel data, we design a cycle reinforcement learning algorithm. We apply the proposed approach to the Yelp review dataset, obtaining state-of-the-art results in both automatic evaluation and human evaluation.