Human-Like Decision Making: Document-level Aspect Sentiment Classification via Hierarchical Reinforcement Learning

Recently, neural networks have shown promising results on Document-level Aspect Sentiment Classification (DASC). However, these approaches often offer little transparency w.r.t. their inner working mechanisms and lack interpretability. In this paper, to simulating the steps of analyzing aspect sentiment in a document by human beings, we propose a new Hierarchical Reinforcement Learning (HRL) approach to DASC. This approach incorporates clause selection and word selection strategies to tackle the data noise problem in the task of DASC. First, a high-level policy is proposed to select aspect-relevant clauses and discard noisy clauses. Then, a low-level policy is proposed to select sentiment-relevant words and discard noisy words inside the selected clauses. Finally, a sentiment rating predictor is designed to provide reward signals to guide both clause and word selection. Experimental results demonstrate the impressive effectiveness of the proposed approach to DASC over the state-of-the-art baselines.


Introduction
Document-level Aspect Sentiment Classification (DASC) is a fine-grained sentiment classification task in the field of sentiment analysis (Pang and Lee, 2007;Li et al., 2010). This task aims to predict the sentiment rating for each given aspect mentioned in a document-level review. For instance, Figure 1 shows a review document with four given aspects of a hotel (i.e., location, room, value, service). The goal of DASC is to predict the rating score towards each aspect by analyzing the whole document. In the last decade, this task has been drawing more and more interests of researchers in the Natural Language Processing community (Titov and McDonald, 2008;Yin et al., 2017;. In previous studies, neu- ral models have shown to be effective for performance improvement on DASC. Despite the advantages, these complex neural network approaches often offer little transparency w.r.t. their inner working mechanisms and suffer from the lack of interpretability. However, clearly understanding where and how such a model makes such a decision is rather important for developing real-world applications Marcus, 2018). As human beings, if asked to evaluate the sentiment rating for a specific aspect in a document, we often perform sentiment prediction in two steps. First, we select some aspect-relevant snippets (e.g., sentences/clauses) inside the document. Second, we select some sentiment-relevant words (e.g., sentiment words) inside these snippets to make a rating decision. For instance, for aspect location in Figure 1, we first select the aspectrelevant clauses, i.e., Clause1 and Clause2, and then select sentiment-relevant words, i.e., " close" and "very convenient" inside the two clauses, for making the rating decision (5 stars).
Inspired by the above cognitive process of human beings, one ideal and interpretable solution for DASC is to select aspect-relevant clauses and sentiment-relevant words, discarding those noisy parts of a document for decision making. In this solution, two major challenges exist which are illustrated as follows.
The first challenge is how to select aspectrelevant clauses and discard those irrelevant and noisy clauses. For instance, for aspect location, Clause5 mentioning another aspect value (only 1 star) may induce the noise and should be discarded, because the noise can provide wrong signals to mislead the model into assigning very low sentiment rating to aspect location. One possible way to alleviate this noisy problem is to leverage the soft-attention mechanism as proposed in  and . However, this soft-attention mechanism has the limitation that the softmax function always assigns small but non-zero probabilities to noisy clauses, which will weaken the attention given to the few truly significant clauses for a particular aspect. Therefore, a well-behaved approach should discard noisy clauses for a specific aspect during model training.
The second challenge is how to select sentiment-relevant words and discard those irrelevant and noisy words. For instance, for aspect location, words "this", "is" in Clause1 are noisy words and should be discarded since they make no contribution to implying the sentiment rating. One possible way to alleviate this problem is to also leverage the soft-attention mechanism as proposed in . However, this soft-attention mechanism may induce additional noise and lack interpretability because it tends to assign higher weights to some domain-specific words rather than real sentiment-relevant words (Mudinas et al., 2012;Zou et al., 2018). For instance, this soft-attention mechanism tends to regard the name of a hotel "Hilton" with a good reputation in Clause3 as a positive word which could mislead the model into assigning a higher rating to aspect room. Therefore, a well-behaved approach should highlight sentiment-relevant words and discard noisy words for a specific aspect during model training.
In this paper, we propose a Hierarchical Reinforcement Learning (HRL) approach with a highlevel policy and a low-level policy to address the above two challenges in DASC. First, a highlevel policy is leveraged to select aspect-relevant clauses and discard noisy clauses during model training. Then, a low-level policy is leveraged to select sentiment-relevant words and discard noisy words inside the above selected clauses. Finally, a sentiment rating predictor is designed to provide reward signals to guide both clause and word selection. The empirical studies show that the proposed approach performs well by incorporating the clause selection and word selection strategies and significantly outperforms several state-ofthe-art approaches including those with the softattention mechanism.
2 Hierarchical Reinforcement Learning Figure 2 shows the overall framework of our Hierarchical Reinforcement Learning (HRL) approach which contains three components: a high-level policy for clause selection (Section 2.2); a lowlevel policy for word selection (Section 2.3); a sentiment rating predictor for providing reward signals to guide both the above clause and word selection (Section 2.4).
As a preprocessing, we adopt RST style discourse segmentation 1 (MANN, 1988) to segment all documents in corpus C into Elementary Discourse Units (EDUs), and consider thse EDUs as clauses by following .
In summary, we formulate the task of DASC as a semi-Markov Decision process (Sutton et al., 1999b), i.e., hierarchical reinforcement learning with a high-level policy and a low-level policy. In particular, our HRL approach for DASC works as follows. Given a review document with a clause sequence and an aspect, the high-level policy decides whether a clause mentions this aspect. If yes, the high-level policy selects this clause and launches the low-level policy, which scans the words inside this selected clause one by one in order to select sentiment-relevant words. Otherwise, the high-level policy skips current clause and turns to the next clause until all clauses in the review document are scanned. During clause and word selection, a sentiment rating predictor is employed to provide reward signals to guide the above clause and word selection.

Clause Selection with High-level Policy
Assume that a review document D with a given aspect x aspect has been segmented into a clause sequence {u 1 , ..., u n }, high-level policy π h aims to select clause u i which truly mentions aspect x aspect and discard noisy ones. Here, clause u i consists of k i words {x i,1 , ..., x i,k i }. Once a clause is selected, it is passed to the low-level policy for further word selection.
During clause selection, we adopt a stochastic policy as high-level policy π h , which can generate a conditional probability distribution π h (o|·) over option (i.e., high-level action More specifically, we adopt a LSTM model LSTM h to construct high-level policy π h for performing clause selection over the clause sequence. In LSTM h , the hidden statev i ∈ R d of clause u i and memory cell c h i at i-th time-step are given by, where v i is the vector representation of clause u i and initialized by hidden stateŵ i,k i of the last wordx i,k i in clause u i . Here,ŵ i,k i is obtained from the pre-trained LSTM l (presented in line 3 of Algorithm 1); f denotes all gate functions and update function of LSTM h . Note that if o i = 0, LSTM h will skip (i.e., discard) and not encode clause u i , memory cell c h i and hidden statev i of current time-step i are then directly copied from the previous time-step i − 1.
In principle, the high-level policy π h uses a Reward to guide clause selection over the clause sequence. It samples an Option o i with the probability π h (o i |s h i ; θ h ) at each State s h i . More concretely, the state, option and reward of π h are defined as follows.
• State. The state s h i at i-th time-step should provide adequate information for deciding to se-lect a clause or not for x aspect . Thus, state is the trainable parameter; ∼ denotes sampling operation; σ denotes sigmod function.
• Reward. In order to select aspect-relevant clauses inside a clause sequence {u 1 , ..., u n }, given a sampled option trajectory τ h = (s h 1 , o 1 , r h 1 , ..., s h n , o n , r h n ) ∼ π h , we compute the high-level cumulative reward r h i at i-th time-step as follows: where r h i consists of three different terms: 1) The first term log cos(v a ,v t ) is a cosine intermediate reward computed by cosine similarity between aspect embedding v a ∈ R d and hidden statev t ∈ R d of the t-th clause u t . This reward provides aspect supervision signals to guide the policy to select aspect-relevant clauses.
2) The second term r l (u t ) = kt j=1 r l t,j is an intermediate reward from low-level policy after the word selection in the selected clause u t is finished. Note that if clause u t is discarded, r l (u t ) = 0. This reward provides a feedback to indicate how well clause selection is.
3) The third term log p θ (y|v n ) is a delay reward from sentiment rating predictor. After LSTM h finishes all options, we feed the last hidden statev n of LSTM h to the softmax decoder of sentiment rating predictor and then obtain a rating probability p θ (y|v n ) for ground-truth rating label y to compute this delay reward. This reward provides additional reward signals to guide policy to select discriminative clauses. Besides, γ is the discount factor; λ 1 , λ 2 and λ 3 are weight parameters.

Word Selection with Low-level Policy
Given a word sequence {x i,1 , ..., x i,k i } of clause u i selected by the high-level policy, low-level policy π l aims to select the sentiment-relevant word x i,j and discard noisy ones.
During word selection, we still adopt a stochastic policy as low-level policy π l , which can generate a conditional probability distribution π l (a|·) Similar to clause selection, we adopt another LSTM model LSTM l to construct low-level policy π l for performing word selection over word sequence {x i,1 , ..., x i,k i } of each clause (Note that, as shown in Figure 2, LSTM l is shared by all selected clauses from high-level policy). In LSTM l , the hidden stateŵ i,j ∈ R d of word x i,j and memory cell c l j at j-th time-step (Here, we omit the clause index and only use j to denote the j-th time-step in i-th clause u i ) are given by, , word x i,j is discarded, memory cell and hidden state of current time-step are directly copied from the previous time-step. Then, we illustrate state, action and reward of low-level policy as follows.
• State. The state s l i,j at j-th time-step should provide adequate information for deciding to select a word or not. Thus, the state s l i,j ∈ R 3d is composed of three parts, i.e., • Action. π l samples action a i,j ∈ {0, 1} by the conditional probability π l (a i,j |s l i,j ; θ l ). Thus, similar to high-level policy, we adopt logistic function to define π l (a i,j |s l i,j ; θ l ).
Similarly, in order to select sentiment-relevant words inside a word sequence we compute the low-level cumulative reward r l i,j at jth time-step as: where r l i,j consists of two terms: 1) Similar to high-level policy, the first term log p θ (y|ŵ i,k i ) is a delay reward provided by sentiment rating predictor. After LSTM l finishes all actions, we feed last hidden stateŵ i,k i in i-th clause to softmax decoder of sentiment rating predictor and then we can obtain this delay reward. This reward provides rating supervision information to guide policy to select discriminative words, i.e., sentimentrelevant words.
2) The second term γ(−N )/N is a penalty delay reward. N = k i j=1 a i,j denotes the number of selected words. The basic idea of using this penalty reward is to select words as small as possible because sentiment-relevant words is usually a small subset of all words inside the clause. Note that, we could also adopt external sentiment lexicons to achieve this goal, but sentiment lexicons are difficult to obtain in many realworld applications. Besides, λ 1 , λ 2 are weight parameters.

Sentiment Rating Predictor
The goal of sentiment rating predictor lies in twofold. On one hand, during model training, the goal of sentiment rating predictor is to use a softmax decoder to provide rating probabilities as the reward signals (see Eq. (3) and Eq.(6)) to guide both clause and word selection.
On the other hand, when model training is finished, i.e., both high-level and low-level policy finish all their selections, the goal of sentiment rating predictor is to perform DASC. Specifically, we first regard last statev n of LSTM h as the representation of all selected clauses while last state w n,kn of LSTM l as the representation of all selected words. Then, we concatenatev n andŵ n,kn to compute the final representation z of the review document D as: z =v n ⊕ŵ n,kn , where ⊕ denotes concatenation operation. Finally, to perform DASC, we feed z to a softmax decoder as follows: Softmax Decoder. We first feed z to a softmax classifier m = W z + b, where m ∈ R C is output vector; θ = {W, b} is trainable parameter. Then, the probability of labeling sentence with sentiment ratingŷ ∈ [1, C] is computed by . Finally, the label with the highest probability stands for the predicted sentiment rating for x aspect .
Note that, for an aspect, if no clauses and words are finally selected from a review document, the model will assign a random rating for this aspect.

Model Training via Policy Gradient and Back-Propagation
The parameters in HRL are learned according to Algorithm 1. Specifically, these parameters can be divided into two groups: 1) θ h and θ l of high-level policy π h and low-level policy π l respectively. 2) θ of LSTM h , LSTM l and softmax decoder.
For θ h of high-level policy, we optimize it with policy gradient (REINFORCE) (Williams, 1992;Sutton et al., 1999a). The policy gradient w.r.t. θ h is computed by differentiating the maximized expected reward J(θ h ) as follows: where R h = r h i − b(τ h ) is the advantage estimate of the high-level reward. Here, b(τ h ) is the baseline (Williams, 1992) which is used to reduce the variance of the high-level reward without altering its expectation theoretically. In practical use, we sample some trajectories τ h 1 , τ h 2 , ..., τ h m over the clause sequence with the current high-level policy. The model will assign a reward score to each sequence according to the designed scores function, and then estimates b(τ h ) as the average of those rewards. Similarly, the policy gradient w.r.t. θ l of low-level policy is given by, Algorithm 1 Hierarchical reinforcement learning 1: Input: Corpus C; a review document D with a clause sequence {u1, ..., un}; a clause ui with a word sequence {xi,1, ..., x i,k i }. 2: Initialize parameters θ, θ h and θ l randomly; 3: Pre-train LSTM l by forcing π l to select all words for classification, and update θ of LSTM l by Eq.(9); 4: Pre-train LSTM h by forcing π h to select all clauses for classification, and update θ of LSTM h by Eq.(9); 5: Fix all parameters θ and update θ h , θ l as follows: 6: for review document D ∈ C do 7: for clause ui ∈ {u1, ..., un} do 8: Sample option oi ∼ π h (oi|s h i ; θ h ); 9: if option oi = 1 then 10: for word xi,j ∈ {xi,1, ..., x i,k i } do 11: Sample action ai,j ∼ π h (ai,j|s l i,j ; θ h ); 12: end for 13: Compute r l i,j by Eq.(6); 14: Update θ l by Eq. (8) where R l = r l i,j − b(τ l ) is the advantage estimate of the low-level reward. Similarly, b(τ l ) is used to reduce the variance of the low-level reward.
For θ, we optimize it with back-propagation. The objective of learning θ is to minimize the cross-entropy loss in the classification phase, i.e., where (D, x aspect , y) denotes a review document D with a given aspect x aspect from corpus C; y is ground-truth sentiment rating for aspect x aspect . δ is a L 2 regularization.

Experimental Settings
Data. We conduct our experiments on three public datasets on DASC, i.e., TripUser , TripAdvisor (Wang et al., 2010) and BeerAdvocate (McAuley et al., 2012;Lei et al., 2016). In the experiment, we adopt Discourse Segmentation    and those with ‡ are from Yin et al. (2017) Tool 3 to segment all reviews in the three datasets into EDUs (i.e., clauses). Moreover, we adopt training/development/testing settings (8:1:1) by following Yin et al. (2017); . Table 1 shows the statistics of the three datasets. Implementation Details. We adopt the pretrained 200-dimension word embeddings provided by Yin et al. (2017). The dimension of LSTM hidden states is set to be 200. The other hyperparameters are tuned according to the performance in the development set. Specifically, we adopt Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 0.012 for cross-entropy training and adopt SGD optimizer with a learning rate of 0.008 for all policy gradients training. For rewards of high-level and low-level policies, γ is 0.8; λ 1 , λ 2 and λ 3 are 0.25, 0.25 and 0.5 respectively. λ 1 , λ 2 are 0.6 and 0.4. Additionally, the batch size is set to be 64, regularization weight is set to be 10 −5 and the dropout rate is 0.2.
Evaluation Metrics. The performance is evaluated using Accuracy (Acc.) and MSE as Yin et al. (2017). Moreover, t-test is used to evaluate the significance of the performance difference between two approaches (Yang and Liu, 1999).
Baselines. We compare HRL with the following baselines: 1) SVM (Yin et al., 2017). This approach only adopts unigram, bigram as features to train an SVM classifier. 2) LSTM (Tang et al., 2015). This is a neural network approach to document-level sentiment classification which employs gated LSTM to learn text representation.
3) MAMC (Yin et al., 2017). This approach employs hierarchical iterative attention to learn aspect-specific representation. This is a state-of-the-art approach to DASC. 4) HARN . This approach adopts hierarchical attention to incorporate overall rating and aspect information so as to learn aspect-specific representation. This is another state-of-the-art approach to DASC. 5) HUARN . This approach extends HARN by integrating additional user information. This is another state-of-the-art approach to DASC. 6) C-HAN . This approach adopts hierarchical attention to incorporate clause and aspect information so as to learn text representation. Although this is a state-of-the-art approach to sentence-level ASC, it could also be directly applied in DASC. 7) HS-LSTM . This is a reinforcement learning approach to text classification, which employs a hierarchically LSTM to learn text representation. 8) RL-Word-Selection. Our approach which leverages only the word selection strategy by using the low-level policy. 9) RL-Clause-Selection. Our approach which leverages only the clause selection strategy by using the high-level policy. Four state-of-the-art ASC approaches including MAMC, HARN, HUARN and C-HAN all perform better than LSTM. These results confirm the helpfulness of considering aspect information in DASC. Besides, we find the reinforcement learning based approach HS-LSTM without consid-  ering aspect information can achieve comparable performance with MAMC, HARN, C-HAN, and even beat MAMC on two datasets TripUser and TripAdvisor, which demonstrates that using reinforcement learning is a good choice to learn text representation for DASC.

Experimental Results
Our approach RL-Word-Selection and RL-Clause-Selection outperform most above approaches and they only perform slightly worse than HUARN. This result encourages to perform clause or word selection in DASC. Among all these approaches, our approach HRL performs best and it significantly outperforms (p-value < 0.01) strong baseline HUARN which actually considers some other kinds of external information, such as the overall rating and the user information. These results encourage to perform both clause and word selection in DASC.
Ablation Study. Further, we conduct the ablation study of HRL to evaluate the contribution of each component. The results are shown in Table  3. From this table, we can see that 1) Using cosine intermediate reward in Eq.(3) can averagely improve Acc. by 2.87% on three different datasets.
2) Using penalty delay reward in Eq.(6) can averagely improve Acc. by 1.23%. 3) To concatenate additional representation of the selected clauses in rating predictor can improve Acc. by 4.38%. 4) To concatenate additional representation of the selected words in rating predictor can improve Acc. by 2.72%. 5) Using the clause splitting instead of sentence splitting could improve Acc. by 2.44%. This confirms that it is more appropriate to consider clauses as the segmentation units than sentences. This is because that 90% of clauses contain only one opinion expression as proposed in Bayoudhi et al. (2015). For instance, as shown in Figure 1, if we use the sentences as the segmentation units, Clause1-Clause3 will be assigned into one unit while they talk about two aspects, i.e., location and room. In this scenario, sentence selection will not be able to discard noisy parts inside

Analysis and Discussion
Analysis of HRL Training. Figure 3 shows two average rewards (each epoch) of high-level and low-level policy on BeerAdvocate respectively. To clearly observe the change of the reward, following Lillicrap et al. (2016), all rewards are normalized to (0, 1). From this figure, we can see that, both the high-level and low-level reward increase as the training algorithm iterates. This result indicates that our HRL approach is capable of stably revising its policies to obtain more discriminative clauses and words for better performing.
Analysis of Clause and Word Selection. Figure 4 shows visualizations of our HRL approach which performs the clause selection and word selection on a review document. From this figure, we can see that HRL is able to precisely select aspect-relevant clauses, i.e., Clause1 and Clause2, for aspect location while select Clause3 and Clause4 for room. Further, HRL is able to select all sentiment-relevant words, such as "close" and "very convenient" for aspect location, while "a little uncomfortable" and "nitpicking"for room.
Error Type Breakdown. We analyze error cases in the experiments and broadly categorize them into three types: (1) The first type of errors are due to negation words. For instance, for the review "The taste of this beer is not good, don't buy it", HRL could precisely select the sentiment  for aspect (a) location and (b) room. Red denotes the clause has been selected, blue denotes the word has been selected and other colors denote the token has been discarded.
"good", but fail to select the negation word "not". This inspires us to work on optimizing our approach in order to capture negation scope better in our future work.
(2) The second type of errors are due to comparative opinions. For instance, for the review "The room of Sheraton is much better than this one.", HRL incorrectly predicts high rating (5 stars

Related Work
Aspect Sentiment Classification. Traditional studies for DASC mainly focus on feature engineering to explore efficient features for DASC (Titov and McDonald, 2008;Lu et al., 2011;McAuley et al., 2012). Recently, neural networks with the characteristic of automatically mining features have shown promising results on DASC. Lei et al. (2016) focused on extracting rationales for aspects and build a neural text regressor to predict aspect rating; Yin et al. (2017) focused on using hierarchical iterative attention to learn aspectspecific text representation for DASC;  employed a hierarchical attention approach to DASC which incorporates both the external user and overall rating information. Besides, neural networks have been widely adopted for performing a closely related task, i.e., Sentence-level Aspect Sentiment Classification (Wang et al., 2016;Tang et al., 2016;Wang and Lu, 2018). Reinforcement Learning. In recent years, re-inforcement learning has been applied successfully to some NLP tasks. Guo (2015) employed deep Q-learning to improve the seq2seq model for the text generation task; Li et al. (2016) showed how to apply deep reinforcement learning to model future reward in the chatbot dialogue task; Takanobu et al. (2018) employed hierarchical reinforcement learning to model the relation extraction task;  combined LSTM with reinforcement learning to learn structured representations for the text classification task, which is inspirational to our approach.
Unlike all above studies, inspired by the cognitive process of human beings, this paper proposes a new HRL approach to DASC task. To the best of our knowledge, this is the first attempt to address DASC with HRL.

Conclusion
In this paper, we propose a hierarchical reinforcement learning approach to DASC. The main idea of the proposed approach is to perform sentiment classification like human beings. Specifically, our approach employs a high-level policy and a lowlevel policy to perform clause selection and word selection in DASC respectively. Experimentation shows that both the clause and word selection are effective for DASC and the proposed approach significantly outperforms several state-of-the-art baselines for DASC.
In our future work, we would like to solve other challenges in DASC, e.g., negation detection problem, to further improve the performance. Furthermore, we would like to apply our HRL approach to other sentiment analysis tasks, such as aspect and opinion co-extraction, and dialog-level sentiment analysis.