Document-Level Multi-Aspect Sentiment Classification as Machine Comprehension

Document-level multi-aspect sentiment classification is an important task for customer relation management. In this paper, we model the task as a machine comprehension problem where pseudo question-answer pairs are constructed by a small number of aspect-related keywords and aspect ratings. A hierarchical iterative attention model is introduced to build aspectspecific representations by frequent and repeated interactions between documents and aspect questions. We adopt a hierarchical architecture to represent both word level and sentence level information, and use the attention operations for aspect questions and documents alternatively with the multiple hop mechanism. Experimental results on the TripAdvisor and BeerAdvocate datasets show that our model outperforms classical baselines. We will release our code and data for the method replicability.


Introduction
Document-level sentiment classification is one of the pragmatical sentiment analysis tasks (Pang and Lee, 2007;Liu, 2010). There are many Web sites having platforms for users to input reviews over products or services, such as TripAdvisor, Yelp, Amazon, etc. Most of reviews are very comprehensive and thus long documents. Analyzing these documents to predict ratings of products or services is an important complementary way for better customer relationship management. Recently, neural network based approaches have been developed and become state-of-the-arts for longdocument sentiment classification (Tang et al., 2015a,b;. However, predicting an overall score for each long document is not enough, because the document can mention dif-"The situation is good, it's very clean, but there is nothing special. Breakfast at downstairs is directly from grocery store. Water pressure is good! A decent choice for sleeping. New York is expensive place!" Cleanliness:5 Room:: 4 Value:: 2 Review Rating Figure 1: Example: hotel review with aspects. ferent aspects of the corresponding product or service. For example, in Figure 1, there could be different aspects for a review of hotel. These aspects help customer service better understand what are the major pros and cons of the product or service. Compared to the overall rating, users are less motivated to give aspect ratings. Therefore, it is more practically useful to perform document-level multi-aspect sentiment classification task, predicting different ratings for each aspect rather than an overall rating. One straightforward approach for documentlevel multi-aspect sentiment classification is multi-task learning (Caruana, 1997). For neural networks, we can simply treat each aspect (e.g., rating from one to five) as a classification task, and let different tasks use softmax classifier to extract task-specific representations at the top layer while share the input and hidden layers to mutually enhance the prediction results (Collobert et al., 2011;Luong et al., 2016). However, such approach ignores the fact that the aspects themselves have semantic meanings. For example, as human beings, if we were asked to evaluate the aspect rating of a document, we simply read the review, and find aspect-related keywords, and see around comments. Then, we aggregate all the related snippets to make a decision.
In this paper, we propose a novel approach to treat document-level multi-aspect sentiment clas-  sification as a machine comprehension (Kumar et al., 2016;Sordoni et al., 2016) problem. To mimic human's evaluation of aspect classification, we create a list of keywords for each aspect. For example, when we work on the Room aspect, we generate some keywords such as "room," "bed," "view," etc. Then we can ask pseudo questions: "How is the room?" "How is the bed?" "How is the view?" and provide an answer "Rating 5." In this case, we can train a machine comprehension model to automatically attend corresponding text snippets in the review document to predict the aspect rating. Specifically, we introduce a hierarchical and iterative attention model to construct aspect-specific representations. We use a hierarchical architecture to build up different representations at both word and sentence levels interacting with aspect questions. At each level, the model consists of input encoders and iterative attention modules. The input encoder learns memories 1 of documents and questions with Bi-directional LSTM (Bi-LSTM) model and non-linear mapping respectively. The iterative attention module takes into memories as input and attends them sequentially with a multiple hop mechanism, performing effective interactions between documents and aspect questions.
To evaluate the effectiveness of the proposed model, we conduct extensive experiments on the TripAdvisor and BeerAdvocate datasets and the results show that our model outperforms typical baselines. We also analyze the effects of num-bers of the hop and aspect words on performances. Moreover, a case study for attention results is performed at both word and sentence levels.
The contributions of this paper are two-fold. First, we study the document-level multi-aspect sentiment classification as a machine comprehension problem and introduce a hierarchical iterative attention model for it. Second, we demonstrate the effectiveness of proposed model on two datasets, showing that our model outperforms classical baselines. The code and data for this paper are available at https://github.com/ HKUST-KnowComp/DMSCMC.

Method
In this section, we introduce our proposed method.

Problem Definition and Hierarchical Framework
We first briefly introduce the problem we work on. Given a piece of review, our task is to predict the ratings of different aspects. For example, in Figure 1, we predict the ratings of Cleanliness, Room, and Value. To achieve this, we assume that there are existing reviews with aspect ratings for machines to learn. Formally, we denote the review document as d containing a set of T d sentences {s 1 , s 2 , . . . s T d }. For the t-th sentence s t , we use a set of words w 1 , w 2 , . . . w |st| to represent it, and use w i , w w i and w p i as the one-hot encoding, word embedding, and phrase embedding for w i respectively. The phrase embedding encodes the semantics of phrases where the current word w i is the center (e.g., hidden vectors learned by Bi-LSTM shown in Section 2.2). For each q k of K aspects {q 1 , q 2 , . . . , q K }, we use N k aspect-related keywords, q k 1 , q k 2 . . . q k N k , to represent it. Similarly, we use q k i , q w k i as the one-hot encoding and word embedding for q k i respectively.
There are several sophisticated methods for choosing aspect keywords (e.g., topic model).
Here, we consider a simple way where five seeds were first manually selected for each aspect and then more words were obtained based on their cosine similarities with seeds 2 As shown in Figure 2 (left), our framework follows the idea of multi-task learning, which learns different aspects simultaneously. In this case, all these tasks share the representations of words and architecture of semantic model for the final classifiers. Different from straightforward neural network based multi-task learning (Collobert et al., 2011), for each document d and an aspect q k , our model uses both the content of d and all the related keywords q k 1 , q k 2 . . . q k N k as input. Since the keywords can cover most of the semantic meanings of the aspect, and we do not know which document mentions which semantic meaning, we build an attention model to automatically decide it (introduced in Section 2.3). Assuming that the keywords have been decided, we use a hierarchical attention model to select useful information from the review documents. As shown in Figure 2 (right), the hierarchical attention of keywords is applied to both sentence level (to select meaningful words) and document level (to select meaningful sentence). Thus, our model builds aspectspecific representations in a bottom-up manner.
Specifically, we obtain sentence representations s k 1 , s k 2 , . . . s k T using the input encoder (Section 2.2) and iterative attention module (Section 2.3) at the word level. Then we take sentence representations and k-th aspect as input and apply the sentence-level input encoder and attention model to generate the document representation d k for final classification. As shown in Figure 2 (right), the attention model is applied twice at different levels of the representation.

Input Encoder
The input module builds memory vectors for the iterative attention module and is performed both at word and sentence levels. For a document, it con-verts word sequence into word level memory M d w and sentence sequence into sentence level memory M d s respectively. For an aspect question q k , it takes a set of aspect-specific words {q k i } 1≤i≤N k as input and derives word level memory M q w and sentence level memory M q s . To construct M d w , we obtain word embeddings w w 1 , w w 2 , . . . w w |st| from an embedding matrix E A applied to all words shown in the corpus. Then, LSTM (Hochreiter and Schmidhuber, 1997) model is used as the encoder to produce hidden vectors of words based on the word embeddings. At each step, LSTM takes input w w t and derives a new hidden vector by h t = LSTM(w w t , h t−1 ). To preserve the subsequent context information for words, another LSTM is ran over word sequence in a reverse order simultaneously. Then the forward hidden vector − → h t and backward hidden vector ← − h t are concatenated as phrase embedding w p t . We stack these phrase embeddings together as word level memory M d w . Similarly, we feed sentence representations into another Bi-LSTM to derive the sentence level memory M d s . Note that, the sentence representations are obtained using the iterative attention module which is described as Eq. (5) in Section 2.3.
Since we have question keywords as input, to allow the interactions between questions and documents, we also build question memory in following way. We obtain Q k = q w k i 1≤i≤N k by looking up an embedding matrix 3 E B applied to all question keywords. Then a non-linear mapping is applied to obtain the question memory at word level: where W q w is the parameter matrix to adapt q k at word level. Similarly, we use another mapping to obtain the sentence level memory: where W q s is the parameter matrix to adapt q k at sentence level.

Iterative Attention Module
The iterative attention module (IAM) attends and reads memories of questions and documents alternatively with a multi-hop mechanism, deriving aspect-specific sentence and document representations. As we discussed in the introduction, the set of selected question keywords may not best characterize the aspect for different documents. Thus, the IAM module introduces a backward attention to use document information (word or sentence) to select useful keywords of each aspect as the document-specific question to build attention model.
The illustration of IAM is shown in Figure 3. To obtain sentence representations, it takes M d w and M q w as the input and performs m iterations (hops). For each iteration, IAM conducts four operations: (1) attends the question memory by the selective vector p and summarizes question memory vectors into a single vectorq; (2) updates the selective vector by the previous one andq; (3) attends document (content) memory based on the updated selective vector and summarizes memory vectors in to a single vectorĉ; (4) updates the selective vector by the previous one andĉ.
We unify operations (1) and (3) by an attention The attention function A is decomposed as: where 1 is a vector with all elements are 1, which copies the selective vector to meet the dimension requirement. The W a and v a are parameters, a is attention weights for memory vectors, and M i means i-th row in M.
Operations (2) and (4) are formulated as an update function p 2i−{l} = U(x, p 2i−{l}−1 ), where i is the hop index, l can be 0 or 1 which corresponds tox =ĉ orx =q respectively. We initialize p 0 by a zero vector. The update function U can be a recurrent neural network (Xiong et al., 2017) or other heuristic weighting functions. In this paper, we introduce a simple strategy: which ignores the previous selective vector but succeeds to obtain comparable results with other more complicated function in the initial experiments. Multi-hop mechanism attends different memory locations in different hops (Sukhbaatar et al., 2015), capturing different interactions between documents and questions. In order to preserve the information of various kinds of interactions, we concatenate allĉ's in each hop as the final representations of sentences: After obtaining sentence representations, we feed them into the sentence-level input encoder, deriving the memories M d s and M q s . Then, the aspect-specific document representation d k is obtained by the sentence-level IAM in a similar way.

Objective Function
For each aspect, we obtain aspect-specific document representations {d k } 1≤k≤K . All these representations are fed into classifiers, each of which includes a softmax layer. The softmax layer outputs the probability distribution over |Y| categories for the distributed representation, which is defined as: where W class k is the parameter matrix. We define the cross-entropy objective function between gold sentiment distribution p(d, k) and predicted sentiment distribution p (d, k) as the classification loss function: where p(d, k) is a one-hot vector, which has the same dimension as the number of classes, and only the dimension associated with the ground truth label is one, with others being zeros.

Experiment
In this section, we show experimental results to demonstrate our proposed algorithm.

Datasets
We conduct our experiments on TripAdvisor (Wang et al., 2010) and BeerAdvocate (McAuley et al., 2012;Lei et al., 2016) datasets, which contain seven aspects (value, room, location, cleanliness, check in/front desk, service, and business service) and four aspects (feel, look, smell, and taste) respectively. We follow the processing step (Lei et al., 2016) by choosing the reviews with different aspect ratings and the new datasets are described in Table 1. We tokenize the datasets by Stanford corenlp 4 and randomly split them into training, development, and testing sets with 80/10/10%.

Baseline Methods
To demonstrate the effectiveness of the proposed method, we compare our model with following baselines: Majority uses the majority sentiment label in development sets as the predicted label.
SVM uses unigram and bigram as text features and uses Liblinear (Fan et al., 2008) for learning.
SLDA refers to supervised latent Dirichlet allocation (Blei and Mcauliffe, 2010) which is a statistical model of labeled documents.
NBoW is a neural bag-of-words model averaging embeddings of all words in a document and feeds the resulted embeddings into SVM classifier.
DAN is a deep averaging network model which consists of several fully connected layers with averaged word embeddings as input. One novel word dropout strategy is employed to boost model performances (Iyyer et al., 2015).
CNN continuously performs a convolution operation over a sentence to extract words neighboring features, then gets a fixed-sized representation by a pooling layer (Kim, 2014).
LSTM is one variant of recurrent neural network and has been proved to be one of state-ofthe-art models for document-level sentiment classification (Tang et al., 2015a). We use LSTM to refer Bi-LSTM which captures both forward and backward semantic information.
HAN means the hierarchical attention network which is proposed in  for document classification. Note that, the original HAN depends GRU as the encoder. In our experiments, LSTM-based HAN obtains slightly better results. Thus, we report the results of HAN with LSTM as the encoder.
We extend DAN, CNN, LSTM with the hierarchical architecture and multi-task framework, the corresponding models are MHDAN, MHCNN and MHLSTM respectively. Besides, MHAN is also evaluated as one baseline, which is HAN with the multi-task learning.

Implementation Details
We implement all neural models using Theano (Theano Development Team, 2016). The model parameters are tuned based on the development sets. We learn 200-dimensional word embeddings with Skip-gram model (Mikolov et al., 2013) on in-domain corpus, which follows (Tang et al., 2015a). The pre-trained word embeddings are used to initialize the embedding matrices E A and E B . The dimensions of all hidden vectors are set to 200. For TripAdvisor dataset, the hop numbers of word-level and sentence-level iterative attention modules are set to 4 and 2 respectively. For BeerAdvocate dataset, the hop numbers are set to 6 and 2. The number of selected keywords N k = N is set to 20. To avoid model over-fitting, we use dropout and regularization as follows: (1) the regularization parameter is set to 1e-5; (2) the dropout rate is set to 0.3, which is applied to both sentence and document vectors. All parameters are trained by ADADELTA (Zeiler, 2012) without needing to set the initial learning rate. To ensure fair comparisons, we make baselines have same settings as the proposed model, such as word embeddings, dimensions of hidden vectors and optimization details and so on.

Results and Analyses
We use accuracy and mean squared error (MSE) as the evaluation metrics and the results are shown in Table 2.   Compared to SVM and SLDA, NBoW achieves higher accuracy by 3% in both datasets, which shows that embedding features are more effective than traditional ngram features on these two datasets. All neural network models outperform NBoW. It shows the advantages of neural networks in the document sentiment classification.
From the results of neural networks, we can observe that DAN performs worse than LSTM and CNN, and LSTM achieves slightly higher results than CNN. It can be explained that the simple composition method averaging embeddings of words in a document but ignoring word order, may not be as effective as other flexible composition models, such as LSTM and CNN, on aspect classification. Additionally, we observe that the multi-task learning and hierarchical architecture are beneficial for neural networks. Among all baselines, MHAN and MHLSTM achieve comparable results and outperform others.
Compared with MHAN and MHLSTM, our method achieves improvements of 1.5% (3% relative improvement) and 1.0% (2.5% relative improvement) on TripAdvisor and BeerAdvocate re-spectively, which shows that the incorporation of iterative attention mechanism helps the deep neural network based model build up more discriminative aspect-aware representation. Note that BeerAdvocate is relatively more difficult since the predicted ratings are from 1 to 10 while TripAdvisor is 1 to 5. Moreover, t-test is conducted by randomly splitting datasets into train/dev/test sets and random initialization. The results on test sets are described in Table 3 which show performance of our model is stable.

Case Study for Attention Results
In this section, we sample two sentences from TripAdvisor to show the visualization of attention results for case study. Both word-level and sentence-level attention visualizations are shown in Figure 4. We normalize the word weight by the sentence weight to make sure that only important words in a document are highlighted.
From the top figures in (a) and (b), we observe that our model assigns different attention weights for each aspect. For example, in the first sentence, the words comfortable and bed are assigned higher weights in the aspect Room, and the word clean are highlighted by the aspect Cleaniness. In the second sentence, the word internet is assigned a high attention value for Business. Moreover, the bottom figures in (a) and (b) show that (1) word weights of different hops are various; (2) attention values in higher hop are more reasonable. Specifically, in the first sentence, the weight of word clean is higher than the word comfortable in first hop, while comfortable surpasses clean in higher hops. In the second sentence, we observe that the value of word internet increases with the number of hop. Thus, we can see that the more sensible weights are obtained for words through the proposed iterative attention mechanism. Similarly, the figures (c) and (d) show that the conclusion from words is also suitable for sentences. For the first sentence, the sentence weight regarding the aspect Room is lower than Cleanliness in the first hop, but surpasses Cleanliness in the second hop. For the second sentence, the weight for Business becomes higher in the second hop.

Effects of Hop and Aspect Keywords
In this experiment, we investigate the effects of hop number m and size of aspect keywords N on performances. All the experiments are conducted on the development set. Due to lack of space, we only present the results of TripAdvisor and the results of BeerAdvocate have a similar behavior as TripAdvisor.
For the hop number, we vary m from 1 to 7 and the results are shown in Figure 5 (left). We can see that: (1) at the word level, the performance increases when m ≤ 4, but shows no improvement after m > 4; (2) at the sentence level, model performs best when m = 2. Moreover, we can see that the hop number of word level leads to larger variation than the hop number of sentence level.
For the size of aspect keywords, we vary N from 0 to 35, incremented by 5. Note that, we set a learnable vector to represent question memory when N = 0. The results are shown in Figure 5 (right). We observe that the performance increases when N ≤ 20, and has no improvement after N > 20. This indicates that a small number of keywords can help the proposed model achieve competitive results.

Related Work
Multi-Aspect Sentiment Classification. Multiaspect sentiment classification has been studied extensively in literature. Lu et al. (2011)  There were also some heuristic based methods and sophisticated topic models where multi-aspect sentiment classification is solved as a subproblem (Titov and Mc-Donald, 2008;Wang et al., 2010;Diao et al., 2014;Pappas and Popescu-Belis, 2014). However, these approaches often rely on strict assumptions about words and sentences, for example, using the word syntax to determine if a word is an aspect or a sentiment word, or relating a sentence with an specific aspect. Another related problem is called aspect-based sentiment classification (Pontiki et al., 2014(Pontiki et al., , 2016Poria et al., 2016), which first extracts aspect expressions from sentences (Poria et al., 2014;Balahur and Montoyo, 2008;, and then determines their sentiments. With the developments of neural networks and word embeddings in NLP, neural network based models have shown the state-of-the-art results with less feature engineering work. Tang et al. (2016) employed a deep memory network for aspect-based sentiment classification given the aspect location and Lakkaraju et al. (2014) employed recurrent neural networks and its variants for the task of extraction of aspectsentiment pair. However, these tasks are sentencelevel. Another related research field is documentlevel sentiment classification because we can treat single aspect sentiment classification as an individual document classification task. This line of research includes (Tang et al., 2015b;Chen et al., 2016;Tang et al., 2016; which are based on neural networks in a hierarchical structure. However, they did not work on multiple aspects. Machine Comprehension. Recently, neural network based machine comprehension (or reading) has been studied extensively in NLP, with the releases of large-scale evaluation datasets (Hermann et al., 2015;Hill et al., 2016;Rajpurkar et al., 2016). Most of the related studies focus on attention mechanism (Bahdanau et al., 2014) which is firstly proposed in machine translating and aims to solve the long-distance dependency between words. Hermann et al. (2015) used Bi-LSTM to encode document and query, and proposed Attentive Reader and Impatient Reader. The first one attends document based on the query representation, and the second one attends document by the representation of each token in query with an incremental manner. Memory Networks Sukhbaatar et al., 2015) attend and reason document representation in a multihop fashion, enriching interactions between documents and questions. Dynamic Memory Network (Kumar et al., 2016) updates memories of documents by re-running GRU models based on derived attention weights. Meanwhile, the query representation is refined by another GRU model. Gated-Attention Reader (Dhingra et al., 2016) proposes a novel attention mechanism, which is based on multiplicative interactions between the query embeddings and the intermediate states of a recurrent neural network document reader. Bi-Directional Attention Model (Xiong et al., 2017;Seo et al., 2017) fuses co-dependent representations of queries and documents in order to focus on relevant parts of both. Iterative Attention model (Sordoni et al., 2016) attends question and document sequentially, which is related to our model. Different from Iterative Attention model, our model focuses on the document-level multiaspect sentiment classification, which is proposed in a hierarchical architecture and has different procedures in the iterative attention module. Another related research problem is visual question answering which uses an image as question context rather than a set of keywords as question. Neural network based visual question answering (Lu et al., 2016;Xiong et al., 2016) is similar as the proposed models in text comprehension.

Conclusion
In this paper, we model the document-level multiaspect sentiment classification as a text comprehension problem and propose a novel hierarchical iterative attention model in which documents and pseudo aspect-questions are interleaved at both word and sentence-level to learn aspect-aware document representation in a unified model. Extensive experiments show that our model outperforms the other neural models with multi-task framework and hierarchical architecture.

Acknowledgments
This paper is partially supported by the National Natural Science Foundation of China (NSFC Grant Nos. 61472006 and 91646202) as well as the National Basic Research Program (973 Program No. 2014CB340405). This work was also supported by NVIDIA Corporation with the donation of the Titan X GPU, Hong Kong CERG Project 26206717, China 973 Fundamental R&D Program (No.2014CB340304), and the LORELEI Contract HR0011-15-2-0025 with DARPA. The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government. We also thank the anonymous reviewers for their valuable comments and suggestions that help improve the quality of this manuscript.