A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis

Opinion mining from customer reviews has become pervasive in recent years. Sentences in reviews, however, are usually classified independently, even though they form part of a review's argumentative structure. Intuitively, sentences in a review build and elaborate upon each other; knowledge of the review structure and sentential context should thus inform the classification of each sentence. We demonstrate this hypothesis for the task of aspect-based sentiment analysis by modeling the interdependencies of sentences in a review with a hierarchical bidirectional LSTM. We show that the hierarchical model outperforms two non-hierarchical baselines, obtains results competitive with the state-of-the-art, and outperforms the state-of-the-art on five multilingual, multi-domain datasets without any hand-engineered features or external resources.


Introduction
Sentiment analysis (Pang and Lee, 2008) is used to gauge public opinion towards products, to analyze customer satisfaction, and to detect trends. With the proliferation of customer reviews, more fine-grained aspect-based sentiment analysis (ABSA) has gained in popularity, as it allows aspects of a product or service to be examined in more detail.
Reviews -just with any coherent text -have an underlying structure. A visualization of the discourse structure according to Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) for the example review in Figure 1 reveals that sentences Elaboration Background that they cook with only simple ingredients.
I am amazed at the quality of the food I love this restaurant. Intuitively, knowledge about the relations and the sentiment of surrounding sentences should inform the sentiment of the current sentence. If a reviewer of a restaurant has shown a positive sentiment towards the quality of the food, it is likely that his opinion will not change drastically over the course of the review. Additionally, overwhelmingly positive or negative sentences in the review help to disambiguate sentences whose sentiment is equivocal.
Neural network-based architectures that have recently become popular for sentiment analysis and ABSA, such as convolutional neural networks (Severyn and Moschitti, 2015), LSTMs (Vo and , and recursive neural networks (Nguyen and Shirai, 2015), however, are only able to consider intra-sentence relations such as Background in Figure 1 and fail to capture inter-sentence relations, e.g. Elaboration that rely on discourse structure and provide valuable clues for sentiment prediction.
We introduce a hierarchical bidirectional long short-term memory (H-LSTM) that is able to leverage both intra-and inter-sentence relations. The sole dependence on sentences and their structure within a review renders our model fully languageindependent. We show that the hierarchical model outperforms strong sentence-level baselines for aspect-based sentiment analysis, while achieving results competitive with the state-of-the-art and outperforming it on several datasets without relying on any hand-engineered features or sentiment lexica.

Related Work
Aspect-based sentiment analysis. Past approaches use classifiers with expensive hand-crafted features based on n-grams, parts-of-speech, negation words, and sentiment lexica (Pontiki et al., 2014;Pontiki et al., 2015). The model by Zhang and Lan (2015) is the only approach we are aware of that considers more than one sentence. However, it is less expressive than ours, as it only extracts features from the preceding and subsequent sentence without any notion of structure. Neural network-based approaches include an LSTM that determines sentiment towards a target word based on its position (Tang et al., 2015) as well as a recursive neural network that requires parse trees (Nguyen and Shirai, 2015). In contrast, our model requires no feature engineering, no positional information, and no parser outputs, which are often unavailable for low-resource languages. We are also the first -to our knowledge -to frame sentiment analysis as a sequence tagging task.
Hierarchical models. Hierarchical models have been used predominantly for representation learning and generation of paragraphs and documents: Li et al. (2015) use a hierarchical LSTM-based autoencoder to reconstruct reviews and paragraphs of Wikipedia articles. Serban et al. (2016) use a hierarchical recurrent encoder-decoder with latent variables for dialogue generation. Denil et al. (2014) use a hierarchical ConvNet to extract salient sentences from reviews, while Kotzias et al. (2015) use the same architecture to learn sentence-level labels from review-level labels using a novel cost function. The model of Lee and Dernoncourt (2016) is perhaps the most similar to ours. While they also use a sentencelevel LSTM, their class-level feed-forward neural network is only able to consider a limited number of preceding texts, while our review-level bidirectional LSTM is (theoretically) able to consider an unlimited number of preceding and successive sentences.

Model
In the following, we will introduce the different components of our hierarchical bidirectional LSTM architecture displayed in Figure 2.

Sentence and Aspect Representation
Each review consists of sentences, which are padded to length l by inserting padding tokens. Each review in turn is padded to length h by inserting sentences containing only padding tokens. We represent each sentence as a concatentation of its word embeddings Every sentence is associated with an aspect. Aspects consist of an entity and an attribute, e.g.

FOOD#QUALITY.
Similarly to the entity representation of Socher et al. (2013), we represent every aspect a as the average of its entity and attribute embeddings 1 2 (x e + x a ) where x e , x a ∈ R m are the m-dimensional entity and attribute embeddings respectively 1 .

LSTM
We use a Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997), which adds input, output, and forget gates to a recurrent cell, which allow it to model long-range dependencies that are essential for capturing sentiment.
For the t-th word in a sentence, the LSTM takes as input the word embedding x t , the previous output h t−1 and cell state c t−1 and computes the next output h t and cell state c t . Both h and c are initialized with zeros.

Bidirectional LSTM
Both on the review and on the sentence level, sentiment is dependent not only on preceding but also successive words and sentences. A Bidirectional LSTM (Bi-LSTM) (Graves et al., 2013) allows us to look ahead by employing a forward LSTM, which processes the sequence in chronological order, and a backward LSTM, which processes the sequence in reverse order. The output h t at a given time step is then the concatenation of the corresponding states of the forward and backward LSTM.

Hierarchical Bidirectional LSTM
Stacking a Bi-LSTM on the review level on top of sentence-level Bi-LSTMs yields the hierarchical bidirectional LSTM (H-LSTM) in Figure 2.
The sentence-level forward and backward LSTMs receive the sentence starting with the first and last word embedding x 1 and x l respectively. The final output h l of both LSTMs is then concatenated with the aspect vector a 2 and fed as input into the reviewlevel forward and backward LSTMs. The outputs of both LSTMs are concatenated and fed into a final softmax layer, which outputs a probability distribution over sentiments 3 for each sentence.

Datasets
For our experiments, we consider datasets in five domains (restaurants, hotels, laptops, phones, cam-eras) and eight languages (English, Spanish, French, Russian, Dutch, Turkish, Arabic, Chinese) from the recent SemEval-2016 Aspect-based Sentiment Analysis task (Pontiki et al., 2016), using the provided train/test splits. In total, there are 11 domainlanguage datasets containing 300-400 reviews with 1250-6000 sentences 4 . Each sentence is annotated with none, one, or multiple domain-specific aspects and a sentiment value for each aspect.

Training Details
Our LSTMs have one layer and an output size of 200 dimensions. We use 300-dimensional word embeddings. We use pre-trained GloVe (Pennington et al., 2014)    and attribute embeddings of aspects have 15 dimensions and are initialized randomly. We use dropout of 0.5 after the embedding layer and after LSTM cells, a gradient clipping norm of 5, and no l 2 regularization.
We unroll the aspects of every sentence in the review, e.g. a sentence with two aspects occurs twice in succession, once with each aspect. We remove sentences with no aspect 8 and ignore predictions for all sentences that have been added as padding to a review so as not to force our model to learn meaningless predictions, as is commonly done in sequenceto-sequence learning (Sutskever et al., 2014). We segment Chinese data before tokenization. We train our model to minimize the cross-entropy loss, using stochastic gradient descent, the Adam update rule (Kingma and Ba, 2015), mini-batches of size 10, and early stopping with a patience of 10.

Comparison models
We compare our model using random (H-LSTM) and pre-trained word embeddings (HP-LSTM) against the best model of the SemEval-2016 Aspectbased Sentiment Analysis task (Pontiki et al., 2016) for each domain-language pair (Best) as well as against the two best single models of the competition: IIT-TUDA (Kumar et al., 2016), which uses large sentiment lexicons for every language, and XRCE (Brun et al., 2016), which uses a parser aug-8 Labeling them with a NONE aspect and predicting neutral slightly decreased performance. mented with hand-crafted, domain-specific rules. In order to ascertain that the hierarchical nature of our model is the deciding factor, we additionally compare against the sentence-level convolutional neural network of Ruder et al. (2016) (CNN) and against a sentence-level Bi-LSTM (LSTM), which is identical to the first layer of our model. 9

Results and Discussion
We present our results in Table 1. Our hierarchical model achieves results superior to the sentencelevel CNN and the sentence-level Bi-LSTM baselines for almost all domain-language pairs by taking the structure of the review into account. We highlight examples where this improves predictions in Table 2.
In addition, our model shows results competitive with the best single models of the competition, while requiring no expensive hand-crafted features or external resources, thereby demonstrating its language and domain independence. Overall, our model compares favorably to the state-of-the-art, particularly for low-resource languages, where few hand-engineered features are available. It outperforms the state-of-the-art on four and five datasets using randomly initialized and pre-trained embeddings respectively.  knowledge of the quality of the green tea crème brulée helps the H-LSTM to predict the correct sentiment.

Pre-trained embeddings
In line with past research (Collobert et al., 2011), we observe significant gains when initializing our word vectors with pre-trained embeddings across almost all languages. Pre-trained embeddings improve our model's performance for all languages except Russian, Arabic, and Chinese and help it achieve stateof-the-art in the Dutch phones domain. We release our pre-trained multilingual embeddings so that they may facilitate future research in multilingual sentiment analysis and text classification 10 .

Leveraging additional information
As annotation is expensive in many real-world applications, learning from only few examples is important. Our model was designed with this goal in mind and is able to extract additional information inherent in the training data. By leveraging the structure of the review, our model is able to inform and improve its sentiment predictions as evidenced in Table 2.
The large performance differential to the state-ofthe-art for the Turkish dataset where only 1104 sentences are available for training and the performance gaps for high-resource languages such as English, Spanish, and French, however, indicate the limits of an approach such as ours that only uses data available at training time.
While using pre-trained word embeddings is an effective way to mitigate this deficit, for highresource languages, solely leveraging unsupervised language information is not enough to perform onpar with approaches that make use of large external resources (Kumar et al., 2016) and meticulously hand-crafted features (Brun et al., 2016). Sentiment lexicons are a popular way to inject additional information into models for sentiment analysis. We experimented with using sentiment lexicons by Kumar et al. (2016) but were not able to significantly improve upon our results with pre-trained embeddings 11 . In light of the diversity of domains in the context of aspect-based sentiment analysis and many other applications, domain-specific lexicons (Hamilton et al., 2016) are often preferred. Finding better ways to incorporate such domain-specific resources into models as well as methods to inject other forms of domain information, e.g. by constraining them with rules (Hu et al., 2016) is thus an important research avenue, which we leave for future work.

Conclusion
In this paper, we have presented a hierarchical model of reviews for aspect-based sentiment analysis. We demonstrate that by allowing the model to take into account the structure of the review and the sentential context for its predictions, it is able to outperform models that only rely on sentence information and achieves performance competitive with models that leverage large external resources and handengineered features. Our model achieves state-ofthe-art results on 5 out of 11 datasets for aspectbased sentiment analysis.