Exploiting Document Knowledge for Aspect-level Sentiment Classification

Attention-based long short-term memory (LSTM) networks have proven to be useful in aspect-level sentiment classification. However, due to the difficulties in annotating aspect-level data, existing public datasets for this task are all relatively small, which largely limits the effectiveness of those neural models. In this paper, we explore two approaches that transfer knowledge from document-level data, which is much less expensive to obtain, to improve the performance of aspect-level sentiment classification. We demonstrate the effectiveness of our approaches on 4 public datasets from SemEval 2014, 2015, and 2016, and we show that attention-based LSTM benefits from document-level knowledge in multiple ways.


Introduction
Given a sentence and an opinion target (also called an aspect term) occurring in the sentence, aspectlevel sentiment classification aims to determine the sentiment polarity in the sentence towards the opinion target. An opinion target or target for short refers to a word or a phrase describing an aspect of an entity. For example, in the sentence "This little place has a cute interior decor but the prices are quite expensive", the targets are interior decor and prices, and they are associated with positive and negative sentiment respectively.
A sentence may contain multiple sentimenttarget pairs, thus one challenge is to separate different opinion contexts for different targets. For this purpose, state-of-the-art neural methods (Wang et al., 2016;Liu and Zhang, 2017;Chen et al., 2017) adopt attention-based LSTM networks, where the LSTM aims to capture sequential patterns and the attention mechanism aims to emphasize target-specific contexts for encoding sentence representations. Typically, LSTMs only show their potential when trained on large datasets. However, aspect-level training data requires the annotation of all opinion targets in a sentence, which is costly to obtain in practice. As such, existing public aspect-level datasets are all relatively small. Insufficient training data limits the effectiveness of neural models.
Despite the lack of aspect-level labeled data, enormous document-level labeled data are easily accessible online such as Amazon reviews. These reviews contain substantial linguistic patterns and come with sentiment labels naturally. In this paper, we hypothesize that aspect-level sentiment classification can be improved by employing knowledge gained from document-level sentiment classification. Specifically, we explore two transfer methods to incorporate this sort of knowledge -pretraining and multi-task learning.
In our experiments, we find that both methods are helpful and combining them achieves significant improvements over attentionbased LSTM models trained only on aspect-level data. We also illustrate by examples that additional knowledge from document-level data is beneficial in multiple ways. Our source code can be obtained from https://github.com/ ruidan/Aspect-level-sentiment.

Related Work
Various neural models (Dong et al., 2014;Nguyen and Shirai, 2015;Vo and Zhang, 2015;Tang et al., 2016a,b;Wang et al., 2016;Liu and Zhang, 2017;Chen et al., 2017) have been proposed for aspect-level sentiment classification. The main idea behind these works is to develop neural architectures that are able to learn continuous features and capture the intricate relation between a target and context words. However, to sufficiently train these models, substantial aspect-level annotated data is required, which is expensive to obtain in practice.
We explore both pretraining and multi-task learning for transferring knowledge from document level to aspect level. Both methods are widely studied in the literature. Pretraining is a common technique used in computer vision where low-level neural layers can be usefully transferred to different tasks (Krizhevsky and Sutskever, 2012;Zeiler and Fergus, 2014). In natural language processing (NLP), some efforts have been initiated on pretraining LSTMs (Dai and Le, 2015;Zoph et al., 2016;Ramachandran et al., 2017) for sequence-to-sequence models in both supervised and unsupervised settings, where promising results have been obtained. On the other hand, multi-task learning simultaneously trains on samples in multiple tasks with a combined objective (Collobert and Weston, 2008;Luong et al., 2015a;, which has improved model generalization ability in certain cases. In the work of Mou et al. (2016), the authors investigated the transferability of neural models in NLP applications with extensive experiments, showing that transferability largely depends on the semantic relatedness of the source and target tasks. For our problem, we hypothesize that aspect-level sentiment classification can be improved by employing knowledge gained from document-level sentiment classification, as these two tasks are highly related semantically.

Attention-based LSTM
We first describe a conventional implementation of an attention-based LSTM model for this task. We use it as a baseline model and extend it with pretraining and multi-task learning approaches for incorporating document-level knowledge.
The inputs are a sentence s = (w 1 , w 2 , ..., w n ) consisting of n words, and an opinion target x = (x 1 , x 2 , ..., x m ) occurring in the sentence consisting of a subsequence of m words from s. Each word is associated with a continuous word embedding e w (Mikolov et al., 2013) from an embedding matrix E ∈ R V ×d , where V is the vocabulary size and d is the embedding dimension.
LSTM is used to capture sequential information, and outputs a sequence of hidden vectors: [h 1 , ..., h n ] = LSTM([e w 1 , ..., e wn ], θ lstm ) (1) An attention layer assigns a weight α i to each word in the sentence. The final target-specific representation of the sentence s is then given by: And α i is computed as follows: where t is the target representation computed as the averaged word embedding of the target. f score is a content-based function that captures the semantic association between a word and the target, for which we adopt the formulation used in (Luong et al., 2015b;He et al., 2017) with parameter matrix W a ∈ R d×d . The sentence representation z is fed into an output layer to predict the probability distribution p over sentiment labels on the target: We refer to this baseline model as LSTM+ATT. It is trained via cross entropy minimization: where D denotes the overall training corpus, c i denotes the true label for sample i, and p i (c i ) denotes the probability of the true label.

Transfer Approaches
LSTM+ATT is used as our aspect-level model with parameter set θ aspect We also build a standard LSTM-based classifier based on document-level training examples. This network is the same as the LSTM+ATT apart from the lack of the attention layer. The training objective is also cross entropy minimization as shown in equation (7) and the parameter set is Pretraining (PRET): In this setting, we first train on document-level examples. The last hidden vector from the LSTM outputs is used as the document representation. We initialize the relevant  Multi-task Learning (MULT): This approach simultaneously trains two tasks -document-level and aspect-level classification. In this setting, the embedding layer (E) and the LSTM layer (θ lstm ) are shared by both tasks, and a document is represented as the mean vector over LSTM outputs. The other parameters are task-specific. The overall loss function is then given by: where U is the loss function of document-level classification. λ ∈ (0, 1) is a hyperparameter that controls the weight of U .
Combined (PRET+MULT): In this setting, we first perform PRET on document-level examples. We use the pretrained weights for parameter initialization for both aspect-level model and document-level model, and then perform MULT as discussed above.

Datasets and Experimental Settings
We run experiments on four benchmark aspectlevel datasets, taken from SemEval 2014 (Pontiki et al., 2014), SemEval 2015 (Pontiki et al., 2015), and SemEval 2016 (Pontiki et al., 2016). Following previous work (Tang et al., 2016b;Wang et al., 2016), we remove samples with conflicting polarities in all datasets 1 . Statistics of the resulting datasets are presented in Table 1. We derived two document-level datasets from Yelp2014  and the Amazon Electronics dataset (McAuley et al., 2015) respectively. The original reviews were rated on a 5point scale. We consider 3-class classification and thus label reviews with rating < 3, > 3, and = 3 as negative, positive, and neutral respectively. Each sampled dataset contains 30k instances with exactly balanced class labels. We pair up an aspectlevel dataset and a document-level dataset when they are from a similar domain -the Yelp dataset is used by D1, D3, and D4 for PRET and MULT, and the Electronics dataset is only used by D2.
In all experiments, 300-dimension GloVe vectors (Pennington et al., 2014) are used to initialize E and E when pretraining is not conducted for weight initialization. These vectors are also used for initializing E in the pretraining phase. Values for hyperparameters are obtained from experiments on development sets. We randomly sample 20% of the original training data from the aspectlevel dataset as the development set and only use the remaining 80% for training. For all experiments, the dimension of LSTM hidden vectors is set to 300, λ is set to 0.1, and we use dropout with probability 0.5 on sentence/document representations before the output layer. We use RMSProp as the optimizer with the decay rate set to 0.9 and the base learning rate set to 0.001. The mini-batch size is set to 32. Table 2 shows the results of LSTM, LSTM+ATT, PRET, MULT, PRET+MULT, and four representative prior works (Tang et al., 2016a,b;Wang et al., 2016;Chen et al., 2017). Significance tests are conducted for testing the robustness of methods under random parameter initialization. Both accuracy and macro-F1 are used for evaluation as label distribution is unbalanced. The reported numbers are obtained as the average value over 5 runs with random initialization for each method.

Model Comparison
We observe that PRET is very helpful, and consistently gives a 1-3% increase in accuracy over LSTM+ATT across all datasets. The improvements in macro-F1 scores are even more, especially on D3 and D4 where the labels are extremely unbalanced. MULT gives similar performance as LSTM+ATT on D1 and D2, but improvements can be clearly observed for D3 and D4. The combination (PRET+MULT) overall yields better results.
There are two main reasons why the improvements of macro-F1 scores are more significant on D3 and D4 than on D1: (1)   without any external knowledge might still be able to learn some neutral-related features on D1 but it is very hard to learn from D3 and D4.
(2) The numbers of neutral examples in the test sets of D3 and D4 are very small. Thus, the precision and recall on neutral class will be largely affected by even a small prediction difference (e.g., with 5 more neutral examples correctly identified, recall is increased by more than 10% on both datasets). As a result, the macro-F1 scores on D3 and D4 are affected more. Table 2 indicates that a large percentage of the performance gain comes from PRET. To better understand the transfer effects of different layersembedding layer (E), LSTM layer (θ lstm ), and output layer (W o , b o ) -we conduct ablation tests on PRET with different layers transfered from the document-level model to the aspect-level model. Results are presented in Table 3. "LSTM only" denotes the setting where only the LSTM layer is transferred, and "Without LSTM" denotes the setting where only the embedding and output layers are transferred (excluding the LSTM layer). The key observations are: (1) Transfer is helpful in all settings. Improvements over LSTM+ATT are observed even when only one layer is transferred.

Ablation Tests
(2) Overall, transfers of the LSTM and embedding layer are more useful than the output layer. This is what we expect, since the output layer is normally more task-specific.
(3) Transfer of the embedding layer is more helpful on D3 and D4. One possible explanation is that the label distribution is extremely unbalanced on these two datasets. Sentiment information is not adequately captured by GloVe word embeddings. Therefore, with a small number of training examples in the negative and neutral classes, the embeddings trained by aspectlevel classification still do not effectively capture the true semantics of the relevant opinion words. Transfer of the embedding layer can greatly help in this case. To better understand in which conditions the proposed method is helpful, we analyze a subset of test examples that are correctly classified by PRET+MULT but are misclassified by LSTM+ATT. We find that the benefits brought by document-level knowledge are typically shown in four ways.

Analysis
First of all, to our surprise, LSTM+ATT made obvious mistakes on some instances with common opinion words. Below are two examples where the target is enclosed in [] with its true sentiment indicated in the subscript: 1. "I was highly disappointed in the [food] neg ." 2. "This particular location certainly uses substandard [meats] neg ." In the above examples, LSTM+ATT does attend to the right opinion words, but makes the wrong predictions. One possible reason is that the word embeddings from GloVe without PRET do not effectively capture sentiment information, while the aspect-level training samples are not sufficient to capture that for certain words. PRET+MULT eliminates this kind of errors.
Another finding is that our method helps to better capture domain-specific opinion words due to additional knowledge from documents that are from a similar domain: 1. "The smaller [size] pos was a bonus because of space restrictions." 2. "The [price] pos is 200 dollars down." LSTM+ATT attends on smaller correctly for the first example but makes the wrong prediction as smaller can be negative in many cases. It does not even capture down in the second example.
Thirdly, we find that LSTM+ATT made a number of errors on sentences with negation words: 1. I have experienced no problems, [works] pos as anticipated. 2.
[Service] neg not the friendliest to our party! LSTMs typically only show their potential on large datasets. Without sufficient training examples, it may not be able to effectively capture various sequential patterns. Pretraining the network on larger document-level corpus addresses this problem.
Lastly, PRET+MULT makes fewer errors on recognizing neutral instances. This can also be observed from the macro-F1 scores in Table 2. The lack of training examples makes the prediction of neutral instances very difficult for all previous methods. Knowledge from document-level examples with balanced labels compensates for this disadvantage.

Conclusion
The effectiveness of existing aspect-level neural models is limited due to the difficulties in obtaining training data in practice. Our work is the first attempt to incorporate knowledge from documentlevel corpus for training aspect-level sentiment classifiers. We have demonstrated the effectiveness of our proposed approaches and analyzed the major benefits brought by the knowledge transfer. The proposed approaches can be potentially integrated with other aspect-level neural models to further boost their performance.