Aspect Based Sentiment Analysis with Aspect-Specific Opinion Spans

Aspect based sentiment analysis, predicting sentiment polarity of given aspects, has drawn extensive attention. Previous attention-based models emphasize using aspect semantics to help extract opinion features for classification. However, these works are either not able to capture opinion spans as a whole, or not able to capture variable-length opinion spans. In this paper, we present a neat and effective structured attention model by aggregating multiple linear-chain CRFs. Such a design allows the model to extract aspect-specific opinion spans and then evaluate sentiment polarity by exploiting the extracted opinion features. The experimental results on four datasets demonstrate the effectiveness of the proposed model, and our analysis demonstrates that our model can capture aspect-specific opinion spans.


Introduction
Aspect Based Sentiment Analysis (ABSA) (Pang and Lee, 2008;Liu, 2012) is an extensively studied sentiment analysis task on a fine-grained semantic level, i.e., opinion targets explicitly mentioned in sentences. Previous ABSA studies focused on a few sub-tasks, such as Aspect Sentiment Classification (ASC) (Wang et al., 2016;Ma et al., 2018), Aspect Term Extraction (ATE) (Li et al., 2018b;He et al., 2017), Aspect and Opinion Co-Extraction (Liu et al., 2013;Xu et al., 2018;Dai and Song, 2019), E2E-ABSA (a joint task of ASC and ATE) (Li et al., 2019a;He et al., 2019;Li et al., 2019b), Aspect Sentiment Triplet Extraction (ASTE) (Peng et al., 2019;Xu et al., 2020), etc. ASC analyzes the sentiment polarity of given aspects/targets in a review. * Lu Xu is under the Joint PhD Program between Alibaba and Singapore University of Technology and Design.
Accepted in EMNLP 2020 (Conference on Empirical Methods in Natural Language Processing). 1 Our code is released at https://github.com/ xuuuluuu/Aspect-Sentiment-Classification For example, consider the review sentence "Food is usually very good, though occasionally I worry about freshness of raw vegetables in side orders." This review mentions two aspects: Food and raw vegetables, and for ASC, the objective is to give a positive sentiment on Food and a negative sentiment on raw vegetables. Most of the previous works (Wang et al., 2016;Liu and Zhang, 2017;Li et al., 2018c;He et al., 2018;Li and Lu, 2019;Hu et al., 2019) adopt attention mechanism (Bahdanau et al., 2015) to capture the semantic relatedness among the context words and the aspect, and learn aspect-specific features for sentiment classification.
However, it is challenging for attention-based approaches to consider an opinion span as a whole during feature extraction because they are overreliant on neural models to learn the contextstructural information and perform feature extraction over individual hidden representations. Previous work (Wang and Lu, 2018) engage structured attention networks (Kim et al., 2017), which extend the previous attention mechanism to incorporate structure dependencies, to model the interaction among context words, and perform softselections of word spans. In particular, they introduce two hand-coded regularizers to constrain the soft-selection process to attend to few short opinion spans. However, such regularizers disturb the structure dependencies, and their method is not capable of emphasizing aspect-specific opinion spans for sentiment classification.
To better capture opinion features for aspect sentiment classification, we propose the MCRF-SA model, which introduces multiple conditional random fields (CRF) (Lafferty et al., 2001) to structured attention model. While exploiting the advantages of structured attention mechanisms, our model avoids the regularizers by the complementarity among multiple CRFs. We also improve the previous position decay function (Li et al., 2018a;Tang et al., 2019) to reduce the importance of context words that are further away from the aspect so as to emphasize aspect-specific opinion spans. Our multi-CRF layer with the effective decay function extracts aspect-specific features from different representation sub-spaces to overcome the previous limitations. The experimental results on the four datasets demonstrate the effectiveness of our model, and the analysis shows that the behaviors are in alignment with our intuition.

Model Description
Given a context sequence w c = {w 1 , w 2 , . . . , w n } and a aspect sequence w a = {w i , ..., w j } (1 ≤ i ≤ j ≤ n) which is a sub-sequence of w c , the goal of ASC is to predict sentiment polarity y ∈ {positive, negative, neutral} over the given aspect. Our model is mainly constructed with a few neural layers, including an input layer, an aspect-specific contextualized representation layer, a position decay layer, a multi-CRF structured attention layer, and a sentiment classification layer. Figure 1 presents the architecture of our MCRF-SA model.

Input Layer
The input of our model consists of word embedding w word t and aspect indicator embedding w as t . The aspect indicator embedding is to differentiate aspect words and context words and is randomly initialized. The input representation x t is as follows:

Aspect-Specific Contextualized Representation
We employ a bi-directional GRU (Cho et al., 2014) to generate the contextualized representation. Since the input representation has already contained the aspect information, the aspect-specific contextualized representation is obtained by concatenating the hidden states from both directions: where − → h t is the hidden state from the forward GRU and ← − h t is from the backward.

Position Decay
Following the previous work (Li et al., 2018a;Zhang et al., 2019;Tang et al., 2019), we also use a position decay function to reduce the influence of the context words on the aspect as it goes further away from the aspect. We propose a higher-order decay function, which is more sensitive to distance, and the sensitivity can be tuned by γ on different datasets.
where i and j are the starting and ending position of an aspect, L is the maximum length of sentences across all datasets, γ is a hyper-parameter and a larger value enables more influence from the context words that are close to the aspect. Then, the decayed contextual word representation is as follows:

Multi-CRF Structured Attention
We use multiple linear-chain CRFs to intensively incorporate structure dependencies to capture the corresponding opinion spans of an aspect. In particular, we create a latent label (Wang and Lu, 2018) z ∈ {Y es, N o} to indicate whether each context word belongs to part of opinion spans. Similar to (Lample et al., 2016), given the sentence representation x, the CRF is defined as: where score(z, x) is a score function that is defined as the summation of transition scores and emission scores from the Bi-GRU: where T is a transition matrix and T zt,z t+1 denotes the transition score from label z t to z t+1 . E t,zt denotes the emission score of label z t at the t-th position, and the score is obtained from a linear layer, which takes r t as input and returns a vector whose length is label size.

Marginal Inference
The latent labels introduced in the CRF layer show whether the word influences the given aspect's sentiment. Intuitively, we can understand that the marginal probabilities on the Y es label indicate the influence of the current context word on the aspect word's sentiment. By using the forward-backward algorithm, we calculate the marginal distribution of the latent label. With the marginal distribution, the sentence representation s is obtained: The final representation for classification is obtained by concatenating the sentence representations from all CRFs: where a is the number of CRFs.

Sentiment Classification
The sentence representation q is passed to a sentiment classier to obtain the distribution of sentiment polarities: where W and b are learnable parameters for the sentiment classifier layer. We learn model parameters by minimizing the negative log-likelihood.

Experimental Setup
Our proposed MCRF-SA model is evaluated on four benchmark datasets: SemEval 2014 Task4 (Pontiki et al., 2014), SemEval 2015 Task12 (Pontiki et al., 2015) and SemEval 2016 Task 5 (Pontiki et al., 2016). Following the previous works (Tang et al., 2016;Wang and Lu, 2018;   Table 1. We use the 300d GloVe (Pennington et al., 2014) to initialize our word embeddings. One-sixth of instances are randomly selected from the original training dataset as the development dataset, and the model is only trained with the remaining data. With the development set, we tune our model hyperparameters using an open-source black-box tuner (Alberto and Giacomo, 2018). We set the hidden size of GRU to 32 or 64. The batch size is set to 64 or 96. The dropout rate is selected from 0.3 to 0.8, with a step size of 0.1. The dimension of the aspect indicator is selected from {50, 70, 90}. The value of γ in the position decay function is selected from {1,2,3}. The number of layer of GRU is selected from {1,2,3}. We adopt Adam (Kingma and Ba, 2014) to optimize our model with a learning rate of 0.008. All hyper-parameters are selected based on the best performance on the development set.

Baselines
Our MCRF-SA model is compared with the following methods 2 . SVM (Kiritchenko et al., 2014) is a support vector machine based method that integrates surface, lexicon, and parse features. ATAE-LSTM (Wang et al., 2016) is an LSTM (Hochreiter and Schmidhuber, 1997) based model, which has an extra attention to perform soft-selection over the context words. MemNet (Tang et al., 2016) introduces a deep memory network to implement attention mechanisms to learn the relatedness of context words towards the aspect. IAN (Ma et al., 2017) utilizes two LSTM based attention models to learn both context and aspect representations interactively. SA-LSTM-P (Wang and Lu, 2018) employs structured attention networks with multiple regularizers to capture the opinion spans for ASC. TNets (Li et al., 2018a) implements a contextpreserving mechanism to get the aspect-specific word representations and uses a Convolutional Neu-  ral Network (CNN) (Lecun et al., 1998) layer to obtain the sentence representation. TNet-ATT (Tang et al., 2019) is an extension of TNet-LF, and it provides an attention supervision mining mechanism to improve the previous model. ASCNN and ASGCN (Zhang et al., 2019) use CNN and Graph Convolutional Network (GCN) (Kipf and Welling, 2017) to capture the long-range dependencies and syntactic information.

Experimental Results
Our proposed model shows significant improvements on the four datasets, Table 2 shows the performance comparisons. Our method outperforms SVM (Kiritchenko et al., 2014) by 2.7 and 7.15 Acc. score on 14Rest and 14Lap, respectively. This indicates that our neural approach extracts more effective features than hard-coded feature engineering. Compared to the attention-based methods -ATAE (Wang et al., 2016), MemNet (Tang et al., 2016), IAN (Ma et al., 2017), and TNet-ATT (Tang et al., 2019), our MCRF-SA model pays more attention to the aspect-specific opinion spans, which bring significant performance improvement on the four datasets.
We also compare our model with methods that focus on word segmentations for sentiment classification. Our method outperforms the previous regularizers guided structured attention model SA-LSTM-P (Wang and Lu, 2018) by more than 1.2 Acc. score on 14Rest and 14Lap. TNet-LF (Li et al., 2018a) and ASCNN (Zhang et al., 2019) em-ploy CNN to evaluate word spans regarding how much it contributed to the sentiment, but the kernel size limits the length of the span. ASGCN (Zhang et al., 2019) employs GCN over the dependency tree to capture syntactic and dependency information. However, the performance heavily relies on the accuracy of the dependency trees. Our proposed multi-CRF structured attention along with the position decay function allows MCRF-SA to perform soft-selection of multiple aspectspecific opinion spans that influence the aspect's sentiment. The large performance gaps between our model and baseline models confirm the effectiveness of our proposed architecture. Such results also demonstrate that sentiment classification can benefit greatly from aspect-specific opinion spans.
Furthermore, we observe that the performance on 15Rest is not as good as the other three datasets. Such behavior is caused by the different distribution of positive, neutral, and negative sentiment between training and test set, shown in Table 1.

Effect of Number of CRFs
To fully investigate the effect of the number of CRFs, we conduct additional experiments on 14Rest and 14Lap with the number of CRFs ∈ {1, 2, 3, ..., 16}. Figure 2 shows the experimental results. The model achieves the best performance when the number of CRFs equals to 4. Particularly, the performance becomes relatively plateau when a large number of CRFs is adopted. We believe this is because the sizes of the four benchmark datasets are relatively small, and an excessively large number of parameters may not be able to further extract   Figure 3 shows the marginal distributions (Equation 5) of SA-LSTM-P (Wang and Lu, 2018) and our MCRF-SA model. The aspect for the given example is "Indian food" with negative sentiment, and only our model predicts correct sentiment. From Figure 3b heat map, the different marginal distributions on the four CRFs indicate that our model indeed captures different opinion features. It can be observed that MCRF-SA is able to attend to the two major opinion spans: "real" and "n't". The SA-LSTM-P model returns positive sentiment as it focuses too much on wrong opinion words.

Case Study and Error Analysis
We also analyze some common errors from our MCRF-SA model, ASGCN, and TNet-ATT on the Lap14 dataset. We observe two major types of errors, and Table 3 shows the examples for error analysis. The first two sentences belong to the type 1 error and the last one presents a type 2 error. The first type of errors appear frequently in neutral cases. In general, the neural models cannot well differentiate if the negative expressions (e.g. "cost", "shouldn't", etc.) is associated with the target/aspect. The second type typically involve complicated sentence structures with non-trivial semantics, which requires advanced language understanding capability.

Ablation Study
We examine the effectiveness of the major components of our MCRF-SA model, and Table 3 presents the ablation results on 14Rest and 14Lap datasets. Without the aspect indicator, our model becomes a sentence-level sentiment classification method which inevitably produces wrong predictions for sentences having multiple aspects with different sentiments. Removing the position decay function hurts the performance by 2.84 and 1.11 F 1 score on 14Rest and 14Lap, respectively. Lastly, without multi-CRF structured attention layer, the architecture becomes a simple Bi-GRU based model and the performance drops significantly by 4.89 and 10.49 F 1 points on 14Rest and 14Lap.

Conclusion
We propose a simple and effective MCRF-SA model to extract aspect-specific opinion span features. In particular, with the proposed multi-CRF structured attention layer and the effective position decay function, our model is capable of extracting various aspect-specific opinion span features from different representation sub-spaces. The experimental results demonstrate that our method effectively exploits the corresponding opinion features for sentiment classification. One future direction is to investigate how to integrate the two different attention mechanisms, namely the standard attention and structured attention for NLP applications.