Aspect Sentiment Classification Towards Question-Answering with Reinforced Bidirectional Attention Network

In the literature, existing studies on aspect sentiment classification (ASC) focus on individual non-interactive reviews. This paper extends the research to interactive reviews and proposes a new research task, namely Aspect Sentiment Classification towards Question-Answering (ASC-QA), for real-world applications. This new task aims to predict sentiment polarities for specific aspects from interactive QA style reviews. In particular, a high-quality annotated corpus is constructed for ASC-QA to facilitate corresponding research. On this basis, a Reinforced Bidirectional Attention Network (RBAN) approach is proposed to address two inherent challenges in ASC-QA, i.e., semantic matching between question and answer, and data noise. Experimental results demonstrate the great advantage of the proposed approach to ASC-QA against several state-of-the-art baselines.


Introduction
As a fine-grained sentiment analysis task, Aspect Sentiment Classification (ASC) aims to predict sentiment polarities (e.g., positive, negative, neutral) towards given particular aspects from a text and has been drawing more and more interests in natural language processing and computational linguistics over the past few years (Jiang et al., 2011;Tang et al., 2016b;. However, most of the existing studies on ASC focus on individual non-interactive reviews, such as customer reviews (Pontiki et al., 2014) and tweets (Mitchell et al., 2013;Vo and Zhang, 2015;Dong et al., 2014). For example, in a customer review "The food is delicious, but ambience is badly in need of improvement.", the customer mentions two aspects, i.e., "food" and "ambience", and expresses positive sentiment towards the former and negative sentiment towards the latter. Recently, a new interactive reviewing form, namely "Customer Question-Answering (QA)", has become increasingly popular and a large-scale of such QA style reviews (as shown in Figure  1) could be found in several famous e-commerce platforms (e.g., Amazon and Taobao). Compared to traditional non-interactive customer reviews, interactive QA style reviews are more reliable and convincing because answer providers are randomly selected from the real customers who have purchased the product (Shen et al., 2018a). To well automatically-understand the QA style reviews, it's worthwhile to perform ASC on the QA style reviews.
However, we believe that Aspect Sentiment Classification towards QA (ASC-QA) is not easy work and this novel task faces at least two major challenges. On one hand, different from traditional non-interactive reviews with a single sequence structure, interactive QA style reviews consist of two parallel units, i.e., question and answer. Thus, it's rather difficult to infer the sentiment polarity towards an aspect based on a single question or single answer. Take Figure 1 as an example. A well-behaved approach to ASC-QA should match each question and answer bidirectionally so as to correctly determine the sentiment polarity towards a specific aspect.
On the other hand, different from common QA matching tasks such as question-answering (Shen et al., 2018a), ASC-QA focuses on extracting sentiment information towards a specific aspect and may suffer from much aspect-irrelevant noisy information. For instance, in Figure 1, although the words in the answer (e.g., "quite slow", "obtuse") and the question (e.g., "operating speed") are relevant to aspect "operating speed", they are noisy for the other aspect "battery life". These noisy words might provide wrong signals and mislead the model into assigning a negative sentiment polarity to aspect "battery life" and vice versa. Therefore, a well-behaved approach to ASC-QA should alleviate the effects of noisy words for a specific aspect in both question and answer during model training.
In this paper, we propose a reinforced bidirectional attention network approach to tackle the above two challenges. Specifically, we first propose a word selection model, namely Reinforced Aspect-relevant Word Selector (RAWS), to alleviate the effects of noisy words for a specific aspect through discarding noisy words and only select aspect-relevant words in a word sequence. On the basis of RAWS, we then develop a Reinforced Bidirectional Attention Network (RBAN) approach to ASC-QA, which employs two fundamental RAWS modules to perform word selection over the question and answer text respectively. In this way, RBAN is capable of not only addressing the semantic matching problem in the QA text pair, but also alleviating the effects of noisy words for a specific aspect in both the question and answer sides. Finally, we optimize RBAN via a reinforcement learning algorithm, i.e., policy gradient (Williams, 1992;Sutton et al., 1999). The main contributions of this paper are in two folds: • We propose a new research task, i.e., Aspect Sentiment Classification towards Question-Answering (ASC-QA), and construct a highquality annotated benchmark corpus for this task.
• We propose an innovative reinforced bidirectional attention network approach to ASC-QA and validate the effectiveness of this approach through extensive experiments.

Data Collection and Annotation
We collect 150k QA style reviews from Taobao 1 , the most famous electronic business platform in China. The QA style reviews consist of three different domains: Bags, Cosmetics and Electronics. Since corpus annotation is labor-expensive and time-consuming, we randomly select 10k QA text pairs from each domain to perform annotation. Specifically, following Pontiki et al. (2014), we define an aspect at two levels of granularity, i.e., aspect term and aspect category. Besides, following Pontiki et al. (2015), we define three sentiment polarities, i.e., positive, negative and neutral (mildly positive or mildly negative) towards both aspect terms and categories. In this way, each QA text pair is annotated with two tuples, i.e., (aspect term, polarity), (aspect category, polarity).
For Tuple (Aspect Term, Polarity), we annotate the single/multi-word terms together with its corresponding polarities inside each QA text pair according to four main guidelines as follows: (1) We only annotate the aspect term when the related question and answer are matched. For example, the QA text pair in Figure 1 is annotated as ("battery life", positive) and ("operating speed", negative) due to words "durable", "slow" and "obtuse". However, in E1, the answer does not reply to the question correctly and thus the aspects of "macos" and "screen" will not be annotated.

E1: Q: Is macos good? How about the screen?
A: The shopkeeper is very warm-hearted.
(2) We only annotate the aspect term towards which an opinion is expressed. For example, in E2, the answer conveys only objective information without expressing opinions towards "phone" and thus "phone" will not be annotated. However, "case" will be annotated and tagged as neutral.
E2: Q: How is this phone? How about the case? A: I bought this phone yesterday. Case is okay nothing great.
(3) We only annotate aspect terms which explicitly name particular aspects. For example, in E3, "this", "it" will not be annotated. (4) When one aspect term has two different descriptions in both question and answer, the annotated aspect term should be consistent with the question. For example, in E4, the annotated aspect term should be "battery life" instead of "battery".

E4: Q: Is battery life durable?
A: Yes, this battery is very durable. For Tuple (Aspect Category, Polarity), we first define 2 15, 16, 10 aspect categories (as shown in Table 1) for the domains of Bags, Cosmetics and Electronics respectively. Then, we annotate aspect categories (chosen from the above predefined category list) discussed in each QA text pair according to similar guidelines for aspect term. For example, there are two aspect categories discussed in Figure 1, i.e., Battery and System Performance, and annotated as (Battery, positive) and (System Performance, negative) respectively. Finally, we discard the QA text pairs which have no annotated term and category.
We assign two annotators to tag each QA text pair and the Kappa consistency check value of the annotation is 0.81. When two annotators cannot reach an agreement, an expert will make the final decision, ensuring the quality of data annotation. Table 2 shows the statistics of the final corpus. To motivate future investigations for this track of research, the annotated corpus consisting of three domains are released in github 3 .

Our Approach
In this section, we first introduce the word selection model, i.e., Reinforced Aspect-relevant Word Selector (RAWS) as illustrated in Figure 2, which functions as a fundamental module of our approach to alleviate the effects of noisy words (Section 3.1). On the basis of RAWS, we present the Reinforced Bidirectional Attention Network (RBAN) approach to ASC-QA as illustrated in Figure 3, which employs two RAWS modules to 2 Aspect categories are defined and summarized through preliminary annotation.
3 https://github.com/jjwangnlp/ASC-QA  . perform word selection over the question and answer text respectively (Section 3.2). Finally, we introduce our optimization strategy via policy gradient and back-propagation (Section 3.3).

Reinforced Aspect-relevant Word
Selector (RAWS) Figure 2 shows the framework of the word selection model, i.e., Reinforced Aspect-relevant Word Selector (RAWS). Given an input word sequence x = {x 1 , .., x E }, RAWS aims to discard noisy words and only select aspect-relevant words inside x for a specific aspect x aspect In this way, RAWS virtually functions as a "hard" attention mechanism and thus cannot be directly optimized through back-propagation due to the non-differentiable problem as proposed in Xu et al. (2015) and Shen et al. (2018b). To address this issue, we employ the reinforcement learning algorithm, i.e., policy gradient (Sutton et al., 1999), to model RAWS. In this fashion, RAWS plays as an agent which decides to select the word or not by following a policy network as follows.
Policy Network. In this paper, we adopt a stochastic policy network p π which can provide a conditional probability distribution p π (o|·) over More specifically, we adopt LSTM (Graves, 2013) to construct the policy network p π for performing word selection over word sequence x, denoted as LSTM p . In order to differentiate whether a word is selected or discarded, inspired by Lei et al. (2016), we incorporate the action result o i into the inputv i of LSTM p at time-step i and compute hidden state h i ∈ R d of word x i as: where In principle, the policy network p π uses a Reward to guide the policy learning over word sequence x. It samples an Action o i with the probability p π (o i |s i ; θ r ) at each State s i . In this paper, state, action and reward are defined as follows.
• State. The state s i at i-th time-step should provide adequate information for deciding to select a word or not for aspect x aspect x aspect x aspect . Thus, the state s i ∈ R 4d is composed of four parts, i.e., h i−1 , , which could be cast as a binary classification problem. Thus, we use a logistic function to compute p π (o i |s i ; θ r ).
where θ r = {W r ∈ R 1×4d , b r ∈ R} is the parameter to be learned. ∼ denotes the discrete action sampling operation.
• Reward. In order to select aspect-relevant words inside word sequence x, we define an  aspect-relevant reward R based on cosine similarity between aspect vector v a ∈ R d of x aspect x aspect x aspect and the last hidden state h E ∈ R d of LSTM p after p π finishes all actions, i.e., Besides, it's worthwhile to mention that, we regard the loss log p(y| (P, x aspect x aspect x aspect ) presented in Eq.(10) from the classification phase as another loss delay reward. This loss reward combining with the above cosine reward could provide adequate supervision signals to guide RAWS to select aspect-relevant and also discriminative words (e.g., sentiment words "slow" and "obtuse" for aspect "operating speed") for performing ASC-QA. γE /E is an additional term for limiting the number of selected words. E = E i=1 o i denotes the number of selected words. γ is a penalty weight (tuned to be 0.01 with development set).

Reinforced Bidirectional Attention
Network (RBAN) Figure 3 shows the overall framework of our proposed reinforced bidirectional attention network (RBAN) approach to ASC-QA, which consists of three parts: 1) Word Encoder. 2) Reinforced Bidirectional Attention. 3) Softmax Decoder. Word Encoder. Given a QA text pair P with an aspect x aspect x aspect x aspect , let x q = {x q i }, ∀i ∈ [1, E q ] denotes the word sequence in question text, and x a = {x a j }, ∀j ∈ [1, E a ] denotes the word sequence in answer text. To alleviate the effects of noisy words for aspect x aspect x aspect x aspect in both the question and answer text, we make use of two RAWS modules (as introduced in Section 3.1) to perform word selection over question x q and answer x a respectively. More specifically, we employ two LSTM p to construct policy networks p q π and p a π for sampling action o q over question x q and sampling action o a over answer x a . Here, the two LSTM p are denoted as LSTM q p and LSTM a p respectively. Therefore, according to Eq.(1), the hidden states h q i , h a j ∈ R d of words x q i and x a j are computed as: where .] over question x q and answer x a , we employ a positional mask matrix M ∈ R E q ×E a to calculate the matching matrix S ∈ R E q ×E a between question and answer as: (6) where S ij denotes the similarity between the i-th question word and the j-th answer word; M ij = −∞ leads to S ij = −∞ indicating that the i-th question word or the j-th answer word has been regarded as the noisy word for x aspect x aspect x aspect and thus discarded by RAWS; W 1 , W 2 ∈ R d×d , w, b ∈ R d are the trainable parameters.
In order to mine semantic matching information between question and answer, we employ S to compute attentions in both directions, which could be seen as a Question-to-Answer attention and an Answer-to-Question attention. Specifically, we first employ the row/column-wise softmax operation to get two normalized matrices S r and S c .
where S ij = −∞ leads to S r ij , S c ij = 0 when the softmax operation is applied. This switches off the attentions between word x q i and x a j so as to filter the noisy word information and only mine the matching information relevant to aspect x aspect x aspect x aspect . Second, since each word x q i in question interacts all words in answer x a and vice versa, its importance can be measured as the summation of the strengths of all these interactions, i.e., matching scores computed in Eq.(7). Therefore, we perform row/column-wise summation operation over the normalized matching matrices, i.e.,α a = i S r i: andα q = j S c :j , whereα a = [..,α a j , ..] ∈ R E a andα q = [..,α q i , ..] ∈ R E q are matching score vectors. Finally, the bidirectional attention is computed as follows: • Question-to-Answer Attention (Q2A). We first perform softmax operation overα a to compute the attention weight α a j of word x a j in answer text as α a j = . Then, the vector s a ∈ R d of the answer text is computed as a weighted sum of hidden state h a j based on the attention weight α a j , i.e., s a = E a j=1 α a j h a j . • Answer-to-Question Attention (A2Q). Similar to question-to-answer attention, the question vector s q ∈ R d is computed based on attention we concatenate the answer vector s a and question vector s q so as to obtain the vector representation r ∈ R 2d of the QA text pair P, i.e., r = s a ⊕ s q .
Softmax Decoder. To perform ASC-QA, we feed the vector r to a softmax classifier, i.e., β = W r + b, where β ∈ R C is the output vector. Then, the probability of labeling sentence with sentiment polarity l ∈ [1, C] is computed by p θ = exp(β l ) C c=1 exp (βc) . Finally, the label with the highest probability stands for the predicted sentiment polarity towards aspect x aspect x aspect x aspect .

Optimization via Policy Gradient and Back-Propagation
The parameters in RBAN are divided into two groups: 1) θ q r and θ a r for policy networks p q π , p a π in two fundamental RAWS modules. 2) θ for the rest parts including word embeddings, LSTM, bidirectional attention and softmax decoder.
For θ q r , we optimize it with policy gradient algorithm (Sutton et al., 1999). In detail, we first obtain an aspect-relevant reward R q according to Eq.(3) after p q π finishes all actions. Then, the policy gradient w.r.t. θ q r is computed by differentiating the maximized expected reward J(θ q r ) as follows: where ∇ θ q r J(θ q r ) is estimated by using Monte-Carlo simulation (Sutton et al., 1999) to sample some action sequences over question texts. Similarly, the policy gradient w.r.t. θ a r is computed as: For θ, we optimize it with back-propagation. In detail, the objective of learning θ is to minimize the cross-entropy loss function in the classification phase as follows: x aspect x aspect ,y)∼D [− log p(y|(P, x aspect x aspect x aspect ))] (10) where (P, x aspect x aspect x aspect , y) denotes QA text pair P with given aspect x aspect x aspect x aspect from dataset D; y is groundtruth sentiment polarity towards aspect x aspect x aspect x aspect . Note that, during model training, θ q r and θ q r are not updated in early stage, and thus two RAWS modules select all words in question and answer. When θ is optimized until the loss over development set does not decrease significantly, we then begin to optimize θ, θ q r and θ a r simultaneously.

Experimentation
We systematically evaluate the performance of our proposed RBAN approach to ASC-QA on the corpus as described in Section 2.

Experimental Settings
Data Settings. As introduced in Section 2, we have annotated QA text pairs from three different domains listed in Table 2. For each domain, we randomly split the annotated data into training, development, and testing sets with the ratio of 8:1:1. Word Embedding. We first adopt FudanNLP (Qiu et al., 2013) to perform word segmentation over our collected 150k Chinese QA text pairs. Then, we employ these QA text pairs to pre-train 200-dimension word vectors with skip-gram 6 .
Hyper-parameters. In all our experiments, word embeddings are optimized during training. The dimensions of LSTM hidden states are set to be 200. The other hyper-parameters are tuned according to the development set. Specifically, we adopt Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 0.01 for crossentropy training and adopt the SGD optimizer with 6 https://github.com/dav/word2vec a learning rate of 0.002 for all policy gradients training. Regularization weight of parameters is 10 −5 , dropout rate is 0.25 and batch size is 32.
Evaluation Metrics. The performance is evaluated using Accuracy (Acc.) and Macro-F1 (F1) . Moreover, t-test is used to evaluate the significance (Yang and Liu, 1999).
Task Definition. Our proposed ASC-QA consists of two sub-tasks: 1) Term-level ASC-QA. Given a set of pre-identified aspect terms, this subtask is to determine the polarity towards each aspect term inside a QA text pair. 2) Category-level ASC-QA. Given a set of pre-identified aspect categories, this sub-task is to determine the polarity towards each aspect category discussed in a QA text pair.

Baselines
For comparison, we implement several state-ofthe-art approaches to ASC as baselines. Since the input of all these approaches should be a single sequence, we concatenate question and answer text to generate a single sequence. Besides, we employ some QA matching approaches to ASC-QA and implement several basic versions of RBAN as baselines. Note that, for fair comparison, all the above baselines adopt the same pre-trained word embeddings as RBAN.
The baselines are listed as follows in detail: 1) LSTM (Wang et al., 2016). This approach only adopts a standard LSTM network to model the text without considering aspect information. 2) RAM (Chen et al., 2017). This is a state-of-theart deep memory network approach to ASC. 3) GCAE (Xue and Li, 2018). This is a state-ofthe-art approach to ASC which combines CNN and gating mechanisms to learn text representation. 4) S-LSTM (Wang and Lu, 2018). This is a state-of-the-art approach to ASC which considers structural dependencies between targets and opinion terms. 5) BIDAF (Seo et al., 2016). This is a QA matching approach to reading comprehension. We substitute its decoding layer with softmax decoder to perform ASC-QA. 6) HMN (Shen et al., 2018a). This is a QA matching approach to coarse-grained sentiment classification towards QA style reviews. 7) MAMC (Yin et al., 2017). This is a QA matching approach to ASC which proposes a hierarchical iterative attention to learn the aspect-specific text representation. 8) RBAN w/o RAWS. Our RBAN approach without using RAWS modules. 9) RBAN w/o Q2A. Our RBAN  (Wang et al., 2016) 0  approach without using question-to-answer attention. 10) RBAN w/o A2Q. Our RBAN approach without using answer-to-question attention. Table 3 shows the performances of different approaches to ASC-QA. From this table, we can see that all the three state-of-the-art ASC approaches, i.e., RAM, GCAE and S-LSTM, perform better than LSTM. This confirms the usefulness of considering aspect information in ASC. Besides, both the attention based approaches RAM and S-LSTM achieve comparable or better performance than GCAE. This result demonstrates the usefulness of a proper attention mechanism to model aspect information.

Experimental Results
The two QA matching approaches, i.e., BIDAF and HMN could achieve comparable performance with the three state-of-the-art ASC approaches, and MAMC even beats all of them. This indicates the appropriateness of treating question and answer in a QA style review as two parallel units instead of a single sequence in ASC-QA.
Furthermore, our RBAN w/o RAWS approach (i.e., without considering aspect information) performs consistently better than MAMC. This encourages to employ bidirectional attention to learn the representation vectors of both the question and answer in order to capture the sentiment information therein. Besides, it's interesting to notice that RBAN w/o A2Q (i.e., without question vector s q ) performs much better than RBAN w/o Q2A (i.e., without answer vector s a ). This is due to the fact that the main sentiment polarity towards aspect is usually expressed in the answer text.
In comparison, when using RAWS, RBAN per-forms best and significantly outperforms RBAN w/o RAWS (p-value < 0.05), which encourages to discard noisy words for a specific aspect in both the question and answer sides. Impressively, in the sub-task of Term-level ASC-QA, compared to LSTM, RBAN achieves average improvements of 7.97% (F1) and 8.67% (Acc.) in three domains. In the sub-task of Category-level ASC-QA, compared to LSTM, RBAN achieves average improvements of 9.1% (F1) and 9.23% (Acc.). Significance test shows that these improvements are all significant (p-value < 0.05). These results encourage to incorporate both RAWS and bidirectional attentions to tackle ASC-QA.

Analysis and Discussion
Case Study. We provide a qualitative analysis of our approach on the development set. Specifically, in Figure 4, we visualize the attention matrix S r in RBAN towards aspect "operating speed" in two cases, i.e., not using RAWS (Figure 4(a)) and using RAWS (Figure 4(b)). In Figure 4(a), color blue denotes attention weight (the darker the more important), we can find that both aspect "battery life" and aspect "operating speed" in question have been successfully matched with their corresponding answer phrases, i.e., "very durable" and "quite slow and obtuse". However, RBAN without RAWS can't discard noisy words (e.g., "battery life", "durable") for aspect "operating speed". In Figure 4(b), color white denotes the word inside question or answer has been discarded, we can find that RBAN is capable of effectively discarding noisy words such as "battery" and "durable" and highlighting those significant words such as "slow" and "obtuse" for aspect "operating speed". Error Analysis. We randomly analyze 100 error cases in the experiments, which can be roughly categorized into 5 types. 1) 27% errors are because that the answer length is too short. An example is "Question: Is the screen good? Answer: No.". 2) 24% errors are due to negation words. An example is "the case is not good". Our approach fails to select the word "not" and incorrectly predicts positive polarity. This inspires us to optimize our approach so as to capture the negation scope better in the future. 3) 19% errors are due to the wrong prediction on recognizing neutral instances. The shortage of neutral training examples makes the prediction of neutral instances very difficult. 4) 16% errors are due to comparative opinions. An example is "macos is much better than Windows". Our approach incorrectly predicts positive for aspect "Windows". 5) Finally, 14% errors are due to mistakes during Chinese word segmentation. An example is "好难看(very ugly)". It's incorrectly segmented into "好(good)|难(hard)|看(look)" and predicted as positive. This encourages to improve the performance of word segmentation on informal customer reviews.

Related Work
Existing studies on Aspect Sentiment Classification (ASC) could be divided into two groups according to the different level of text, i.e., sentencelevel ASC and document-level ASC.
Sentence-level ASC is typically regarded as a sentence-level text classification which aims to incorporate aspect information into a model. Recently, Wang et al. (2016); Ma et al. (2017) propose an attention based LSTM to ASC by exploring the connection between an aspect and the content of a sentence. Tang et al. (2016b), Chen et al. (2017) and  employ memory networks to model the context and aspect. Wang and Lu (2018) propose a segmentation attention to capture structural dependency between target and opinion terms.
Document-level ASC aims to predict sentiment ratings for aspects inside a long text. Traditional studies (Titov and McDonald, 2008;Wang et al., 2010;Pontiki et al., 2016) solve document-level ASC as a sub-problem by utilizing heuristic based methods or topic models. Recently, Lei et al. (2016) focus on extracting rationales for aspects in a document.  propose an useraware attention approach to document-level ASC. Yin et al. (2017) model document-level ASC as a machine comprehension problem, of which the input is also a parallel unit, i.e., question and answer. However, their question texts are pseudo and artificially constructed. This disaccords with the fact that real-world question texts also possibly involve multi-aspect and sentiment information.
Unlike all the above studies, this paper performs ASC on a different type of text, i.e., QA style reviews. To the best of our knowledge, this is the first attempt to perform ASC on QA style reviews.

Conclusion
In this paper, we propose a new task, i.e., Aspect Sentiment Classification towards Question Answering (ASC-QA). Specifically, we first build a high-quality human annotated benchmark corpus. Then, we design a reinforced bidirectional attention network (RBAN) approach to address ASC-QA. Empirical studies show that our proposed approach significantly outperforms several state-ofthe-art baselines in the task of ASC-QA. In our future work, we would like to solve other challenges in ASC-QA such as data imbalance and negation detection to improve the performance. Furthermore, we would like to explore the effectiveness of our approach to ASC-QA in other languages.