Multi-Instance Multi-Label Learning Networks for Aspect-Category Sentiment Analysis

Aspect-category sentiment analysis (ACSA) aims to predict sentiment polarities of sentences with respect to given aspect categories. To detect the sentiment toward a particular aspect category in a sentence, most previous methods first generate an aspect category-specific sentence representation for the aspect category, then predict the sentiment polarity based on the representation. These methods ignore the fact that the sentiment of an aspect category mentioned in a sentence is an aggregation of the sentiments of the words indicating the aspect category in the sentence, which leads to suboptimal performance. In this paper, we propose a Multi-Instance Multi-Label Learning Network for Aspect-Category sentiment analysis (AC-MIMLLN), which treats sentences as bags, words as instances, and the words indicating an aspect category as the key instances of the aspect category. Given a sentence and the aspect categories mentioned in the sentence, AC-MIMLLN first predicts the sentiments of the instances, then finds the key instances for the aspect categories, finally obtains the sentiments of the sentence toward the aspect categories by aggregating the key instance sentiments. Experimental results on three public datasets demonstrate the effectiveness of AC-MIMLLN.


Introduction
Sentiment analysis (Pang and Lee, 2008;Liu, 2012) has attracted increasing attention recently. Aspectbased sentiment analysis (ABSA) (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016) is a fine-grained sentiment analysis task and includes many subtasks, two of which are aspect category detection (ACD) that detects the aspect categories mentioned in a sentence and * Equal contribution † Corresponding author 1 Data and code are available at https://github.com/l294265421/AC-MIMLLN While it was large and a bit noisy, the drinks were fantastic, and the food was superb.  aspect-category sentiment analysis (ACSA) that predicts the sentiment polarities with respect to the detected aspect categories. Figure 1 shows an example. ACD detects the two aspect categories, ambience and food, and ACSA predicts the negative and positive sentiment toward them respectively. In this work, we focus on ACSA, while ACD as an auxiliary task is used to find the words indicating the aspect categories in sentences for ACSA.
Since a sentence usually contains one or more aspect categories, previous studies have developed various methods for generating aspect categoryspecific sentence representations to detect the sentiment toward a particular aspect category in a sentence. To name a few, attention-based models Cheng et al., 2017;Tay et al., 2018;Hu et al., 2019) allocate the appropriate sentiment words for the given aspect category. Xue and Li (2018) proposed to generate aspect categoryspecific representations based on convolutional neural networks and gating mechanisms. Since aspectrelated information may already be discarded and aspect-irrelevant information may be retained in an aspect independent encoder, some existing methods (Xing et al., 2019;Liang et al., 2019) utilized the given aspect to guide the sentence encoding from arXiv:2010.02656v1 [cs.CL] 6 Oct 2020 scratch. Recently, BERT based models Jiang et al., 2019) have obtained promising performance on the ACSA task. However, these models ignored that the sentiment of an aspect category mentioned in a sentence is an aggregation of the sentiments of the words indicating the aspect category. It leads to suboptimal performance of these models. For the example in Figure 1, both "drinks" and "food" indicate the aspect category food. The sentiment about food is a combination of the sentiments of "drinks" and "food". Note that, words indicating aspect categories not only contain aspect terms explicitly indicating an aspect category but also contain other words implicitly indicating an aspect category . In Figure 1, while "drinks" and "food" are aspect terms explicitly indicating the aspect category food, "large" and "noisy" are not aspect terms implicitly indicating the aspect category ambience.
In this paper, we propose a Multi-Instance Multilabel Learning Network for Aspect-Category sentiment analysis (AC-MIMLLN). AC-MIMLLN explicitly models the fact that the sentiment of an aspect category mentioned in a sentence is an aggregation of the sentiments of the words indicating the aspect category. Specifically, AC-MIMLLN treats sentences as bags, words as instances, and the words indicating an aspect category as the key instances  of the aspect category. Given a bag and the aspect categories mentioned in the bag, AC-MIMLLN first predicts the instance sentiments, then finds the key instances for the aspect categories, finally aggregates the sentiments of the key instances to get the bag-level sentiments of the aspect categories.
Our main contributions can be summarized as follows: • We propose a Multi-Instance Multi-Label Learning Network for Aspect-Category sentiment analysis (AC-MIMLLN). AC-MIMLLN explicitly model the process that the sentiment of an aspect category mentioned in a sentence is obtained by aggregating the sentiments of the words indicating the aspect category.
• To the best of our knowledge, it is the first time to explore multi-instance multi-label learning in aspect-category sentiment analysis.
• Experimental results on three public datasets demonstrate the effectiveness of AC-MIMLLN.

Related Work
Aspect-Category Sentiment Analysis predicts the sentiment polarities with regard to the given aspect categories. Many methods have been developed for this task.  proposed an attention-based LSTM network, which can concentrate on different parts of a sentence when different aspect categories are taken as input. Some new attention-based methods Tay et al., 2018;Hu et al., 2019) allocated more appropriate sentiment words for aspect categories and obtained bertter performance. Ruder et al. (2016) modeled the interdependencies of sentences in a text with a hierarchical bidirectional LSTM. Xue and Li (2018) (Li et al., 2017;Schmitt et al., 2018; were proposed to avoid error propagation, which performed ACD and ACSA jointly. However, all these models mentioned above ignored that the sentiment of an aspect category discussed in a sentence is an aggregation of the sentiments of the words indicating the aspect category. Multi-Instance Multi-Label Learning (MIMLL) (Zhou and Zhang, 2006) deals with problems where a training example is described by multiple instances and associated with multiple class labels. MIMLL has achieved success in various applications due to its advantages on learning with complicated objects, such as image classification (Zhou and Zhang, 2006;Chen et al., 2013), text categorization (Zhang and Zhou, 2008), relation extraction (Surdeanu et al., 2012;Jiang et al., 2016), etc. In ACSA, a sentence contains multiple words (instances) and expresses sentiments to multiple aspect categories (labels), so MIMLL is suitable for ACSA. However, as far as our knowledge, MIMLL has not been explored in ACSA.
Multiple instance learning (MIL) (Keeler and Rumelhart, 1992) is a special case of MIMLL, where a real-world object described by a number of instances is associated with only one class label. Some studies (Kotzias et al., 2015;Angelidis and Lapata, 2018;Pappas and Popescu-Belis, 2014) have applied MIL to sentiment analysis. Angelidis and Lapata (2018) proposed a Multiple Instance Learning Network (MILNET), where the overarching polarity of a text is an aggregation of sentence or elementary discourse unit polarities, weighted by their importance. An attention-based polarity scoring method is used to obtain the importance of segments. Similar to MILNET, our model also uses an attention mechanism to obtain the importance of instances. However, the attention in our model is learned from the ACD task, while the attention in MILNET is learned from the sentiment classification task. Pappas and Popescu-Belis (2014) applied MIL to another subtask of ABSA. They proposed a multiple instance regression (MIR) model to assign sentiment scores to specific aspects of products. However, i) their task is different from ours, and ii) their model is not a neural network.

Model
In this section, we describe how to apply the multiinstance multi-label learning framework to the aspect-category sentiment analysis task. We first introduce the problem formulation, then describe our proposed Multi-Instance Multi-Label Learning Network for Aspect-Category sentiment analysis (AC-MIMLLN).

Problem Formulation
In the ACSA task, there are N predefined aspect categories A = {a 1 , a 2 , ..., a N } and a predefined set of sentiment polarities P = {N eg, N eu, P os} (i.e., Negative, Neutral and Positive respectively). Given a sentence, S = {w 1 , w 2 , ..., w n } and the K aspect categories, The multi-instance multi-label learning assumes that, for the k-th aspect category, p k is an unknown function of the unobserved word-level sentiment distributions. AC-MIMLLN first produces a sentiment distribution p j for each word and then combines these into a sentence-level prediction: In this section, we introduce our proposed Multi-Instance Multi-Label Learning Network for Aspect-Category sentiment analysis (AC-MIMLLN), which is based on the intuitive assumption that the sentiment of an aspect category mentioned in a sentence is an aggregation of the sentiments of the words indicating the aspect category.
In MIMLL, the words indicating an aspect category are called the key instances of the aspect category. Specifically, AC-MIMLLN contains two parts, an attention-based aspect category detection (ACD) classifier and an aspect-category sentiment analysis (ACSA) classifier. Given a sentence, the ACD classifier as an auxiliary task generates the weights of the words for every aspect category. The weights indicate the probabilities of the words being the key instances of aspect categories . The ACSA classifier first predicts the sentiments of the words, then obtains the sentence-level sentiment for each aspect category by combining the corresponding weights and the sentiments of the words. The overall model architecture is illustrated in Figure 2. While the ACD part contains four modules: embedding layer, LSTM layer, attention layer and aspect category prediction layer, the ACSA part also consists of four components: embedding layer, multi-layer Bi-LSTM, word sentiment prediction layer and aspect category sentiment prediction layer. In the ACD task, all aspect categories share the embedding layer and the LSTM layer, and have different attention layers and aspect category prediction layers. In the ACSA task, all aspect categories share the embedding layer, the multi-layer Bi-LSTM, and the word sentiment prediction layer, and have different aspect category sentiment prediction layers.

Input:
The input of our model is a sentence consisting of n words S = {w 1 , w 2 , ..., w n }.
Embedding Layer for ACD: The input of this layer is the sentence. With an embedding matrix W w , the sentence is converted to a sequence of vectors where,W w ∈ R d×|V | , d is the dimension of the word embeddings, and |V | is the vocabulary size. LSTM Layer: When LSTM (Hochreiter and Schmidhuber, 1997) is effective enough, attention mechanisms may not offer effective weight vectors (Wiegreffe and Pinter, 2019). In order to guarantee the effectiveness of the weights offered by attention mechanisms, we use a single-layer single-direction LSTM for ACD. This LSTM layer takes the word embeddings of the ACD task as input, and outputs hidden states H = {h 1 , h 2 , ..., h n }. At each time step i, the hidden state h i is computed by: The size of the hidden state is also set to be d.
Attention Layer: This layer takes the output of the LSTM layer as input, and produce an attention (Yang et al., 2016) weight vector for each predefined aspect category. For the j-th aspect category: where W j ∈ R d×d ,b j ∈ R d ,u j ∈ R d are learnable parameters, and α j ∈ R n is the attention weight vector.
Aspect Category Prediction Layer: We use the weighted hidden state as the sentence representation for ACD prediction. For the j-th category: where W j ∈ R d×1 and b j is a scalar.
Embedding Layer for ACSA: For ease of reference, we use different embedding layers for ACD and ACSA. This embedding layer converts the sentence S to a sequence of vectors Multi-Layer Bi-LSTM: The output of the embedding layer for ACSA are fed into a multi-layer Bidirectional LSTM (Graves et al., 2013) (Bi-LSTM). Each layer takes the output of the previous layer as input. Formally, given the hidden states of the (l − 1)-th layer, where Word Sentiment Prediction Layer: We use the hidden state h L i at the time step i of the L-th layer Bi-LSTM as the representation of the i-th word, and two fully connected layers are used to produce the i-th word sentiment prediction p i : where W 1 ∈ R d×d , W 2 ∈ R d×3 , b 1 ∈ R d , b 2 ∈ R 3 are learnable parameters. Note there is no softmax activation function after the fully connected layer, which lead it difficult to train our model.

Aspect Category Sentiment Prediction Layer:
We obtain the aspect category sentiment predictions by aggregating the word sentiment predictions based on the weights offered by the ACD task. Formally, for the j-th aspect category, its sentiment p j can be computed by: where p j ∈ R 3 , and α i j indicates the weight of the i-th word about the j-th aspect category from the weight vector α j offered by the ACD task.
Loss: For the ACD task 2 , as each prediction is a binary classification problem, the loss function is defined by: For the ACSA task, only the loss of the K aspect categories mentioned in the sentence is included, and the loss function is defined by: We jointly train our model for the two tasks. The parameters in our model are then trained by minimizing the combined loss function: where β is the weight of ACSA loss, λ is the L2 regularization factor and θ contains all parameters of our model. 2 ACD is an auxiliary task. Although AC-MIMLLN performs both ACD and ACSA, the aspect categories it detects (i.e., the results of ACD) are usually ignored in both training stage and testing stage. The reason is that our ACD classifier is simple, it can produce effective attention weights, but may not generate effective predictions for the ACD task. In this paper, we focus on ACSA and only evaluate the performance of AC-MIMLLN on ACSA.

MAMS-ACSA:
Since the test set of Rest14-hard is small, we also adopt the Multi-Aspect Multi-Sentiment dataset for Aspect Category Sentiment Analysis (denoted by MAMS-ACSA). MAMS-ACSA is released by Jiang et al. (2019), all sentences in which contain multiple aspect categories with different sentiment polarities. We select Rest14-hard and MAMS-ACSA that we call hard datasets because most sentences in Rest14 contain only one aspect or multiple aspects with the same sentiment polarity, which makes ACSA degenerate to sentence-level sentiment analysis (Jiang et al., 2019). Rest14-hard and MAMS-ACSA can measure the ability of a model to detect multiple different sentiment polarities in one sentence toward different aspect categories. Statistics of these three datasets are given in Table 1.

Comparison Methods
We compare AC-MIMLLN with various baselines.

Implementation Details
We implement our models in PyTorch (Paszke et al., 2017). We use 300-dimentional word vectors pretrained by GloVe (Pennington et al., 2014) to initialize the word embedding vectors. The batch sizes are set to 32 and 64 for non-BERT models on the Rest14(-hard) dataset and the MAMS-ACSA dataset, respectively, and 16 for BERT-based models. All models are optimized by the Adam optimizer (Kingma and Ba, 2014). The learning rates are set to 0.001 and 0.00002 for non-BERT models and BERT-based models, respectively. We set L = 3, λ = 0.00001 and β = 1. For the ACSA task, we apply a dropout of p = 0.5 after the embedding and Bi-LSTM layers. For AC-MIMLLN-BERT, ACD is trained first then both of ACD and ACSA are trained together. For other models, ACD and ACSA are directly trained jointly. We apply early stopping in training and the patience is 10. We run all models for 5 times and report the average results on the test datasets.

Experimental Results
Experimental results are illustrated in Table 2. According to the experimental results, we can come to the following conclusions. First, AC-MIMLLN outperforms all non-BERT baselines on the Rest14hard dataset and the MAMS-ACSA dataset, which indicates that AC-MIMLLN has better ability to detect multiple different sentiment polarities in one sentence toward different aspect categories. Second, AC-MIMLLN obtains +1.0% higher accuracy than AC-MIMLLN -w/o mil on the Rest14 dataset, +0.8% higher accuracy on the Rest14-hard dataset and +0.8% higher accuracy on the MAMS-ACSA dataset, which shows that the Multiple Instance Learning (MIL) framework is more suitable for

Rest14
Rest14-hard MAMS-ACSA KID(F 1 ) KISC(acc) KID(F 1 ) KISC(acc) KID(F 1 ) KISC (   the ACSA task. Third, AC-MIMLLN-BERT surpasses all BERT-based models on all three datasets, indicating that AC-MIMLLN can achieve better performance by using more powerful sentence encoders for ACSA. In addition, AC-MIMLLN can't outperform As-capsule on Rest14. The main reason is that AC-MIMLLN has poor perfmance on the aspect category misc (the abbreviation for anecdotes/miscellaneous) (see Table 3 and Figure 4 (f)).

Impact of Multi-Task Learning
AC-MIMLLN is a multi-task model, which performs ACD and ACSA simultaneously. Multi-task learning (Caruana, 1997) achieves improved performance by exploiting commonalities and differences across tasks. In this section, we explore the performance of AC-MIMLLN in different multi-task settings on the ACSA task. Specifically, we explore four settings: single-pipeline, single-joint, multipipeline and multi-joint. The "single" means that the ACSA task predicts the sentiment of one aspect category in sentences every time, while the "multi" means that the ACSA task predicts the sentiments of all aspect categories in sentences every time.
The "pipeline" indicates that ACD is trained first, then ACSA is trained, while the "joint" indicates ACD and ACSA are trained jointly. The multi-joint is AC-MIMLLN. Experimental results are shown in Table 5. First, we observe that, multi-* outperform all their counterparts, indicating modeling all aspect categories in sentences simultaneously can improve the per-  formance of the ACSA task. Second, *-joint surpass *-pipeline on the Rest14-hard dataset and the MAMS-ACSA dataset, which shows that training ACD and ACSA jointly can improve the perfomance on hard datasets. Third, *-joint obtain worse perfomance on the Rest14 dataset than *-pipeline. One possible reason is that Rest14 is simple and *-joint have bigger model capacity than *-pipeline and overfit on Rest14.

Impact of Multi-layer Bi-LSTM Depth
In this section, we explore the effect of the number of the Bi-LSTM layers. Experiments results are shown in Figure 3, which also contains the results of AC-MIMLLN-softmax. AC-MIMLLN-softmax is obtained by adding the softmax activation function to the word sentiment prediction layer of AC-MIMLLN. We observe that, when the number of Bi-LSTM layer increases, AC-MIMLLN usually obtains better performance, and AC-MIMLLNsoftmax obtains worse results. It indicates that AC-MIMLLN-softmax is hard to train when its complexity increases, while AC-MIMLLN can achieve better performance by using more powerful sentence encoders for ACSA.

Quality Analysis
In this subsection, we show the advantages of our model and analyze where the error lies in through  some typical examples and estimating the performance of our model detecting the key instances (KID) of the given aspect category and classifying the sentiments of the given key instances (KISC). We annotate the key instances for the aspect categories mentioned in sentences and their sentiment polarities on the test set of the three datasets. Models judge a word as a key instance if the weight of the word is greater than or equal to 0.1. Experimental results are illustrated in Table 4. Case Study Figure 4 visualizes the attention weights and the word sentiment prediction results of four sentences. Figure 4 (a) shows that, our model accurately finds the key instances "expensive" for the aspect category price and "food" for food, and assigns correct sentiments to both the aspect categories and the key instances. Compared with previous models, which generate aspect category-specific sentence representations for the ACSA task directly (e.g. BERT-pair-QA-B) or based on aspect category-related sentiment words (e.g. As-capsule), our model is more interpretable.
In Figure 4, (b) and (c) show that, both AC-MIMLLN and AC-MIMLLN-Affine can correctly predict the sentiments of the aspect categories, food and service. While AC-MIMLLN-Affine accurately find the key instance "service" for service, AC-MIMLLN assigns weights to all the words in the text snippet "service was dreadful!". This is because the LSTM-based ACD model in AC-MIMLLN can select useful words for both ACD and ACSA based on the context, which results in better performance (see Table 2). This also can explain why AC-MIMLLN has worse performance on detecting the key instances of the given aspect category than AC-MIMLLN-Affine (see Table 4).
Error Analysis In Figure 4 (d), the sentiments toward "drinks" and "dessert" (key instances of the aspect category food) should be neutral, however AC-MIMLLN assigns negative sentiment to "drinks" and positive sentiment to "dessert". Figure 4 (e) shows AC-MIMLLN-BERT also assigns wrong sentiments to "drinks" and "dessert". Table 4 shows that although AC-MIMLLN-BERT significantly improve the performance of KISC, it's results are also less than 80% on the Rest14-hard dataset and the MAMS-ACSA dataset.
In Figure 4 (f), AC-MIMLLN wrongly predict the sentiment of the aspect category misc, because it finds the wrong key instances for misc. Compared to other aspect categories, it's harder to decide which words are the key instances of misc for AC-MIMLLN, resulting in poor performance of AC-MIMLLN on the aspect category misc. Figure 4 (g) shows AC-MIMLLN-BERT correctly predict the sentiments of the aspect category misc, but also finds the wrong key instances for misc. Table 4 shows that all results on KID are less than 75%.

Conclusion
In this paper, we propose a Multi-Instance Multi-Label Learning Network for Aspect-Category sentiment analysis (AC-MIMLLN). AC-MIMLLN predicts the sentiment of an aspect category mentioned in a sentence by aggregating the sentiments of the words indicating the aspect category in the sentence. Experimental results demonstrate the effectiveness of AC-MIMLLN. Since AC-MIMLLN finds the key instances for the given aspect category and predicts the sentiments of the key instances, it is more interpretable. In some sentences, phrases or clauses rather than words indicate the given aspect category, future work could consider multi-grained instances, including words, phrases and clauses. Since directly finding the key instances for some aspect categories is ineffective, we will try to first recognize all opinion snippets in a sentence, then assign these snippets to the aspect categories mentioned in the sentence.