DiversifiEd Multiple Instance Learning for Document-Level Multi-Aspect Sentiment ClassifiCation

Neural Document-level Multi-aspect Sentiment Classiﬁcation (DMSC) usually requires a lot of manual aspect-level sentiment annotations, which is time-consuming and laborious. As document-level sentiment labeled data are widely available from online service, it is valuable to perform DMSC with such free document-level annotations. To this end, we propose a novel Diversiﬁed Multiple In-stance Learning Network (D-MILN), which is able to achieve aspect-level sentiment classiﬁ-cation with only document-level weak supervision. Speciﬁcally, we connect aspect-level and document-level sentiment by formulating this problem as multiple instance learning, providing a way to learn aspect-level classiﬁer from the back propagation of document-level super-vision. Two diversiﬁed regularizations are further introduced in order to avoid the overﬁt-ting on document-level signals during training. Diversiﬁed textual regularization encourages the classiﬁer to select aspect-relevant snip-pets, and diversiﬁed sentimental regularization prevents the aspect-level sentiments from being overly consistent with document-level sentiment. Experimental results on TripAdvi-sor and BeerAdvocate datasets show that D-MILN remarkably outperforms recent weakly-supervised baselines, and is also comparable to the supervised method.


Introduction
Document-level multi-aspect sentiment classification (DMSC) is a fine-grained sentiment analysis task, aiming to predict the sentiments of aspects in a document consisting of several sentences. In previous studies, neural models have shown to be effective for improving DMSC with the help of large amounts of aspect-level annotations (Chen et al., 2017;Xue and Li, 2018;Chen and Qian, 2019; * corresponding author.

+ -
A great location to stay at. The room was ordinary the bathroom looked like an after thought, the shower was extremely small. Also the front desk clerks provided minimum service. Besides, the price is very expensive.
Review overall: 2 Ratings room: 2 location: 4 service: 1 value: 1 A great location to stay at. The room was ordinary the bathroom looked like an after thought, the shower was extremely small. Also the front desk clerks provided minimum service. Besides, the price is very expensive.
Review overall: Sentiment room: location: service: value: A great location to stay at. The room was ordinary the bathroom looked like an after thought, the shower was extremely small. Also the front desk clerks provided minimum service. Besides, the price is very expensive.
Review overall: Ratings room: location: service: value: A great location to stay at. The room was ordinary the bathroom looked like an after thought, the shower was extremely small. Also the front desk clerks provided minimum service. Besides, the price is very expensive. A great location to stay at. The room was ordinary the bathroom looked like an after thought, the shower was extremely small. Also the front desk clerks provided minimum service. Besides, the price is very expensive.
Review overall:-Ratings room: -location:+ service: 1 value: 1 Figure 1: A review example with sentiment labels. . Despite the advantages, the acquisition of aspect-level sentiment annotations remains a laborious and expensive endeavor. Fortunately, the overall document-level sentiment annotations are relatively easy to obtain thanks to the widespread online reviews with overall star ratings. Therefore, it is practically meaningful to perform DMSC by weak supervision from document-level sentiment signals. However, this problem is far from solved. To the best of our knowledge, there is no neural model that is able to achieve DMSC with only document-level signals. There are mainly two challenges need to be settled. First, the granularity between aspect-level sentiment and document-level sentiment is quite different. It is unclear how to properly model the relation between them, in order to transfer knowledge from document-level to aspect-level. Second, the relevant text of aspect-level is unobserved. Without any constraint, a vanilla weakly supervised model would be easy to overfit to document-level signals in terms of both sentiment and attended text, despite each aspect often has its unique relevant text and different sentiment (as shown in Figure 1). However in this case, no matter the given aspect is location, room, service, or value, a vanilla model would pay more attention to the words "great", "ordinary", "small", "minimum" and "expensive", and transfer the negative sentiment from documentlevel to all aspects. As a result, the sentiment towards location is wrongly learned as negative, which should be positive instead.
Accordingly, we propose a diversified multiple instance learning network (D-MILN) to achieve DMSC with only document-level sentiment supervision. We novelly formularize this problem as multiple instance learning (MIL; Keeler and Rumelhart 1991) to model document-level sentiment as a combination of aspect-level sentiments. The aspects are regarded as instances and their sentiment distributions are predicted by an attention-based classifier, while the document is regarded as a bag and its sentiment distribution is computed as a combination of the aspect-level sentiment distributions. Thus, we provide a framework for learning aspectlevel classifier by optimizing the document-level predictions. Meanwhile, in order to avoid the overfiting to document-level signals, we further propose two kinds of diversified regularization. Diversified textual regularization is applied to guide the aspectlevel sentiment classifier to select aspect-relevant snippets. Diversified sentimental regularization is leveraged to control the variance among aspectlevel sentiments. Overall, our contributions are summarized as follows: • We propose a novel diversified multiple instance learning neural network, which properly models the relation between aspectlevel and document-level sentiment, and thus achieves DMSC with merely document-level supervision.
• Two kinds of diversified regularization are introduced to alleviate the key challenge of overfitting document-level signals and to improve the aspect-level sentiment classification performance.
• Comprehensive experiments are conducted on the BeerAdvocate and TripAdvisor benchmark datasets. The results verify the necessity and advantages of both our framework and diversified regularizations. Meanwhile, our D-MILN outperforms previous weakly supervised methods significantly and is also comparable to the supervised method with thousands of labeled instances per aspect.

Related Work
Document-level multi-aspect sentiment classification In previous studies, DMSC is usually done by supervised learning methods (Lei et al., 2016;Yin et al., 2017;Wang et al., 2019), where aspect-level annotations should be provided. However, human annotation of aspectlevel sentiment is laborious and expensive, therefore, some researches focus on weakly supervised DMSC. This approach can be further categorized into knowledge-supervised and document-level supervised methods. As for knowledge-supervised methods, Zeng et al. (2019) propose to use aspectopinion word pairs as knowledge for supervision. The aspect-level sentiment classification is achieved by accomplishing another relevant objective: to predict an opinion word when given an aspect. However, their model heavily depends on the performance of dependency parsing and manually designed rules. As for document-level supervised methods, Wang et al. (2010Wang et al. ( , 2011 propose to use the document-level sentiment as supervision which is similar to ours. Specifically, they propose a probabilistic graphical model for the task, which assumes the overall rating is generated based on a weighted sum of the latent aspect ratings. However, this non-neural network model adopts bagof-words representations which are insufficient at capturing the order of words and complex semantics. Furthermore, their model fails to consider the problem of overfitting to document-level signals. Multiple Instance Learning Multiple instance learning is a form of weakly supervised learning where instances are arranged in bags and a label is provided for the entire bag (Keeler and Rumelhart, 1991). Most MIL methods (Zhou et al., 2009;Wei et al., 2014;Pappas and Popescu-Belis, 2017;Haußmann et al., 2017;Tu et al., 2019;Ilse et al., 2018;Wang and Wan, 2018) focus on the bag-level performance and there are also a few methods focusing on the instance-level performance. Apart from the loss defined on the bag level, Kotzias et al. (2015) also introduces a regularization based on the instance similarities into the objective function. Peng and Zhang (2019) assigns the bag-level label to instances under the i.i.d assumption and directly define the loss function on the instance-level label prediction. Some works propose to apply MIL to sentencelevel sentiment classification task. Kotzias et al. (2015); Angelidis and Lapata (2018a); Wang and  Wan (2018) and Angelidis and Lapata (2018b) propose to train the sentence-level sentiment classifier with document-level annotations. For these works, the content for each instance (i.e. words in the sentence) is already given. However, for DMSC task, the relevant text snippets for a given aspect, which are crucial for determining the sentiment, are not provided in advance. This makes the DMSC task much different and challenging to apply MIL. Besides, these works never consider the overfitting to bag-level supervision. To the best of our knowledge, this is the first work to apply MIL to DMSC task.

Methodology
We first briefly introduce the problem we work on. Given a review, our task is to predict the sentiments of aspects in the review. Formally, we denote the review document as d which contains I words {w 1 , w 2 , · · · , w I }, the sentiment label for the document as l d , and the set of J aspects mentioned in the document as {a 1 , a 2 , · · · , a J }. Same as Yin et al. (2017), each aspect a j is represented by K aspect-related keywords, {a j 1 , a j 2 , · · · , a j K }, in order to cover most of the semantic meanings of the aspect 1 . Figure 2 shows the architecture of D-MILN, where Figure 2(a) is the entire workflow and Figure  2(b) is the detailed network of aspect and document encoding. First, the aspect-level attentionbased classifier predicts sentiment distributions for every mentioned aspect which are denoted as p a 1 , p a 2 , · · · , p a J . Then, the document-level sen-1 See Appendix A.1 for the keywords. timent distribution p d is computed as a weighted sum of aspect-level sentiment distributions. The diversified sentimental regularization as shown in Figure 2(a) is applied on the aspect-level sentiment distributions to alleviate the overfitting to document-level sentiment. The diversified textual regularization as shown in Figure 2(b) is applied on the attention weights to encourage the aspect-level classifier to select aspect-relevant snippets.

Aspect-level Sentiment Distribution
In this section, we introduce our aspect-level attention-based sentiment classifier. Aspect encoding We first apply a one-layer MLP on the top of word embedding of each aspectrelated keyword a j k : where e j k is the word embedding of a j k , W q and b q are parameters of the one-layer MLP. Then the final representation of aspect a j is calculated as q j = k c k q j k , where c k encodes the importance of each keyword for the given aspect: and w c is the parameter to learn. Document encoding We first convert the words in the given document into a sequence of embedding vectors E = [e 1 , e 2 , · · · , e I ]. Usually, the sentiments are expressed through phrases in the document (Fei et al., 2004). For example, "a lovely room" expresses a positive sentiment towards the aspect room. Since one-dimension convolutional layers can serve as linguistic feature detectors to extract specific patterns of n-grams (Kalchbrenner et al., 2014), we apply several one-dimension convolutional layers on top of the word embeddings and obtain the final contextual features for the input words: Aspect-specific representations We obtain the aspect-specific representation by a weighted sum of contextual features: where α i j encodes the importance of word w i to determine the sentiment towards aspect a j . α i j is calculated through attention mechanism: where W a is a bilinear term to capture the relevance between q j and h i . Prediction The aspect-specific representation is then used to predict the aspect-level sentiment distribution p a j by: where W p and b p are parameters of the softmax layer.

Document-level Sentiment Distribution
Since only document-level supervision is provided, we could not directly use the aspect-level sentiment distribution p a j for optimization. In order to connect aspect-level sentiment with document-level sentiment, we compute document-level sentiment distribution as a weighted sum of aspect-level distributions. Thus, by optimizing the document-level predictions, the parameters of the aspect-level sentiment classifier are learned through back propagation. Specifically, the document-level distribution is as following: where β j encodes the importance of aspect a j for determining the sentiment of the overall document.
To obtain β j , we first average the aspect representations: then we use attention mechanism to derive β j : is the concatenation of r a j and r d , W r , b r and v r are parameters of the attention mechanism. . After obtaining document-level sentiment distributions, we train the model with respect to document-level sentiment labels and introspectively, the aspect-level sentiment classifier is learned through back propagation.

Diversified Regularizations
The aspect-level sentiment classifier simply learned in such a way suffers from the overfitting to document-level supervision signals. Firstly, given different aspects, the aspect-level sentiment classifier tends to focus on the same snippets, which actually express the document-level sentiment. Secondly, the predicted aspect-level sentiments tend to be overly consistent with the document-level sentiment. Diversified Textual Regularization To alleviate the first problem, diversified textual regularization is proposed to encourage the sentiment classifier to select aspect-relevant snippets with distant supervision. The main idea is that the aspect-level classifier should pay more attention to the words which co-occur with the given aspect in a same sentence. Specifically, given an aspect a j , a distantlylabeled word selection vector s j is leveraged to guide the attention weight vector α j in Equation 4. To obtain s j , we first initialize the weights of all words in the document to be 0. Secondly, we find the sentences which contain any keywords of the given aspect 2 . Then we set the weights of words in these sentences to be 1. Finally, we normalize the weight vector. The diversified textual regularization is defined as the KL-divergence betweens α j and s j : Furthermore, there exist sentences which describe multiple aspects. As in most of these sentences, the parts related to different aspects are  non-overlapping, we also apply orthogonal regularization (Lin et al., 2017;Hu et al., 2018) to guide the attention weights in a fine granularity: Minimizing the dot product between two attention weight vectors will force orthogonality between them, so that different aspects attend on different parts of the sentence with less overlap. Diversified Sentimental Regularization Given a document, some of its aspects often have different sentiments from the document-level sentiment. But simply fitting the document-level supervision leads the sentiments of all aspects to be same with the document-level sentiment. To tackle this problem, we propose diversified sentimental regularization to control the variance among aspect-level sentiment distributions. The variance is computed as follows: where p a j (l d ) is the probability of class l d for aspect a j . By maximizing L d−senti , the model allows the aspect-level sentiment distributions to be different, so that for some aspects, their sentiments could be different from the document-level sentiment l d . Furthermore, instead of using cross-entropy loss, we propose to leverage hinge loss to control the fitting degree of the document-level sentiment distribution p d to the ground truth label l d . The hinge loss is defined as follows: where p d (l d ) is the probability of the ground-truth label l d , t ∈ (0.5, 1.0] is the probabilistic margin, which gives the tolerance to diverse aspect-level sentiment distributions.

Final Objective Function
The final objective function of D-MILN is a combination of document-level loss and diversified regularizations. To minimize clutter, we describe the objective function for a single document: where α, β, γ are the hyper-parameters, m is the number of training steps. In diversified textual regularization, the distant supervision is relatively "hard" on the attention weights, which may hurt the generalization of D-MILN, so we further introduce a decay factor α ∈ (0, 1). With the increase of training steps (m), the weight of textual diversified regularization will decrease to zero such that the model will be allowed to achieve better generalization. γ controls the sentimental diversity among aspects. For γ < 0, the sentimental diversity is encouraged. For γ > 0, the sentimental diversity is discouraged.

Datasets
We evaluate our model on TripAdvisor (Wang et al., 2010) and BeerAdvocate (McAuley et al., 2012) benchmark datasets, which contain seven predefined aspects (value, room, location, cleanliness, check in/front desk, service, and business) and four predefined aspects (feel, look, smell, and taste) respectively. We run the same preprocessing steps as Zeng et al. (2019). The original ratings of Tri-pAdvisor and BeerAdvocate datasets are converted to binary scales, namely, positive or negative. The exploration on fine-grained sentiment classification remains for future work. The number of reviews with negative overall sentiment and that with positive overall sentiment are balanced. Table 1 shows the statistics of the two datasets. Both datasets are split into train/development/test sets with proportions 8:1:1. The development set is used to tune the hyper-parameters for all methods. We use accuracy as the evaluation metric. Note that both aspectlevel and document-level sentiment annotations are provided in the datasets, but our D-MILN only uses document-level annotations for training.

Implementation Details
We adopt the pre-trained uncased GloVe 300dimensional word embeddings (Pennington et al., 2014), which are set to be trainable during the training process 3 . In document encoding, we apply three one-dimension convolutional layers with kernel widths of 3, 5, and 7 respectively 4 . The number of filters is 200 for each convolutional layer. Batch normalization is applied on the output of the convolutional layers. The dimension of all hidden layers is 200. Dropout is applied on the embedding layer and the final representations of aspects and document words with dropout rate being 0.4. The values of α, β, γ in Equation 13 are 0.999, 0.1 and −0.1 respectively. The probabilistic margin t is 0.7. The batch size is set to be 64. Parameter optimization is performed using Adam (Kingma and Ba, 2014) with learning rate being 0.001. We run experiments on one Tesla V100 16GB GPU and each epoch takes several minutes. Our model has 438K parameters, not including word embeddings.

Compared Methods
Here, we compare our method with a variety of baselines, which can be divided into three categories. (1) Weakly supervised baselines. We use these baselines to show the advancement of D-MILN in terms of weak supervision. (2) MIL baselines. We novelly formulate weakly supervised DMSC as MIL for the first time. By comparing with several simple MIL methods, we also hope to see the necessity of D-MILN. (3) Supervised baseline. Finally, we compare D-MILN with supervised baselines to analyse the performance gap with supervised methods.

Weakly Supervised Baselines
Assign-O, which directly uses the overall sentiment of a review in the test set as the prediction for its aspects.
LRR (Wang et al., 2010), which is a probabilistic graphical model (non-neural model) that regards the aspect-level sentiments as latent variables and assumes the document-level sentiment is generated based on a weighted sum of the latent aspect sentiments. LRR only requires document-level annotations.
VWS-DMSC (Zeng et al., 2019), which is previous state-of-the-art weakly supervised approach for DMSC. VWS-DMS uses aspect-opinion word pairs as supervision. The sentiment of an aspect is treated as a latent variable and is used to predict the opinion word of the given aspect. VWS-DMSC also uses document-level sentiment labels to train a document encoder.

MIL Baselines
Vanilla-MILN, which is derived by removing key components from D-MILN. Specifically, in Vanilla-MILN, the loss function is cross-entropy loss and the diversified regularizations are not applied.
Identity-MILN, which sets the aspect-level sentiment of training data to be identical with document-level labels, and directly trains the aspect-level attention-based sentiment classifier introduced in Section 3.1.
Explicit-MILN, of which the relevant snippets for each aspect are firstly extracted by an iterative method adopted in Wang et al. (2010), then a CNNbased text classifier is applied on the extracted snippets to predict the aspect-level sentiment under the MIL framework.

Supervised Baselines
AB-DMSC, which is the attention-based aspectlevel sentiment classifier introduced in Section 3.1. We directly train this classifier with entire aspectlevel sentiment annotations. AB-DMSC serves as an upper bound to our model. AB-DMSC-{500, 1000, 2000, 5000}, which is the AB-DMSC model trained with {500, 1000, 2000, 5000} labeled instances per aspect. Since the sampled labeled data may vary for different trials, we perform five trials of random sampling and report both mean and standard deviation of the results.
N-DMSC (Yin et al., 2017), which is the stateof-the-art supervised neural model. N-DMSC is also trained with entire aspect-level sentiment annotations.    Table 2 shows the main results. It contains three blocks, corresponding to the three categories of systems.We compare D-MILN with them as follows.

Results and Analysis
(1) Weakly Supervised Baselines. Our model achieves the best performance comparing with previous weakly supervised baselines. From Assign-O, we can see that directly transferring the documentlevel sentiment to aspects gives a poor result, showing the difficulty and necessity of finding a way to properly model the relation between documentlevel sentiment and aspect-level sentiment. Our model outperforms the traditional probabilistic graphical model LRR with a substantial margin, which demonstrates the necessity of utilizing neural networks to capture deep semantic features. Our model also outperforms previous SOTA VWS-DMSC significantly. VWS-DMSC relies on the extracted aspect-opinion word pairs, but we find that there are no typical opinion words for some aspects in the corpus (e.g. look in BeerAdvocate). Besides, in VWS-DMSC, the document-level supervision is only used to train a document encoder, which ignores the relationship between aspects and documents. As our D-MILN only relies on documentlevel signals, this further confirms that D-MILN properly models the relation between aspect-level and document-level sentiment.
(2) MIL Baselines. D-MILN significantly outperforms all MIL baselines with a substantial margin. Meanwhile, we find simple MIL baselines often fail to improve performance against previous work (LRR and VWS-DMSC), showing the  Table 3: Accuracies on the two datasets in the ablation study.
difficulty of achieving weakly-supervised DMSC by MIL. Furthermore, from Vanilla-MILN, we can conclude that locating aspect-relevant snippets and overcoming the overfitting to document-level supervision are two challenges to improve the performance of MIL on DMSC. Compared with Identity-MILN, it suggests that our method could reduce the noises brought from the document-level supervision signals. Compared with Explicit-MILN, it suggests that our method could effectively select aspect relevant snippets.
(3) Supervised Baselines. we first find that AB-DMSC is comparable with N-DMSC, which demonstrates that our aspect-level sentiment classifier could serve as a strong supervised baseline model. Our D-MILN is comparable with AB-DMSC-2000. To analyse the performance gap between D-MILN and AB-DMSC, we conduct a case study, which is contained in Appendix A.3, to qualitatively evaluate the aspect-level attention-based sentiment classifiers.

Ablation Study
To demonstrate the effectiveness of each component of D-MILN, we conduct an ablation study and list the results in Table 3. "-keywords" means simply using the aspect term rather than its keywords to interact with the document. "-hinge loss" means replacing the hinge loss in Equation 12 by cross-entropy loss. "-d-senti" means removing diversified sentimental regularization. "-d-text" means removing diversified textual regularization. We can see that extending a single aspect term with a list of aspect relevant keywords can improve the classification performance on both datasets. The orthogonal regularization is much more useful in the TripAdvisor dataset, which indicates there are more sentences containing multiple aspects. By employing the diversified sentimental regularization, the overfitting problem of document-level signals can be alleviated and thus improves the classification performance. When removing the diversified textual regularization, the results are much worse than removing other components, demonstrating locating the aspect-relevant snippets is crucial for correctly predicting the aspect-level sentiments.

Effectiveness of Diversified Textual Regularization
To further demonstrate the effectiveness of diversified textual regularization, we display the KLdivergence between attention weight distributions of different aspect pairs in Figure 3. The attention weight distribution, which is calculated by Equation 4, indicates the importances of document words to the given aspect. Large KL-divergences indicate that the aspect-level classifier selects distinct snippets for different aspects. For Vanilla-MILN, the KL-divergences are relatively small, which indicates that the model focuses on similar snippets for different aspects. For Vanilla-MILN+dtext, on which the diversified textual regularization is applied, the KL-divergences become larger and are similar with that of AB-DMSC, which is trained with aspect-level annotations and produces the most proper attention weights among the three models. Such results indicate that diversified textual regularization encourages the aspect-level sentiment classifier to select aspect-relevant snippets.

Hinge Loss for Diversified Sentimental Regularization
We further demonstrate that hinge loss is more compatible than cross-entropy loss with diversified sentimental regularization. In Figure 4, we display the variances, which is calculated by Equation 11, among aspect-level sentiment distributions when different loss functions are adopted. The horizontal axis γ denotes the weight of the diversified sentimental regularization. When γ turns to 0.0, which means the diversified sentimental regularization is BeerAdvocate TripAdvisor Figure 4: The variances among aspect-level sentiment distributions with different loss functions.
not applied, we find that the variance is relatively small for both hinge loss and cross entropy loss, which indicates that the predicted aspect-level sentiments are over consistent with document-level ones. When γ turns to −0.1, which means the diversity of sentiments is encouraged, the variance under hinge loss grows significantly than crossentropy loss, which verifies that by applying hinge loss, the diversity among aspect-level sentiments could be controlled more effectively.

Conclusion
In this paper, we propose a diversified multiple instance learning network to achieve DMSC with only document-level supervision. We formulate this problem as multiple instance learning, so as to model the relation between aspect-level sentiment and document-level sentiment. In order to guarantee the proper transfer from document-level supervision to aspect-level prediction, we further propose diversified textual regularization and diversified sentimental regularization. Through experiments on two benchmark datasets, we verify that our D-MILN can properly capture the interaction between aspect-level and document-level, and achieve new SOTA on weakly supervised DMSC. Detailed comparisons also show the necessity and effectiveness of our diversified regularizations. In the future, we plan to further improve D-MILN with aspect-level annotations and find appropriate way to combine D-MILN with pre-training methods (Tian et al., 2020).  Table 4: Aspect-related keywords Review very unwelcoming staff -downright unfriendly while the room be lovely , the staff be very unfriendly and discourteous . we be very easygoing people . and experienced traveller . however , the staff be very unwilling to answer basic question unk airport unk and restaurant recommendation . one woman behind the desk just seem to be angry all the time . while i love barcelona -this hotel experience be very unk to unk . definitely not a service orient hotel .
lovely be love to and very people -and be hotel restaurant definitely be . this unk however i ,

Room
lovely be room , the while the the unfriendly desk staff unwelcoming behind -downright staff very be woman just Stuff unfriendly very very and unfriendly unwelcoming staff be unwilling very very be all to -downright while easygoing just orient unwelcoming very unfriendly downright not unwillingdiscourteous while and staff and a . the unfriendly unk be the orient Figure 5: Case study. The left blocks contain the words selected by AB-DMSC, the right blocks contain the words selected by D-MILN. We display 20 words with the highest attention weights for each aspect. We manually label the words related with Room (in red) and Stuff (in green).  BERT-doc, we find that the accuracy declines more than 10% on both datasets when the aspect-level sentiment classifier is trained with document-level annotations with MIL even though the classifier is BERT-based. BERT-enc-fix doesn't outperform D-MILN, we believe this is because the parameters of BERT haven't been fine-tuned for DMSC task. However, when the parameters of BERT are trainable, the performance degrades. By analysing the changes of training loss of BERT-enc-fix and BERT-enc-train, as depicted in Figure 6, we find that the loss of BERT-enc-train declines rapidly to a very low level, showing that it has overfitted the document-level supervision even though the diversified regularizations are applied. In sum- mary, fine-tuning the parameters of BERT with the document-level annotations in MIL will lead to overfitting the document-level sentiment and degrading the performance on aspect-level sentiment prediction. The experiment results also point out a direction for our future work which is to find a way to effectively utilize pre-trained models with weak supervision.

A.3 Case study
To further analyse the performance gap between AB-DMSC and D-MILN, we conduct a qualitative case study on the learned attention mechanism of the aspect-level sentiment classifier. In Figure 5, the gold sentiment labels for room and stuff are positive and negative respectively. AB-DMSC predicts correctly on both aspects while D-MILN predicts correctly only on stuff. For room, D-MILN not only picks the words describing it, but also selects the words describing stuff. Unfortunately, the words describing stuff express an opposite sentiment. In this case, the description of room is much shorter than that of stuff and the only words describing room are surrounded by the words describing stuff. Such unbalanced and mixed descriptions remain a challenge for D-MILN.