A Multi-Task Incremental Learning Framework with Category Name Embedding for Aspect-Category Sentiment Analysis

(T)ACSA tasks, including aspect-category sentiment analysis (ACSA) and targeted aspect-category sentiment analysis (TACSA), aims at identifying sentiment polarity on predefined categories. Incremental learning on new categories is necessary for (T)ACSA real applications. Though current multi-task learning models achieve good performance in (T)ACSA tasks, they suffer from catastrophic forgetting problems in (T)ACSA incremental learning tasks. In this paper, to make multi-task learning feasible for incremental learning, we proposed Category Name Embedding network (CNE-net). We set both encoder and decoder shared among all categories to weaken the catastrophic forgetting problem. Besides the origin input sentence, we applied another input feature, i.e., category name, for task discrimination. Our model achieved state-of-the-art on two (T)ACSA benchmark datasets. Furthermore, we proposed a dataset for (T)ACSA incremental learning and achieved the best performance compared with other strong baselines.


Introduction
Sentiment analysis has become an increasingly popular natural language processing (NLP) task in academia and industry. It provides realtime feedback on consumer experience and their needs, which helps producers to offer better services. To deal with the presence of multiple categories in one document, (T)ACSA tasks, including aspect-category sentiment analysis (ACSA) and targeted aspect-category sentiment analysis (TACSA), were introduced.
The main purpose for ACSA task is to identify sentiment polarity (i.e. positive, neutral, negative and none) of an input sentence upon specific predefined categories (Mohammad et al., 2018;Wu et al., 2018). For example, as shown in Table 1, giving an input sentence "Food is always fresh and hot-ready to eat, but it is too expensive." and predefined categories {food, service, price, ambience and anecdotes/miscellaneous}, the sentiment of category food is positive, the polarity regarding to category price is negative, while is none for others. In this task, the models should capture both explicit expressions and implicit expressions. For example, the phrase "too expensive" indicates the negative polarity in the price category, without a direct indication of "price".
In order to deal with ACSA with both multiple categories and multiple targets, TACSA task was introduced (Saeidi et al., 2016) to analyze sentiment polarity on a set of predefined targetcategory pairs. An example is shown in Table 1, given targets "restaurant-1" and "restaurant-2", in the case "I like restaurant-1 because it's cheap, but restaurant-2 is too expansive", the category price for target "restaurant-1" is positive, but is negative for target "restaurant-2", while is none for other target-category pairs. A mathematical definition for (T)ACSA is given as follows: giving a sentence s as input, a predefined set of targets T and a predefined set of aspect categories A, a model predicts the sentiment polarity y for each targetcategory pair {(t, a) : t ∈ T, a ∈ A}. For ACSA task, there is only one target t in all (t, a) categories. In this paper, in order to simplify the expression in TACSA, we use predefined categories, which is short for predefined target-category pairs.
Multi-task learning, with shared encoders but individual decoders for each category, is an approach to analyze all the categories in one sample simultaneously for (T)ACSA (Akhtar et al., 2018;Schmitt et al., 2018). Compared with single-task ways (Liang et al., 2019), multi-task approaches utilize category-specific knowledge in training signals from each task and get better performance. However, current multi-task models arXiv:2010.02784v1 [cs.CL] 6 Oct 2020 Task Sentence Labels ACSA Food is always fresh and hot-ready to eat, but it is too expensive (food,positive), (service, none), (price, negative), (ambience, none) (anecdotes/miscellaneous, none) TACSA I like restaurant-1 because it's cheap, but restaurant-2 is too expansive.
On the other hand, the predefined categories in (T)ACSA task make the application in new categories inflexible, as for (T)ACSA applications, the number of categories maybe varied over time. For example, fuel consumption, price level, engine power, space and so on are source categories to be analyzed in the gasoline automotive domain. For electromotive domain, source categories in the automotive domain will still be used, while new target category such as battery duration should also be analyzed. Incremental learning is a way to solve this problem. Therefore, it is necessary to propose an incremental learning task and an incremental learning model concerned with new category for (T)ACSA tasks.
Unfortunately, in the current multi-task learning (T)ACSA models, the encoder is shared but the decoders for each category are individual. This parameter sharing mechanism results in only the shared encoder and target-category-related decoders are finetuned during the finetuning process, while the decoder of source categories remains unchanged. The finetuned encoder and original decoder of source categories may cause catastrophic forgetting problem in the origin categories. For real applications, high accuracy is excepted in source categories and target categories. Based on the previous researches that decoders between different tasks are usually modeled by mean regularization (Evgeniou and Pontil, 2004) , an idea comes up to further make the decoders the same by sharing the decoders in all categories to de-crease the catastrophic forgetting problem. But here raises another question, how to identify each category in the encoder and decoder shared network? In our approach, we solve the category discrimination problem by the input category name feature.
In this paper, we proposed a multi-task category name embedding network (CNE-net). The multitask learning framework makes full use of training signals from all categories. To make it feasible for incremental learning, both encoder and decoders for each category are shared. The category names were applied as another input feature for task discrimination. We also present a new task for (T)ACSA incremental learning. In particular, our contribution is three-folded: (1) We proposed a multi-task CNE-net framework with both encoder and decoder shared to weaken catastrophic forgetting problem in multitask learning (T)ACSA model.
(3) We proposed a new task for incremental learning in (T)ACSA. By sharing both encoder layers and decoder layers of all the tasks, we achieved better results compared with other baselines both in source categories and in the target category.
2 Related Work 2.1 Aspect-category Sentiment Analysis (T)ACSA task is to predict sentiment polarity on a set of predefined categories. It is able to analyze sentiment in an end-to-end way with explicit expressions or implicit expressions (Mohammad et al., 2018;Wu et al., 2018). The earliest works most concerned on feature engineering (Zirn et al., 2011;Wiebe, 2012;Wagner et al., 2014). Subsequently, Nguyen and Shirai (2015); Wang et al. (2017); Meisheri and Khadilkar (2018) applied neural network models to achieve higher accuracy. Ma et al. (2018) then involved commonsense knowledge as additional features. The current approaches consist of multi-task models (Akhtar et al., 2018;Schmitt et al., 2018), which analyze all the categories simultaneously in one sample to make full use of all the features and labels in the training sample, and single-task models that treat one category in one sample (Jiang et al., 2019).

Multi-Task Learning
Multi-task learning(MTL) utilizes all the related tasks by sharing the commonalities while learning individual features for each sub-task. MTL has been proven to be effective in many NLP tasks, such as information retrieval (Liu et al., 2015), machine translation (Dong et al., 2015), and semantic role labeling (Collobert and Weston, 2008). For ACSA task, Schmitt et al. (2018) applied MTL framework with a shared LSTM encoder and individual decoder classifiers for each category. The multiple aspects in MTL were handled by constrained attention networks with orthogonal and sparse regularization (Hu et al., 2019).

Incremental Learning
Incremental learning was inspired by adding new abilities to a model without having to retrain the entire model. For example, Doan and Kalita (2016) presented several random forest models to perform sentiment analysis on customers' reviews. Many domain adaptation approaches utilizing transfer learning suffer from "catastrophic forgetting" problem (French and Chater, 2002). To solve this problem, Rosenfeld and Tsotsos (2017) proposed an incremental learning Deep-Adaption-Network that constrains newly learned filters to be linear combinations of existing ones.
To the best of our knowledge, for (T)ACSA task, few researches concerned with incremental learning in new categories. In this paper, we proposed a (T)ACSA incremental learning task and the CNE-net model to solve this problem in a multi-task learning approach with a shared encoder and shared decoders. We also apply category name for task discrimination.

Datasets
This section describes the benchmark datasets we used to evaluate our model, the incremental learning task definition, the methodology to prepare the incremental learning dataset, and the evaluation metric.
The ACSA task was evaluated on SemEval-2014 Task4, a dataset on restaurant reviews. Our model provides a joint solution for sub-task 3 (Aspect Category Detection) and sub-task 4 (Aspect Category Sentiment Analysis). The sentiment polarities are y ∈ Y = {positive, neutral, negative, conflict and none}, and the categories are a ∈ A = {food, service, price, ambience and anecdotes/miscellaneous}. The conflict label indicates both positive and negative sentiment is expressed in one category (Pontiki et al., 2014).
The TACSA task was evaluated on the Sentihood dataset, which describes locations or neighborhoods of London and was collected from question answering platform of Yahoo. The sentiment polarities are y ∈ Y = {positive, negative and none}, the targets are t ∈ T = {Location1, and Location2}, and the aspect categories are a ∈ A = {general, price, transit-location, and safety}.

Evaluation Transfer Learning Datasets
Besides evaluating the model on existing (T)ACSA tasks, we also proposed incremental learning tasks for (T)ACSA 1 in new category based on SemEval-2014 Task4 and Sentihood dataset, respectively.
Firstly, we split the categories into source categories and target categories. For ACSA task, the source categories are {food, price, ambience and anecdotes/miscellaneous}, while the target category is {service}. For TACSA task, the source categories are {general, transit-location, and safety}, while the target category is {price}. This was considered by the amount of data with positive/negative/neutral polarity in this category, as well as the sense of this category for real applications. origin ACSA sample {"text": "The only thing more wonderful than the food is the service.", "sentiment": {"food": "Positive", "service": "Positive", "price": None, "ambience": None, "anecdotes/miscellaneous": None } }

ACSA Sample-Source
{"text": "The only thing more wonderful than the food is the service.", "sentiment": {"food": "Positive", "price": None, "ambience": None, "anecdotes/miscellaneous": None } } ACSA Sample-Target {"text": "The only thing more wonderful than the food is the service.", "sentiment": {"service": "Positive" } } Secondly, we prepare training, validation and testing data for incremental learning task by independently splitting the origin training data, validation data and test data into sourcecategory data (Sample-Source) containing label only in source categories and target-category data (Sample-Target) with target-category label only. For example, as shown in Table 2, in ACSA task, the origin labels {food: positive, service:positive, price:none, ambience:none, anecdotes/miscellaneous:none} were transformed to {food: positive, price:none, ambience:none, anecdotes/miscellaneous:none} in Sample-Source and {service:positive} in Sample-Target. The input sentences were kept the same as origin dataset. For other researches to investigate the influence of target-category training data amount quantitatively, we also created incremental learning data by combining all the Sample-Source and sampled Sample-Target. The sampling rate is a range from 0.0 to 1.0.
In this paper, the ACSA incremental learning dataset is created from SemEval14-Task ACSA dataset, and it is called SemEval14-Task-inc. The TACSA incremental learning dataset is created from Sentihood TACSA dataset, and it is called Sentihood-inc.

Evaluation Metrics
We evaluated the aspect category extraction (to determine whether the sentiment is none for each category) and sentiment analysis (to predict the sentiment polarity) on the two datasets. For aspect category extraction evaluation, we applied the probability 1 − p as the not none probability for each category, where p is the probability of the "none" class in this category. The evaluation metric is the same as Sun et al. (2019). For the origin SemEval-14 Task4 dataset, we use Micro-F1 for category extraction evaluation and accuracy for sentiment analysis evaluation. For the origin Sen-tihood dataset, we use Macro-F1, strict accuracy, and area-under-curve(AUC) for category extraction evaluation while use AUC, and strict accuracy for sentiment analysis evaluation. When evaluating the incremental learning task, we use the F1 metric (Micro-F1 for SemEval-14 and Macro-F1 for Sentihood) for category extraction and accuracy for sentiment analysis.

Approach
In this section, we describe the architecture of CNE-net for (T)ACSA task. In BERT classification tasks, the typical approach is feeding sentence "[CLS]tokens in sentence[SEP]" into the model, while the token "[CLS]" is used as a feature for classification. In order to encode category names into BERT model, as well as analyze sentiment polarity of all the categories simultaneously, we made two significant differences from the original BERT, one on the encoder module and another on the decoder module.

Encoder with Category Name Embedding
In order to get a better category name embedding, as well as to make it feasible for incremental learning cross categories, the category names are encoded into the model,  Figure 1. In ACSA task, the category names are "{food, service, price, ambiance, and anecdotes/miscellaneous}", while in TACSA task, the category names are "{location-1 general, location-1 price, location-1 transit-location, location-1 safety, location-2 general, location-2 price, location-2 transit-location, and location-2 safety}". We mark output states of the BERT encoder as follows: the hidden state of [CLS] Figure 1: CNE-net model architecture the hidden states of words in origin sentences H sent ∈ R Lsent×d , the hidden states of separators H [SEP ] ∈ R ncat×d , and the hidden states of category words H cat−i ∈ R L cat−i ×d for the i-th category (0 < i ≤ n cat ), where L sent is the length of the input sentence, d is the dimension of hidden states, n cat is the number of categories feed into the model, and L cat−i is the length of the i-th category input words.

Multi-Task Decoders
We proposed three types of decoder for (T)ACSA task, as shown in Figure 1 1 , 2 and 3 . These decoders are multi-label classifiers, which apply a softmax classifier for sentiment analysis in each category. Type 1, CNE-net-SEP, as shown in Figure 1 1 , the separator token h [SEP −i] is applied as feature representation for sentiment polarity analysis in each category directly. The probability for each polarity in category i is calculated as follows where f i ∈ R s is the output logits for category i, p i ∈ R s is the output probability for category i, W i ∈ R d×s and b i ∈ R s are randomly initialized parameters to be trained, and s is the number of sentiment classes. s = 5 for {positive, neutral, negative, conflict and none} in SemEval14-Task4, while s = 3 for {positive, negative and none} in Sentihood dataset. In our approach, W 1 = W 2 = ... = W ncat and b 1 = b 2 = ... = b ncat .
Type 2, CNE-net-CLS-att., in order to get content-aware category embedding vector, we applied attention mechanism with h [CLS] serves as query vector, and H cat−i serves as both key and value matrix, as shown in Figure 1 2 . The category embedding vector e cat−i for the i-th category is as follows: The probability for category i in type 2 is calculated following equation (1) where h = e cat i .
Type 3, CNE-net-SEP-sent.-att. applied attention mechanism for both sentence embedding and category name embedding. As it is shown in Figure 1 3 . Firstly, sentence vector correlated with the i-th category is calculated by attention with separator embedding h [SEP −i] serving as query, and sentence embedding H sent serving as key and value matrix. Sentence vector h sent−i correlated with the i-th category is as follows: Secondly, similar to that in type 2, the category embedding vector e cat−i for the i-th category calculated by attention mechanism is as follows: The probability for for category i in type 3 is calculated following equation (1) where h = e cat i .

Model Training
The CNE-net multi-task framework was trained in an end-to-end way by minimizing the sum of cross-entropy loss of all the categories. We employed L 2 regularization to ease over-fitting. The loss function is given as follows: where D is the training dataset, N is the number of categories, Y is the sentiment classes Y = {positive, neutral, negative, conflict, none} (neutral and conflict is not included in TACSA task), y i ∈ R |Y | is the one-hot label vector for the i-th category with true label marked as 1 and others marked as 0, p i (x; θ) is the probability for the i-th category, and λ is the L 2 regularization weight. Besides L 2 regularization, we also employed dropout and early stopping to ease overfitting. During training incremental learning models, we follow the workflow of the incremental learning application. We firstly train a source-category model with the Sample-Source training data. Then finetuned the source-category model with Sample-Target training data to get incremental learning model.

Experiment Settings
The pretrained uncased BERT-base 2 was used as the encoder in CNE-net. The number of Transformer blocks is 12, the number of self-attention heads is 12, and the hidden layer size in each selfattention head is 64. The total amount of parameters in BERT encoder is about 110M. The dropout ratio is 0.1 during training, the traning epochs is 10, and the learning rate is 5e-5 with a warm-up ratio of 0.25.

Compared Methods
We compare the performance of our model with some state-of-the-art models.
• NRC-Canada (Kiritchenko et al., 2014): several binary one-vs-all SVM classifiers for this multi-class multi-label classification problem. • AT-LSTM and ATAE-LSTM (Wang et al., 2016): a LSTM attention framework with aspect word embeddings concatenated with sentence word embeddings. • Dmu-Entnet : model with delayed memory update mechanism to track different targets. • Recurrent Entity Network (REN) (Ye and Li, 2020): a recurrent entity memory network that employs both word-level information and sentence-level hidden memory for entity state tracking. In TACSA task, besides these models, we also compared our model with the BERT-pair-QA-B model and MTL model mentioned in ACSA comparison methods.

Main Results
The performances of compared methods and three types of CNE-net are shown in Table 3 (ACSA task) and Table 4 (TACSA task). All the models with BERT encoder (QA-B, MTL and our CNEnet) achieved better performance compared with models without BERT encoder (XRCE, NCR-Canada, AT-LSTM, ATAE-LSTM, SenitcLSTM, Dmu entnet, and REN). Our CNE-net performs Model Category Extraction Sentiment Analysis P R F binary 3-way 4-way XRCE (Brun et al., 2014) 83.23 81.37 82.29 --78.1 NRC-Canada (Kiritchenko et al., 2014)    better compared with QA-B and MTL framework in both ACSA and TACSA tasks. QA-B is a single-task approach, which each category is trained independently. Our CNE-net is a multitask learning framework. It performs better than QA-B by using shared semantic features and sentiment labels in all the categories. CNE-net also performs better compared with the MTL model since it encodes the category names as additional features to generate the representation of each category.
Our CNE-net-SEP-sent.-att. model achieves state-of-the-art on all the evaluation metrics in both SemEval14-Task4 and Sentihood dataset. The improved extraction F 1 is 0.0080 in the SemEval14-Task4 (increased from 0.9147 in QA-B to 0.9227 in CNE-net-SEP-sent.att.), while it is 0.010 in the Sentihood dataset (increased from 0.884 in MTL to 0.894 in CNE-net-SEPsent.att.). The accuracy metrics for sentiment analysis in the SemEval14-Task4 are binary, 3way and 4way, which refers to accuracy with positive/negative (binary), positive/neutral/negative (3-way) and positive/neutral/negative/conflict (4-way). The improvement of sentiment classification accuracy is 0.012 in SemEval14-Task4 (4way setting, increased from 0.859 in QA-B to 0.871 in CNE-net-SEP-sent.att.), while is 0.004 in the Sentihood dataset (increased from 0.971 in MTL to 0.975 in CNE-net-SEP-sent.att.). CNE-net-SEP uses [SEP] as a feature representation for sentiment classification. It performs the poorest among all three types of CNE-net since representation from only [SEP] token does not make full use of sentence information and category information. CNE-net-CLS-att. uses [CLS] as sentence representation and applies attention mechanism to build the relationship between sentence representation and the category name hidden states to get sentiment classification feature and achieve better performance. The CNE-net-SEP-sent.-att. uses attention twice. The first one is to build category-name-aware sentence embeddings for each category with [SEP] as query and sentence hidden states matrix as key and value, while the second one is to apply each categoryname-aware sentence embedding to generate category representation like what we do in CNE-net-   CLS-att.. This category-name-aware sentence embedding and the sentence-aware category embedding makes it perform the best in the three types of CNE-net.

Incremental Learning Results
This section describes the performance in the incremental learning task. We trained the model following incremental learning workflow, as mentioned in section 4.3. We compared the results between mix-training (short as mix.) (mixing Sample-Source and Sample-Target) and incremental learning (short as incre.), for both extraction F 1 and sentiment accuracy. Firstly, we compare the performance in target category, i.e. aspect category extraction F 1 (short as extra.) and sentiment analysis accuracy (short as senti.) from mix-training process and incremental learning. As the target category performance shown in Table 5, there is no significant difference between mix-training and incremental learning for both aspect extraction and sentiment analysis. For example, in SemEval14-Task-inc, the extraction F 1 and sentiment accuracy of CNEnet-SEP-sent.-att. are 0.936 and 0.930 respectively in mix-training, while they are 0.937 and 0.932 respectively in incremental learning. In Sentihood-inc, the extraction F 1 and sentiment accuracy of CNE-net-SEP-sent.-att. are 0.952 and 0.919 respectively in mix-training, while they are 0.954 and 0.920 respectively in incremental learning. This indicates incremental learning does not decrease the performance in the target category. Our CNE-net-SEP-sent.-att. performs the best in all the models.
Secondly, we compare aspect extraction and sentiment analysis performance in source categories after incremental learning, since both source categories and target categories requires high accuracy. The extraction F 1 and sentiment accuracy of source categories after the incremental learning process as well as in the mix-training process are shown in Table 6. There is no significant difference in sentiment accuracy of source categories after training with incremental learning data. For example, in SemEval14-Task-inc, sentiment accuracy of CNE-net-SEP-sent.-att. is 0.855 in mix-training, while it is 0.854 in incremental learning. This is probably because of the similar sentiment features between categories, in which the fine-tuning process does not make a great difference.  However, for category extraction, MTL suffers from catastrophic forgetting after fine-tuning. In SemEval14-Task4-inc, extraction F 1 of MTL model of source categories decreases from 0.898 in mix-training to 0.698 after incremental learning, while in Sentihood-inc, F 1 metric of MTL model of source categories decreases from 0.870 in mix-training to 0.757 after incremental learning. Fortunately, the QA-B model, as well as our CNE-nets, suffer less from this problem. In SemEval14-Task4-inc, extraction F 1 metric of CNE-SEP-sent.-att. is 0.913 in source categories after fine-tuning, while it is 0.916 in mix-training. In Sentihood-inc, extraction F 1 of CNE-SEPsent.-att. is 0.863 in source categories after finetuning, while it is 0.877 in mix-training.

Discussion
We have confirmed the effectiveness of CNEnets for (T)ACSA tasks and (T)ACSA incremental learning tasks. However, there remains a question, why our model suffers less from catastrophic forgetting in incremental learning?
To answer this question, we compare the incremental learning performance of our CNE-net-SEP-sent.-att. with a similar model but the decoders in each category are unshared with W 1 = W 2 = ... = W ncat and b 1 = b 2 = ... = b ncat (CNE-net-SEP-sent.-att.-unshared) in equation (1) and the results are shown in Table 7. There is no significant difference in target category between the model with shared decoders and the model with unshared decoders, indicating both shared and unshared model is able to get enough feature for category extraction and sentiment analysis in target category. However, it is more important that, in CNE-net-SEP-sent.-att.-unshared, the extraction F 1 suffers from a sudden decrease. In SemEval14-Task4-inc, extraction F 1 decreases from 0.913 with shared decoder to 0.842 with unshared decoder, while in Sentihood-inc, extraction F 1 decreases from 0.863 with shared decoder to 0.796 with unshared decoder.
We believe the decreased extraction F 1 in source categories is due to the unshared decoders for each task, which results in only shared encoder and target-category decoders are fine-tuned during the fine-tuning process. In contrast, the decoder of source categories remains unchanged. The finetuned encoder and original source-category decoder is the reason for the catastrophic forgetting problem in the category extraction evaluation. In our shared decoder approach, both encoders and decoders are shared and fine-tuned to weaken the catastrophic forgetting problem.

Conclusion
In this paper, in order to make multi-task learning feasible for incremental learning, we proposed CNE-net with different attention mechanisms. The category name features and the multitask learning structure help the model achieve state-of-the-art on ACSA and TACSA tasks. Furthermore, the shared encoder and decoder layers weaken catastrophic forgetting in the incremental learning task. We proposed a task for (T)ACSA incremental learning and achieved the best performance with CNE-net compared with other strong baselines. Further research may be concerned with zero-shot learning on new categories.