Using the Past Knowledge to Improve Sentiment Classification

This paper studies sentiment classification in the lifelong learning setting that incrementally learns a sequence of sentiment classification tasks. It proposes a new lifelong learning model (called L2PG) that can retain and selectively transfer the knowledge learned in the past to help learn the new task. A key innovation of this proposed model is a novel parameter-gate (p-gate) mechanism that regulates the flow or transfer of the previously learned knowledge to the new task. Specifically, it can selectively use the network parameters (which represent the retained knowledge gained from the previous tasks) to assist the learning of the new task t. Knowledge distillation is also employed in the process to preserve the past knowledge by approximating the network output at the state when task t-1 was learned. Experimental results show that L2PG outperforms strong baselines, including even multiple task learning.


Introduction
A typical sentiment analysis (SA) or social media company that provides sentiment analysis services has to work for a large number of clients (Liu, 2012). Each client normally wants to study people's opinions about a particular category of products or services, which we also call a domain. If we regard each such study/project as a task, we can model a SA company's working on a large number of studies/projects for clients as performing a sequence of SA tasks. A natural question that one would ask is whether after analyzing opinions about a number of products or services (tasks), the SA system of the company can do better on a new task by retaining the knowledge learned from the past/previous tasks and selectively transfer the prior knowledge to the new task to help it learn better. The answer should be yes because words and phrases used to express opinions or sentiments in different domains are similar and thus can mostly be shared or transferred across domains, although different domains do have domain specific sentiment expressions. This is a lifelong learning setting (Thrun, 1998;Silver et al., 2013;Chen and Liu, 2016). This paper focuses on lifelong sentiment classification (Chen et al., 2015).
Problem Definition: We consider incrementally learning a sequence of supervised sentiment classification (SC) tasks, 1, ..., t, .... Each task t has a training dataset D t train = {x t i , y t i } nt i=1 , where x t i is an input instance and y t i is its label, and n t is the number of training examples of the tth task. Our goal is to design a lifelong learning algorithm f (·; θ t ) or neural network that can retain the knowledge learned in the past and selectively transfer the knowledge to improve the learning of each new task t. It is assumed that after each task is learned, its training data is deleted and thus not available to help learn any subsequent tasks. This is a common scenario in practice because clients usually want to ensure the confidentiality of their data and don't want their data shared or used by others.
This problem is clearly related a continual learning (CL) Parisi et al., 2019;Li and Hoiem, 2017;Wu et al., 2018;Schwarz et al., 2018;Hu et al., 2019;Ahn et al., 2019), which also aims to learn a sequence of tasks incrementally. However, the main objective of the current CL techniques is to solve the catastrophic forgetting (CF) problem (McCloskey and Cohen, 1989). That is, in learning each new task, the network parameters need to be modified in order to learn the new task. However, this modification can result in accuracy degradation for the previously learned tasks. In the problem defined above, our goal is to forward transfer the past knowledge to improve the new task learning. We don't need to ensure the classifiers or models learned for previous tasks still work well. 1 However, as we will see in the experiment section, the proposed method is able to outperform the current state-of-the-art CL algorithms. Although there is some existing work on lifelong sentiment classification (Chen et al., 2015; based on naive Bayes. Our deep learning model is based on an entirely different approach and it performs markedly better. To solve the proposed lifelong sentiment classification problem using a single neural network, two objectives have to be achieved. The first objective is to selectively transfer some pieces of knowledge learned in the past to assist the new task learning. Knowledge selection is critical here because not every piece of the past knowledge is useful (some even harmful) to the new task. The second objective is to preserve the knowledge learned in the past during learning the new task because if many pieces of previous knowledge are corrupted due to updates made in learning a new task, future tasks will not be able to benefit from them. This paper proposes a novel model, called L2PG (Lifelong Learning with Parameter-Gates), to achieve the objectives. To achieve the first objective, we propose a novel mechanism called the parameter-gate (p-gate) to give suitable importance values to the network parameters representing the past knowledge according to how useful they are to the new task and transfer them to the new task to enable it to learn better. We split the parameters θ t of the proposed model f (·; θ t ) into three subsets: (1) the shared parameters θ s,t , (2) the task classification parameters θ c,t and (3) the p-gate parameters, where the shared parameters θ s,t and pgate parameters are continuously updated with the learning of each new task t. θ c,t remains unchanged for task t once the task is learned/trained. In learning a new task t, we only randomly initialize the task classification parameters θ c,t , and use an input p-gate to select parameters (or knowledge) from the shared parameters θ s,t−1 of the network state after learning task t − 1 that are helpful to the new task t and use a block p-gate to block part of the previous training step parameters of θ s,t that are not useful (or harmful) to task t.
To achieve the second objective, knowledge dis-1 Lifelong learning and continual learning are often regarded as the same. Here, we follow (Thrun, 1998) and make this distinction. tillation (Hinton et al., 2015) is used to ensure that the updated network can preserve the previous model's knowledge in learning the new task. This paper makes three main contributions: • It proposes a novel deep learning model L2PG that uses a novel p-gate mechanism and knowledge distillation for lifelong sentiment classification. To the best of our knowledge, this approach has not been reported in the existing lifelong or continual learning literature. • Unlike traditional gates that regulate the feature information flow through the sequence chain, the goal of the proposed p-gates is to select useful parameters (which represent the learned knowledge from previous tasks) to be transferred to the new task to make it learn better. In other words, p-gates regulate the knowledge transfer from the past to the present. • It creates a 3-class sentence level sentiment classification corpus from reviews of 10 diverse product categories for lifelong learning evaluation. Such evaluations need many tasks. To our knowledge, no existing sentence sentiment classification corpus fits this need. Experimental results show that L2PG outperforms state-of-the-art baselines including multitask learning, which optimizes all the tasks at the same time.

Related Work
Our work is related to sentiment classification (Liu, 2012), lifelong learning and continual learning. For sentiment classification, recent deep learning models have been shown to outperform traditional methods (Kim, 2014;Devlin et al., 2018;Shen et al., 2018;Qin et al., 2020). However, these models don't retain or transfer the knowledge to new tasks.
Lifelong learning: Most relevant to our work is lifelong learning (Thrun, 1998;Silver et al., 2013;Ruvolo and Eaton, 2013;Liu, 2014, 2016). For lifelong sentiment classification, Chen et al. (2015) used naive Bayes to leverage word probabilities under different classes in old tasks/domains as priors to help optimize the new task learning.  worked similarly but their method can improve the model of a previous task without retraining. Xia et al. (2017) proposed a voting method but their method works on the same data from different time periods. Lv et al. (2019) proposed a model using two networks, one for knowledge retention and one for feature learning. But it was shown to be weaker than . L2PG has a very different approach and performs markedly better.  studied aspect level sentiment classification, which is not the goal of L2PG. However, to the best of our knowledge, none of these methods used gated mechanisms to regulate the transfer of knowledge in the lifelong learning process.
Continual learning: It is similar to lifelong learning, but its main goal is to overcome catastrophic forgetting to ensure learning of a new task will not forget the models learned for previous tasks (McCloskey and Cohen, 1989;Goodfellow et al., 2013). For example, LWF (Li and Hoiem, 2017) uses knowledge distillation loss to ensure that after learning a new task, it can still approximate the performance of the old tasks. EWC (Kirkpatrick et al., 2017) introduces constraints to control parameter changes when learning a new task. HAT (Serrà et al., 2018) masks units that are important to previous tasks by a hard attention. PGMA (Hu et al., 2019) generates a subset of parameters. Two reviews of continual learning can be found in Parisi et al., 2019). Our lifelong learning setting focuses on transferring the past knowledge to the current task. We don't ensure that the models learned in the past still work well after learning a new task. Although Progressive Networks (Rusu et al., 2016) also tries to help future learning through knowledge transfer, but it is not scalable as its network size scales quadratically in the number of tasks.
Knowledge Distillation Loss was proposed in (Hinton et al., 2015) for transferring knowledge in a large model to a smaller one. LWF uses knowledge distillation to help deal with forgetting. Dhar et al. (2019) proposed an information preserving penalty, attention distillation loss, to preserve the information about existing classes. This setting is different from ours as it incrementally learns more classes. Each of our tasks is an independent sentiment classification problem with multiple classes.

The Proposed L2PG Model
The working of the proposed model L2PG in learning the new task t is illustrated in Figure 1. Our learner f (·; θ t ) consists of three modules and two loss functions. The first module is the shared knowledge module (SK), which consists of a CNN (i.e., convolutional neural network) with various fil-it's a charming journey.

Task t Word embedding
Task t+1 Task t-1 x PG Figure 1: The proposed L2PG model. In learning task t, the parameters in the yellow boxes are temporary copies of the parameters of task t − 1 (a superscript • is used to indicate a copy) and are not changed (they are deleted after learning task t). The parameters in the blue boxes and blue disk are updated. Green lines are for knowledge distillation.
ters. It contains the shared knowledge across tasks in its parameters θ s,t . The second module is the task classification module (TC) with parameters θ c,t , which is a fully connected layer for the classification of task t. There is one TC for each task and it is fixed once t is learned. The third module is the p-gate module (PG).
In learning each new task t, a temporary copy of SK and of TC (in the yellow boxes of Figure 1) are made from the state of the network after task t − 1 was learned. For clarity, we use the superscript • to indicate a copy of something. For example, θ s,t−1,• and θ c,t−1,• denote the copies of θ s,t−1 and θ c,t−1 respectively. They are fixed and not updated during the learning of task t. SK (in the blue box) and PG (in the blue disk) are updated in learning task t, and are also used in testing. The goal of PG is to identify useful knowledge for task t from the parameters θ s,t−1,• of SK after task t − 1 training and to block the unhelpful or harmful knowledge in SK (see Sec. 3.3) for the current task. Knowledge distillation is used to ensure that in learning task t, the knowledge gained from the previous tasks are not forgotten. Updating the parameters of SK, TC and PG are done through back propagation. The two loss functions used are knowledge distillation loss and cross entropy loss.

Shared Knowledge Module (SK)
Let the training data of task t be D t train , and an instance of it with length L (after padding or cutting) be x t i with label y t i . Training of SK (in the blue box of Figure 1) for the new task t starts with SK of the task t − 1 model f (·; θ t−1 ). After training of task t, f (·; θ t−1 ) becomes SK of the model f (·; θ t ) for task t. During training, the input instance goes through SK to get advanced features to be used by task t's TC module. Let V t ij ∈ R k be the word vector corresponding to the jth word of x t i and X t i ∈ R L×k be the embedding matrix of x t i . SK receives X t i from the input layer, and then extracts advanced features C t i in the form of a n-gram, i.e., (1) where c j represents the output produced by CNN's filter on X t i [j : j + n − 1, :]. Mathematically, a convolution operation consists of a filter W t ∈ R n×k and a bias b t ∈ R. c j can be expressed as: where g is a nonlinear activation function such as Relu. We use a Maxpooling operation over the feature map and take the maximum value C t i = max{C t i } as the feature corresponding to this particular filter. The shared knowledge from SK of the current task t is where θ s,t is the whole set of parameters of SK of the current task t.

Task Classification Module (TC)
Using Eq. 3 we obtain a high-level representation of the input instance x t i . Then, we pass the feature of x t i through TC of the task t to obtain the classification result, where W t c , b t c are the weight and bias of the classifier. Like SK above, we refer the classifier from the TC module of the current task t as where θ c,t is the set of all parameters of the TC (classifier) of the current task t. As mentioned earlier, TC is a fully connected layer (in the top blue box of Figure 1) and is randomly initialized.

P-Gate Module (PG)
Recall that in learning the new task t, the proposed p-gate mechanism (PG) selectively transfers some pieces of knowledge from the parameters θ s,t−1 after task t − 1 is learned, i.e., f (·; θ t−1 ), to the current task t. At the same time, PG also needs to block the knowledge that is not helpful to the current task or knowledge that may cause forgetting for previous tasks. We achieve the goals using two p-gates, an input p-gate and a block p-gate.
The input p-gate uses the Sigmoid function to determine what proportion of each parameter in the SK from the previous task should help the current task to learn. The input p-gate is formulated as, where θ s,t−1,• is a copy of θ s,t−1 , the parameters of the network state after task t − 1 was learned (see the top yellow box in Figure 1), and W z is the set of trainable input p-gate's parameters. θ s,t−1,• does not change during training. z ij → 1 means that the corresponding parameter is almost completely helpful to the learning of the current task, and z ij → 0 means that the parameter is of no help (or harmful) to the current task t.
The block p-gate blocks some SK's parameters from the previous training step S − 1 in the training process of the current task t. θ s,t S−1 serves as the initial parameters of θ s,t S of the current training step S. The block p-gate is formulated as, where W b is the set of trainable block p-gate's parameters. b ij → 0 means that the current parameter almost certainly has a negative effect on the next learning or may lead to forgetting. Both the input p-gate's parameters W z and block p-gate's parameters W b are trained by minimizing the loss function of the current task t's classification module TC. After this step of training using a batch of examples for task t is completed, SK's parameters of step S is revised by the following combination operation, i.e., the trained θ s,t S is replaced, This operation is to reduce the interference of the new task t on the existing knowledge learned in the past and cause forgetting.
After the parameter combination and revision is done, the training goes to the next step/iteration S + 1 using another batch of data. Note that this combination and replacement operation is not used if S is the last step of an epoch.

Objective of Optimization
In order for the model to retain old knowledge during the learning process, we use the knowledge distillation loss in (Hinton et al., 2015) to encourage the outputs of one network to approximate the outputs of another, similar to LWF. Therefore, when we start training task t, we first use f (·; θ t−1 ) to get the softmax output is the softmax outputs of SK of task t combining TC of task t − 1, which are used to build a knowledge distillation loss. Let Y t = {y t i } nt i=1 be all ground truth labels of task t and Y t = { y t i } nt i=1 be the softmax outputs of f (·; θ t ) used to build the cross entropy loss. n t is the number of training examples of task t.
We now present the L2PG's optimization goals when sequentially learning each new task t.
Knowledge Distillation Loss: It is defined as: where K is a hyperparameter and Hinton et al. (2015) suggests K > 1, which increases the weight of smaller logit values and encourages the network to better encode similarities among classes. Classification Loss: The classification loss of the current learner f (·; θ t ) for task t is cross entropy of Y t and Y t , So, the total loss is where λ and β are hyperparameters, R(θ t ) is the regularization term (we use L2 regularizer), and θ t includes θ c,t , θ s,t , W b and W z . The algorithm of L2PG for training the new task t is given in Algorithm 1, which is self-explanatory.

Experiments
We now evaluate L2PG and compare it with two main types of baselines, i.e., those under lifelong sentiment classification and those under continual learning for dealing with catastrophic forgetting.
Algorithm 1 L2PG -Learning the new task t 1: Input: Training set D train t of task t, and shared parameters θ s,t−1,• and task classification parameters θ c,t−1,• from task t − 1.  Y t = f (X t S ; θ s,t S , θ c,t S ); 10: Update parameters:

11:
Parameters θ s,t S , θ c,t S , W z and W b are updated by minimize Eq. 12; 12: // Use the trained p-gate parameters to // select the knowledge for the next step 13: 16: end for

Datasets
We carried out experiments on two datasets. The first dataset is for document level sentiment classification with two classes, positive and negative. It consists of reviews of 16 diverse kinds of products (domains) commonly used in multi-task text classification (Liu et al., 2017). The reviews of the first 14 products are from Amazon.com. The remaining two are about movie reviews (IMDB and MR). The number of training and testing samples for each product (or task) is about 1,400 and 400, respectively. We call this dataset Mix-16, which gives us 16 tasks, one per product category/domain. The second dataset is for sentence-level sentiment classification and is created by us. It consists of review sentences of 10 types of products/domains crawled from Amazon.com, which gives us 10 tasks. Each sentence is labeled with positive, negative or neutral. The sentences with conflict opinions (e.g., both positive and negative) are not used. Sentence sentiment classification of each domain forms a task. The review sentences of each product are annotated by two annotators independently. We trained all the annotators and provided them with an annotation instruction document. After training, each of them was asked to  perform annotation of 50 sentences to assess their annotation quality. They started their annotation only after we were satisfied with their annotations. After they completed their annotations, sentences with disagreements were identified and discussed by the annotators to come to an agreement. The Kappa score for annotator agreement was 0.7947. Note that we are aware that there are some existing sentence sentiment classification data, but each of them is only from reviews of a single product. We are unable to create many different domain tasks from them to suit lifelong learning. Furthermore, they mostly have only two classes, positive and negative, which do not reflect all review sentences because many review sentences express no sentiment (neutral), e.g., "I bought this camera yesterday." That is why we created the new dataset with 10 different categories of products, which give us 10 tasks for lifelong learning. 2 We denote this dataset as Amazon-10.

Baselines
We consider the following baselines for comparison with the proposed L2PG model. The feature extraction module (e.g., SK of L2PG) of all models including L2PG uses CNN (Kim, 2014) and each classifier is a fully connected layer (e.g., TC of L2PG for each task).
I-CNN: I-CNN is a single-task CNN classifier, where one CNN model performs each task independently, no sharing of knowledge across tasks.
S-CNN: S-CNN is I-CNN but uses one CNN model (one feature extractor and one classifier) to incrementally learn all tasks. No mechanism is used to deal with knowledge transfer or forgetting.
LWF-T: This is a continual learning model based on Learning without forgetting (LWF) (Li and Hoiem, 2017). It uses knowledge distillation to deal with catastrophic forgetting. Since LWF was originally designed for image classification, we modified it for text classification using the same model as the above, i.e., CNN for the shared parameter module, one fully connected layer for each task's classifier (each task has its own classifier). When training the new task, the parameters of the task-specific classifiers of the previous tasks are fixed. We denote this LWF model as LWF-T.
HAT: This is a well-known algorithm for continual learning that deals with catastrophic forgetting (Serrà et al., 2018). Since HAT (or UCL below) was also designed for image classification, we again adapted it for text. HAT has almost no forgetting for image classification.
UCL: This is a latest continual learning model (Ahn et al., 2019) that improves HAT.
LSC: This is the naive Bayes-based lifelong sentiment classification model in (Chen et al., 2015).
LNB: LNB  is similar to LSC but is able to improve the model of a previous task without retraining. The system in (Lv et al., 2019) is not compared as it performed poorer than LNB .
MTL: This is a multi-task learning baseline using CNN as the shared knowledge module as L2PG and each task has its own task-specific classifier like L2PG, HAT, and UCL. 3 In (Li and Hoiem, 2017), MTL's performance was regarded as the upper bound of continual learning because the training data of all tasks are available during training. But for L2PG, after each sentiment classification task is learned, its data is assumed deleted.
Training details. For all models in our experiments, the word embedding are randomly initialized as 300-dimension vectors and then modified during training. We use filter sizes of [3,4,5] with 100 feature maps each in the CNN module, and dropout rate of 0.5. In L2PG, we set mini-batch size to 50, learning rate to 0.001, temperature T = 2 and λ, β = 1. We use the same feature extractor CNN and classifier as other models. For HAT and UCL, we modified their code for text and optimized their parameters (their original parameters performed poorly for text), but we did not change their algorithms. HAT and UCL need 300   and 100 epochs to achieve the best results respectively, but for others, 20 epochs are sufficient. For LSC and LNB, we use their original code. Note that LSC and LNB can only deal with two-class sentiment classification due to the limitation of its knowledge sharing mechanism. Thus we cannot run it on the second dataset which has three classes.

Results and Analysis
For our lifelong learning setting, we use 5 random task sequences to compute the accuracy as different task sequences may give different results. 4 For each sequence, each task (also a domain) is used as the last task in turn to collect its test result. This is because we are only interested in improving the accuracy of the current/new task based on knowledge learned in the previous tasks. Table 2 and Table 3 give the mean accuracy of each task when it is the last task for Mix-16 and Amazon-10 respectively. The average accuracy of each column is given in the last row of each Compare with Lifelong Sentiment Classification Models: LSC and LNB are only designed for 2-class lifelong sentiment classification. They cannot handle three classes in Amazon-10 and thus have no result for it. L2PG is only compared to LSC and LNB on Mix-16. In Table 2, we see that L2PG outperforms LSC and LNB by 2.53%, 2.77% respectively. One reason is that LSC and LNB are naive Bayes approaches, which cannot model the contextual relationship due to its conditional independence assumption on features (words). L2PG does not have this limitation.

Compare with Continual Learning Models:
For continual learning models LWF-T, HAT and UCL, to be consistent with the lifelong setting of L2PG, we also take turns to put each task as the last and use the final model to get the accuracy of the last task (there is no forgetting for the last task). The average accuracy of L2PG on both datasets is  markedly higher than these models. For example, on Amazon-10, L2PG's average accuracy is 2.20% higher than LWF-T, 6.48% higher than HAT and 2.66% higher than UCL. As we can see, continual learning models LWF-T and UCL (the latest algorithm) that only deals with catastrophic forgetting also achieve better results than I-CNN as the tasks are similar and share a great deal of knowledge (HAT is markedly worse). However, since they do not have specific mechanisms to perform knowledge transfer, they are weaker than L2PG. Compare with MTL: Under the condition that the same CNN is used as the feature extractor and a fully connected layer is used as a task-specific classifier for each task, L2PG is on average 1.22% better than MTL on Mix-16 and 2.99% better than MTL on Amazon-10. MTL is often considered the upper bound of continual learning because it trains all the tasks together. However, its loss is the sum of the losses of all tasks, which does not mean it optimizes for every individual task. L2PG in the lifelong learning setting tries to do the best for the new/current task.
Ablation Experiments and Analysis: To show the usefulness of each component of L2PG, we perform ablation experiments on the Amazon-10 data without using knowledge distillation loss, the p-gate modele (PG), or both. Their results are given in Table 4.
When only removing the knowledge distillation loss from L2PG (w/o L D ), which we call L2PG-NK, the average accuracy drops by about 1.15%, which indicates that using knowledge distillation loss to actively preserve the old knowledge is useful. When only removing the p-gate module from L2PG (w/o PG), which is actually LWF-T, the average accuracy drops by about 2.10%, which shows that our PG mechanism can choose and transfer the right knowledge to the new task. Without both knowledge distillation loss and PG (w/o L D or P G), which is actually S-CNN, the result is much worse. Comparing L2PG-NK with LWF-T and S-CNN, we can see L2PG-NK's average score is 0.95% higher than LWF-T, 1.74% higher than S-CNN, which indicates that even without distillation loss, the PG mechanism can effectively retain the past knowledge and use it effectively.

L2PG in the Continual Learning Setting
Here we run L2PG as a continual learning system. Like LWF-T, HAT and UCL, after all tasks are learned, L2PG is tested on every task's test data (note, in the lifelong learning setting, we only test on the last task). The continual learning results on the two datasets are presented in Figures 2 and 3, where six models are compared, namely, I-CNN, S-CNN, LWF-T, HAT, UCL and L2PG. From the figures, we observe that L2PG actually can outperform all the other five models. This is due to the fact that L2PG encourages knowledge transfer, while the continual learning systems LWF-T, HAT and UCL only focus on preserving the past knowledge.

Conclusion
This paper proposed an effective model L2PG for lifelong sentiment classification. L2PG not only can retain what it has learned, but also selectively transfer the past knowledge to learn the new task better. The key component is the proposed parameter gate (p-gate) mechanism that is able to select the right previously learned knowledge or parameters to transfer to the new task. Knowledge distillation is also employed to maintain the knowledge or models learned for the previous tasks. Empirical evaluation showed L2PG outperforms strong baselines in lifelong learning, continual learning, and even multi-task learning.