Bayes-enhanced Lifelong Attention Networks for Sentiment Classification

The classic deep learning paradigm learns a model from the training data of a single task and the learned model is also tested on the same task. This paper studies the problem of learning a sequence of tasks (sentiment classification tasks in our case). After each sentiment classification task is learned, its knowledge is retained to help future task learning. Following this setting, we explore attention neural networks and propose a Bayes-enhanced Lifelong Attention Network (BLAN). The key idea is to exploit the generative parameters of naive Bayes to learn attention knowledge. The learned knowledge from each task is stored in a knowledge base and later used to build lifelong attentions. The constructed lifelong attentions are then used to enhance the attention of the network to help new task learning. Experimental results on product reviews from Amazon.com show the effectiveness of the proposed model.


Introduction
In recent years, deep learning methods have brought sentiment classification to a new height with big data, see a survey paper . However, these deep learning methods primarily focus on learning a model from the training data of one task and the learned model is tested on the same task. In this paper, we study the problem of learning a sequence of sentiment classification tasks. This learning setting is called lifelong learning. Lifelong learning is a continual learning process where the learner learns a sequence of tasks; after each task is learned, its knowledge is retained and later used to help new/future task learning (Thrun, 1998;Silver et al., 2013;Ruvolo and Eaton, 2013;Mitchell et al., 2018;Parisi et al., 2019). Lifelong learning has been introduced extensively in a book, see (Chen and Liu, 2018). The first lifelong learning method for sentiment classification (called lifelong sentiment classification) was introduced in (Chen et al., 2015). Following Chen et al. (2015), several other lifelong sentiment classification approaches were proposed respectively in (Xia et al., 2017;Lv et al., 2019;Xu et al., 2020). We will discuss these and other related work in the next section. Among these, (Lv et al., 2019) is a pioneering lifelong sentiment classification method based on deep learning. Following these existing works, we treat the classification in each domain (e.g., a type of product) as a learning task. Thus, we use the terms domain and task interchangeably throughout the paper. Formally, we study the following problem.
Problem Definition: At any point in time, a learner has learned n sentiment classification tasks. Each task T i | n i=1 has its own data D i . The learned knowledge is retained in a knowledge base. When faced with a new sentiment classification task T n+1 , the knowledge in the knowledge base is leveraged to help learn the new task T n+1 . After T n+1 is learned, its knowledge is also incorporated into the knowledge base for future task learning.
The above-defined problem is a reformulation from the work of Chen et al. (2015). As can be seen, the main challenges of this problem are 2-fold: (1) what forms of knowledge should be retained from each task and (2) how does the learner use the knowledge to help future task learning? These are also the challenges faced by lifelong learning. This paper studies the problem of lifelong learning for sentiment classification. Below let us consider an example in sentiment classification.
"I bought this camera last month. It works good although a little heavy. Would recommend this." We (humans) can observe that (1) the first sentence shows a neutral sentiment (or no sentiment), (2) the second sentence expresses both positive (i.e., good) and negative (i.e., heavy) sentiments, and (3) the third sentence shows a strong positive (i.e., recommend) sentiment. The overall sentiment polarity of this example is positive. One reason why we can understand these clearly is that we give different focuses/attentions to each word and each sentence. We give strong attentions to the words (good, heavy, and recommend) and the last two sentences. Another crucial reason is that we have accumulated a large amount of knowledge in the past, and use it now to help understanding. For example, we know that the word "good" is positive for all products, and the word "heavy" is negative for camera but may be positive for some other products. Such a problem is called domain generality and domain specificity as introduced in (Li and Zong, 2008). The problem is particularly acute in lifelong learning as lifelong learning needs to tackle a large number of (possibly never ending) domains/tasks. Thus, we motivate and propose to use attention from previous tasks as knowledge to help future task learning.
Attention has become enormously popular as an essential component of neural networks especially for a wide range of natural language processing tasks (Chaudhari et al., 2019). Attention neural networks was first introduced for machine translation (Bahdanau et al., 2014). Since then, a large body of sentiment classification models using attention mechanisms have been arrived -to name a few (Zhou et al., 2016;Yang et al., 2016;Gu et al., 2018;He et al., 2018;Peng et al., 2019;. However, none of them is a lifelong sentiment classification method. Most of them learn in isolation as the classic deep learning paradigm learned, although some methods work for multi-domain sentiment classification which focus on leveraging data in source domain to help target domain. So, they are far from sufficient for our goal as we aim at building a model to learn a sequence of sentiment classification tasks and learn them successively. In addition, whether attention is explainable (or interpretable) was recently discussed in (Jain and Wallace, 2019;Serrano and Smith, 2019;Wiegreffe and Pinter, 2019). Letarte et al. (2018) and Clark et al. (2019) show the importance of attention for neural networks.
In this paper, we introduce a knowledge-enhanced attention model which we call BLAN (as shorthand for Bayes-enhanced Lifelong Attention Network) for learning a sequence of sentiment classification tasks. The idea is that the model parameters of naïve Bayes are first generated from each task. These parameters are then transformed to knowledge, retained in the knowledge base and later used to build lifelong attentions. Frequent Pattern Mining (Agrawal et al., 1994) is used in extracting lifelong attentions. The extracted lifelong attentions are used to enhance attention neural networks. This enhancement is especially useful when the training data of a task is small and thus, attention neural networks do not work well. The proposed lifelong attention mechanism using naïve Bayes and Frequent Pattern Mining is helpful and promising. These will be clear later. The experimental results demonstrate the superiority of the proposed model against multiple baseline methods. In summary, the paper makes the following contributions: • The paper studies a problem of learning a sequence of sentiment classification tasks, where the tasks are learned incrementally.
• It proposes a Bayes-enhanced Lifelong Attention Network to address the aforementioned problem. The proposed method exploits the generative parameters of naïve Bayes to learn attention knowledge and the knowledge learned from each previous task is retained and later used to build lifelong attention to enhance neural networks for new task learning.
• Experimental results using two Amazon review datasets (involving 20 and 24 types of product domains) show that the proposed method achieves remarkable results.

Related Work
Lifelong learning, an advanced machine learning paradigm has been recently researched significantly in a relatively short time, see the book (Chen and Liu, 2018). In brief, lifelong learning aims to imitate "human learning" process and capability, i.e., accumulating the knowledge learned in the past and using it to help future learning and problem solving. Lifelong learning for sentiment classification (called lifelong sentiment classification) was first introduced in (Chen et al., 2015), which is based on optimization considering the knowledge from previous tasks to help future task learning. Following this work, Xia et al. (2017) investigated lifelong sentiment classification based on voting of each individual task classifier.  proposed a heuristic naïve Bayes which achieves forward knowledge transfer to improve any future task learning and also achieves backward knowledge transfer to improve the model of any past task without retraining using the past task training data. Xu et al. (2020) proposed a continuous naïve Bayes learning framework for e-commerce product review sentiment classification by extending the parameter estimation mechanism in naïve Bayes. Lv et al. (2019) introduced a deep learning method for lifelong sentiment classification by jointly training two networks, a feature learning network and a knowledge retention network. Such a method is the first lifelong sentiment classification method based on deep learning. We have compared this deep learning method in our experiments and show that it is inferior to our proposed method. (Shu et al., 2017;Wang et al., 2018a;Wang et al., 2018b) also proposed lifelong learning methods for sentiment classification, but they focus on aspect-based analysis, which is different from document-level sentiment classification (our focus in this paper). There are currently several research topics that are closely related to lifelong learning -most notably, multi-task learning and transfer learning.
Multi-task learning vs. lifelong learning: Multi-task learning (Zhang and Yang, 2017) jointly optimizes the learning of multiple tasks. Although it is possible to make it continual, multi-task learning does not retain any knowledge. Lifelong learning is a continual learning process where the learner learns task incrementally, accumulates knowledge from each task and later uses it to help future task learning.
Transfer learning vs. lifelong learning: Transfer learning (Pan and Yang, 2010) uses the source domain labeled data to help target domain learning which has no or little labeled data. Unlike lifelong learning, transfer learning is not continual and has no knowledge retention. The source should be similar to the target (which are normally selected by the user). Domain adaptation is a type of transfer learning by establishing knowledge transfer from a labeled source domain to an unlabeled target domain. Domain adaptation with multiple sources (or called Multi-source Domain Adaption) aims to use the labeled data collected from multiple sources to help target domain learning (Sun et al., 2015;Wu and Huang, 2016). However, these methods are only one-directional, i.e., sources helps target because the target has no or little labeled data.
Our work is also related to knowledge-enhanced attention neural networks. Knowledge-enhanced attention models have been introduced in (Chen et al., 2017;Zhou et al., 2018;Peters et al., 2019;Huang et al., 2020). However, they do not concern sentiment classification task. Although there are a large body of attention models for sentiment classification as we mentioned in the previous section, none of them uses enhanced attention mechanisms. They mainly use self-attention learned from the networks. Our proposed attention mechanism is a combination of self-attention and enhanced-attention. We detail our attention mechanism and the proposed model next.

Proposed Model
The proposed BLAN model is a hierarchical architecture as shown in Figure 1. The key characteristic of BLAN is the Bayes-enhanced lifelong attention (to build knowledge-enhanced attention). This is done by using the generative model parameters P w i |c k of the naïve Bayes, formulated as where w i is a word, c k is the positive (+) or negative (-) class in our case, N c k ,w i is the number of times that word w i occurs in the training documents of class c k , |V | is the vocabulary V size, and λ is the smoothing parameter. Below we provide an "ideal" case to illustrate how our model uses these parameters to perform Bayes-enhanced lifelong attention in a lifelong learning setting. From Parameters to Knowledge (Attention): Recall the example mentioned in the introduction. Suppose the three words "good, heavy, recommend" appear in the camera domain. We first transform the parameters to ratios as shown in Table 1, where R w|+ = P w|+ /P w|− and R w|− = P w|− /P w|+ . Then, the ratios are treated as attentions (i.e., knowledge) and retained in the Knowledge Base (KB). We use the ratios (not the original parameters) as we aim to reinforce the sentiment intensity for a sentiment word. Thus the KB stores two types of attention knowledge (i.e., positive ratio R w|+ and negative ratio R w|− ) 1 for each task. From Table 1, we can see that the attention knowledge for each word here is useful, also meaningful and interpretable in such a way.
Word Parameters → Ratios P w|+ , P w|− R w|+ , R w|− Good 0.8, 0.2 4.0, 1/4 Heavy 0.3, 0.7 3/7, 7/3 Recommend 0.9, 0.1 9.0, 1/9 Table 1: An "ideal" case from parameters to ratios. (Note, this is an ideal case. In practice, it is hard to achieve such a result for each word but each word should have a relative sentiment polarity.) Lifelong Attention: Suppose we are given a task sequence (T 1 , ..., T n , T n+1 , ...), our model has learned the first n tasks. The above two types of knowledge (i.e., R w|+ and R w|− ) of each task has also been retained in the KB. The knowledge mining process is marked using a Knowledge Miner in Figure 1. Again, we use the terms domain and task interchangeably throughout the paper because we treat the classification in each domain as a learning task following the existing lifelong sentiment classification works. We now focus on using the retained knowledge and the newly learned knowledge from the new/target task T n+1 to generate lifelong attentions (namely lifelong word-level attention and lifelong sentence-level attention as shown later) for the target task. As introduced early, we need to tackle the problems of domain-generality and domain-specificity. For each word w in the target task, we rely on the knowledge from all domains to solve the former. We filter/remove the domain-specific knowledge from previous domains to address the latter when helping new task/domain learning. Here we follow Wang et al. (2018a) and propose to filter the domain-specific knowledge using Frequent Pattern Mining (FPM) (Agrawal et al., 1994). A frequent pattern is a set of items that appear frequently (above a minimum frequency threshold, e.g., 10, called minimum support) in a database of transactions. Taking positive attention knowledge as an example, we treat the positive attention knowledge of all words in each domain as items. A set of items in one attention distribution form one transaction. Given all transactions, FPM can find frequent items. Then, we treat those infrequent items as domain-specific knowledge and filter out the domain-specific knowledge of each previous domain. LetR t w|+ andR t w|− denote the results of the word w after performing FPM in the t-th previous domain/task. The lifelong word-level attentions 2 of this word for the target task T n+1 are formulated as Assume that all words in a sentence s are conditionally independent given the class (which follows the naïve Bayesian assumption). Then, the lifelong sentence-level attentions of s are formulated as where A is the number of words in s.
Bayes-enhanced Lifelong Attention Network: When the target task T n+1 comes, our model first extracts the above two types of lifelong attentions. The extracted lifelong attentions are used to enhance the hierarchical neural networks. Given a document d with Q sentences {s 1 , ..., s Q } and the sentence s j with words {w j 1 , ..., w j A }, where j denotes the j-th sentence s j , the proposed BLAN works as follows: Word-level learner (WLL): For each word in the sentence s j , WLL first looks up the embedding vector from pre-trained word embedding vectors. The embedding vectors of all words in this sentence are fed into a bidirectional GRU (Bi-GRU) (Bahdanau et al., 2014). Then WLL can learn a representation for each word (e.g., w j i ) by concatenating the forward hidden state − → h j i and the backward hidden state ← − h j i . The representations of these words in sentence s j with positive and negative attentions are fed into a positive sentence-level learner (pSLL in Figure 1) and a negative sentence-level learner (nSLL in Figure 1), respectively.
Let H j w = [h j 1 , ..., h j A ] and w j = [w j 1 , ..., w j A ], then the network attentionsα w j in a vector form are formulated as where tanh() is a tanh function, W w , b w and v w are trainable parameters. From Eq. (4), we can see that the values of any attention of the network is a quantity bounded in magnitude O(1) as we perform sof tmax. To keep lifelong attentions in the same scale, we also perform sof tmax for lifelong word-level attentions: where α + l,w j and α − l,w j are lifelong word-level attentions for words w j using Eq. (2).
For each word w j i (i = 1, ..., A) in w j , we propose to use the lifelong word-level attention to enhance the network's attention, formulated as - where α + w j i and α − w j i are the final attentions on the word w j i . Then, positive and negative unified representations of the words in sentence s j are respectively formulated as Sentence-level learner (SLL): Two SLLs (positive and negative, i.e., pSLL and nSLL in Figure 1) receive the output (i.e., Eq. (8)) of WLL. Similar to WLL, pSLL and nSLL use Bi-GRU to learn repre- where and v − are trainable parameters. As introduced above, we use Eq. (3) to compute the lifelong sentence-level attentions β + l,s and β − l,s for the sentences s in document d. These lifelong sentence-level attentions are further normalized as β + l,s = sof tmax(β + l,s ), β − l,s = sof tmax(β − l,s ).
Similar to WLL, each SLL also learns a unified representation of the sentences {s 1 , ..., s Q } in the document d by using knowledge-enhanced attentions. For each sentence s j (j = 1, ..., Q), we propose to use Bayes-enhanced lifelong sentence-level attention to enhance the network's attention: where β + s j and β − s j are the final attentions on the sentence s j . Then, the unified representation of all sentences in this document d is formulated as follows Document-level learner (DLL): The learned d + and d − are fed into a full connection layer (FCL) to produce the positive sentiment score P +|d and the negative sentiment score P −|d for this document. DLL learns the class label by performing arg max{P +|d , P −|d }. For the objective function, the motivation is that the proposed model should assign a large positive sentiment score P +|d (meanwhile, a small or even zero negative sentiment score P −|d ) to a positive document, and the reverse results to a negative document. Thus, we propose to maximize the objection function below where D are the training documents, θ(d) = sigmoid(P +|d − P −|d ),θ(d) = sigmoid(P −|d − P +|d ), and the activation function δ(z) is defined as In practice, we solve a logarithmic minimization problem with a regularization term. Formally, the objective function to train our model is formulated as follows where λ is a regularization parameter (set to 0.01 in training). For the new/target domain T n+1 , the training process of the proposed BLAN model is presented in Algorithm 1.

Datasets
Since the most related work to ours is lifelong sentiment classification (Chen et al., 2015;Xia et al., 2017;Lv et al., 2019), we evaluate sentiment classification using the same dataset as in (Chen et al., 2015).  and Lv et al. (2019) also used this dataset. The dataset contains Amazon reviews of 20 types of product domains, denoted by Amazon(20) 3 . Each domain consists of 1,000 reviews. Each review has been assigned a sentiment label, i.e., positive (+) or negative (-). We did not use the dataset in (Xia et al., 2017) as their data for each task/domain are from the same domain. In addition, we sampled a large Amazon review dataset from the SNAP corpus 4 , which contains product reviews from 24 types of product domains. Following the existing works (Pang et al., 2002;Blitzer et al., 2007;Chen et al., 2015), we treat reviews with rating > 3 as positive and reviews with rating < 3 as negative. We randomly sampled 10,000 reviews for each domain from the SNAP corpus. The sampled dataset is denoted by SNAP(24).  Table 2: Name of each domain and the proportion of negative reviews in each domain.
The name of each domain from the datasets Amazon(20) and SNAP (24), and the proportion of negative reviews in each domain are shown in Table 2. We can see that the proportion of negative reviews in each domain is in the range [4%, 31%]. It is worth stressing that the negative reviews in each domain in both datasets are minorities and thus very hard to classify.

Baselines
In this work, we use naïve Bayes (NB) parameters to enhance attention neural networks and propose a lifelong sentiment classification model. So, we compare with three types of methods, namely naïve Bayes, neural networks, and lifelong sentiment classification.
• Naïve Bayes: we compare our model with the classic NB method. We also compare with a strong variant of NB, called NBSVM (Wang and Manning, 2012). There are two models of NBSVM with uni-grams and bi-grams, denoted by NBSVM-uni and NBSVM-bi, respectively.
For our BLAN, we also created a variant without the Bayes-enhanced attentions, denoted by nBLAN. nBLAN is to evaluate our Bayes-enhanced attentions are helpful.

Settings
Our experiment settings and model parameters are set as follows.
Training set, validation set and test set. For each dataset, the review documents in each domain are partitioned into training set, validation set and test set with 80%, 10% and 10% respectively. For each review document, we use NLTK 5 to split it into sentences, then tokenize and lemmatize each sentence into a sequence of words. We filter out words that appear less than 3 (and 5) times in the training data of each domain of the dataset Amazon(20) (and SNAP(24)) in building the vocabulary of each domain.

Method
Non-lifelong SC Evaluation Lifelong SC Evaluation Amazon (20) SNAP (24) Amazon ( (20) and SNAP(24). Note, the F1 of negative class should be more reliable than the ACC of both classes as negative reviews are minorities in each dataset.
Model parameters. For the NB, the smoothing parameter is set to 1 (Laplacian smoothing). For the other non-neural networks models (i.e., NBSVM-uni, NBSVM-bi, LLV, LSC and LNB), we use their default parameter settings. In addition, we followed Pang et al. (2002) to deal with negation words as performed in (Chen et al., 2015) for NB, LSC and LNB. For the neural networks models (i.e., Conv-GRNN, LSTM-GRNN, HAN, SRK, nBLAN and BLAN), we use pre-trained GloVe.840B 6 (Pennington et al., 2014) to initialize the word embeddings and the embedding dimension is 300. The number of convolutional filters is 3 with kernel sizes 1, 2 and 3 respectively. The size of LSTM and GRU hidden states is 300. The network parameters are updated using the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001. The learning rate is clipped gradually using a norm of 0.9 in performing the Adam optimization. The dropout rate is set to 0.5 in the input layer. We use Accuracy (ACC) of both classes in evaluation. We also use F1 of the negative class because the number of reviews in the negative class is small (see Table 2) and thus not easy to classify.

Result Analysis
We evaluate sentiment classification performance using both non-lifelong setting and lifelong setting. For the non-lifelong setting, we train and test all systems on each single domain (no knowledge from other domains). For the lifelong setting, we follow (Chen et al., 2015) to treat each domain as the future/target domain with the remaining domains as the past domains. That is, we run each system 20 (and 24) times on the dataset Amazon(20) (and SNAP (24)). Note that those non-lifelong baselines are tailored for a single domain. To have a fair comparison, we combine the training data of all domains (past and target domains) to train each non-lifelong model and test the model on each target domain test data. The average accuracy (ACC) of both classes and the F1 of negative class over the 20 domains of Amazon(20) and 24 domains of SNAP(24) are shown in Table 3. From the results, we make the following observations: • The results of each method in lifelong setting are better than in non-lifelong setting on each dataset. This show that lifelong learning is a promising research direction.
• Our BLAN achieves the best ACC and F1 on both datasets respectively. The results clearly show the superiority of BLAN.  • BLAN outperforms the existing lifelong sentiment classification baselines (i.e., LLV, LSC, LNB and SRK) markedly. This shows that our work presents a new margin to beat.
• Attention based baseline (i.e., HAN) is inferior to our BLAN, especially on the smaller dataset Amazon(20). nBLAN (a variant of BLAN without Bayes-enhanced attention) is also inferior to BLAN. This shows that Bayes-enhanced attention (i.e., lifelong attention) works and is useful.
• Most of the neural network baselines (i.e., Conv-GRNN, LSTM-GRNN, HAN and SRK) are slightly superior to the non-neural network baselines (i.e., NB, NBSVM-uni, NBSVM-bi, LLV, LSC and LNB), especially on the large dataset SNAP (24). This shows that neural network based methods are superior in dealing with sentiment classification using large-scale data.

Case Study
Recall the example given in the introduction: "I bought this camera last month. It works good although a little heavy. Would recommend this." The final learned lifelong attentions from the dataset Amazon(20) are shown in Figure 2, where each number above the axis corresponds to a sentence ID and its location on the axis represents the attention value (i.e., the value below the axis) of that sentence. The color depth in the textbox represents the attention value of each word. We can see that our model works well for attention learning. The learned lifelong attentions here are also interpretable in our human cognition.

Conclusions
In this work, we studied deep learning for a sequence of sentiment classification tasks. After each task learned, its knowledge was retained to help subsequent task learning. This problem is called lifelong sentiment classification. This paper proposed a novel knowledge-enhanced attention network model for lifelong sentiment classification. The proposed model learns to build two types of lifelong attentions by exploiting the model parameters of naïve Bayes. The built lifelong attentions are used to enhance the overall attention networks. Experimental results showed that the proposed model outperforms a wide range of baselines markedly. The proposed model works for two-class (positive and negative) problem. In our future work, we plan to extend it to multi-class problem. We also plan to address catastrophic forgetting problem (Kirkpatrick et al., 2017) in neural networks and bring our method to a new height.