Lifelong Learning CRF for Supervised Aspect Extraction

This paper makes a focused contribution to supervised aspect extraction. It shows that if the system has performed aspect extraction from many past domains and retained their results as knowledge, Conditional Random Fields (CRF) can leverage this knowledge in a lifelong learning manner to extract in a new domain markedly better than the traditional CRF without using this prior knowledge. The key innovation is that even after CRF training, the model can still improve its extraction with experiences in its applications.


Introduction
Aspect extraction is a key task of opinion mining (Liu, 2012). It extracts opinion targets from opinion text. For example, from the sentence "The battery is good", it aims to extract "battery", which is a product feature, also called an aspect.
Aspect extraction is commonly done using a supervised or an unsupervised approach. The unsupervised approach includes methods such as frequent pattern mining (Hu and Liu, 2004;Popescu and Etzioni, 2005;Zhu et al., 2009), syntactic rules-based extraction (Zhuang et al., 2006;Wang and Wang, 2008;Wu et al., 2009;Zhang et al., 2010;Qiu et al., 2011;Poria et al., 2014), topic modeling (Mei et al., 2007;Titov and McDonald, 2008;Brody and Elhadad, 2010;Wang et al., 2010;Moghaddam and Ester, 2011;Mukherjee and Liu, 2012;Lin and He, 2009;Zhao et al., 2010;Jo and Oh, 2011;Fang and Huang, 2012;Wang et al., 2016), word alignment (Liu et al., 2013), label propagation (Zhou et al., 2013;Shu et al., 2016), and others (Zhao et al., 2015). This paper focuses on the supervised approach (Jakob and Gurevych, 2010;Choi and Cardie, 2010;Mitchell et al., 2013) using Conditional Random Fields (CRF) (Lafferty et al., 2001). It shows that the results of CRF can be significantly improved by leveraging some prior knowledge automatically mined from the extraction results of previous domains, including domains without labeled data. The improvement is possible because although every product (domain) is different, there is a fair amount of aspects sharing across domains . For example, every review domain has the aspect price and reviews of many products have the aspect battery life or screen. Those shared aspects may not appear in the training data but appear in unlabeled data and the test data. We can exploit such sharing to help CRF perform much better.
Due to leveraging the knowledge gained from the past to help the new domain extraction, we are using the idea of lifelong machine learning (LML) Thrun, 1998;Silver et al., 2013), which is a continuous learning paradigm that retains the knowledge learned in the past and uses it to help future learning and problem solving with possible adaptations.
The setting of the proposed approach L-CRF (Lifelong CRF) is as follows: A CRF model M has been trained with a labeled training review dataset. At a particular point in time, M has extracted aspects from data in n previous domains D 1 , . . . , D n (which are unlabeled) and the extracted sets of aspects are A 1 , . . . , A n . Now, the system is faced with a new domain data D n+1 . M can leverage some reliable prior knowledge in A 1 , . . . , A n to make a better extraction from D n+1 than without leveraging this prior knowledge.
The key innovation of L-CRF is that even after supervised training, the model can still improve its extraction in testing or its applications with expe-riences. Note that L-CRF is different from semisupervised learning (Zhu, 2005) as the n previous (unlabeled) domain data used in extraction are not used or not available during model training.
There are prior LML works for aspect extraction , but they were all unsupervised methods. Supervised LML methods exist Ruvolo and Eaton, 2013), but they are for classification rather than for sequence learning or labeling like CRF. A semi-supervised LML method is used in NELL (Mitchell et al., 2015), but it is heuristic pattern-based. It doesn't use sequence learning and is not for aspect extraction. LML is related to transfer learning and multi-task learning (Pan and Yang, 2010), but they are also quite different (see  for details).
To the best of our knowledge, this is the first paper that uses LML to help a supervised extraction method to markedly improve its results.

Conditional Random Fields
CRF learns from an observation sequence x to estimate a label sequence y: p(y|x; θ), where θ is a set of weights. Let l be the l-th position in the sequence. The core parts of CRF are a set of feature functions F = {f h (y l , y l−1 , x l )} H h=1 and their corresponding weights θ = {θ h } H h=1 . Feature Functions: We use two types of feature functions (FF). One is Label-Label (LL) FF: where Y is the set of labels, and ½{·} an indicator function. The other is Label-Word (LW) FF: where V is the vocabulary. This FF returns 1 when the l-th word is v and the l-th label is v's specific label i; otherwise 0. x l is the current word, and is represented as a multi-dimensional vector. Each dimension in the vector is a feature of x l .
Following the previous work in (Jakob and Gurevych, 2010), we use the feature set {W, -1W, +1W, P, -1P, +1P, G}, where W is the word and P is its POS-tag, -1W is the previous word, -1P is its POS-tag, +1W is the next word, +1P is its POS-tag, and G is the generalized dependency feature.
Under the Label-Word FF type, we have two sub-types of FF: Label-dimension FF and Label-G FF. Label-dimension FF is for the first 6 features, and Label-G is for the G feature.
The Label-dimension (Ld) FF is defined as where V d is the set of observed values in feature d ∈ {W, −1W, +1W, P, −1P, +1P } and we call V d feature d's feature values. Eq. (3) is a FF that returns 1 when x l 's feature d equals to the feature value v d and the variable y l (lth label) equals to the label value i; otherwise 0. We describe G and its feature function next, which also holds the key to the proposed L-CRF.

General Dependency Feature (G)
Feature G uses generalized dependency relations. What is interesting about this feature is that it enables L-CRF to use past knowledge in its sequence prediction at the test time in order to perform much better. This will become clear shortly. This feature takes a dependency pattern as its value, which is generalized from dependency relations.
The general dependency feature (G) of the variable x l takes a set of feature values V G . Each feature value v G is a dependency pattern. The Label-G (LG) FF is defined as: (4) Such a FF returns 1 when the dependency feature of the variable x l equals to a dependency pattern v G and the variable y l equals to the label value i.

Dependency Relation
Dependency relations have been shown useful in many sentiment analysis applications (Johansson and Moschitti, 2010;Jakob and Gurevych, 2010).
A dependency relation 1 is a quintuple-tuple: (type, gov, govpos, dep, deppos), where type is the type of the dependency relation, gov is the governor word, govpos is the POS tag of the governor word, dep is the dependent word, and deppos is the POS tag of the dependent word. The l-th word can either be the governor word or the dependent word in a dependency relation.

Dependency Pattern
We generalize dependency relations into dependency patterns using the following steps:  1. For each dependency relation, replace the current word (governor word or dependent word) and its POS tag with a wildcard since we already have the word (W) and the POS tag (P) features.
2. Replace the context word (the word other than the l-th word) in each dependency relation with a knowledge label to form a more general dependency pattern. Let the set of aspects annotated in the training data be K t . If the context word in the dependency relation appears in K t , we replace it with a knowledge label 'A' (aspect); otherwise 'O' (other).
For example, we work on the sentence "The battery of this camera is great." The dependency relations are given in Table 1. Assume the current word is "battery," and "camera" is annotated as an aspect. The original dependency relation between "camera" and "battery" produced by a parser is (nmod, battery, NN, camera, NN). Note that we do not use the word positions in the relations in Table  1. Since the current word's information (the word itself and its POS-tag) in the dependency relation is redundant, we replace it with a wild-card. The relation becomes (nmod, *, camera, NN). Secondly, since "camera" is in K t , we replace "camera" with a general label 'A'. The final dependency pattern becomes (nmod,*, A, NN).
We now explain why dependency patterns can enable a CRF model to leverage the past knowledge. The key is the knowledge label 'A' above, which indicates a likely aspect. Recall that our problem setting is that when we need to extract from the new domain D n+1 using a trained CRF model M , we have already extracted from many previous domains D 1 , . . . , D n and retained their extracted sets of aspects A 1 , . . . , A n . Then, we can mine reliable aspects from A 1 , . . . , A n and add them in K t , which enables many knowledge labels in the dependency patterns of the new data A n+1 due to sharing of aspects across domains. This enriches the dependency pattern features, which consequently allows more aspects to be extracted from the new domain D n+1 . We now present the L-CRF algorithm. As the dependency patterns for the general dependency feature do not use any actual words and they can also use the prior knowledge, they are quite powerful for cross-domain extraction (the test domain is not used in training).
Let K be a set of reliable aspects mined from the aspects extracted in past domain datasets using the CRF model M . Note that we assume that M has already been trained using some labeled training data D t . Initially, K is K t (the set of all annotated aspects in the training data D t ). The more domains M has worked on, the more aspects it extracts, and the larger the set K gets. When faced with a new domain D n+1 , K allows the general dependency feature to generate more dependency patterns related to aspects due to more knowledge labels 'A' as we explained in the previous section. Consequently, CRF has more informed features to produce better extraction results.
L-CRF works in two phases: training phase and lifelong extraction phase. The training phase trains a CRF model M using the training data D t , which is the same as normal CRF training, and  Table 2: Annotation details of the datasets will not be discussed further. In the lifelong extraction phase, M is used to extract aspects from coming domains (M does not change and the domain data are unlabeled). All the results from the domains are retained in past aspect store S. At a particular time, it is assumed M has been applied to n past domains, and is now faced with the n + 1 domain. L-CRF uses M and reliable aspects (denoted K n+1 ) mined from S and K t (K = K t ∪ K n+1 ) to extract from D n+1 . Note that aspects K t from the training data are considered always reliable as they are manually labeled, thus a subset of K. We cannot use all extracted aspects from past domains as reliable aspects due to many extraction errors. But those aspects that appear in multiple past domains are more likely to be correct. Thus K contains those frequent aspects in S. The lifelong extraction phase is in Algorithm 1. Lifelong Extraction Phase: Algorithm 1 performs extraction on D n+1 iteratively.

It generates features (F ) on the data D n+1
(line 3), and applies the CRF model M on F to produce a set of aspects A n+1 (line 4).

2.
A n+1 is added to S, the past aspect store. From S, we mine a set of frequent aspects K n+1 . The frequency threshold is λ.
3. If K n+1 is the same as K p from the previous iteration, the algorithm exits as no new aspects can be found. We use an iterative process because each extraction gives new results, which may increase the size of K, the reliable past aspects or past knowledge. The increased K may produce more dependency patterns, which can enable more extractions.
4. Else: some additional reliable aspects are found. M may extract additional aspects in the next iteration. Lines 10 and 11 update the two sets for the next iteration.

Experiments
We now evaluate the proposed L-CRF method and compare with baselines.

Evaluation Datasets
We use two types of data for our experiments. The first type consists of seven (7) annotated benchmark review datasets from 7 domains (types of products). Since they are annotated, they are used in training and testing. The first 4 datasets are from (Hu and Liu, 2004), which actually has 5 datasets from 4 domains. Since we are mainly interested in results at the domain level, we did not use one of the domain-repeated datasets. The last 3 datasets of three domains (products) are from . These datasets are used to make up our CRF training data D t and test data D n+1 . The annotation details are given in Table 2.
The second type has 50 unlabeled review datasets from 50 domains or types of products . Each dataset has 1000 reviews. They are used as the past domain data, i.e., D 1 , . . . , D n (n = 50). Since they are not labeled, they cannot be used for training or testing.

Baseline Methods
We compare L-CRF with CRF. We will not compare with unsupervised methods, which have been shown improvable by lifelong learning . The frequency threshold λ in Algorithm 1 used in our experiment to judge which extracted aspects are considered reliable is empirically set to 2.
CRF: We use the linear chain CRF from 2 . Note that CRF uses all features including dependency features as the proposed L-CRF but does not employ the 50 domains unlabeled data used for lifelong learning CRF+R: It treats the reliable aspect set K as a dictionary. It adds those reliable aspects in K that are not extracted by CRF but are in the test data to the final results. We want to see whether incorporating K into the CRF extraction through dependency patterns in L-CRF is actually needed.
We do not compare with domain adaptation or transfer learning because domain adaption basically uses the source domain labeled data to help learning in the target domain with few or no labeled data. Our 50 domains used in lifelong learning have no labels. So they cannot help in transfer  learning. Although in transfer learning, the target domain usually has a large quantity of unlabeled data, but the 50 domains are not used as the target domains in our experiments.

Experiment Setting
To compare the systems using the same training and test data, for each dataset we use 200 sentences for training and 200 sentences for testing to avoid bias towards any dataset or domain because we will combine multiple domain datasets for CRF training. We conducted both cross-domain and in-domain tests. Our problem setting is crossdomain. In-domain is used for completeness. In both cases, we assume that extraction has been done for the 50 domains. Cross-domain experiments: We combine 6 labeled domain datasets for training (1200 sentences) and test on the 7th domain (not used in training). This gives us 7 cross-domain results. This set of tests is particularly interesting as it is desirable to have the trained model used in crossdomain situations to save manual labeling effort.
In-domain experiments: We train and test on the same 6 domains (1200 sentences for training and 1200 sentences for testing). This also gives us 7 in-domain results.
Evaluating Measures: We use the popular precision P, recall R, and F 1 -score.

Results and Analysis
All the experiment results are given in Table 3.
Cross-domain: Each −X in column 1 means that domain X is not used in training. X in column 2 means that domain X is used in testing. We can see that L-CRF is markedly better than CRF and CRF+R in F 1 . CRF+R is very poor due to poor precisions, which shows treating the reliable aspects set K as a dictionary isn't a good idea.
In-domain: −X in training and test columns means that the other 6 domains are used in both training and testing (thus in-domain). We again see that L-CRF is consistently better than CRF and CRF+R in F 1 . The amount of gain is smaller. This is expected because most aspects appeared in training probably also appear in the test data as they are reviews from the same 6 products.

Conclusion
This paper proposed a lifelong learning method to enable CRF to leverage the knowledge gained from extraction results of previous domains (unlabeled) to improve its extraction. Experimental results showed the effectiveness of L-CRF. The current approach does not change the CRF model itself. In our future work, we plan to modify CRF so that it can consider previous extraction results as well as the knowledge in previous CRF models.