Lifelong-RL: Lifelong Relaxation Labeling for Separating Entities and Aspects in Opinion Targets

It is well-known that opinions have targets. Extracting such targets is an important problem of opinion mining because without knowing the target of an opinion, the opinion is of limited use. So far many algorithms have been proposed to extract opinion targets. However, an opinion target can be an entity or an aspect (part or attribute) of an entity. An opinion about an entity is an opinion about the entity as a whole, while an opinion about an aspect is just an opinion about that specific attribute or aspect of an entity. Thus, opinion targets should be separated into entities and aspects before use because they represent very different things about opinions. This paper proposes a novel algorithm, called Lifelong-RL, to solve the problem based on lifelong machine learning and relaxation labeling. Extensive experiments show that the proposed algorithm Lifelong-RL outperforms baseline methods markedly.


Introduction
A core problem of opinion mining or sentiment analysis is to identify each opinion/sentiment target and to classify the opinion/sentiment polarity on the target (Liu, 2012). For example, in a review sentence for a car, one wrote "Although the engine is slightly weak, this car is great." The person is positive (opinion polarity) about the car (opinion target) as a whole, but slightly negative (opinion polarity) about the car's engine (opinion target).
Past research has proposed many techniques to extract opinion targets (we will just call them targets hereafter for simplicity) and also to classify sentiment polarities on the targets. However, a target can be an entity or an aspect (part or attribute) of an entity. "Engine" in the above sentence is just one aspect of the car, while "this car" refers to the whole car. Note that in (Liu, 2012), an entity is called a general aspect. For effective opinion mining, we need to classify whether a target is an entity or an aspect because they refer to very different things. One can be positive about the whole entity (car) but negative about some aspects of it (e.g., engine) and vice versa. This paper aims to perform the target classification task, which, to our knowledge, has not been attempted before. Although in supervised extraction one can annotate entities and aspects with separate labels in the training data to build a model to extract them separately, in this paper our goal is to help unsupervised target extraction methods to classify targets. Unsupervised target extraction methods are often preferred because they save the time-consuming data labeling or annotation step for each domain.
Problem Statement: Given a set of opinion targets T = {t 1 , . . . , t n } extracted from an opinion corpus d, we want to classify each target t i ∈ T into one of the three classes, entity, aspect, or NIL, which are called class labels. NIL means that the target is neither an entity nor an aspect and is used because target extraction algorithms can make mistakes. This paper does not propose a new target extraction algorithm. We use an existing unsupervised method, called Double Propagation (DP) (Qiu et al., 2011), for extraction. We only focus on target classification after the targets have been extracted. Note that an entity here can be a named entity, a prod-uct category, or an abstract product (e.g., "this machine" and "this product"). An named entity can be the name of a brand, a model, or a manufacturer. An aspect is a part or attribute of an entity, e.g., "battery" and "price" of the entity "camera".
Since our entities not just include the traditional named entities (e.g., "Microsoft" and "Google") but also other expressions that refer to such entities, traditional named entity recognition algorithms are not sufficient. Pronouns such as "it," "they," etc., are not considered in this paper as co-reference resolution is out of the scope of this work.
We solve this problem in an unsupervised manner so that there is no need for labor-intensive manual labeling of the training data. One key observation of the problem is that although entities and aspects are different, they are closely related because aspects are parts or attributes of entities and they often have syntactic relationships in a sentence, e.g., "This phone's screen is super." Thus it is natural to solve the problem using a relational learning method. We employ the graph labeling algorithm, Relaxation Labeling (RL) (Hummel and Zucker, 1983), which performs unsupervised belief propagation on a graph. In our case, each target extracted from the given corpus d forms a graph node and each relation identified in d between two targets forms an edge. With some initial probability assignments, RL can assign each target node the most probable class label. Although some other graph labeling methods can be applied as well, the key issue here is that just using a propagation method in isolation is far from sufficient due to lack of information from the given corpus, which we detail in Section 5. We then employ Lifelong Machine Learning (LML) (Thrun, 1998;Chen and Liu, 2014b) to make a major improvement.
LML works as follows: The learner has performed a number learning tasks in the past and has retained the knowledge gained so far. In the new/current task, it makes use of the past knowledge to help current learning and problem solving. Since RL is unsupervised, we can assume that the system has performed the same task on reviews of a large number of products/domains (or corpora). It has also saved all the graphs and classification results from those past domains in a Knowledge Base (KB). It then exploits this past knowledge to help classification in the current task/domain. We call this combined approach of relaxation labeling and LML Lifelong-RL. The approach is effective because there is a significant amount of sharing of targets and target relations across domains.
LML is different from the classic learning paradigm (supervised or unsupervised) because classic learning has no memory. It basically runs a learning algorithm on a given data in isolation without considering any past learned knowledge (Silver et al., 2013). LML aims to mimic human learning, which always retains the learned knowledge from the past and uses it to help future learning.
Our experimental results show that the proposed Lifelong-RL system is highly promising. The paradigm of LML helps improve the classification results greatly.

Related Work
Although many target extraction methods exist (Hu and Liu, 2004;Zhuang et al., 2006;Ku et al., 2006;Wang and Wang, 2008;Wu et al., 2009;Lin and He, 2009;Zhang et al., 2010;Mei et al., 2007;Li et al., 2010;Brody and Elhadad, 2010;Wang et al., 2010;Mukherjee and Liu, 2012;Fang and Huang, 2012;Zhou et al., 2013;Liu et al., 2013;Poria et al., 2014), we are not aware of any attempt to solve the proposed problem. As mentioned in the introduction, although in supervised target extraction, one can annotate entities and aspects with different labels, supervised methods need manually labeled training data, which is time-consuming and laborintensive to produce (Jakob and Gurevych, 2010;Choi and Cardie, 2010;Mitchell et al., 2013). Note that relaxation labeling was used for sentiment classification in (Popescu and Etzioni, 2007), but not for target classification. More details of opinion mining can be found in (Liu, 2012;Pang and Lee, 2008).
Our work is related to transfer learning (Pan and Yang, 2010), which uses the source domain labeled data to help target domain learning, which has little or no labeled data. Our work is not just using a source domain to help a target domain. It is a continuous and cumulative learning process. Each new task can make use of the knowledge learned from all past tasks. Knowledge learned from the new task can also help improve learning of any past task. Transfer learning is not continuous, does not accumulate knowledge over time and cannot improve learning in the source domain. Our work is also related to multi-task learning (Caruana, 1997), which jointly optimizes a set of related learning tasks. Clearly, multi-task learning is different as we learn and save information which is more realistic when a large number of tasks are involved.
Our work is most related to Lifelong Machine Learning (LML). Traditional LML focuses on supervised learning (Thrun, 1998;Ruvolo and Eaton, 2013;Chen et al., 2015). Recent work used LML in topic modeling (Chen and Liu, 2014a), which is unsupervised. Basically, they used topics generated from past domains to help current domain model inference. However, they are just for aspect extraction. So is the method in (Liu et al., 2016). They do not solve our problem. Their LML methods are also different from ours as we use a graph and results obtained in the past domains to augment the current task/domain graph to solve the problem.

Lifelong-RL: The General Framework
In this section, we present the proposed general framework of lifelong relaxation labeling (Lifelong-RL). We first give an overview of the relaxation labeling algorithm, which forms the base. We then incorporate it with the LML capability. The next two sections detail how this general framework is applied to our proposed task of separating entities and aspects in opinion targets.

Relaxation Labeling
Relaxation Labeling (RL) is an unsupervised graphbased label propagation algorithm that works iteratively. The graph consists of nodes and edges. Each edge represents a binary relationship between two nodes. Each node t i in the graph is associated with a multinomial distribution P (L(t i )) (L(t i ) being the label of t i ) on a label set Y . Each edge is associated with two conditional probability distributions P (L(t i )|L(t j )) and P (L(t j )|L(t i )), where P (L(t i )|L(t j )) represents how the label L(t j ) influences the label L(t i ) and vice versa. The neighbors Ne(t i ) of a node t i are associated with a weight distribution w(t j |t i ) with t j ∈Ne(t i ) w(t j |t i ) = 1.
Given the initial values of these quantities as inputs, RL iteratively updates the label distribution of each node until convergence. Initially, we have P 0 (L(t i )). Let ∆P r+1 (L(t i )) be the change of P (L(t i )) at iteration r + 1. Given P r (L(t i )) at iteration r, ∆P r+1 (L(t i )) is computed by: (1) Then, the updated label distribution for iteration r + 1, P r+1 (L(t i )), is computed as follows: Once RL ends, the final label of node t i is its highest probable label: ), w(t j |t i ) and P 0 (L(t i )) are provided by the user or computed based on the application context. RL uses these values as input and iteratively updates P (L(t i )) based on Equations (1) and (2) until convergence. Next we discuss how to incorporate LML in RL.

Lifelong Relaxation Labeling
For LML, it is assumed that at any time step, the system has worked on u past domain corpora D = {d 1 , . . . , d u }. For each past domain corpus d ∈ D, the same Lifelong-RL algorithm was applied and its results were saved in the Knowledge Base (KB). Then the algorithm can borrow some useful prior/past knowledge in the KB to help RL in the new/current domain d u+1 . Once the results of the current domain are produced, they are also added to the KB for future use.
We now detail the specific types of information or knowledge that can be obtained from the past domains to help RL in the future, which should thus be stored in the KB. some past domains. Then, those edges and their associated probabilities can be borrowed.
2. Prior labels: Some nodes in the current new domain may also exist in some past domains. Their labels in the past domains are very likely to be the same as those in the current domain. Then, those prior labels can give us a better idea about the initial label probability distributions of the nodes in the current domain d u+1 .
To leverage those edges and labels from the past domains, the system needs to ensure that they are likely to be correct and applicable to the current task domain. This is a challenge problem. In the next two sections, we detail how to ensure these to a large extent in our application context along with how to compute those initial probabilities.

Initialization of Relaxation Labeling
We now discuss how the proposed Lifelong-RL general framework is applied to solve our problem. In our case, each node in the graph is an extracted target t i ∈ T , and each edge represents a binary relationship between two targets. T is the given set of all opinion targets extracted by an extraction algorithm from a review dataset/corpus d. The label set for each target is Y = {entity, aspect, NIL}. In this section, we describe how to use text clues in the corpus d to compute P (L(t i )|L(t j )), w(t j |t i ) and P 0 (L(t i )). In the next section, we present how these quantities are improved using prior knowledge from the past domains in the LML fashion.

Text Clues for Initialization
We use two kinds of text clues, called type modifiers M (t) and relation modifiers M R to compute the initial label distribution P (L(t i )) and conditional label distribution P (L(t i )|L(t j )) respectively.
Type Modifier: This has two kinds M T = {m E , m A }, where m E and m A represent entity modifier and aspect modifier respectively. For example, the word "this" as in "this camera is great" indicates that "camera" is probably an entity. Thus, "this" is a type modifier indicating M (camera) = m E . "These" is also a type modifier. Aspect modifier is implicitly assumed when the number of appearances of entity modifiers is less than or equal to a threshold (see Section 4.2).
Relation Modifier: Given two targets, t i and t j , we use M t j (t i ) to denote the relation modifier that the label of target t i is influenced by the label of target t j . Relation modifiers are further divided into 3 kinds: Conjunction modifier m c : Conjoined items are usually of the same type. For example, in "price and service", "and service" indicates a conjunction modifier for "price" and vice versa.
Entity-aspect modifier m A|E : A possessive expression indicates an entity and an aspect relation. For example, in "the camera's battery", "camera" indicates an entity-aspect modifier for "battery".
Aspect-entity modifier m E|A : Same as above except that "battery" indicates an aspect-entity modifier for "camera".
Modifier Extraction: These modifiers are identified from the corpus d using three syntactic rules. "This" and "these" are used to extract type modifier is the occurrence count of that modifier on target t, which is used in determining the initial label distribution in Section 4.2.
Relation modifiers are identified by dependency relations conj(t i , t j ) and poss(t i , t j ) using the Stanford Parser (Klein and Manning, 2003). Each occurrence of a relation rule contributes one count of M t j (t i ) for t i and one count of M t i (t j ) for t j . We use C mc,t j (t i ), C m A|E ,t j (t i ) and C m E|A ,t j (t i ) to denote the count of t j modifying t i with conjunction, entity-aspect and aspect-entity modifiers respectively. For example, "price and service" will contribute one count to C mc,price (service) and one count to C mc,service (price). Similarly, "camera's battery" will contribute one count to C m A|E ,camera (battery) and one count to C m E|A ,battery (camera).

Computing Initial Probabilities
The initial label probability distribution of target t is computed based on C m E (t), i.e., Here, we have two pre-defined distributions: P m E and P m A , which have a higher probability on entity and aspect respectively. The parameter α is a threshold indicating that if the entity modifier rarely occurs, the target is more likely to be an aspect. These values are set empirically (see Section 6).
Let term q(M t j (t i ) = m) be the normalized weight on the count for each kind of relation modifier m ∈ M R : The conditional label distribution P (L(t i )|L(t j )) of t i given the label of t j is the weighted sum over the three kinds of relation modifiers: where P mc , P m A|E , and P m E|A are pre-defined conditional distributions. They are filled with values to model the label influence from neighbors and can be found in Section 6.
Finally, target t i 's neighbor weight for target t j , i.e., w(t j |t i ), is the ratio of the count of relation modifiers C t j (t i ) over the total of all t i 's neighbors: If C t j (t i ) = 0, t i and t j has no edge between them.

Using Past Knowledge in Lifelong-RL
Due to the fact that the review corpus d u+1 in the current task domain may not be very large and that we use high quality syntactic rules to extract relations to build the graph to ensure precision, the number of relations extracted can be small and insufficient to produce a graph that is information rich with accurate initial probabilities. We thus apply LML to help using knowledge learned in the past. The proposed LML process in Lifelong-RL for our task is shown in Figure 1. Our prior knowledge includes type modifiers, relation modifiers and labels of targets obtained from past domains in D.
Each record in the KB is stored as a 9-tuple: where d ∈ D is a past domain; t i and t j are two targets; M d (t i ), M d (t j ) are their type modifiers, C d m,t j (t i ) and C d m,t i (t j ) are counts for relation modifiers; L d (t i ) and L d (t j ) are labels decided by RL. For example, the sentence "This camera's battery is good" forms: (d, camera, battery, m E , m A , C m E|A ,battery (camera) = 1, C m A|E ,camera (battery) = 1, entity, aspect) . It means that in the past domain d, "camera" and "battery" are extracted targets. Since "camera" is followed by "this", its type modifier is m E . Since "battery" is not identified by an entity modifier, it is m A . The pattern "camera's battery" contributes one count for both relation modifiers C m E|A ,battery (camera) and C m A|E ,camera (battery). RL has labeled "camera" as entity and "battery" as aspect in d.
The next two subsections present how to use the knowledge in the KB to improve the initial assignments for the label distributions, conditional label distributions and neighborhood weight distributions in order to achieve better final labeling/classification results for the current/new domain d u+1 .

Exploiting Relation Modifiers in the KB
If two targets in the current domain corpus have no edge, we can check whether relation modifiers of the same two targets exist in some past domains. If so, we may be able to borrow them. But to ensure suitability, two consistency checks are performed.
Label Consistency Check: Since RL makes mistakes, we need to ensure that relation modifiers in a record in the KB are consistent with target labels in that past domain. For example, "camera's battery" is confirmed by "camera" being labeled as entity and "battery" being labeled as aspect in a past domain d ∈ D. Without this consistency, the record may not be reliable and should be discarded from the KB.
We define an indicator variable I d m,t j (t i ) to ensure that the record r's relation modifier is consistent with the labels of its two targets: For example, if "camera" is labeled as entity and "battery" is labeled as aspect in the past domain d, we have I d m A|E ,camera (battery) = 1 and I d m E|A ,battery (camera) = 1.
Type Consistency Check: Here we ensure the type modifiers for two targets in the current domain d u+1 are consistent with these type modifiers in the past domain d ∈ D. This is because an item can be an aspect in one domain but an entity in another. For example, if the current domain is "Cellphone", borrowing the relation "camera's battery" from domain "Camera" can introduce an error because "camera" is an aspect in domain "Cellphone".
Syntactic pattern "this" is a good indicator for this checking. In the "Cellphone" domain, "its camera" or "the camera" are often mentioned but not "this camera". In the "Camera" domain, "this camera" is often mentioned. The type modifier of "camera" in "Cellphone" is m A , but in "Camera" it is m E .
Updating Probabilities in Current Domain d u+1 : Edges for RL are in the forms of conditional label distribution P (L(t i )|L(t j )) and neighborhood weight distribution w(t j |t i ). We now discuss how to use the KB to estimate them more accurately.
Updating Conditional Label Distribution: Equation (5) tells that conditional label distribution P (L(t i )|L(t j )) is the weighted sum of relation modifiers' label distributions P mc , P m A|E , and P m E|A . These 3 label distributions are pre-defined and given in Table 2. They are not changed. Thus, we update conditional label distribution through updating the three relation modifiers' weights q(M t j (t i )) with the knowledge in the KB. Recall the three relation modifiers are M R = {m c , m A|E , m E|A }.
After consistency check, there can be multiple relation modifiers between two targets in similar past domains D s ⊂ D. The number of domains supporting a relation modifier m ∈ M R can tell which kind of relation modifiers is common and likely to be correct. For example, given many past domains like "Laptop", "Tablet", "Cellphone", etc., "camera and battery" appears more than "camera's battery", "camera" should be modified by "battery" more with m E|A rather than m c (likely to be an aspect).
Let C d u+1 m,t j (t i ) be the count that target t i modified by target t j on relation m in the current domain d u+1 (not in KB). The count C (CL) is for updating the Conditional Label (CL) distributions considering the information in both the current domain d u+1 and the KB. It is calculated as: This equation says that if there is any relation modifier existing between the two targets in the new domain d u+1 , we do not borrow edges from the KB; Otherwise, the number of similar past domains supporting the relation modifier m is used. Recall that I d m,t j (t i ) is the result calculated by Equation (7) after label consistency check.
We use count C (4) in Section 4.2. Then the conditional label distribution accommodating relation modifiers in the KB, P (LL1) (L(t i )|L(t j )), is calculated by Equation, (5) using q d u+1 (M t j (t i )). LL1 denotes Lifelong Learning 1.
Updating Neighbor Weight Distribution: Equation (6) says that w(t j |t i ) is the importance of target t i 's neighbor t j to t i among all t i 's neighbors. When updating conditional label distribution using the KB, the number of domains can decide which kind of relation modifiers m is more common between the two targets t i and t j . But we cannot tell that neighbor t j is more important than another neighbor t j to t i .
For example, given the past domains such as "Laptop", "Tablet", "Cellphone", etc., no matter how many domains believe "camera" is an aspect given "battery" is also an aspect, if the current domain is "All-in-one desktop computer", we should not consider the strong influences from "battery" in the past domains. We should rely more on the weights of "camera"'s neighbors provided by "Allin-one desktop computer". That means "mouse", "keyboard", "screen" etc., should have strong influences on "camera" than "battery" because most Allin-one desktops (e.g. iMac) do not have battery.
We introduce another indicator variable I D m,t j (t i ) = d∈D s I d m,t j (t i ), to indicate whether target t j modified t i on relation m in past similar domains D s . It only considers the existence of a 230 relation modifier m among domains D s .
The count C (w) t j (t i ) for updating the neighbor weight (w) distribution considers both the KB and the current domain d u+1 . It is as follows: This equation tells that if there are relation modifiers existing between the two targets in the new domain d u+1 , we count the total times that t j modifies t i in the new domain; Otherwise, we count the total kinds of relation modifiers in M R if a relation modifier m ∈ M R existed in past domains. Let w (LL1) (t j |t i ) be the neighbor weight distribution considering knowledge from the KB and d u+1 . It is calculated by Equation (6) using C (w) t j (t i ). The initial label distribution P d u+1 ,0 is calculated by Equation (3) only using type modifiers found in the new domain d u+1 . We use Lifelong-RL-1 to denote the method that employs P (LL1) (L(t i )|L(t j )), w (LL1) (t j |t i ) and P d u+1 ,0 as inputs for RL.

Exploiting Target Labels in the KB
Since we have target labels from past domains, we may have a better idea about the initial label probabilities of targets in the current domain d u+1 . For example, after labeling domains like "Cellphone", "Laptop", "Tablet," and "E-reader", we may have a good sense that "camera" is likely to be an aspect. To use such knowledge, we need to check if the type modifier of target t in the current domain matches those in past domains and only keep those domains that have such a matching type modifier.
Let D s ⊂ D be the past domains consistent with target t's type modifier in the current domain d u+1 . Let C D s (L(t)) be the number of domains in D s that target t is labeled as L(t). Let λ be the ratio that controls how much we trust knowledge from the KB. Then the initial label probability distribution P d u+1 ,0 calculated by Equation (3) only using type modifier found in d u+1 is replaced by : P (LL2),0 (L(t)) = |D|×P d u+1 ,0 (L(t))+λC D s (L(t)) |D|+λ|D| (8) Similarly, let D s ⊂ D be the past domains consistent with both targets t i 's and t j 's type modifiers in d u+1 . Let C D s (L(t i ), L(t j )) be the number of domains in D s that t i and t j are labeled as L(t i ) and L(t j ) respectively. The conditional label probability distribution accommodating relation modifiers in the KB, P (LL1) (L(t i )|L(t j )), is further updated to P (LL2) (L(t i )|L(t j )) by exploiting the target labels in KB (LL2 denotes Lifelong Learning 2): For example, given "this camera", "battery" in the current domain, we are more likely to consider domains (e.g. "Film Camera", "DSLR", but not "Cellphone") that have entity modifiers on "camera" and aspect modifiers on "battery". Then we count the number of those domains that label "camera" as entity and "battery" as aspect: C D s (L(camera) = entity, L(battery) = aspect). Similarly, we count domains having other types of target labels on "camera" and "battery". These counts form an updated conditional label distribution that estimates "camera" as an entity and "battery" as an aspect.
Note that |D − D s |, the number of past domains not consistent with targets' type modifiers, is added to C D s (L(t i ) = NIL) and C D s (L(t i ) = NIL, L(t j )) for Equations (8) and (9) respectively to make the sum over L(t i ) equal to 1. We use Lifelong-RL to denote this method which uses P (LL2),0 (L(t)), P (LL2) (L(t i )|L(t j )) and w (LL1) (t j |t i ) as input for RL.

Experiments
We now evaluate the proposed method and compare with baselines. We use the DP method for target extraction (Qiu et al., 2011). This method uses dependency relations between opinion words and targets to extract targets using seed opinion words. Since our paper does not focus on extraction, interested readers can refer to (Qiu et al., 2011) for details.

Experiment Settings
Evaluation Datasets: We use two sets of datasets. The first set consists of eight (8) annotated review datasets. We use each of them as the new domain data in LML to compute precision, recall, F1 scores. Five of them are from (Hu and Liu, 2004), and the remaining three are from (Liu et al., 2016). They have been used for target extraction, and thus have annotated targets, but no annotation on whether a  Distribution target is an entity or aspect. We made this annotation, which is straightforward. We used two annotators to annotate the datasets. The Cohen's kappa is 0.84. Through discussion, the annotators got complete agreement. Details of the datasets are listed in Table 1. Each cell is the number of distinct terms. These datasets are not very large but they are realistic because many products do not have a large number of reviews. The second set consists of unlabeled review datasets from 100 diverse products or domains (Chen and Liu 2014). Each domain has 1000 reviews. They are treated as past domain data in LML since they are not annotated and thus cannot be used for computing evaluation measures.
Evaluating Measures: We mainly use precision P, recall R, and F 1 -score F 1 as evaluation measures. We take multiple occurrences of the same target as one count, and only evaluate entities and aspects. We will also give the accuracy results.
Compared Methods: We compare the following methods, including our proposed method, Lifelong-RL.
NER+TM: NER is Named Entity Recognition.
We can regard the extracted terms from a NER system as entities and the rest of the targets as aspects. However, a NER system cannot identify entities such as "this car" from "this car is great." Its result is rather poor. But our type modifier (TM) does that, i.e., if an opinion target appears after "this" or "these" in at least two sentences, TM labels the target as an entity; otherwise an aspect. However, TM cannot extract named entities. Its result is also rather poor. We thus combine the two methods to give NER+TM as they complement each other very well.
To make NER more powerful, we use two NER systems: Stanford-NER 1 (Manning et al., 2014) and UIUC-NER 2 (Ratinov and Roth, 2009). NER+TM treats the extracted entities by the three systems as entities and the rest of the targets as aspects.
NER+TM+DICT: We run NER+TM on the 100 datasets for LML to get a list of entities, which we call the dictionary (DICT). For a new task, if any target word is in the list, it is treated as an entity; otherwise an aspect.
RL: This is the base method described in Section 3. It performs relaxation labeling (RL) without the help of LML.
Lifelong-RL-1: This performs LML with RL but the current task only uses the relations in the KB from previous tasks (Section 5.1).
Lifelong-RL: This is our proposed final method. It improves Lifelong-RL-1 by further incorporating target labels in the KB from previous tasks (Section 5.2).
Parameter Settings: RL has 2 initial label distributions P m E and P m A and 3 conditional label distributions P mc , P m E|A and P m A|E . Like other belief propagation algorithms, these probabilities need to be set empirically, as shown in Table 2. The parameter α is set to 1. Our LML method has one parameter λ for Lifelong-RL. We set it to 0.1. Table 3 shows the test results of all systems in precision, recall and F 1 -score except NER+TM+DICT. NER+TM+DICT is not included due to space limitations and because it performed very poorly. The reason is that a target can be an entity in one domain  Table 3: Comparative results on Entity and Aspect in precision, recall and F 1 score: NER+TM+DICT's results are very poor and not included (see Section 6.2) for the average results.

Results Analysis
but an aspect in another. Its average F 1 -score for entity is only 49.2, and for aspect is only 50.2.
Entity Results Comparison: We observe from the table that although NER+TM combines NER and TM, its result for entities is still rather poor. We notice that phrases like "this price" causes low precision. Since it does not use many other relations and NER does not recognize many named entities that are written in lower case letters (e.g., "apple is good"), its recall is also low.
RL has a higher precision as it considers relation modifiers. However, its recall is low because it lacks information in its graph, which causes RL to make many wrong decisions. Lifelong-RL-1 introduces relation modifiers in KB from past domains into the current task. Both precision and recall increase markedly.
Lifelong-RL improves Lifelong-RL-1 further by considering target labels of past domains. Their counts improve the initial label probability distributions and conditional label probability distributions. For example, "this price" may appear in some domains but "price"'s target label is mostly aspect. We consider their counts in initial label distributions and thus rectify the initial distribution of "price". This makes "price" easier to be classified as aspect and thus improves the precision for entity.
Aspect Results Comparison: For aspects, the trend is the same but the improvements are not as dramatic as for entity. This is because the distribution of entity and aspect in the data is highly skewed. There are many more aspects than entities as we can see from the Table 1. When an entity term is wrongly classified as an aspect, it has much less impact on the aspect result than on the entity result.
Accuracy Results Comparison:

Conclusion
This paper studied the problem of classifying opinion targets into entities and aspects. To the best of our knowledge, this problem has not been attempted in the unsupervised opinion target extraction setting. But this is an important problem because without separating or classifying them one will not know whether an opinion is about an entity as a whole or about a specific aspect of an entity. This paper proposed a novel method based on relaxation labeling and the paradigm of lifelong machine learning to solve the problem. Experimental results showed the effectiveness of the proposed method.