A Knowledge-driven Approach to Classifying Object and Attribute Coreferences in Opinion Mining

Classifying and resolving coreferences of objects (e.g., product names) and attributes (e.g., product aspects) in opinionated reviews is crucial for improving the opinion mining performance. However, the task is challenging as one often needs to consider domain-specific knowledge (e.g., iPad is a tablet and has aspect resolution) to identify coreferences in opinionated reviews. Also, compiling a handcrafted and curated domain-specific knowledge base for each domain is very time consuming and arduous. This paper proposes an approach to automatically mine and leverage domain-specific knowledge for classifying objects and attribute coreferences. The approach extracts domain-specific knowledge from unlabeled review data and trains a knowledgeaware neural coreference classification model to leverage (useful) domain knowledge together with general commonsense knowledge for the task. Experimental evaluation on realworld datasets involving five domains (product types) shows the effectiveness of the approach


Introduction
Coreference resolution (CR) aims to determine whether two mentions (linguistic referring expressions) corefer or not, i.e., they refer to the same entity in the discourse model (Jurafsky, 2000;Ding and Liu, 2010;Atkinson et al., 2015;Lee et al., 2017Joshi et al., 2019;Zhang et al., 2019b). The set of coreferring expressions forms a coreference chain or a cluster. Let's have an example: [S1] I bought a green Moonbeam for myself.
[S2] I like its voice because it is loud and long.
Here all colored and/or underlined phrases are mentions. Considering S1 (sentence-1) and S2 (sentence-2), the three mentions "I", "myself " in S1 and "I" in S2 all refer to the same person and form a cluster. Similarly, "its" in S2 refers to the object "a green Moonbeam" in S1 and the cluster is {"its" (S2), "a green Moonbeam" (S1) }. The mentions "its voice" and "it" in S2 refer to the same attribute of the object "a green Moonbeam" in S1 and form cluster {"its voice" (S2), "it" (S2)}.
CR is beneficial for improving many downstream NLP tasks such as question answering (Dasigi et al., 2019), dialog systems (Quan et al., 2019), entity linking (Kundu et al.), and opinion mining (Nicolov et al., 2008). Particularly, in opinion mining tasks (Liu, 2012;Wang et al., 2016;Zhang et al., 2018;Ma et al., 2020), Nicolov et al. (2008) reported performance improves by 10% when CR is used. The study by Ding and Liu (2010) also supports this finding. Considering the aforementioned example, without resolving "it" in S2, it is difficult to infer the opinion about the attribute "voice" (i.e., the voice, which "it" refers to, is "loud and long"). Although CR plays such a crucial role in opinion mining, only limited research has been done for CR on opinionated reviews. CR in opinionated reviews (e.g., Amazon product reviews) mainly concerns about resolving coreferences involving objects and their attributes. The objects in reviews are usually the names of products or services while attributes are aspects of those objects (Liu, 2012).
Resolving coreferences in text broadly involves performing three tasks (although they are often performed jointly or via end-to-end learning): (1) identifying the list of mentions in the text (known as mention detection); (2) given a pair of candidate mentions in text, making a binary classification decision: coreferring or not (referred to as coreference classification), and (3) grouping coreferring mentions (referring to the same discourse entity) to form a coreference chain (known as clustering). In reviews, mention detection is equiv-alent to extracting entities and aspects in reviews which has been widely studied in opinion mining or sentiment analysis (Hu and Liu, 2004;Qiu et al., 2011;Xu et al., 2019;Luo et al., 2019;Dragoni et al., 2019;Asghar et al., 2019). Also, once the coreferring mentions are detected via classification, clustering them could be straightforward 1 . Thus, following (Ding and Liu, 2010), we only focus on solving the coreference classification task in this work, which we refer to as the object and attribute coreference classification (OAC2) task onwards. We formulate the OAC2 problem as follows.
Problem Statement. Given a review text u (context), an anaphor 2 p and a mention m which refers to either an object or an attribute (including their position information), our goal is to predict whether the anaphor p refers to mention m, denoted by a binary class y ∈ {0, 1}. Note. an anaphor here can be a pronoun (e.g., "it") or definite noun phrase (e.g., "the clock") or ordinal (e.g., "the green one").
In general, to classify coreferences, one needs intensive knowledge support. For example, to determine that "it" refers to "its voice" in S2, we need to know that "voice" can be described as "loud and long" and "it" can not refer to "a green Moonbeam" in S1, since "Moonbeam" is a clock which cannot be described as "long".
Product reviews contain a great many such domain-specific concepts like brands (e.g., "Apple" in the laptop domain), product name (e.g., "T490" in the computer domain), and aspects (e.g."hand" in the alarm clock domain) that often do not exist in general knowledge bases (KBs) like Word-Net (Miller, 1998), ConceptNet (Speer and Havasi, 2013), etc. Moreover, even if a concept exists in a general KB, its semantics may be different than that in a given product domain. For example, "Moonbeam" in a general KB is understood as "the light of the moon" or the name of a song, rather than a clock (in the alarm clock domain). To encode such domain-specific concepts, we need to mine and feed domain knowledge (e.g., "clock" for "Moonbeam", "laptop" for "T490") to a coreference classification model. Existing CR methods (Zhang et al., 2019b) do not leverage such domain knowledge and thus, often fail to resolve such co-references that require explicit reasoning over domain facts.
In this paper, we propose to automatically mine such domain-specific knowledge from unlabeled reviews and leverage the useful pieces of the extracted domain knowledge together with the (general/comensense) knowledge from general KBs to solve the OAC2 task 3 . Note the extracted domain knowledge and the general knowledge from the existing general KBs are both considered as candidate knowledge. To leverage such knowledge, we design a novel knowledge-aware neural coreference classification model that selects the useful (candidate) knowledge with attention mechanism. We discuss our approach in details in Section 3.
The main contributions of this work can be summarized: 1. We propose a knowledge-driven approach to solving OAC2 in opinionated reviews. Unlike existing approaches that mostly dealt with general CR corpus and pronoun resolution, we show the importance of leveraging domainspecific knowledge for OAC2.
2. We propose a method to automatically mine domain-specific knowledge and design a novel knowledge-aware coreference classification model that leverages both domainspecific and general knowledge.
3. We collect a new review dataset 4 with five domains or product types (including both unlabeled and labeled data) for evaluation. Experimental results show the effectiveness of our approach.

Related Work
Coreference resolution has been a long-studied problem in NLP. Early approaches were mainly rule-based (Hobbs, 1978) and feature-based (Ding and Liu, 2010;Atkinson et al., 2015) where researchers focused on leveraging lexical, grammatical properties and semantic information. Recently, end-to-end solutions with deep neural models (Lee et al., 2017Joshi et al., 2019) have dominated the coreference resolution research. But they did not use external knowledge. Conisdering CR approaches that use external knowledge, Aralikatte et al. (2019) solved CR task by incorporate knowledge or information in reinforcement learning models. Emami et al. (2018) solved the binary choice coreference-resolution task by leveraging information retrieval results from search engines. Zhang et al. (2019a,b) solved pronoun coreference resolutions by leveraging contextual, linguistic features, and external knowledge where knowledge attention was utilized. However, these works did not deal with opinionated reviews and also did not mine or use domain-driven knowledge.
In regard to CR in opinion mining, Ding and Liu (2010) formally introduced the OAC2 task for opinionated reviews, which is perhaps the only prior study on this problem. However, it only focused on classifying coreferences in comparative sentences (not on all review sentences). We compare our approach with (Ding and Liu, 2010) in Section 4.
Many existing general-purpose CR datasets are not suitable for our task, which include MUC-6 and MUC-7 (Hirschman and Chinchor, 1998), ACE (Doddington et al., 2004), OntoNotes (Pradhan et al., 2012), and WikiCoref (Ghaddar and Langlais, 2016). Bailey et al. (2015) proposed an alternative Turing test, comprising a binary choice CR task that requires significant commonsense knowledge.  proposed visual pronoun coreference resolution in dialogues that require the model to incorporate image information. These datasets are also not suitable for us as they are not opinionated reviews. We do not focus on solving pronoun resolution here because, for opinion text such as reviews, discussions and blogs, personal pronouns mostly refer to one person (Ding and Liu, 2010). Also, we aim to leverage domainspecific knowledge on (unlabeled) domain-specific reviews to help the CR task which has not been studied by any of these existing CR works.

Proposed Approach
Model Overview. Our approach consists of the following three main steps: (1) knowledge aquisition, where given the (input) pair of mention m (e.g., "a green Moonbeam") and anaphor p (e.g., "it") and the context t (i.e., the review text), we acquire candidate knowledge involving m, denoted K m consists of both domain knowledge (mined from unlabeled reviews) as well as general knowledge (compiled from existing general KBs) (discussed in Section 3.1). Next, in (2) syntaxbased span representation, we extract syntaxrelated phrases for mention m and anaphor p. Syntax-related phrases are basically noun phrases, verbs or adjectives that have a dependency relation 5 with m (or p). For example, "bought" is a syntax-related phrase of the mention "a green Moonbeam" and "like" and "voice" are two syntaxrelated phrases for the anaphor "it" in the example review text in Section 1. Once the syntax-related phrases are extracted and the candidate knowledge is prepared for m and p, we learn vector representations of the phrases and the knowledge (discussed in Section 3.2), which are used in step-3. Finally, in (3) knowledge-driven OAC2 model, we select and leverage useful candidate domain knowledge together with general knowledge to solve the OAC2 task. Figure 1 shows our model architecture. Table 1 summarizes a (non-exhaustive) list of notations, used repeatedly in subsequent sections.

Knowledge Acquisition
Domain Knowledge Mining. Given the mention m, we first split the mention into words. Here, we only keep the words that satisfy one of the following two conditions 6 : (1) a word is a noun (determined by its POS tag); (2) a word is part of a named entity (by NER). For example, "a westclox clock" will result in words "westclox" and "clock". We use the mention words as the keys to search a domain knowledge base (KB) to retrieve domain To construct the domain KB, we use unlabeled review data in the particular domain. Specifically, all unlabeled sentences that contain mention words are extracted. Next, we collect domain knowledge The elements in K d m are phrases of nouns, adjectives, and verbs co-occurring with m in the unlabeled review sentences.
Domain Knowledge Filtering. Some domain knowledge (i.e., co-occurring phrases) can be too general to help reason over the mention. For example, given mention "Moonbeam", the verb "like" can be related to any objects or attributes and thus, is not a very useful knowledge for describing the mention. To filter such unimportant phrases from K d m , we use tf -idf (Aizawa, 2003) scoring. Given mention m and a phrase k ∈ K d m , we compute tf -idf score of k, denoted as tf -idf k as given below: where C k denotes the co-occurrence count of phrase k with m in unlabeled domain reviews T d and |·| denotes set count. We retain phrase k in General Knowledge Aquisition. General Knowledge bases like ConceptNet, WordNet, etc. store facts as triples of the form (e 1 , r, e 2 ), denoting entity e 1 is related to entity e 2 by a relation r. e.g., ("clock", "UsedFor", "set an alarm").
To acquire and use general knowledge for mention m, we first split m into words (in the same way as we do during domain knowledge construction) and use these words as keywords to retrieve triples such that one of the entities (in a given triple) contains a word of m. Finally, we collect the set of entities (from the retrieved triples) as general knowledge for m, by selecting the other entity (i.e., instead of the entity involving a mention word) from each of those retrieved triples.

Syntax-based Span Representation
Once the domain-specific and general knowledge for mention m is acquired, we extract all syntaxrelated phrases for m and anaphor p from review text t (see "Model Overview" in Section 3). We denote the syntax-related phrases of m and p as S m and S p respectively.
We represent mention, anaphor, the syntaxrelated phrases, and also the phrases of knowledge from domain-specific and general KBs as spans (a continuous sequence of words), and learn a vector representation for each span (we call it a span vector) based on the embeddings of words that compose the span. The span vectors are then used by our knowledge-driven OAC2 model (discussed in Section 3.3) for solving the OAC2 task. Below, we discuss the span vector representation learning for a given span (corresponding to a syntax-related phrase or a phrase in KB).
We use BERT (Devlin et al., 2019) to learn the vector representation for each span. To encode the words in a span, we use BERT's WordPiece tokenizer. Given a span x, let {x i } N 1 i=1 be the output token embeddings of x from BERT, where N 1 is the total number of word-piece tokens for span x.
BERT is a neural model consisting of stacked attention layers. To incorporate the syntax-based information, we want the head of a span and words that have a modifier relation to the head to have higher attention weights. To achieve the goal, we adopt syntax-based attention . The weight of a word in a span depends on the dependency parsing result of the span. Note, the dependency parsing of a span is different from what is described in Section 3.1. The dependency parsing in Section 3.1 extracts the relation between chunks of words while here we extract relations between single words.
An example has been shown in top left corner of Figure 1. The head of "a green Moonbeam" is "Moonbeam" that we want to have the highest attention weight when computing the embedding of the span. The distance of ("a", "Moonbeam") and ("green", "Moonbeam") considering the dependency path are both 1.
To learn the span vector v x for span x, we first compute the attention weights b i 's for each x i , as: where F F N 1 is a feed-forward layer that projects the input into a score f i , is element-wise multiplication, [, ] is concatenation, x head is the head of the span, l i is the distance to the head along the dependency path, L is the attention window size. Next, we learn the attention-based representation of the span x, denoted asx as: Finally, we concatenate the start and end word embeddings of the span x start and x end , attentionbased representationx and a length feature φ(x) following (Lee et al., 2017) to learn span vector v x : (8) where F F N 2 is a feed-forward layer.

Knowledge-driven OAC2 Model
The knowledge-driven OAC2 model leverages the syntax-related phrases together with the domain knowledge and general knowledge to solve the OAC2 task. The model first computes three relevance scores: (a) a contextual relevance score F C between m and p, (b) a knowledge-based relevance score F K between m and p, and (c) a relevance score F SK between knowledge and syntax-related phrases (see Figure 1) and then, these scores are summed up to compute the final prediction scorê F , as shown below: (a) Contextual Relevance Score (F C ). F C is computed based on the context t, mention m and anaphor p. We use BERT to encode t. Let the output BERT embeddings of words in t be {t i } N 2 i=1 , where N 2 is length of t. Also, let the span vector representations of m and p are v m and v p respectively. Then, for each v ∈ {v m , v p }, we compute cross attention between t and v as follows: where F F N 3 is a feed-forward layer. We learn the interaction of {t i } N 2 i=1 with v m and v p to get attention-based vector representations {w m i } N 2 i=1 and {w p i } N 2 i=1 for m and p respectively. Next, we concatenate these vectors and their pointwise multiplication for each context word, sum up the concatenated representations and feed it to a feed-forward layer to compute F C ∈ R 1×1 : where F F N 4 is a feed-forward layer.
(b) Knowledge-based Relevance Score (F K ). The OAC2 model leverages the external knowledge to compute a relevance score F K between m and p. Let v m and v p be the span vectors for m and p and {v K i } N 3 i=1 be the span vectors for phrases in K m (see Sec 3.1 and Table 1), where N 3 is size of K m . Then, we compute F K using v m , v p and {v K i } N 3 i=1 as discussed below. To leverage external knowledge information, we first learn cross attention between the mention and the knowledge as: where F F N 5 is a feed-forward layer. Next, we learn an attention-based representation v m of mention m as: We now concatenate v m , v p , the attention-based representationv m and learn interaction between them to compute F K ∈ R 1×1 as: where F F N 6 is a feed-forward layer.
(c) Syntax-related Phrase Relevance Score (F SK ). F SK measures the relevance between the knowledge (i.e., phrases) in K m and the syntaxrelated phrases in S m (S p ) corresponding to m (p).
Let v K i be the span vector for i th phrase in K m and v m i (v p i ) be the span vector for i th phrase in S m (S p ). Then, we concatenate these span vectors row-wise to form matrices Next, we learn interaction between these matrices using scaled dot attention (Vaswani et al., 2017) as:M Finally, the syntax-related phrase relevance score F SK ∈ R 1×1 is computed as: where F F N 7 and F F N 8 are two feed-forward network layers. Loss Function. As shown in Equation 9, given three scores F C , F K , and F SG , we sum them up and then feed the sum into a sigmoid function to get the final predictionF . The proposed model is trained in an end-to-end manner by minimizing the following cross-entropy loss L: (20) where, N is the number of training examples and y i is the ground truth label of i th training example.

Experiments
We evaluate our proposed approach using five datasets associated with five different domains: (1) alarm clock, (2) camera, (3) cellphone, (4) computer, and (5) laptop and perform both quantitative and qualitative analysis in terms of predictive performance and domain-specific knowledge usage ability of the proposed model.

Evaluation Setup
Labelled Data Collection. We use the product review dataset 7 from Chen and Liu (2014), where each product (domain) has 1,000 unlabeled reviews. For each domain, we randomly sample 100 reviews, extract a list of (mention, anaphor) pairs from each of those reviews and label them manually with ground truths. That is, given a review text and a candidate (mention, anaphor) pair, we assign a binary label to denote whether they co-refer or not. In other words, we view each labeled example as a triple (u, m, p), consisting of the context u, a mention m and an anaphor p. Considering the review example (in Section 1), the triple ("I bought . . . loud and long", "a green Moonbeam", "its") is a positive example, since "a green Moonbeam" and "its" refers to the same entity (i.e., they are in the same coreference cluster). Negative examples are naturally constructed by selecting m and p from two different clusters under the same context like ("I bought . . . loud and long", "a green Moonbeam", "its voice").
Next, we randomly split the set of all labeled examples (for a given domain) into 80% for training, 10% as development, and rest 10% as test data. The remaining 900 unlabeled reviews form the unlabeled domain corpus is used for domain-specific knowledge extraction (as discussed in Section 3.1). All sentences in reviews and (mention, anaphor) pairs were annotated by two annotators independently who strictly followed the MUC-7 annotation standard (Hirschman and Chinchor, 1998). The Cohen's kappa coefficient between two annotators is 0.906. When disagreement happens, two annotators adjudicate to make a final decision. Table 2 provides the statistics of labeled dataset used for training, development and test for each of the five domains.
Knowledge Resources. We used three types of knowledge resources as listed below. The first two are general KBs, while the third one is our mined domain-specific KB.
2. Senticnet (Cambria et al., 2016). Senticnet is another commonsense knowledge base that contains 50k concepts associated with affective properties including sentiment information. To make the knowledge base fit for deep neural models, we concatenate SenticNet embeddings with BERT embeddings to extend the embedding information.
3. Domain-specific KB. This is mined from the unlabeled review dataset as discussed in Sec 3.1.
Hyper-parameter Settings. Following the previous work of (Joshi et al., 2019;, we use (Base) BERT 8 embeddings of context and knowledge representation (as discussed in Section 3). The number of training epochs is empirically set as 20. We train five models on five datasets sepa-rately, because the domain knowledge learned from a certain domain may conflict with that from others. Without loss of generality and model extensibility, we use the same set of hyper-parameter settings for all models built on each of the five different domains. We select the best model setting based on its performance on the development set, by averaging five F1-scores on the five datasets. The best model uses maximum length of a sequence as 256, dropout as 0.1, learning rate as 3e −5 with linear decay as 1e −4 for parameter learning, and ρ = 5.0 (threshold for tf-idf ) in domain-specific knowledge extraction (Section 3.1). The tuning of the other baseline models is the same as we do for our model.
Baselines. We compare following state-of-the art models from existing works on CR task: (1) Review CR (Ding and Liu, 2010): A reviewspecific CR model that incorporates opinion mining based features and linguistic features.
(2) Review CR+BERT: For a fairer comparison, we further combine BERT with features from (Ding and Liu, 2010) as additional features. Specifically, we combine the context-based BERT to compute F C (m, p) (see Section 3.3 (a)).
(4) C2f-Coref+BERT (Joshi et al., 2019): This model integrates BERT into C2f-Coref. We use its independent setting which uses non-overlapping segments of a paragraph, as it is the best performing model in Joshi et al. (2019).
(5) Knowledge+BERT (Zhang et al., 2019b): This is a state-of-the-art knowledge-base model, which leverages different types of general knowledge and contextual information by incorporating an attention module over knowledge. General knowledge includes the aforementioned OMCS, linguistic feature and selectional preference knowledge extracted from Wikipedia. To have a fair comparison, we replace the entire LSTM-base encoder with BERT-base transformer.
To accommodate the aforementioned baseline models into our settings, which takes context, anaphor, and mention as input and perform binary classification, we change the input and output of the baseline models, i.e., the models compute a score between mention and anaphor and feeds the score to a sigmoid function to get a score within [0, 1]. Note, this setting is consistently used for all Table 3: Performance (+ve F1 scores) of all models on all test datasets. Here, "cam", "com", "lap" are the abbreviation for "camera", "computer", "laptop" respectively. candidate models (including our proposed model).
Evaluation Metrics. As we aim to solve the OAC2 problem, a focused coreference classification task, we use the standard evaluation metrics F1-score (F 1), following the same setting of the prior study (Ding and Liu, 2010). In particular, we report positive (+ve) F1-score [F1(+)]. The average +ve F1-score is computed over five domains.

Results and Analysis
Comparison with baselines. Table 3 reports F1 scores of all models for each of five domains and average F1 over all domains. We observe the following: (1) Overall, our model performs the best considering all five domains, outperforming the noknowledge baseline model C2f-Coref+BERT by 3.14%. On the cellphone domain, our model outperforms it by 3.8%. (2) Knowledge+BERT turns out to be the strongest baseline, outperforming the other three baselines, which also shows the importance of leveraging external knowledge for the OAC2 task. However, our model achieves superior performance over Knowledge+BERT which indicates leveraging domain-specific knowledge indeed helps. (3) C2f-Coref+BERT achieves better scores than C2f-Coref and Review CR. This demonstrates that both representation (using pre-trained BERT) and neural architectures are important for feature fusions in this task.
Ablation study. To gain further insight, we ablate various components of our model with the results reported in Table 4. For simplicity, we only show the average F1-scores on the five domain datasets. The results indicate how each knowledge resource or module contributes, from which we have the following observations. 1. From comparison Knowledge resources in Table 4, we see that domain knowledge con-  Table 4, we see that the disabling the use of context score F C has the highest drop in performance, showing the importance of contextual information for this task. Disabling the use of knowledge scores F G and F SG also impact the predictive performance of the model, by causing a drop in performance.

Considering comparisons of various types of scores in
3. From the comparison of attention mechanism for span representation in Table 4, we see that, before summing up the embedding of each word of the span, the attention layer is necessary. Note, we use the selected attention instead of popular dot attention in (Vaswani et al., 2017) during span representation. The influence of the syntax-based attention layer is slightly better than the dot attention layer. Therefore, we use the selected attention for better interpretability.
Qualitative Evaluation. We first give a real example to show the effectiveness of our model by comparing it with two baseline models C2f-coref+BERT and Knowledge+BERT. Table 5 shows a sample in the alarm domain. Here the major difficulty is to identify "Moonbeam" as a "clock". Knowledge+BERT fails due to its lack of domain-specific knowledge. C2f-coref+BERT  fails as well because it simply tries to infer from contextual information only, where there is no domain knowledge support. In contrast, with our domain-specific knowledge base incorporated, "Moonbeam" can be matched to the knowledge like "clock", "alarm", and "hang" which are marked with green color. So our model successfully addresses this case. In other words, in our model, not only the mention "a green Moonbeam" but also syntax-related phrase "a gold band" of "the clock" will be jointly considered in reasoning. We can see the modeling superiority of our knowledge-aware solution. Table 6 shows the effectiveness of our extraction module introduced in Section 3.1, especially the usage of tf -idf to filter out useless knowledge.

Conclusion
This paper proposed a knowledge-driven approach for object and attribute coreference classification in opinion mining. The approach can automatically extract domain-specific knowledge from unlabeled data and leverage it together with the general knowledge for solving the problem. We also created a set of annotated opinionated review data (including 5 domains) for object and attribute coreference evaluation. Experimental results show that our approach achieves state-of-the-art performance.