Grid Tagging Scheme for Aspect-oriented Fine-grained Opinion Extraction

Aspect-oriented Fine-grained Opinion Extraction (AFOE) aims at extracting aspect terms and opinion terms from review in the form of opinion pairs or additionally extracting sentiment polarity of aspect term to form opinion triplet. Because of containing several opinion factors, the complete AFOE task is usually divided into multiple subtasks and achieved in the pipeline. However, pipeline approaches easily suffer from error propagation and inconvenience in real-world scenarios. To this end, we propose a novel tagging scheme, Grid Tagging Scheme (GTS), to address the AFOE task in an end-to-end fashion only with one unified grid tagging task. Additionally, we design an effective inference strategy on GTS to exploit mutual indication between different opinion factors for more accurate extractions. To validate the feasibility and compatibility of GTS, we implement three different GTS models respectively based on CNN, BiLSTM, and BERT, and conduct experiments on the aspect-oriented opinion pair extraction and opinion triplet extraction datasets. Extensive experimental results indicate that GTS models outperform strong baselines significantly and achieve state-of-the-art performance.


Introduction
Aspect-oriented Fine-grained Opinion Extraction (AFOE) aims to automatically extract opinion pairs (aspect term, opinion term) or opinion triplets (aspect term, opinion term, sentiment) from review text, which is an important task for fine-grained sentiment analysis (Pang and Lee, 2007;Liu, 2012). In this task, aspect term and opinion term are two key opinion factors. Aspect term, also known as opinion target, is the word or phrase in a sentence representing feature or entity of products or services. Opinion term refers to the term in a sentence * † Corresponding author. used to express attitudes or opinions explicitly. For example, in the sentence of Figure 1, "hot dogs" and "coffee" are two aspect terms, "top notch" and "average" are two opinion terms.
To obtain the above two opinion factors, many works devote to the co-extraction of aspect term and opinion term in a joint framework (Wang et al., 2016(Wang et al., , 2017Li and Lam, 2017;Dai and Song, 2019). However, the extracted results of these works are two separate sets of aspect term and opinion term, and they neglect the pair relation between them, which is crucial for downstream sentiment analysis tasks and has many potential applications, such as providing sentiment clues for aspect level sentiment classification (Pontiki et al., 2014), generating fine-grained opinion summarization (Zhuang et al., 2006) or analyzing in-depth opinions (Kobayashi et al., 2007), etc.
Opinion pair extraction (OPE) is to extract all opinion pairs from a sentence in the form of (aspect term, opinion term). An opinion pair consists of an aspect term and a corresponding opinion term. This task needs to extract three opinion factors, i.e., aspect terms, opinion terms, and the pair relation between them. Figure 1 shows an example. We can see that the sentence "the hot dogs are top notch and great coffee!" contains two opinion pairs, respectively (hot dogs, top notch) and (coffee, average) (the former is the aspect term, and latter represents the corresponding opinion term). OPE sometimes could be complicated because an aspect term may correspond to several opinion terms and vice versa. Despite the great importance of OPE, it is still under-investigated, and only a few early works mentioned or explored this task (Hu and Liu, 2004;Zhuang et al., 2006;Klinger and Cimiano, 2013b;Yang and Cardie, 2013).
To address the above issues and facilitate the research of AFOE, we propose a novel tagging scheme, Grid Tagging Scheme (GTS), which transforms opinion pair extraction into one unified grid tagging task. In this grid tagging task, we tag all word-pair relations and then decode all opinion pairs simultaneously with our proposed decoding method. Accordingly, GTS can extract all opinion factors of OPE in one step, instead of pipelines. Furthermore, different opinion factors are mutually dependent and indicative in the OPE task. For example, if we know "average" is an opinion term in Figure 1, then "coffee" is probably deduced as an aspect term because "average" is its modifier. To exploit these potential bridges, we specially design an inference strategy in GTS to yield more accurate opinion pairs. In the experiments, we implement three GTS models, respectively, with CNN, LSTM, and BERT, to demonstrate the effectiveness and compatibility of GTS.
Besides OPE, we find that GTS is very easily extended to aspect-oriented Opinion Triplet Extraction (OTE), by replacing the pair relation detection of OPE with specific sentiment polarity detection. OTE, also called aspect sentiment triplet extraction (ASTE) (Peng et al., 2019), is a new finegrained sentiment analysis task and aims to extract all opinion triplets (aspect term, opinion term, sentiment) from a sentence. To tackle the task, Peng et al. (2019) propose a two-stage framework and still extract the pair (aspect term, opinion term) in pipeline, thus suffering from error propagation. In contrast, GTS can extract all opinion triplets simultaneously only with a unified grid tagging task.
The main contributions of this work can be summarized as follows: • We propose a novel tagging scheme, Grid Tagging Scheme (GTS). To the best of our knowledge, GTS is the first work to address the complete aspect-oriented fine-grained opinion extraction, including OPE and OTE, with one unified tagging task instead of pipelines. Besides, this new scheme is easily extended to other pair/triplet extraction tasks from text.
• For the potential mutual indications between different opinion factors, we design an effective inference strategy on GTS to exploit them for more accurate extractions.
• We implement three GTS neural models respectively with CNN, LSTM, and BERT, and conduct extensive experiments on both tasks of OPE and OTE to verify the compatibility and effectiveness of GTS.
The following sections are organized as follows. Section 2 presents our proposed Grid Tagging Scheme. In Section 3, we introduce the models based on GTS and the inference strategy. Section 4 shows experiment results. Section 5 and Section 6 are respectively related work and conclusions. Our code and data will be available at https://github.com/NJUNLP/GTS.

Grid Tagging Scheme
In this section, we first give the task definition of Opinion Pair Extraction (OPE) and Opinion Triplet Extraction (OTE), then explain how the two tasks are represented in Grid Tagging Scheme. Finally, we present how to decode opinion pairs or opinion triplets according to the tagging results in GTS.

Task Definition
We first introduce the definition of the OPE task. Given a sentence s = {w 1 , w 2 , · · · , w n } consisting n words, the goal of the OPE task is to extract a set of opinion pairs P = {(a, o) m } |P| m=1 from the sentence s, where (a, o) m is an opinion pair in s. The notations a and o respectively denote an aspect term and an opinion term. They are two non-overlapped spans in s.
As for the OTE task, it additionaly extracts the corresponding sentiment polarity of each opinion pair (a, o), i.e., extracting a set of opinion tripelts T = {(a, o, c) m } |T | m=1 from the given sentence s, where c denotes the sentiment polarity and c ∈ {positive, neutral, negative}.

Grid Tagging
To tackle the OPE task, Grid Tagging Scheme (GTS) uses four tags {A, O, P, N} to represent the relation of any word-pair (w i , w j ) in a sentence.
Here the word-pair (w i , w j ) is unordered and thus word-pair (w i , w j ) and (w j , w i ) have the same relation. The meanings of four tags can be seen in Table 1. In GTS, the tagging result of a sentence is like a grid after displaying it in rows and columns. For simplicity, we adopt an upper triangular grid. Figure 2 shows the tagging results of the sentence of Figure 1 in GTS.

Tags Meanings
A two words of word-pair (w i , w j ) belong to the same aspect term.
O two words of word-pair (w i , w j ) belong to the same opinion term.
P two words of word-pair (w i , w j ) respectively belong to an aspect term and an opinion term, and they form opinion pair relation. N no above three relations for word-pair (w i , w j ). The hot dogs are top notch but average coffee The hot dogs are top notch but average coffee Figure 2: A tagging example with GTS for the OPE task. In the sentence, the spans highlighted in red are aspect terms and the spans in blue are opinion terms.
Specifically, the tag A represents that the two words of word-pair (w i , w j ) belong to the same aspect term. For example, the position of wordpair (hot, dogs) in Figure 2 is the tag A. Similarly, the tag O indicates that the two words of word-pair (w i , w j ) exist in the same aspect term. Notably, GTS also considers the word-pair (w i , w i ), i.e., the relation of each word to itself, which can help represent a single-word aspect term or opinion term. The tag P represents that two words of word-pair (w i , w j ) respectively belong to an aspect term and an opinion term, and the two terms are an opinion pair, such as the word-pair (hot, top) and (dogs, top) in Figure 2. The last tag N denotes no relation between word-pair (w i , w j ).
To deal with the OTE task, GTS replaces the previous P tag with the specific sentiment label. To be specific, GTS adopts the tag set {A, O, Pos, Neu, Neg, N} to denote the relation of word-pair in the OTE task. The three tags Pos, Neu, Neg respectively indicate positive, neutral, or negative sentiment expressed in the opinion triplet consisting of the word-pair (w i , w j ). A tagging example of the OTE task is shown in Figure 3.
The hot dogs are top notch but average coffee The hot dogs are top notch but average coffee Figure 3: A tagging example for the OTE task.
It can be concluded that Grid Tagging Scheme successfully transforms end-to-end aspect-oriented fine-grained opinion extraction into a unified tagging task by labeling the relations of all word-pairs.

Decoding Algorithm
In this subsection, we focus on how to decode the final opinion pairs or opinion triplets according to the tagging results of all word-pairs. In fact, various methods can be applied to obtaining these tagging results, and we adopt neural network models in this work (see Section 3).
After obtaining the predicted tagging results of a sentence in GTS, we can extract opinion pairs or opinion triplets by strictly matching the relations of word-pairs as in Figure 2 and Figure 3. However, it might get low recall due to abundant N tags in GTS. To address this issue, we relax matching constraints and design a simple but effective method to decode opinion pair or opinion triplet.
The decoding details for the OPE task are shown in Algorithm 1. Firstly, we use the predicted tags Algorithm 1 Decoding Algorithm for OPE Input: The tagging results T of a sentence in GTS. T (wi, wj) denotes the predicted tag of the word-pair (wi, wj). Output: Opinion pair set P of the given sentence.
1: Initialize the aspect term set A, opinion term set O, and opinion pair set P with ∅. 2: while a span left index l ≤ n and right index r ≤ n do 3: Regard the words {w l , · · · , wr} as an aspect term Regard the words {w l , · · · , wr} as an opinion term end if 9: end while 10: while a ∈ A and o ∈ O do 11: while wi ∈ a and wj ∈ o do 12: if any T (wi, wj) = P then 13: end if 15: end while 16: end while 17: return the set P of all (w i , w i ) word-pairs on the main diagonal to recognize aspect terms and opinion terms, without considering other word-pair constraints. As line 2 to line 9 of Algorithm 1 shows, the spans comprised of continuous A tags are regarded as aspect terms, and spans consisting of continuous O are detected as opinion terms. For an extracted aspect term a and an opinion term o, we think they form an opinion pair on condition that at least one wordpair (w i , w j ) is labeled with the tag P when w i ∈ a and w j ∈ o, as shown in line 11 to line 15.
For the OTE task, the decoding part is different from the OPE task from line 11 to line 15 of Algorithm 1. Specifically, we count the predicted tags of all word-pairs (w i , w j ) when w i ∈ a and w j ∈ o. The most predicted sentiment tag c ∈ {Pos, Neu, Neg} is regarded as the sentiment polarity of the opinion triplet (a, o, c). If their predicted tags do not belong to {Pos, Neu, Neg}, we think a and o cannot form an opinion triplet.

Validation Models
To verify the effectiveness and good compatibility of GTS, we respectively tried three typical neural networks, i.e., CNN, LSTM, and BERT, as encoder implementations of GTS (Section 3.1). Besides, different opinion factors in AFOE mutually rely on and can benefit each other. Therefore, we design an inference strategy to exploit these potential indi- cations in Section 3.2. Figure 4 shows the overall architecture of GTS models.

Encoding
Given a sentence s = {w1, w 2 , · · · , w n }, CNN, BiLSTM or BERT can be used as the encoder of GTS to generate the representation r ij of the wordpair (w i , w j ).
CNN. We follow the design of state-of-the-art aspect term extraction model DE-CNN (Xu et al., 2018). It employs 2 embedding layers and a stack of 4 CNN layers to encode the sentence s, then generates the feature representation h i for each word w i . Dropout (Srivastava et al., 2014) is applied after the embedding and each ReLU activation. The details can be found in Xu et al. (2018).
BiLSTM. BiLSTM employs a standard forward Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) and a backward LSTM to encode the sentence, then concatenate the hidden states in two LSTMs as the representation h i of each word w i .
BERT. BERT adopts subwords embedding, position embedding and segment embedding as the representation of subword, then employs a multilayer bidirectional Transformer (Vaswani et al., 2017) to generate the contextual represenations {h 1 , h 2 , · · · , h n } of the given sentence s. For a more comprehensive description, readers can refer to Devlin et al. (2019).
To obtain a robust representation for word-pair (w i , w j ), we additionally employ an attention layer to enhance the connection between w i and w j . The details are as follows: where W a1 and W a2 are weight matrices, and b a is the bias. Note that, the above attention is not applied on the representations of BERT, because BERT itself contains multiple self-attention layers. Finally, we concatenate the enhanced representations of w i and w j to represent the word-pair (w i , w j ), i.e., r ij = [ h i ; h j ], where [·; ·] denotes the vector concatenation operation.

Inference on GTS
As aforementioned, different opinion factors of AFOE are mutually indicative. Therefore, we design the inference strategy in GTS to exploit these potential indications for facilitating AFOE.
In Grid Tagging Scheme, let us consider what is helpful to detect the relation of word-pair (w i , w j ). First, relations between w i and other words (except w j ) can help detection. For example, if predicted tags of word-pairs consisting of w i contain A, the tag of word-pair (w i , w j ) is less possible to be O and vice versa. So does the word w j . Second, the previous prediction for (w i , w j ) heps infer the tag of (w i , w j ) of the current turn. To this end, we propose an inference strategy on GTS to exploit these indications by iterative prediction and inference. In the t-th turn, the feature representation z t ij and predicted probability distribution p t ij of word-pair (w i , w j ) can be calculated as follows: In the above process, p t−1 i,: represents all predicted probability between the word w i and other words. In fact, p t−1 i,: = (p t−1 1:i,i , p t−1 i,i:n ) in GTS as we use the upper triangular grid. Equation 4 and 5 aim to help infer the possible tags for (w i , w j ) by observing predictions between w i /w j and other words. The initial predicted probability p 0 ij and representation z 0 ij of (w i , w j ) is set as: Finally, the prediction p L ij in the final turn is used to extract fine-grained opinions according to Algorithm 1. The L is a hyperparameter denoting the inference times.

Training Loss
We use y ij to represent the ground truth tag of the word-pair (w i , w j ). The unified training loss for AFOP is defined as the cross entropy loss between grouhd truth distribution and predicted tagging distribution p L ij of all word-pairs: where I(·) is the indicator function, and C denotes the label set.  Table 2: Statistics of aspect-oriented fine-grained opinion extraction datasets. Here "#S", "#A", "#O", "#P", and "#T" respectively denote the numbers of sentence, aspect term, opinon term, opinion pair, and opinion triplet. The "res" and "lap" represent datasets from restaurant domain or laptop domain.
To study aspect-oriented opinion term extraction, Fan et al. (2019) annotate and release four opinion pair datasets 1 based on SemEval Challenges (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016 Table 2 shows their statistics, and we can observe that one sentence may contain multiple aspect terms or opinion terms. Besides, one aspect term may correspond to multiple opinion terms and vice versa. To evaluate the performance of different methods, we use precision, recall, and F1-score as the evaluation metrics. The extracted aspect terms and opinion terms are regarded as correct only if predicted and ground truth spans are exactly matched.

Experimental Settings
Following the design of DE-CNN (Xu et al., 2018), we use double embeddings to initialize the word vectors of GTS-CNN and GTS-BiLSTM, which contains a domain-general embedding from 300-dimension GloVe (Pennington et al., 2014) pre-trained with 840 billion tokens and a 100dimension domain-specific embedding trained with fastText (Bojanowski et al., 2017). The CNN kernel size on domain-specific embedding is 3 and others are 5. In GTS-BiLSTM, the dimension of LSTM cell is set to 50. We adopt Adam optimizer (Kingma and Ba, 2015) to optimize networks and the initial learning rate is 0.001. The dropout (Srivastava et al., 2014) is applied after embedding layer with probability 0.5. As for GTS-BERT, we use uncased BERT BASE version 2 and set the learn-2 https://github.com/google-research/bert ing rate to 5e-5. The mini-batch size is set to 32. The development set is used for early stopping. We run each model five times and report the average result of them.

Results of Opinion Pair Extraction
Compared Methods We summarize the ABSA studies and combine the state-of-the-art methods as our strong OPE baselines. They include: (I). CMLA (Wang et al., 2017) and RINANTE (Dai and Song, 2019) for the co-extraction of aspect term and opinion term (Co-extraction), Dis-BiLSTM and C-GCN (Zhang et al., 2018) for the Pair relation Detection (PD); (II). BiLSTM-ATT and DE-CNN (Xu et al., 2018) for Aspect term Extraction (AE), Distance (Hu and Liu, 2004), Dependency (Zhuang et al., 2006), and IOG (Fan et al., 2019) for Aspect-oriented Opinion Term Extraction (AOTE). Note that, our GTS models do not use sentiment labels information when performing the OPE task. Table 3 shows the experiment results of different methods.
Observing two types of pipeline methods, we can find that the pipeline of AE+AOTE seems to perform better than Co-extraction+PD. Specifically, the method RINANTE+IOG outperforms RINANTE+C-GCN significantly on the datasets 14res and 15res, though C-GCN is a strong relation classification model. This indicates that the detection of opinion pair relation might be more difficult than aspect-oriented opinion term extraction. Besides, RINANTE+IOG also achieves better performances than another strong method DE-CNN+IOG respectively by the F1-score of 1.75% and 1.12%  Table 4: The experiment results on the OTE task (%). Best and second-best results are respectively in bold and underline. The results with § are retrieved from Peng et al. (2019). The marker "-" represents that the original code of IMN method does not contain necessary resources for running on the dataset 16res.
on the datasets 14lap and 15res, which validates the facilitation of co-extraction strategy for the aspect term extraction. Compared with the strong pipelines DE-CNN+IOG and RINANTE+IOG, our three endto-end GTS models all achieve obvious improvements, especially on the datasets 15res and 16res. Despite RINANTE using weak supervision to extend millions of training data, GTS-CNN and GTS-BiLSTM still obtain obvious improvements only through one unified tagging task without additional resources. This comparison shows that error propagations in pipeline methods limit the performance of OPE. There is no doubt that GTS-BERT achieves the best performance because of the powerful ability to model context. The results in Table 3 and above analysis consistently demonstrate the effectiveness of GTS for the OPE task.

Results of Opinion Triplet Extraction
Compared Method We use the latest OTE work proposed by Peng et al. (2019) as the compared method. In addition, we also employ the state-ofthe-art work IMN (He et al., 2019) and the first step of Peng et al. (2019) for extracting the (aspect term, sentiment) pair, then combine them with IOG as strong baselines. The experiment results are shown in Table 4.
We can observe that IMN+IOG outperforms Peng-unified-R+IOG obviously on the datasets 14res and 15res, because IMN uses multi-domain document-level sentiment classification data as auxiliary tasks. In contrast, GTS-CNN and GTS-BiLSTM still obtain about 3% improvements in F1-score than IMN+IOG without requiring additional document-level sentiment data. The overall experiment results on the OTE task again validate the effectiveness of GTS. Furthermore, GTS-BERT outperforms GTS-CNN and GTS-BiLSTM only about 2%-3% on the datasets 15res and 16res, which to some extent shows the ability of the proposed tagging scheme itself besides BERT encoder.  Table 5: The results of different methods on the extractions of aspect term and opinion term (%) . The abbreviations "A" and "O" respectively denote the aspect term extraction and opinion term extraction.

Results of Aspects Term Extraction and Opinion Term Extraction
To further analyze the performance of different methods, we also compare them on extractions of aspect term and opinion term. We only report F1score of datasets 14res and 15res for limited space.
The experiment results are shown in Tabel 5. Compared to GTS-CNN and GTS-BiLSTM, we can see that RINANTE achieves comparable or better results on the datasets 14res, while it performs worse on the OPE task. This comparison indicates that pipeline methods suffer from error propagation. According to the results on the dataset 15res, our GTS models not only can address the OPE task and OTE task in an end-to-end way, but also improve the performance of aspect term extraction and opinion term extraction. This is because our novel tagging scheme and inference strategy can exploit potential connections between different opinion factors to facilitate extraction.

Ablation Study
To investigate the effects of the attention mechanism and inference strategy on GTS models, we conduct ablation study on the OPE task. The experiment results are shown in Table 6  After removing the attention mechanism, the performance of the model GTS-CNN and GTS-BiLSTM drop slightly, which indicates that the attention mechanism enhances the connection between words. Comparing the full models with the versions w/o inference, we find that the former outperforms the latter significantly on all datasets. It is reasonable because the proposed inference strategy can leverage the potential bridges between different opinion factors and makes more comprehensive predictions. As for the model GTS-BERT w/o inference, it represents that the inference times is 0, and we show its results in the next section.  To investigate the effects of inference times on performance, we report the results of GTS models for the OPE task on the datasets 14res and 14lap with different inference times in Figure 5.

Effects of Inference Times
It can be observed that the inference strategy brings significant improvements for the model GTS-CNN. On the whole, GTS-CNN and GTS-BiLSTM achieve the best results respectively with 2 and 3 inference times on two datasets, and GTS-CNN performs better than GTS-BiLSTM in different inference times. In contrast, GTS-BERT reaches a crest only with 1 time of inference because BERT has contained rich context semantics.

Related Work
In literature, only a few works mentioned or explored the opinion pair extraction. Hu and Liu (2004) employ frequent pattern mining to extract aspect terms, then regard the closest adjective to aspect term as the corresponding opinion term. Zhuang et al. (2006) adopt dependency-tree based templates to identify opinion pairs after extracting the aspect term set and opinion term set. Recently, some works adopt neural networks to perform the subtasks of OPE, such as co-extraction of aspect term and opinion term (Wang et al., 2017;Dai and Song, 2019) (Xu et al., 2018). aspect term extraction (Xu et al., 2018), and aspect-oriented opinion term extraction (Fan et al., 2019;Wu et al., 2020), and finally combine them to accomplish OPE in pipeline. To avoid the error propagation of pipeline methods, some studies use joint learning based on traditional machine learning algorithms and hand-crafted features, including Imperatively Defined Factor graph (IDF) (Klinger and Cimiano, 2013a), joint inference based on IDF (Klinger and Cimiano, 2013b), and Integer Linear Programming (ILP) (Yang and Cardie, 2013). However, these methods heavily depend on the quality of handcrafted features and sometimes perform worse than pipeline methods (Klinger and Cimiano, 2013b).
The opinion triplet extraction is a new aspectoriented fine-grained opinion extraction task (Peng et al., 2019). Inspired by extracting (aspect term, sentiment) pair in a joint model He et al., 2019), Peng et al. (2019) propose a two-stage framework to extract opinion triplets. In the first stage, they first use a neural model to extract the pair (aspect term, sentiment) and unpaired opinion terms, then detect the pair relation between aspect term and opinion terms in the second stage. We can see that the key opinon pair extraction of aspect term and opinion term is still accomplished in pipeline and their approach also suffers from error propagation.

Conclusions
Aspect-oriented fine-grained opinion extraction (AFOE), including opinion pair extraction (OPE) and opinion triplet extraction (OTE), is usually achieved in the pipeline because of referring to multiple opinion factors, thereby suffering from error propagation. In this paper, we propose a novel scheme, Grid Tagging Scheme (GTS), to address this task in an end-to-end way. Through tagging the relations between all word-pairs, GTS successfully includes all opinion factors extraction of AFOE into a unified grid tagging task, and then uses the designed decoding algorithm to generate opinion pairs or opinion triplets. To exploit the potential mutual indications between different opinion factors, we design an effective inference strategy on GTS. Three different GTS models respectively based on CNN, BiLSTM, and BERT consistently indicate that our methods outperform strong baselines and achieve state-of-the-art performance on the opinion pair extraction and opinion triplet extraction. Further analysis also validates the effectiveness of GTS and the inference strategy.