Open-Domain Targeted Sentiment Analysis via Span-Based Extraction and Classification

Open-domain targeted sentiment analysis aims to detect opinion targets along with their sentiment polarities from a sentence. Prior work typically formulates this task as a sequence tagging problem. However, such formulation suffers from problems such as huge search space and sentiment inconsistency. To address these problems, we propose a span-based extract-then-classify framework, where multiple opinion targets are directly extracted from the sentence under the supervision of target span boundaries, and corresponding polarities are then classified using their span representations. We further investigate three approaches under this framework, namely the pipeline, joint, and collapsed models. Experiments on three benchmark datasets show that our approach consistently outperforms the sequence tagging baseline. Moreover, we find that the pipeline model achieves the best performance compared with the other two models.


Introduction
Open-domain targeted sentiment analysis is a fundamental task in opinion mining and sentiment analysis (Pang et al., 2008;Liu, 2012). Compared to traditional sentence-level sentiment analysis tasks (Lin and He, 2009;Kim, 2014), the task requires detecting target entities mentioned in the sentence along with their sentiment polarities, thus being more challenging. Taking Figure 1 as an example, the goal is to first identify "Windows 7" and "Vista" as opinion targets and then predict their corresponding sentiment classes. Typically, the whole task can be decoupled into two subtasks. Since opinion targets are not given, we need to first detect the targets from the input text. This subtask, which is usually denoted as target extraction, can be solved by sequence tagging methods (Jakob and Gurevych, 2010;Liu et al., 2015;Wang et al., 2016a;Poria et al., 2016;Shu et al., 2017;Xu et al., 2018). Next, polarity classification aims to predict the sentiment polarities over the extracted target entities (Jiang et al., 2011;Dong et al., 2014;Tang et al., 2016a;Wang et al., 2016b;Chen et al., 2017;Xue and Li, 2018;Fan et al., 2018). Although lots of efforts have been made to design sophisticated classifiers for this subtask, they all assume that the targets are already given.
Rather than using separate models for each subtask, some works attempt to solve the task in a more integrated way, by jointly extracting targets and predicting their sentiments (Mitchell et al., 2013;Zhang et al., 2015;Li et al., 2019). The key insight is to label each word with a set of target tags (e.g., B, I, O) as well as a set of polarity tags (e.g., +, -, 0), or use a more collapsed set of tags (e.g., B+, I-) to directly indicate the boundary of targeted sentiment, as shown in Figure 2 (a). As a result, the entire task is formulated as a sequence tagging problem, and solved using either a pipeline model, a joint model, or a collapsed model under the same network architecture. However, the above annotation scheme has several disadvantages in target extraction and polarity classification. Lee et al. (2016) show that, when using BIO tags for extractive question answering tasks, the model must consider a huge search space due to the compositionality of labels (the power set of all sentence words), thus being less effective. As for polarity classification, the sequence tagging scheme turns out to be problematic for two reasons. First, tagging polarity over each word  ignores the semantics of the entire opinion target. Second, since predicted polarities over target words may be different, the sentiment consistency of multi-word entity can not be guaranteed, as mentioned by Li et al. (2019). For example, there is a chance that the words "Windows" and "7" in Figure 2(a) are predicted to have different polarities due to word-level tagging decisions.
To address the problems, we propose a spanbased labeling scheme for open-domain targeted sentiment analysis, as shown in Figure 2 (b). The key insight is to annotate each opinion target with its span boundary followed by its sentiment polarity. Under such annotation, we introduce an extract-then-classify framework that first extracts multiple opinion targets using an heuristic multispan decoding algorithm, and then classifies their polarities with corresponding summarized span representations. The advantage of this approach is that the extractive search space can be reduced linearly with the sentence length, which is far less than the tagging method. Moreover, since the polarity is decided using the targeted span representation, the model is able to take all target words into account before making predictions, thus naturally avoiding sentiment inconsistency.
We take BERT (Devlin et al., 2018) as the default backbone network, and explore two research questions. First, we make an elaborate comparison between tagging-based models and span-based models. Second, following previous works (Mitchell et al., 2013;Zhang et al., 2015), we compare the pipeline, joint, and collapsed models under the span-based labeling scheme. Extensive experiments on three benchmark datasets show that our models consistently outperform sequence tagging baselines. In addition, the pipeline model firmly improves over both the joint and collapsed models. Source code is released to facilitate future research in this field 1 . 1 https://github.com/huminghao16/SpanABSA

Related Work
Apart from sentence-level sentiment analysis (Lin and He, 2009;Kim, 2014), targeted sentiment analysis, which requires the detection of sentiments towards mentioned entities in the open domain, is also an important research topic.
As discussed in §1, this task is usually divided into two subtasks. The first is target extraction for identifying entities from the input sentence. Traditionally, Conditional Random Fields (CRF) (Lafferty et al., 2001) have been widely explored (Jakob and Gurevych, 2010;Wang et al., 2016a;Shu et al., 2017). Recently, many works concentrate on leveraging deep neural networks to tackle this task, e.g., using CNNs (Poria et al., 2016;Xu et al., 2018), RNNs (Liu et al., 2015;, and so on. The second is polarity classification, assuming that the target entities are given. Recent works mainly focus on capturing the interaction between the target and the sentence, by utilizing various neural architectures such as LSTMs (Hochreiter and Schmidhuber, 1997;Tang et al., 2016a) with attention mechanism (Wang et al., 2016b;Fan et al., 2018), CNNs (Xue and Li, 2018;Huang and Carley, 2018), and Memory Networks (Tang et al., 2016b;Chen et al., 2017;Li and Lam, 2017).
Rather than solving these two subtasks with separate models, a more practical approach is to directly predict the sentiment towards an entity along with discovering the entity itself. Specifically, Mitchell et al. (2013) formulate the whole task as a sequence tagging problem and propose to use CRF with hand-crafted linguistic features. Zhang et al. (2015) further leverage these linguistic features to enhance a neural CRF model. Recently, Li et al. (2019) have proposed a unified model that contains two stacked LSTMs along with carefully-designed components for maintaining sentiment consistency and improving target   (Devlin et al., 2018) that contains L pre-trained Transformer blocks (Vaswani et al., 2017). The last block's hidden states are used to (a) propose one or multiple candidate targets based on the probabilities of the start and end positions, (b) predict the sentiment polarity using the span representation of the given target. word detection. Our work differs from these approaches in that we formulate this task as a spanlevel extract-then-classify process instead.
The proposed span-based labeling scheme is inspired by recent advances in machine comprehension and question answering (Seo et al., 2017;Hu et al., 2018), where the task is to extract a continuous span of text from the document as the answer to the question (Rajpurkar et al., 2016). To solve this task, Lee et al. (2016) investigate several predicting strategies, such as BIO prediction, boundary prediction, and the results show that predicting the two endpoints of the answer is more beneficial than the tagging method. Wang and Jiang (2017) explore two answer prediction methods, namely the sequence method and the boundary method, finding that the later performs better. Our approach is related to this line of work. However, unlike these works that extract one span as the final answer, our approach is designed to dynamically output one or multiple opinion targets.

Extract-then-Classify Framework
Instead of formulating the open-domain targeted sentiment analysis task as a sequence tagging problem, we propose to use a span-based labeling scheme as follows: given an input sentence x = (x 1 , ..., x n ) with length n, and a target list T = {t 1 , ..., t m }, where the number of targets is m and each target t i is annotated with its start position, its end position, and its sentiment polarity. The goal is to find all targets from the sentence as well as predict their polarities.
The overall illustration of the proposed framework is shown in Figure 3. The basis of our frame-work is the BERT encoder (Devlin et al., 2018): we map word embeddings into contextualized token representations using pre-trained Transformer blocks (Vaswani et al., 2017) ( §3.1). A multitarget extractor is first used to propose multiple candidate targets from the sentence ( §3.2). Then, a polarity classifier is designed to predict the sentiment towards each extracted candidate using its summarized span representation ( §3.3). We further investigate three different approaches under this framework, namely the pipeline, joint, and collapsed models in §3.4.

BERT as Backbone Network
We use Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018), a pre-trained bidirectional Transformer encoder that achieves state-of-the-art performances across a variety of NLP tasks, as our backbone network.
We first tokenize the sentence x using a 30,522 wordpiece vocabulary, and then generate the input sequencex by concatenating a [CLS] token, the tokenized sentence, and a [SEP] token. Then for each tokenx i inx, we convert it into vector space by summing the token, segment, and position embeddings, thus yielding the input embeddings h 0 ∈ R (n+2)×h , where h is the hidden size.
Next, we use a series of L stacked Transformer blocks to project the input embeddings into a sequence of contextual vectors h i ∈ R (n+2)×h as: Here, we omit an exhaustive description of the block architecture and refer readers to Vaswani et al. (2017) for more details.

Multi-Target Extractor
Multi-target extractor aims to propose multiple candidate opinion targets (Figure 3(a)). Rather than finding targets via sequence tagging methods, we detect candidate targets by predicting the start and end positions of the target in the sentence, as suggested in extractive question answering (Wang and Jiang, 2017;Seo et al., 2017;Hu et al., 2018). We obtain the unnormalized score as well as the probability distribution of the start position as: where w s ∈ R h is a trainable weight vector. Similarly, we can get the probability of the end position along with its confidence score by: During training, since each sentence may contain multiple targets, we label the span boundaries for all target entities in the list T. As a result, we can obtain a vector y s ∈ R (n+2) , where each element y s i indicates whether the i-th token starts a target, and also get another vector y e ∈ R (n+2) for labeling the end positions. Then, we define the training objective as the sum of the negative log probabilities of the true start and end positions on two predicted probabilities as: At inference time, previous works choose the span (k, l) (k ≤ l) with the maximum value of g s k + g e l as the final prediction. However, such decoding method is not suitable for the multi-target extraction task. Moreover, simply taking top-K spans according to the addition of two scores is also not optimal, as multiple candidates may refer to the same text. Figure 4 gives a qualitative example to illustrate this phenomenon.
Sentence: Great food but the service was dreadful! Targets: food, service Predictions: food but the service, food, Great food, service, service was dreadful, ... To adapt to multi-target scenarios, we propose an heuristic multi-span decoding algorithm as shown in Algorithm 1. For each example, top-M indices are first chosen from the two predicted scores g s and g e (line 2), and the candidate span (s i , e j ) (denoted as r l ) along with its heuristicregularized score u l are then added to the lists R and U respectively, under the constraints that the end position is no less than the start position as well as the addition of two scores exceeds a threshold γ (line 3-8). Note that we heuristically calculate u l as the sum of two scores minus the span length (line 6), which turns out to be critical to the performance as targets are usually short entities. Next, we prune redundant spans in R using the non-maximum suppression algorithm (Rosenfeld and Thurston, 1971). Specifically, we remove the span r l that possesses the maximum score u l from the set R and add it to the set O (line 10-11). We also delete any span r k that is overlapped with r l , which is measured with the word-level F1 function (line 12-14). This process is repeated for remaining spans in R, until R is empty or top-K target spans have been proposed (line 9).

Algorithm 1 Heuristic multi-span decoding
Input: g s , g e , γ, K g s denotes the score of start positions g e denotes the score of end positions γ is a minimum score threshold K is the maximum number of proposed targets 1: Initialize R, U, O = {}, {}, {} 2: Get top-M indices S, E from g s , g e 3: for si in S do 4: for ej in E do 5: if si ≤ ej and g s s i + g e e j ≥ γ then 6: u l = g s s i + g e e j − (ej − si + 1) 7: r l = (si, ej) 8: for r k in R do 13: if f1(r l , r k ) = 0 then 14:

Polarity Classifier
Typically, polarity classification is solved using either sequence tagging methods or sophisticated neural networks that separately encode the target and the sentence. Instead, we propose to summarize the target representation from contextual sentence vectors according to its span boundary, and use feed-forward neural networks to predict the sentiment polarity, as shown in Figure 3(b).
Specifically, given a target span r, we calculate a summarized vector v using the attention mechanism (Bahdanau et al., 2014) over tokens in its corrsponding bound (s i , e j ), similar to  and He et al. (2018): The polarity score is obtained by applying two linear transformations with a Tanh activation in between, and is normalized with the softmax function to output the polarity probability as: where W v ∈ R h×h and W p ∈ R k×h are two trainable parameter matrices.
We minimize the negative log probabilities of the true polarity on the predicted probability as: where y p is an one-hot label indicating the true polarity, and k is the number of sentiment classes. During inference, the polarity probability is calculated for each candidate target span in the set O, and the sentiment class that possesses the maximum value in p p is chosen.

Model Variants
Following Mitchell et al. (2013); Zhang et al. (2015), we investigate three kinds of models under the extract-then-classify framework: Pipeline model We first build a multi-target extractor where a BERT encoder is exclusively used. Then, a second backbone network is used to provide contextual sentence vectors for the polarity classifier. Two models are separately trained and combined as a pipeline during inference.
Joint model In this model, each sentence is fed into a shared BERT backbone network that finally branches into two sibling output layers: one for proposing multiple candidate targets and another for predicting the sentiment polarity over each extracted target. A joint training loss L + J is used to optimize the whole model. The inference procedure is the same as the pipeline model.

Collapsed model
We combine target span boundaries and sentiment polarities into one label space. For example, the sentence in Figure 2(b) has a positive span (3+, 4+) and a negative span   (11-, 11-). We then modify the multi-target extractor by producing three sets of probabilities of the start and end positions, where each set corresponds to one sentiment class ( e.g., p s+ and p e+ for positive targets). Then, we define three objectives to optimize towards each polarity. During inference, the heuristic multi-span decoding algorithm is performed on each set of scores (e.g., g s+ and g e+ ), and the output sets O + , O − , and O 0 are aggregated as the final prediction.

Setup
Datasets We conduct experiments on three benchmark sentiment analysis datasets, as shown in Table 1. LAPTOP contains product reviews from the laptop domain in SemEval 2014 ABSA challenges (Pontiki et al., 2014). REST is the union set of the restaurant domain from SemEval 2014, 2015 and 2016 (Pontiki et al., 2015(Pontiki et al., , 2016. TWITTER is built by Mitchell et al. (2013), consisting of twitter posts. Following Zhang et al. (2015); Li et al. (2019), we report the ten-fold cross validation results for TWITTER, as there is no train-test split. For each dataset, the gold target span boundaries are available, and the targets are labeled with three sentiment polarities, namely positive (+), negative (-), and neutral (0).

Metrices
We adopt the precision (P), recall (R), and F1 score as evaluation metrics. A predicted target is correct only if it exactly matches the gold target entity and the corresponding polarity. To separately analyze the performance of two subtasks, precision, recall, and F1 are also used for the target extraction subtask, while the accuracy (ACC) metric is applied to polarity classification.  Table 2: Main results on three benchmark datasets. A BERT LARGE backbone network is used for both the "TAG" and "SPAN" models. State-of-the-art results are marked in bold. and refer readers to Devlin et al. (2018) for details on model sizes. We use Adam optimizer with a learning rate of 2e-5 and warmup over the first 10% steps to train for 3 epochs. The batch size is 32 and a dropout probability of 0.1 is used. The number of candidate spans M is set as 20 while the maximum number of proposed targets K is 10 (Algorithm 1). The threshold γ is manually tuned on each dataset. All experiments are conducted on a single NVIDIA P100 GPU card.

Baseline Methods
We compare the proposed span-based approach with the following methods: TAG-{pipeline, joint, collapsed} are the sequence tagging baselines that involve a BERT encoder and a CRF decoder. "pipeline" and "joint" denote the pipeline and joint approaches that utilize the BIO and +/-/0 tagging schemes, while "collapsed" is the model following the collapsed tagging scheme (Figure 2(a)). UNIFIED (Li et al., 2019) is the current stateof-the-art model on targeted sentiment analysis 3 . It contains two stacked recurrent neural networks enhanced with multi-task learning and adopts the collapsed tagging scheme.
We also compare our multi-target extractor with the following method: DE-CNN (Xu et al., 2018) is the current stateof-the-art model on target extraction, which combines a double embeddings mechanism with convolutional neural networks (CNNs) 4 .
Finally, the polarity classifier is compared with the following methods: MGAN (Fan et al., 2018) uses a multi-grained attention mechanism to capture interactions between targets and sentences for polarity classification. TNet  is the current state-of-the-art model on polarity classification, which consists of a multi-layer context-preserving network architecture and uses CNNs as feature extractor 5 .

Main Results
We compare models under either the sequence tagging scheme or the span-based labeling scheme, and show the results in Table 2. We denote our approach as "SPAN", and use BERT LARGE as backbone networks for both the "TAG" and "SPAN" models to make the comparison fair.
Two main observations can be obtained from the Table. First, despite that the "TAG" baselines already outperform previous best approach ("UNIFIED"), they are all beaten by the "SPAN" methods. The best span-based method achieves 1.55%, 0.94% and 3.43% absolute gains on three datasets compared to the best tagging method, indicating the efficacy of our extract-then-classify framework. Second, among the span-based methods, the SPAN-pipeline achieves the best performance, which is similar to the results of Mitchell et al. (2013); Zhang et al. (2015). This suggests that there is only a weak connection between target extraction and polarity classification. The conclusion is also supported by the result of SPANcollapsed method, which severely drops across all datasets, implying that merging polarity labels into target spans does not address the task effectively.

Analysis on Target Extraction
To analyze the performance on target extraction, we run both the tagging baseline and the multitarget extractor on three datasets, as shown in Table 3. We find that the BIO tagger outperforms our extractor on LAPTOP and REST. A likely reason for this observation is that the lengths of input sentences on these datasets are usually small (e.g., 98% of sentences are less than 40 words in REST), which limits the tagger's search space (the power set of all sentence words). As a result, the computational complexity has been largely reduced, which is beneficial for the tagging method. In order to confirm the above hypothesis, we plot the F1 score with respect to different sentence lengths in Figure 5. We observe that the performance of BIO tagger dramatically decreases as the sentence length increases, while our extractor is more robust for long sentences. Our extractor manages to surpass the tagger by 16.1 F1 and 1.0 F1 when the length exceeds 40 on LAPTOP and REST, respectively. The above result demonstrates that our extractor is more suitable for long sentences due to the fact that its search space only increases linearly with the sentence length.
Since a trade-off between precision and recall can be adjusted according to the threshold γ in our extractor, we further plot the precision-recall curves under different ablations to show the effects of heuristic multi-span decoding algorithm. As can be seen from Figure Figure 6: Precision-recall curves on LAPTOP and REST for target extraction. "NMS" and "heuristics" denote the non-maximum suppression and the length heuristics in Algorithm 1. heuristics results in consistent performance drops across two datasets. By sampling incorrect predictions we find that there are many targets closely aligned with each other, such as "perfect [size] + and [speed] + ", "[portions] + all at a reasonable [price] + ", and so on. The model without length heuristics is very likely to output the whole phrase as a single target, thus being totally wrong. Moreover, removing the non-maximum suppression (NMS) leads to significant performance degradations, suggesting that it is crucial to prune redundant spans that refer to the same text.

Analysis on Polarity Classification
To assess the polarity classification subtask, we compare the performance of our span-level polarity classifier with the CRF-based tagger in Table  5. The results show that our approach significantly outperforms the tagging baseline by achieving 9.97%, 8.15% and 15.4% absolute gains on three datasets, and firmly surpasses previous stateof-the-art models on LAPTOP. The large improvement over the tagging baseline suggests that detecting sentiment with the entire span representation is much more beneficial than predicting polarities over each word, as the semantics of the given target has been fully considered.
To gain more insights on performance improvements, we plot the accuracy of both methods with respect to different target lengths in Figure 7. We find that the accuracy of span-level classifier only drops a little as the number of words increases on the LAPTOP and REST datasets. The performance of tagging baseline, however, significantly decreases as the target becomes longer. It demonstrates that the tagging method indeed suffers from the sentiment inconsistency problem when it comes to multi-word target entities. Our  span-based method, on the contrary, can naturally alleviate such problem because the polarity is classified by taking all target words into account. Table 4 shows some qualitative cases sampled from the pipeline methods. As observed in the first two examples, the "TAG" model incorrectly predicts the target span by either missing the word "Mac" or proposing a phrase across two targets ("scallps and prawns"). A likely reason of its failure is that the input sentences are relatively longer, and the tagging method is less effective when dealing with them. But when it comes to shorter inputs (e.g., the third and the fourth examples), the tagging baseline usually performs better than our approach. We find that our approach may sometimes fail to propose target entities (e.g., "adjustments" in (3) and "feel" in (4)), which is due to the fact that a relatively large γ has been set. As a result, the model only makes cautious but confident predictions. In contrast, the tagging method does not rely on a threshold and is observed to have a higher recall. For example, it additionally predicts the entity "food" as a target in the second example. Moreover, we find that the tagging method sometimes fails to predict the correct sen- timent class, especially when the target consists of multiple words (e.g., "battery cycle count" in (5) and "Casa La Femme" in (6)), indicating the tagger can not effectively maintain sentiment consistency across words. Our polarity classifier, however, can avoid such problem by using the target span representation to predict the sentiment.

Conclusion
We re-examine the drawbacks of sequence tagging methods in open-domain targeted sentiment analysis, and propose an extract-then-classify framework with the span-based labeling scheme instead. The framework contains a pre-trained Transformer encoder as the backbone network. On top of it, we design a multi-target extractor for proposing multiple candidate targets with an heuristic multispan decoding algorithm, and introduce a polarity classifier that predicts the sentiment towards each candidate using its summarized span representation. Our approach firmly outperforms the sequence tagging baseline as well as previous stateof-the-art methods on three benchmark datasets. Model analysis reveals that the main performance improvement comes from the span-level polarity classifier, and the multi-target extractor is more suitable for long sentences. Moreover, we find that the pipeline model consistently surpasses both the joint model and the collapsed model.