A Shared-Private Representation Model with Coarse-to-Fine Extraction for Target Sentiment Analysis

Target sentiment analysis aims to detect opinion targets along with recognizing their sentiment polarities from a sentence. Some models with span-based labeling have achieved promising results in this task. However, the relation between the target extraction task and the target classification task has not been well exploited. Besides, the span-based target extraction algorithm has a poor performance on target phrases due to the maximum target length setting or length penalty factor. To address these problems, we propose a novel framework of Shared-Private Representation Model (SPRM) with a coarse-to-fine extraction algorithm. For jointly learning target extraction and classification, we design a Shared-Private Network, which encodes not only shared information for both tasks but also private information for each task. To avoid missing correct target phrases, we also propose a heuristic coarse-to-fine extraction algorithm that first gets the approximate interval of the targets by matching the nearest predicted start and end indexes and then extracts the targets by adopting an extending strategy. Experimental results show that our model achieves state-of-the-art performance.


Introduction
Target sentiment analysis aims to detect the opinion targets explicitly mentioned in the sentences, referred to as target extraction, and predict the sentiment polarities over the opinion targets, referred to as target classification. For example, in the sentence "I love Windows 7 which is a vast improvement over Vista.", the user mentions two opinion targets, namely, "Windows 7" and "Vista", and expresses positive sentiment over the first target, and negative sentiment over the second one.
Traditional methods formulated the jointly target extraction and classification task as a sequence labeling task. Under the scheme of sequence tagging, some prevalent models, including Conditional Random Field (CRF) (Mitchell et al., 2013;Zhang et al., 2015;Li and Lu, 2017), Gated Recurrent Unit (GRU) (Ma et al., 2018), Long Short-Term Memory (LSTM) (Li et al., 2019a), Convolutional Neural Network (CNN) (He et al., 2019) and Bidirectional Encoder Representations from Transformers (BERT) (Li et al., 2019b), are applied. Although these methods have achieved improved results, they suffer from the sentiment inconsistency problem of sequence tagging scheme.
To address it, some methods with span-based labeling, which can assure the sentiment consistency within a span, have been proposed (Zhou et al., 2019;. (Zhou et al., 2019) proposed a span-based loss to predict whether the target within a span is correct.  proposed a span-based model, which first predict the boundary of the targets and then predict the sentiment polarities based on the corresponding features. Although deep learning methods, especially spanbased methods, have achieved promising results, there are still some issues: 1) The relation between target extraction and target classification is not well exploited. Previous methods applied either a shared encoding module (Ma et al., 2018) or two private encoding modules (Luo et al., 2019; to learn features for target extraction and target classification, thus weakening the ability to represent the relation between the two tasks. As shown in Fig. 1, there exist shared and private information between target extraction and target classification. Specifically, the semantic and syntactic information are essential for both tasks, so they are shared information. On the other hand, as for the target extraction sub-task, some information like noun and pronoun informa-  Figure 1: An example of shared and private information of target extraction and target classification tion can be exploited but may interfere with target classification. Similarly, sentiment information may only be useful for target classification. 2) The span-based extraction algorithm still perform poorly on extracting target phrases. (Zhou et al., 2019) faces the trade-off problem between search space and target length. When it sets a small maximum target length, it may miss long target phrases. Conversely, setting a large maximum length will bring huge search space and many negative candidates.  adopts a heuristic algorithm a length penalty to avoid overlong targets. However, the length penalty make the model be incline to ignore target phrases.
To solve these issues, we propose a novel framework, namely Shared-Private Representation Model (SPRM) with a coarse-to-fine extraction algorithm. Inspired by (Bousmalis et al., 2016;Liu et al., 2016;Chen et al., 2018), we design a Shared-Private Network, which contains a shared encoding layer, namely Shared BERT (Devlin et al., 2018), and two private encoding layers, namely Target Extraction Long Short Term Memory (TE-LSTM) and Target Classification Long Short Term Memory (TC-LSTM). The two private networks can provide task-specific features and improve the ability of modeling the two sub-tasks. Moreover, we propose a coarse-to-fine extraction algorithm, which obtains the approximate intervals of targets by matching predicted start/end boundaries and then applies an extending strategy instead of a penalty factor for extracting target phrases correctly. The experiments on three benchmark datasets show that our model achieves state-of-the-art performance. Our contributions are summarized as follows: • A Shared-Private Network is designed to learn the shared and private representations for both of the two sub-tasks; • A coarse-to-fine extraction algorithm is proposed for target extraction to better extract target phrases; • Experimental results show our model achieves start-of-the-art performance.
2 Related Work (Mitchell et al., 2013) formulated the task of target sentiment analysis as a sequence tagging problem and proposed to use Conditional Random Field (CRF) with hand-crafted linguistic features. In the proposed method, three ways are designed to solve the problem, namely, pipeline way, collapsed way, and joint way. The pipeline way uses two independent models to extract targets and predict the sentiment of the extracted targets separately. As for the joint way, there are shared modules between the two sub-tasks that are jointly trained. Finally, the collapsed model combines the label of target extraction and target classification into the same label space, and predicts the collapsed label. Based on (Mitchell et al., 2013), rule-based methods (Zhang et al., 2015;Li and Lu, 2017) and deeplearning-based methods (Ma et al., 2018;Li et al., 2019a;Luo et al., 2019;He et al., 2019) have been proposed to solve target sentiment analysis task with the sequence tagging scheme. Although these methods have achieved improved results, they suffer from the problem of huge search space and sentiment inconsistency of sequence tagging scheme .
To address it, some span-based models were proposed (Zhou et al., 2019;, which solved the target sentiment analysis task by predicting the span of the targets. (Zhou et al., 2019) proposed a span-based loss to predict whether the target candidate with a span is a correct target.  proposed an extract-then-classify framework, which first extracts targets using a heuristic multi-span decoding algorithm and then classifies their polarities with corresponding summarized span representations. Compared to (Zhou et al., 2019), the extraction method proposed by  has solved the problem of huge  Figure 2: The overall architecture of SPRM. "TE" and "TC" denote "Target Extraction" and "Target Classification", respectively.
search space better and achieve better results. However, there are still some issues with it. For instance,  simply implements the joint model by employing a shared backbone for the two subtasks, which ignores the private information of each task. In addition, the heuristic multi-span decoding algorithm involves manually-setting thresholds for different datasets, and a length penalty factor for avoiding overlong targets, which is not suitable for extracting target phrases.

Model
To solve the aforementioned issues, we simultaneously learn shared and private features for the target extraction and classification in a unified framework, in which a coarse-to-fine extraction algorithm is designed. In this paper, we propose a novel model of Shared-Private Representation Model (SPRM) shown in Fig. 2, which encodes the shared and private information of the target extraction sub-task and the target classification sub-task effectively at a lower cost. Specifically, a Shared BERT Network is designed to encode as much shared information of both sub-tasks as possible, and two Private BiLSTMs are introduced to get the supplementary private representations for each task with fewer parameters than BERT. Moreover, we design a coarse-to-fine algorithm that first gets the approximate interval of the targets by matching the nearest predicted start and end indexes without any thresholds and then gets the final targets by extending the interval if the adjacent words are predicted as start/end boundaries. With the algorithm, targets can be extracted with reasonable length, since the nearest strategy avoids overlong targets while the extending strategy avoids missing target phrases.

Shared-Private Model
The overall architecture of Shared-Private Model is shown in Fig 2, which is composed by six components: an embedding layer, two Private BiLSTM networks for target extraction and target sentiment classification, a Shared BERT Network for both two sub-tasks, and the final layers of target extraction and target classification. Given the sentence input, the embedding layer process it with the tokenization process and wordpiece embeddings of BERT (Devlin et al., 2018), and obtain the input embeddings E ∈ R n×de , where n is the length of the processed sequence and d e is the size of embedding vectors.
For target sentiment analysis, both shared information of both sub-tasks and private information of each sub-tasks should be considered. Therefore, a shared network is designed to encode shared information between the two sub-tasks, such as semantic and syntactic information of the input sentence.
( 1) where f (·) is the function of learning shared features and V s is the learned feature. At the same time, the task-specific private information of target extraction (e.g., whether a word is a noun) and target classification (e.g., sentiment information of each word) should be learned in private modules.
where g te (·) and g tc (·) are the functions of learning private features of the target extraction task and the target classification task, V te and V tc are the private features. Based on the shared and private features, fusion modules are designed to obtain the final features for the two sub-tasks.
where h te (·) and h tc (·) are the functions of fusing shared and private features of the target extraction task and the target classification task,Ṽ te andṼ tc are the final features, which are fed into output layers.
Finally,Ṽ te andṼ tc are fed into the Target Extraction Layer (TE-Layer) and Target Classification Layer (TC-Layer) to generate the predictions, respectively. The model is finally trained by minimizing the sum of the target extraction loss and polarity classification loss: where l T E and l T C are the losses of the target extraction task and target classification task. Here, we omit an exhaustive description of the TC-Layer as it's same as the classification layer applied in , and readers can get more details from .
In the following subsections, we will detail the design of the aforementioned components, such as the shared module, the two private modules, the combination of shared and private modules, and the TE-Layer.

Shared BERT
As shared features are used in both target extraction and target classification, the shared module needs to have a strong ability of learning a shared representation. In addition, shared features generally portray common information between the two sub-tasks, like semantic and syntactic information, which also exist in other NLP tasks. Therefore, the prevalent model of Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018), which is a pre-trained bidirectional Transformer encoder that achieves state-of-the-art performances across a variety of NLP tasks, is chosen as the shared network.
Given the embeddings E, a series of stacked Transformer blocks are applied to project the input embeddings into a sequence of contextual vectors V s ∈ R n×ds , where d s is the dimension of outputs.

Private BiLSTM
Although the Shared BERT has captured powerful features for the two sub-tasks, these shared features are task-invariant but not task-specific. Therefore, private modules should be designed to learn private features for the two sub-tasks, respectively.
Since the Shared BERT has extracted as sufficient syntactic and semantic information as possible with a huge amount of parameters, we adopt Bidirectional Long Short Term Memory (BiLSTM), which captures the relationship between words in a sentence with fewer parameters than BERT, as the private modules. Specifically, we adopt two Private BiLSTM networks, namely TE-LSTM and TC-LSTM, to learn the private features for the tasks of target extraction and target sentiment classification, respectively. Taking the same embeddings E as inputs, we can obtain the output of BiLSTMs V te ∈ R n×2dp and V tc ∈ R n×2dp , where d p is the hidden size of the BiLSTM networks.

Combination of Shared and Private Features
Since the dimension of the Private BiLSTM output is twice than that of the shared BERT, we first project the outputs of shared and private modules into the same vector space by employing fully connected layers after the private modules: where V te , V tc ∈ R n×ds . Then we simply apply concatenation operation to get the final features at a low cost.
3.2 Coarse-to-Fine Extraction Algorithm  has proposed a heuristic algorithm based on the span-based labeling scheme and verified that the span-based labeling scheme performs better on target extraction compared to sequence tagging methods. However, the heuristic algorithm requires a manually-setting threshold for extracting targets and also has poor performance on target phrases due to the length penalty factor, which is designed to avoid overlong targets.
To address these issues, we propose a coarseto-fine extraction algorithm. In the coarse-to-fine algorithm, the approximate interval of a target can be obtained by matching the nearest predicted start and end indexes rather than manually setting a threshold, and then the final target is extracted with a reasonable length by adopting an extending strategy, which extends the intervals if the adjacent words are predicted as start/end boundaries.
The implementation of the coarse-to-fine extraction algorithm is described in detail in the following subsections, and Table 1 shows how the algorithm is used in a concrete example. The coarse-to-fine extraction algorithm consist of three steps: • Boundary prediction gets the predictions of start and end positions (Sec. 3.2.1); •

Boundary Prediction
As we have mentioned in Sec. 3.1,Ṽ te is fed into the TE-Layer to generate the predictions, and then the loss of the target extraction task l T E is computed. Here, the TE-Layer will be described in detail.
The start and end scores for each word in the sequence can be obtained by first applying fully connected layers and then using a sigmoid function: g s = F C s (Ṽ te ), p s = sigmoid(g s ) (7) g e = F C e (Ṽ te ), p e = sigmoid(g e ) (8) Different to , we employ a sigmoid function instead of the softmax function to get the scores, because the sigmoid function is more suitable for binary classification, like predicting whether a word is a start/end here. Given the probabilities of start and end positions of each word, the corresponding labels denoting whether a word is the start/end boundary of a target can be computed by the following steps.
where p s = {p s 1 , p s 2 , . . . , p s n } and p e = {p e 1 , p e 2 , . . . , p e n } are the start and end scores, respectively. Taking these two scores, the start labels y s = {y s 1 , y s 2 , . . . , y s n } and the end labels y e = {y e 1 , y e 2 , . . . , y e n } as inputs, we get the loss of target extraction: (10) where logloss(p i , y i ) is an error function defined as follows:

Coarse Extraction
The coarse extraction step first gets top start/end boundaries and then generates the original set of target candidates by the nearest strategy, which matches the nearest predicted start and end boundaries without any thresholds. Given the predicted labels of start and end positions, we can get the numbers of tokens predicted as start/end boundaries, namely nb s and nb e . Since enough candidates should be extracted to avoid missing correct candidates, we employ maximum function to compute the number of the boundaries nb which should be considered.
Therefore, the top nb candidates of start/end boundary from p s and p e are obtained and then the set of start/end candidates, namely S and E, are generated.
Since a target generally consists of a few tokens, we apply nearest strategy to avoid overlong targets.
Using the nearest strategy, we match the nearest end index in E with each start boundary candidate to get the start target candidate set C s . Similarly, the end target candidate set C e is also obtained. Finally, the approximate intervals of target candidates are obtained.

Fine Extraction
To get the final targets, the fine extraction step first adopts an extending strategy and then selects targets based on start/end probabilities and the computed target number. For target phrases, the boundaries of the nouns in them are usually predicted as start/end positions, too. For example, the token 'blue' of the target phrase 'integrate bluetooth devices' is predicted as the start position of a target, as shown in Table. 1. Therefore, an extending strategy shown in Algorithm 1 is designed to extract complete targets. In the extending strategy, every possible candidate can be extended on both the left side (line 3-4) and the right side (line 5-6) if the adjacent word is predicted as the start or end boundary.
As is mentioned before, the boundaries of the nouns in target phrases are usually predicted as start/end position of a target. Therefore, we can observe that the model may predict one or a few start/end positions for a target, which are generally adjacent to each other. In other words, the numbers of intervals which contain only labels predicted as true start/end boundaries can be used to infer the number of extracted targets nt. Specifically, the interval numbers of label s and label e , namely nt s , nt e , are computed first, and then we use the average value to estimate the number of the targets nt. nt = round ((nt   With the target number nt, we sort the extended candidate set C in descending order with the addition of start and end probabilities and then choose the top nt candidates. Note that the candidates overlapped by the extracted targets will be removed while being chosen.

Datasets
We conduct experiments on three benchmark datasets, as shown in Table. 2. LAPTOP contains product reviews from the laptop domain in Se-mEval 2014 (Pontiki et al., 2014). REST is the union set of the restaurant domain from SemEval 2014, 2015 and 2016 (Pontiki et al., 2015(Pontiki et al., , 2016. TWITTER is built by (Mitchell et al., 2013), consisting of twitter posts. Following (Zhang et al., 2015;Li et al., 2019a;, we report the ten-fold cross-validation results for TWITTER, as there is no train-test split. For each dataset, the gold target span boundaries are available, and the targets are labeled with sentiment polarities, namely positive (+), negative (-), and neutral (0).

Metrics
We adopt the precision (P), recall (R), and F1 score as evaluation metrics. A predicted target is correct only if it exactly matches the gold targets and the corresponding polarity. To separately analyze the performance of two sub-tasks, precision, recall, and F1 are also used for the target extraction subtask, while the accuracy (ACC) metric is applied to polarity classification.

Model Settings
We use the publicly available BERT-Base model as the shared BERT, and refer readers to (Devlin et al., 2018)    a learning rate of 3e-5 and warmup over the first 10% steps. The batch size is 32 and a dropout probability of 0.1 is used.

Main Results
We report the results of SPRM and the baselines in Table. 3. Two main observations can be obtained from the   tion for the two sub-tasks can be well obtained by applying two private encoding components. Secondly, SPRM achieves 0.66%, 4.25%, and 1.76% absolute gains on three datasets compared to the best SPAN method SPAN-pipeline, indicating the efficacy of the Shared BERT. Therefore, SPRM can get better performance with fewer parameters compared to SPAN-pipeline, which employs two separate BERT encoding network for target extraction and target classification, respectively.

Effectiveness of Shared-Private Network
To verify the effectiveness of the Shared-Private Network, we conduct extensive experiments on the LAPTOP and REST datasets, and the experimental results is shown in Table. 4. From the results, we observe that removing Shared BERT makes the performance worse since BERT has a strong ability of learning powerful features. Although the model can perform well while just applying BERT, the Private BiLSTMs can also learn useful features for each sub-task to improve the performance. Specifically, the Private AE-LSTM is more effective than the Private AC-LSTM, as the performance of the former LSTM has a bigger decrease in performance.
Moreover, we plot the performance of SPAN and SPRM with respect to different BERT backbone networks in Table. 5 to further examine the effectiveness of the Shared-Private Network. We can observe that SPRM with BERT-Base achieves comparable results compared to SPRM with BERT-Large, while the performance of SPAN-joint with BERT-Base is significantly worse than that of SPANjoint with BERT-Large. It shows that the introduction of private layers improves the performance with fewer parameters compared to using BERT-Large as the backbone network instead of BERT-Base. Besides, SPRM with BERT-Base outperforms SPAN-pipeline with BERT-Large, which uses almost 5 times the trainable parameters of SPRM with BERT-Base. Therefore, the introduction of Shared BERT can not only connect the task of target extraction and target classification to some extent but also reduce the parameter number.

Effectiveness of Coarse-to-Fine Extraction Algorithm
To verify the effectiveness of the coarse-to-fine extraction algorithm, we employ CRF and the heuristic algorithm proposed by  instead of our coarse-to-fine extraction algorithm on the LAPTOP and REST datasets, and the experimental results are shown in Table. 6. Among the three extraction methods, CRF preforms worse since it suffers from the problems of huge search space. In addition, the coarse-to-fine extraction algorithm outperforms the heuristic extraction method of  as our model applies a flexible way to extract targets.

Analysis on Both Sub-Tasks
To analyze the performance of our model on target extraction and target sentiment classification, we compare our model with previous approaches designed for both of the two tasks and some state-ofthe-art methods proposed for one of the sub-tasks, namely, DE-CNN (Xu et al., 2018) for target extraction and DMMN-SDCM (Lin et al., 2019)    target extraction and target classification are shown in Table. 7 and Table. 8, respectively. On the task of target extraction, our model doesn't have the best performance on all of the three datasets. SPM outperforms SPAN by 1.37% and 4.33% on the LAPTOP and REST datasets, but has worse performance on the TWITTER dataset. And on the task of target sentiment classification, our model outperforms all the baselines by 0.11%, 0.40%, and 3.18% on three datasets. The experimental results show that one of the disadvantages of the joint model over the pipeline model is that it can make sure to perform best on the task of target sentiment analysis but can't perform best on both sub-tasks at the same time for guarantee.

Qualitative Analysis
Table. 9 shows some qualitative cases sampled from SPAN-pipeline and SPRM. We can observe that our model SPRM with the coarse-to-fine extraction algorithm can extract more accurate targets. The heuristic coarse-to-fine extraction algorithm computes the number of targets by the predict scores of start and end boundaries instead of a manually set threshold, so our method can be more precise with the number of targets. Take the example 6 in the table as an example, the correct targets, "Windows XP" and "Windows 7", are not extracted by SPAN-pipeline as the threshold filters them incorrectly, while our method extracts all the three correct targets as we infer the number of targets correctly. Example 1 is also a good example to confirm this. In addition, our algorithm adopts the extending strategy instead of the strategy of length penalty, and it can avoid missing the targets which consist of a few words. Take the example 4 in the table as an example, the correct extracted target should be "chili signed food items", but SPANpipeline split the gold target entity to two separate targets because of its length penalty. However, our algorithm can extract the target "chili signed food items" correctly since we get the original candidates with the closest indexes and then extract the targets by the extending strategy.

Conclusion
In this paper, we propose a Shared-Private Representation Model (SPRM) with coarse-to-fine extraction for target sentiment analysis. To encode the information of the two sub-tasks of target sentiment analysis, a Shared-Private Network has been proposed to learn shared features as well as private features. Moreover, we designed a coarse-to-fine extraction algorithm, which extracts targets without thresholds and adopts an extending strategy for better extracting target phrases. Experiments on three benchmark datasets show the effectiveness of SPRM.