DOER: Dual Cross-Shared RNN for Aspect Term-Polarity Co-Extraction

This paper focuses on two related subtasks of aspect-based sentiment analysis, namely aspect term extraction and aspect sentiment classification, which we call aspect term-polarity co-extraction. The former task is to extract aspects of a product or service from an opinion document, and the latter is to identify the polarity expressed in the document about these extracted aspects. Most existing algorithms address them as two separate tasks and solve them one by one, or only perform one task, which can be complicated for real applications. In this paper, we treat these two tasks as two sequence labeling problems and propose a novel Dual crOss-sharEd RNN framework (DOER) to generate all aspect term-polarity pairs of the input sentence simultaneously. Specifically, DOER involves a dual recurrent neural network to extract the respective representation of each task, and a cross-shared unit to consider the relationship between them. Experimental results demonstrate that the proposed framework outperforms state-of-the-art baselines on three benchmark datasets.


Introduction
Aspect terms extraction (ATE) and aspect sentiment classification (ASC) are two fundamental, fine-grained subtasks of aspect-based sentiment analysis. Aspect term extraction is the task of extracting the attributes (or aspects) of an entity upon which opinions have been expressed, and aspect sentiment classification is the task of identifying the polarities expressed on these extracted aspects in the opinion text (Hu and Liu, 2004). Consider the example in Figure 1, which contains comments that people expressed about the aspect terms "operating system", "preloaded software", "keyboard", "bag", "price", and "service" labeled with their polarities, respectively. The polarities contain * Tianrui Li is the corresponding author.  four classes, e.g., positive (PO), conflict (CF), neutral (NT) 1 , and negative (NG).
To facilitate practical applications, our goal is to solve ATE and ASC simultaneously. For easy description and discussion, these two subtasks are referred to as aspect term-polarity co-extraction. Both ATE and ASC have attracted a great of attention among researchers, but they are rarely solved together at the same time due to some challenges: 1) ATE and ASC are quite different tasks. ATE is an extraction or sequence labeling task (Jakob and Gurevych, 2010;Wang et al., 2016a), while ASC is a classification task (Jiang et al., 2011;Wagner et al., 2014;Tang et al., 2016a,b;Tay et al., 2018). Thus, they are naturally treated as two separate tasks, and solved one by one in a pipeline manner. However, this two-stage framework is complicated and difficult to use in applications because it needs to train two models separately. There is also the latent error propagation when an aspect term is used to classify its corresponding polarity. Thus, due to the different natures of the two tasks, most current works focus either on extracting aspect terms (Yin et al., 2016;Luo et al., 2018; or on classifying aspect sentiment (Ma et al., 2017;Wang and Lu, 2018). A possible idea to bridge the difference between the two tasks is to change ASC to a sequence labeling task. Then, ATE and ASC have the same formulation.
2) The number of aspect term-polarity pairs in a sentence is arbitrary. Considering the examples depicted in Figure 1, we can observe that some sentences contain two term-polarity pairs and some sentences contain one pair. Moreover, each aspect term can consist of any number of words, which makes the co-extraction task difficult to solve.
Some existing research has treated ATE and ASC as two sequence labeling tasks and dealt with them together. Mitchell et al. (2013) and  compared pipelined, joint, and collapsed approaches to extracting named entities and their sentiments. They found that the joint and collapsed approaches are superior to the pipelined approach. Li and Lu (2017) proposed a collapsed CRF model. The difference with the standard CRF is that they expanded the node type at each word to capture sentiment scopes. Another interesting work comes from Li et al. (2019), where the authors proposed a unified model with the collapsed approach to do aspect term-polarity co-extraction. We can intuitively explain the pipelined, joint, and collapsed approaches through Figure 2. The pipelined approach first labels the given sentence using aspect term tags, e.g., "B" and "I" (the Beginning and Inside of an aspect term) and then feeds the aspect terms into a classifier to obtain their corresponding polarities. The collapsed approach uses collapsed labels as the tags set, e.g., "B-PO" and "I-PO". Each tag indicates the aspect term boundary and its polarity. The joint approach jointly labels each sentence with two different tag sets: aspect term tags and polarity tags.
We believe that the joint approach is more feasible than the collapsed approach when integrating with neural networks because the combined tags of the latter may easily make the learned representation confused. As an example in Figure 2, the "operating system" is an aspect term. Its polarity "positive" actually comes from the word "love". They should be learned separately because the meanings of these two groups of words are different. That means that using "B-PO I-PO" to extract the meaning of "operating system" and "love" simultaneously is difficult in training (this will be clearer later). In contrast, the joint approach has separate representations for ATE and ASC and separate labels. Thus, an extra sentiment lexicon can improve the representation of ASC individu-

Input
I love the operating system and the preloaded software . ally, and the interaction of ATE and ASC can further enhance the performance of each other.

Joint
In this paper, we propose a novel Dual crOss-sharEd RNN framework (DOER) to generate all aspect term-polarity pairs of a given sentence. DOER mainly contains a dual recurrent neural network (RNN) and a cross-shared unit (CSU). The CSU is designed to take advantage of the interactions between ATE and ASC. Apart from them, two auxiliary tasks, aspect length enhancement and sentiment enhancement, are integrated to improve the representation of ATE and ASC. An extra RNN cell called the Residual Gated Unit (ReGU) is also proposed to improve the performance of aspect term-polarity co-extraction. The ReGU utilizes a gate to transfer the input to the output like skip connection (He et al., 2016), and thus, is capable of training deeper and obtaining more useful features. In a word, DOER generates aspect terms and their polarities simultaneously by an end-to-end method instead of building two separate models, which saves time and gives a unified solution to practical applications.
Our contributions are summarized as follows: • A novel framework DOER is proposed to address the aspect term-polarity co-extraction problem in an end-to-end fashion. A crossshared unit (CSU) is designed to leverage the interaction of the two tasks.
• Two auxiliary tasks are designed to enhance the labeling of ATE and ASC, and an extra RNN cell ReGU is proposed to improve the capability of feature extraction.

Methodology
The proposed framework is shown in Figure 3a.
We will first formulate the aspect term-polarity coextraction problem and then describe this framework in detail in this section.

Problem Statement
This paper deals with aspect term-polarity coextraction, in which the aspect terms are explicitly mentioned in the text. We solve it as two sequence labeling tasks. Formally, given a review sentence S with n words from a particular domain, denoted by S = {w i |i = 1, . . . , n}. For each word w i , the objective of ATE is to assign it a tag t a i ∈ T a , and likewise, the objective of ASC is to assign a tag t p i ∈ T p , where T a = {B, I, O} and T p = {PO, NT, NG, CF, O}. The tags B, I and O in T a stand for the beginning of an aspect term, the inside of an aspect term, and other words, respectively. The tags PO, NT, NG, and CF indicate polarity categories: positive, neutral, negative, and conflict, respectively. The tag O in T p means other words like that in T a . Figure 2 shows a labeling example of the first sentence in Figure 1.

Model Overview
We discuss the proposed framework DOER in detail below.
Word Embedding. Instead of adopting standard techniques to generate the embedding of each word w i by concatenating word embedding and char embedding, we use the double embeddings proposed in  as the initial word embeddings. The double embeddings contain two types: general-purpose embeddings and domainspecific embeddings, which are distinguished by whether the embeddings are trained by an indomain corpus or not. Formally, each word w i will be initialized with a feature vector h w i ∈ R d G +d D , where d G and d D are the first dimension size of the general-purpose embeddings G ∈ R d G ×|V | and the domain-specific embeddings D ∈ R d D ×|V | , respectively. |V | is the size of the vocabulary. Hence, where ⊕ means the concatenation operation. h g and h d in Figure 3a denote G(w i ) and D(w i ), respectively. All the out-of-vocabulary words are randomly initialized, and all sentences are padded (or tailored when testing) and initialized with zeros to the max length of the training sentences. Stacked Dual RNNs. The main architecture of DOER is a stacked dual RNNs, one stacked RNN for ATE, and one stacked RNN for ASC. Each layer of RNNs is a bidirectional ReGU (BiReGU). As shown in Figure 4, ReGU has two gates to control the flow of input and hidden state. Given input x t at time t and the previous memory cell c t−1 , the new memory cell c t is calculated via the following equation: and the new hidden state h t is then computed as is a residual gate, andx t is x t or tanh(W x x t ) according to whether the size of x t is equal to c t or not. f t controls the information flow from the previous timestamp to the next timestamp. o t controls the information flow from the previous layer to the next layer. σ denotes the logistic function, tanh means the hyperbolic tangent function, and is element-wise multiplication. W * of size d × d I and U * of size d × d are weight matrices, where * ∈ {i, f , o, x}. The bias vectors are omitted for simplicity. The size of d I changes with the dimension of the input. Its value is d G + d D when it is the first layer of the stacked BiReGU. BiReGU owns two directional representations of the input like Bidirectional LSTM (Graves and Schmidhuber, 2005). We concatenate the hidden states generated by ReGU in both directions belonging to the same input as the output vector, which is expressed as − → h t and ← − h t have the same formulation as Eq.
(2) but different propagation directions. Thus, the size of h t is 2d, and the size of d I will also become 2d when stacking a new BiReGU layer. We refer the outputs of dual BiReGU as h A and h P separately to differentiate ATE and ASC.
Cross-Shared Unit. When generating the representation after BiReGU, the information of ATE and ASC is separated from each other. However, the fact is that the labels of ATE and the labels of ASC have strong relations. For instance, if the label of ATE is O, the label for ASC should be O as well, and if the label of ASC is PO, the label for ATE should be B or I. Besides, both the labels of ATE and the labels of ASC have the information to imply the boundary of each aspect term.
The cross-shared unit (CSU) is used to consider the interaction of ATE and ASC. We first compute the composition vector α M i j ∈ R K through the following tensor operator: where M ∈ {A, P}, m ∈ {a, p}, h m i ∈ h M , and G m ∈ R K×2d×2d are 3-dimensional tensors. K is a hyperparameter. A, a and P, p are indexes of ATE and ASC, respectively, m = p, M = A if m = a, and m = a, M = P if m = p. Such tensor operators can be seen as multiple bilinear terms, which have the capability of modeling more complicated compositions between two vectors (Socher et al., 2013;.
After obtaining the composition vectors, the attention score S M i j is calculated as: where v m ∈ R K is a weight vector used to weight each value of the composition vector, M ∈ {A, P}, and m ∈ {a, p}. Thus, S M i j is a scalar. All these scalars S A i j and S P i j are gathered in two matrices S A and S P , respectively. A higher score S A i j indicates a higher correlation between aspect term i and the polarity representation captured from j-th word. Likewise, a higher score S P i j indicates a higher correlation between aspect polarity i and the representation of aspect term captured from j-th word. We use their related representations to enhance the original ATE and ASC features through: where softmax r is a row-based softmax function, Such an operation can make ATE and ASC get enhanced information from each other. The process is shown in Figure 3b.
Interface. To generate the final ATE tags and ASC tags, either a dense layer plus a softmax function or a Conditional Random Fields (CRF) can be used. According to the comparison in (Reimers and Gurevych, 2017), using a CRF instead of a softmax classifier as the last layer can obtain a performance increase for tasks with a high dependency between tags. Thus, we use the linear-chain CRF as our inference layer. Its log-likelihood is computed as follows: where p (y|h;W c , b c ) is the probability function of CRF, and W c and b c are the weight and bias, respectively. The Viterbi algorithm is used to generate the final labels of ATE and ASC.
Joint Output. After generating the labels for ATE and ASC in the inference layer, the last step is to obtain the aspect term-polarity pairs. It is convenient to get the aspect terms of the given sentence according to the meaning of the elements in T a . To generate the polarity of each aspect term, we use the aspect term as the boundary of polarity labels, and then count the number of each polarity category within the boundary and adopt the label that has the maximum number or the first label (if all the numbers of each polarity category are equal) as the final polarity. For example, the final polarity of "PO NT" is "PO", the final polarity of "PO PO" is also "PO", and the final polarity of "PO NT NT" is "NT". This method is simple and effective in our experiments.

Auxiliary Aspect Term Length Enhancement.
Although CRF is capable of considering the correlation of two adjacent labels, there are generated discontinuous labels, especially for a long target aspect term. To alleviate the influence resulted from the length of the aspect term, we designed an auxiliary task to predict the average length of aspect terms in each sentence when training the model. The computational process of the prediction in ATE is as follows: whereh A ∈ R 2d is the result of max-pooling of h l 1 A , which is generated by the first RNN layer, W u A ∈ R 2d is a weight parameter. We calculate the prediction loss through the mean squared error (MSE): whereẑ u is the average length of aspect terms in a sentence after global normalization on the training dataset. ASC has a similar prediction process to ATE after the first layer of the stacked RNNs, but it has different weight W u P and hidden featureh P than W u A andh A . The prediction loss is denoted by L u P .
Auxiliary Sentiment Lexicon Enhancement. As previously discussed, the polarity of an aspect term is usually inferred from its related opinion words. Thus, we also use a sentiment lexicon to guide ASC. Specifically, we train an auxiliary word-level classifier on the branch of ASC for discriminating positive words and negative words based on the sentiment labelsŶ S p . This means that we use a sentiment lexicon to map each word of a sentence to a sentiment label in training. For each feature of ASC h p,l 1 i generated by the first RNN layer, we use a linear layer and the softmax function to get its sentiment label: where W s ∈ R 2d×c is a weight parameter, c = 3 means the sentiment label is one of the three elements in the set {positive, negative, none}. We use the cross-entropy error to calculate the loss of each sentence: where I(ŷ S i ) means the one-hot vector ofŷ S i ∈Ŷ S p .

Joint Loss
On the whole, the proposed framework DOER has two branches: one for ATE labeling and the other for ASC labeling. Each of them is differentiable, and thus can be trained with gradient descent. We equivalently use the negative of L (W c , b c ) in Eq. (6) as the error to do minimization via back-propagation through time (BPTT) (Goller and Kuchler, 1996). Thus, the loss is as follows: Then, the losses from both tasks and the auxiliary tasks are constructed as the joint loss of the entire model: where L a and L p , which have the same formulation as Eq. (11), denote the loss for aspect term and polarity, respectively. Θ represents the model parameters containing all weight matrices W , U, v and bias vectors b. λ is a regularization parameter.

Datasets
We conduct experiments on two datasets from the SemEval challenges and one English Twitter dataset. The details of these benchmark datasets are summarized in Table 1. S L comes from Se-mEval 2014 (Pontiki et al., 2014), which contains laptop reviews, and S R are restaurant reviews merged from SemEval 2014, SemEval 2015 (Pontiki et al., 2015), and SemEval 2016 (Pontiki et al., 2016). We keep the official data division of these datasets for the training set, validation set, and testing set. The reported results of S L and S R are averaged scores of 10 runs. S T consists of English tweets. Due to lack of standard train-test split, we report the ten-fold cross-validation results of S T as done in (Mitchell et al., 2013;Li et al., 2019). For the auxiliary task of sentiment lexicon enhancement, we exploit a sentiment lexicon 2 to generate the label when training the model. The evaluation metric is F1 score based on the exact match of aspect term and its polarity.

Word Embeddings
To initialize the domain-specific word embeddings, we train the word embeddings by CBOW (Mikolov et al., 2013) using Amazon reviews 3 and Yelp reviews 4 , which are in-domain corpora for laptop and restaurant respectively. Thus, for S L , we use Amazon embedding, and for S R , we use Yelp embedding. The Amazon review dataset contains 142.8M reviews, and the Yelp review dataset contains 2.2M restaurant reviews. The embeddings from all these datasets are trained by Gensim 5 which contains the implementation of CBOW. The parameter min count is set to 10 and iter is set to 200. We use Amazon embedding as the domain-specific word embeddings of S T as Amazon corpora is large and comprehensive although not in the same domain. The general-purpose embeddings are initialized by Glove.840B.300d embeddings (Pennington et al., 2014). Its corpus is crawled from the Web.

Settings
In our experiments, the regularization parameter λ is empirically set as 0.001, and d G and d D as 300 and 100, respectively. The hidden state size of d of ReGU is 300. The hyperparameter K is set to 5. We use Adam (Kingma et al., 2014) as the optimizer with the learning rate of 0.001 and the batch size of 16. We also employ dropout (Srivastava et al., 2014) on the outputs of the embedding layer and two BiReGU layers. The dropout rate is 0.5. To avoid the exploding gradient problem, we clip the gradient norm within 5. The maximum number of epochs is set to 50. The word embeddings are fixed during the training process. We implemented DOER using the TensorFlow library (Abadi et al., 2016), and all computations are done on an NVIDIA Tesla K40 GPU.

Baseline Methods
To validate the performance of the proposed model DOER 6 on the aspect term-polarity co-extraction task, a comparative experiment is conducted with the following baseline models: • CRF-{pipelined, joint, collapsed}: They leverage linguistically informed features with CRF to perform the sequence labeling task using the pipelined, joint, or collapsed approach 7 (Mitchell et al., 2013).
• NN+CRF-{pipelined, joint, collapsed}: An improvement of (Mitchell et al., 2013) that concatenates target word embedding and context four-word embeddings besides using linguistically informed features plus CRF to finish the sequence labeling task . Instead of using the officially released code 8 due to the outdated library, we reproduce the results with the original settings.
• Sentiment-Scope: A collapsed CRF model 9 (Li and Lu, 2017), which expands the node types of CRF to capture sentiment scopes. The discrete features used in this model are exactly the same as the above two groups of models.
• DE-CNN+TNet: DE-CNN 10  and TNet  are the current state-of-the-art models for ATE and ASC, respectively. DE-CNN+TNet combines them in a pipelined manner. We use the official TNet-AS variant 11 as our TNet implementation.  • LM-LSTM-CRF: It is a language model enhanced LSTM-CRF model proposed in , which achieved competitive results on several sequence labeling tasks 12 .
• E2E-TBSA: It is an end-to-end model of the collapsed approach proposed to address ATE and ASC simultaneously 13 (Li et al., 2019).
• S-BiLSTM: It is a stacked BiLSTM model with two layers that adopts the joint approach and has the same Embeddings, Interface, Joint Output layers as DOER.
• S-BiReGU: It is similar to S-BiLSTM but uses a ReGU cell instead of an LSTM cell.
We use two abbreviations AuL and AuS for the ablation study. AuL denotes the auxiliary task of aspect term length enhancement, and AuS denotes 12 https://github.com/LiyuanLucasLiu/ LM-LSTM-CRF 13 https://github.com/lixin4ever/ E2E-TBSA the auxiliary task of sentiment lexicon enhancement. All baselines have publicly available codes, and we ran these officially released codes to reproduce the baseline results except the NN+CRF variants due to the outdated library as discussed in the bullet point for these baseline systems.

Results and Analysis
Comparison Results. The comparison results are shown in Table 2, which are F1 scores of aspect term-polarity pairs. As the results show, our DOER obtains consistent improvement over baselines. Compared to the best pipelined model, the proposed framework outperforms DE-CNN+TNet by 3.88%, 5.24%, and 2.63% on S L , S R , and S T , respectively. It indicates that an elaborated joint model can achieve better performance than pipeline approaches on aspect term-polarity coextraction task. Besides, seven collapsed models are also introduced to the comparison. Compared to the best of these collapsed approaches, DOER improves by 2.36%, 2.87%, and 2.24% over E2E-TBSA on S L , S R , and S T , respectively. This result shows the potential of a joint model which considers the interaction between the two relevant tasks. Comparing with existing works based on the joint approach, i.e., CRF-joint and NN+CRF-joint, DOER makes substantial gains over them as well. The improvements over DE-CNN+TNet and E2E-TBSA are statistically significant (p < 0.05).
Ablation Study. To test the effectiveness of each component of DOER, we conduct an ablation experiment with results shown in the last block of Table 2. The fact that S-BiReGU gives superior performance compared to S-BiLSTM indicates the effectiveness of ReGU in our task. This residual architecture enables information transfer to the next layers more effective. With the help of CSU, S-BiReGU+CSU achieves better performance than without it. We believe the interaction of information between ATE and ASC is essential to improve each other. Although the samples with long aspect terms are rare, the auxiliary task of aspect term length can improve the performance. Another auxiliary task of sentiment lexicon can also enhance the representation of the proposed framework. As a whole of S-BiReGU, CSU, AuL, and AuS, the proposed DOER achieves superior performance. It mainly benefits from the enhanced features by the two auxiliary tasks and the interaction of two separate routes of ATE and ASC.
Results on ATE. Table 3 shows the results of aspect term extraction only. DE-CNN is the current state-of-the-art model on ATE as mentioned above. Comparing with it, DOER achieves new state-of-the-art scores. DOER * denotes the DOER without ASC part. As the table shows, DOER achieves better performance than DOER * , which indicates the interaction between ATE and ASC can yield better performance for ATE than only conduct a single task.  Case Study. Table 4 shows some examples of S-BiLSTM, S-BiReGU+CSU, and DOER. As observed in the first and second rows, S-BiReGU+CSU and DOER predict the aspect termpolarity pair correctly but S-BiLSTM does not. With the constraint of CSU, the error words can be avoided as shown in the second row. The two auxiliary tasks work well on the CSU. They can capture a better sentiment representation, e.g., the third row, and alleviate the misjudgment on the long aspect terms, e.g., the last row. Impact of K. We investigate the impact of hyperparameter K of the CSU on the final performance. The experiment is conducted on S L by varying K from 1 to 10 with the step of 1. As shown in Figure 5, value 5 is the best choice for the proposed method to address our task. Due to the performance demonstrated in the figure, K is set to 5 cross all experiments for simplicity. Visualization of Attention Scores in CSU. We also try to visualize the attention scores S A and S P to explore the effectiveness of CSU. As shown in Figure 6, S A and S P have different values, which indicate that both ATE and ASC indeed interact with each other. The red dashed rectangle in Figure 6a shows that the model learns to focus on itself when labeling the word "OS" in the ATE task. Likewise, the red dashed rectangle in Figure 6b shows that the model learns to focus on the word "great" instead of itself when labeling the word "OS" in the ASC task. The fact that the polarity on the target aspect "OS" is positive, which is inferred from the "great", verifies that the system is doing the right job. In summary, we can conclude that the attention scores learned by CSU benefit the labeling process.

Related Work
Our work spans two major topics of aspect-based sentiment analysis: aspect term extraction and as-  pect sentiment classification. Each of them has been studied by many researchers. Hu and Liu (2004) extracted aspect terms using frequent pattern mining. Qiu et al. (2011) and Liu et al. (2015) proposed to use rule-based approach exploiting either hand-crafted or automatically generated rules about some syntactic relationships. Mei et al. (2007), He et al. (2011) and Chen et al. (2014) used topic modeling based on Latent Dirichlet Allocation (Blei et al., 2003). All of the above methods are unsupervised. For supervised methods, the ATE task is usually treated as a sequence labeling problem solved by CRF. For the ASC task, a large body of literature has tried to utilize the relation or position between the aspect terms and the surrounding context words as the relevant information or context for prediction (Tang et al., 2016a;Laddha and Mukherjee, 2016). Convolution neural networks (CNNs) (Poria et al., 2016;Li and Xue, 2018), attention network (Wang et al., 2016b;Ma et al., 2017;He et al., 2017), and memory network  are also active approaches. However, the above methods are proposed for either the ATE or the ASC task. Lakkaraju et al. (2014) proposed to use hierarchical deep learning to solve these two subtasks. Wu et al. (2016) utilized cascaded CNN and multi-task CNN to address aspect extraction and sentiment classification. Their main idea is to directly map each review sentence into pre-defined aspect terms by using classification and then classifying the corresponding polarities. We believe the pre-defined aspect terms are in general insufficient for most analysis applications because they will almost certainly miss many important aspects in review texts. This paper regards ATE and ASC as two parallel sequence labeling tasks and solves them simulta-neously. Comparing with the methods that address them one by one using two separate models, our framework is easy to use in practical applications by outputting all the aspect term-polarity pairs of input sentences at once. Similar to our work, Mitchell et al. (2013) and  are also about performing two sequence labeling tasks, but they extract named entities and their sentiment classes jointly. We have a different objective and utilize a different model. Li et al. (2019) have the same objective as us. The main difference is that their approach belongs to a collapsed approach but ours is a joint approach. The model proposed by (Li and Lu, 2017) is also a collapsed approach based on CRF. Its performance is heavily dependent on manually crafted features.

Conclusion
In this paper, we introduced a co-extraction task involving aspect term extraction and aspect sentiment classification for aspect-based sentiment analysis and proposed a novel framework DOER to solve the problem. The framework uses a joint sequence labeling approach and focuses on the interaction between two separate routes for aspect term extraction and aspect sentiment classification. To enhance the representation of sentiment and alleviate the difficulty of long aspect terms, two auxiliary tasks were also introduced in our framework. Experimental results on three benchmark datasets verified the effectiveness of DOER and showed that it significantly outperforms the baselines on aspect term-polarity co-extraction.