Joint Learning for Targeted Sentiment Analysis

Targeted sentiment analysis (TSA) aims at extracting targets and classifying their sentiment classes. Previous works only exploit word embeddings as features and do not explore more potentials of neural networks when jointly learning the two tasks. In this paper, we carefully design the hierarchical stack bidirectional gated recurrent units (HSBi-GRU) model to learn abstract features for both tasks, and we propose a HSBi-GRU based joint model which allows the target label to have influence on their sentiment label. Experimental results on two datasets show that our joint learning model can outperform other baselines and demonstrate the effectiveness of HSBi-GRU in learning abstract features.


Introduction
Targeted sentiment analysis (TSA) aims to extract targets in a text and simultaneously predict their sentiment classes (Hu and Liu, 2004;Jin et al., 2009;Li et al., 2010;Yang and Cardie, 2013). For example, given a sentence "ESPN poll says Michael Jordan is the greatest basketball athlete", the targets are ESPN and Michael Jordan and their sentiment classes are Neutral and Positive respectively.
Targeted sentiment analysis can be seen as two tasks: target extraction and sentiment classification. Some researchers have tackled two tasks separately, e.g., target extraction (Liu et al., 2013;Wang et al., 2016a;Yin et al., 2016) and sentiment classification (Tang et al., 2016;Wang et al., 2016b;Ruder et al., 2016). Recently, some researches have attempted to conduct the two tasks jointly and generally see them as sequence labeling problems, where the B/I/O labels indicate target boundaries and the Positive/Neutral/Negative labels denote sentiment classes (Klinger and Cimiano, 2013;Yang and Cardie, 2013). Mitchell et al. (2013) explore labeling targets and their sentiment classes simultaneously by using the Conditional Random Fields (CRF) approach with traditional manual discrete features, and present three models: pipeline, joint and collapsed, according to different labeling processes of the two tasks. They find that the pipeline method outperforms the joint model on tweet dataset. Further, Zhang et al. (2015) introduce word embedding representations into the CRF framework and find that it is beneficial to integrate word embeddings into handcraft features in TSA regardless of pipeline, joint or collapsed methods.
With the success of deep learning techniques, neural networks have demonstrated their capability of sequence labeling (Collobert et al., 2011;Pei et al., 2014;Chen et al., 2015). However, Zhang et al. (2015) only use word embeddings to enrich features without taking full advantages of neural networks' potential in automatically capturing important sequence labeling features like long distance dependencies and character-level features.
To make better use of neural networks to explore appropriate character-level features and high-level semantic features for the two tasks, we design a hierarchical multi-layer bidirectional gated recurrent units networks (HMBi-GRU) which uses a multi-layer Bi-GRU to automatically learn character features (e.g. capitalization, noun suffix, etc) on letter sequence and model long distance dependencies between words on the concatenation of word embedding and its character features. The learned character features can also address out-of-vocabulary word problems.
In above example, the target label and sentiment label for Michael Jordon are "B-Person, I-Person" and "B-Positive, I-Positive", we can see that the boundary information (B, I) of target label and sentiment label is consistent. From the view of human, we should first predict the target label and give corresponding sentiment label afterwards. Therefore, we introduce target label information into predicting sentiment label. In this way, our model can know about the target boundary information when predicting the sentiment label. Meanwhile, we also introduce transition matrix (Collobert et al., 2011) to model the dependencies between labels.
We conduct experiments on two datasets, and the performances show that our models outperform other baselines. This verifies the effectiveness of neural networks in TSA. In the experiments, we find that the target label information is important for predicting sentiment label. We also analyze the performance of multi-layer Bi-GRU and hierarchical architecture in learning character features and dependencies between words.

Model
We will detailedly introduce our model in this section, and our model is shown in Figure 1. Supposing that a sentence is composed of n words [w 1 , w 2 , ..., w n ]. For each word w i consists of l i characters [c 1 , c 2 , ..., c l i ] and l i is the length of w i . We embed all words and characters into low-dimensional real-value vectors which can be learned by language model (Bengio et al., 2003;Mikolov et al., 2013). We represent sentence as a matrix of word embeddings W = [E 1 , E 2 , ..., E n ] ∈ R n×dw . Similarly, word w i is denoted as a matrix of character embeddings C i ∈ R l i ×dc , and d w and d c are the size of word embedding and character embedding respectively.
First, we design a hierarchical two-layer architecture where each layer includes a multi-layer bidirectional Gated Recurrent Units (MBi-GRU). GRU is good at modeling a sequence with the benefits of avoiding the gradient vanishing and exploding problems. For a MBi-GRU, supposing that it has M layers of Bi-GRU, the hidden state on layer m ∈ {1, 2, ..., m} at time t ∈ {1, 2, ..., n} is recursively computed by: where the superscript of h denotes the corresponding layer of a MBi-GRU, and h 0 means the original inputs. BiGRU is bidirectional GRU which is defined as: where x t is inputs which can be word embeddings or the hidden states of other BiGRU. ⊕ indicates the operation of concatenating two vectors.
With the matrix of character embeddings C i as inputs, we utilize a MBi-GRU to learn characterlevel abstract features for word w i based on its character embeddings. Through MBi-GRU, we can obtain the hidden states on which a max-pooling operation is applied to output the character-level features r i ∈ R 2dc for word w i . The character features of all words in a sentence form a new matrix C ∈ R n×2dc . Next, We concatenate C with the matrix of word embeddings W and denote the concatenation as F ∈ R n×(dw+2dc) . With F as input, We utilize another MBi-GRU to learn the hidden states as the final representations of the sentence. Therefore, the hierarchical two-layer MBi-GRU architecture can learn highlevel abstract features with consideration of both character-level and word-level information.
After learning the final representations for sentence, we first project the features: tf i = h M i of each word into target label space by: where W t p and b t p are weight matrix and bias. As we know, the boundary of a target should be the same as that of its sentiment in sequence label. As the example in Section 1, the target label and sentiment label of Michael Jordan are "B-Person, I-Person" and "B-Positive, I-Positive" respectively. To learn this kind of consistency, we introduce the target label information into predicting sentiment label by: where sf i = h M i ⊕ y i t , W t s and b t s are weight matrix and bias respectively. This makes our model know the target label information when predicting their sentiment.
For sequence labeling, there usually exist dependencies between labels. Take the target labeling task for example, label I will never follow label O. To consider the influence of label dependencies, we introduce the transition matrix A i,j proposed by Collobert et al. (2011) which measures the probability of jumping from label i to label j.
Given the sentence x = [w 1 , w 2 , ..., w n ] and the scores y t = [y 1 t , y 2 t , ..., y n t ] and y s = [y 1 s , y 2 s , ..., y n s ] computed by Eq. 5 and Eq. 6, we get the target labeling scores by summing up transition scores and the scores y i t : where A t is label transition matrix for target labeling. θ t = θ ∪ {A t i,j }, and θ denotes parameters of HMBi-GRUs.
Next, we normalize the target label scores over all possible labeling paths of target (i.e., Y t ) by a softmax function: ŷt∈Yt e s(ŷt,x,θt) ; We can also use Eq. 7 and Eq. 8 to get the normalized sentiment label scores p s (y s |x). To train our model, we define the loss function by: loss = − log(p t (y t |x)) − log(p s (y s |x)). (9) Finally, we obtain targets label sequence y * t and their sentiment label sequence y * s which have maximal score y * t = arg maxŷ ∈Yt (s(x,ŷ, θ t )) y * s = arg maxŷ ∈Ys (s(x,ŷ, θ s )). y * t and y * s can be computed by Viterbi algorithm.

Setup
To validate the effectiveness of our model, we conduct experiments on two datasets, consisting of  English tweets and Spanish tweets, which are constructed by Mitchell et al. (2013) 1 . Table 2 depicts the statistics of data, which contains sentence number, target number and the number of positive target, negative target and neutral target. To evaluate the system performance, we adopt Precision, Recall and F-measure. In our experiments, we evaluate the performance of detecting targets (DT) and targeted sentiment analysis (TSA) which a target is taken as correct only when the boundary and the sentiment are both correctly recognized. We also adopt Precision, Recall and F-measure used in Zhang et al. (2015) to evaluate our model. The reason why we don't compare with Mitchell et al. (2013) is that they only evaluate the beginning of targets along with the sentiment expressed towards it.
In our experiments, we use embeddings from Pennington et al. (2014) 2 and Cieliebak et al. (2017) 3 for English words and Spanish words respectively. The character embeddings are initialized by Xavier (Glorot and Bengio, 2010) and their dimension is 50. In our model, all unknown words, weight matrices and biases are initialized by Xavier Glorot and Bengio (2010). The dimensions of the character-level and word-level hidden states in MBi-GRU are set to 300 and 600 respectively. The layer number of multi-layer bidirectional GRU is set to 2. To avoid overfitting, we adopt dropout on embeddings, sf i and tf i , and the dropout rate is set to 0.5. The word embeddings and character embeddings will be tuned during training. Finally, we utilize Adam (Kingma and Ba, 2014) to optimize all parameters of our model.

Baselines
To investigate the performance of our joint model, we compare it with several baselines as follows: • Discrete uses traditional discrete features as  inputs and multi-label CRF which contains two separate output clique potentials and two separate edge clique potentials for target extraction and sentiment classification respectively. There also exist links between target labels and sentiment labels for each word (Zhang et al., 2015).
• Neural uses word embeddings transformed with non-linear function as inputs, and others are the same as Discrete model (Zhang et al., 2015).
• Integrated integrates both discrete features and word embeddings into the same CRF framework and other settings are the same as Discrete (Zhang et al., 2015).
• Bi-GRU only uses word embeddings as inputs, and Bi-GRU is employed to learn representations for sentence.
• MBi-GRU also uses word embeddings as inputs, but MBi-GRU is utilized to model sentence.
• HBi-GRU first uses Bi-GRU to learn character level features for each word. Then, character level features and word embeddings are concatenated as inputs for another Bi-GRU to learn final representations for sentence.
• No-Target uses HMBi-GRU to learn representations for sentence, but h M i (depicted in Section 2) are used to predict target label and sentiment label separately. No-Target doesn't let target label information to affect sentiment label. This is the biggest difference between No-Target and ours.
It is noticed that all of Bi-GRU, MBi-GRU and HBi-GRU use transition matrix to model the dependencies between labels and introduce target label information into predicting sentiment label. Table 2 displays the performance comparison of our models with the baselines. We can see that Discrete gets the worst results on English dataset, and Neural gets the worst results on Spanish dataset. The Integrate greatly improves the performances on both datasets because discrete features and word embeddings can complement each other.

Analysis
Bi-GRU greatly improves the performance compared with Discrete and Neural but gets worse performance than Integrate. This verifies the effectiveness of neural networks in TAS. However, simple neural networks are not enough to acquire better results. MBi-GRU learns high-level features via multi-layer bidirectional GRU and achieves comparable results compared with Integrate.
Nevertheless, Bi-GRU and MBi-GRU do not make full use of character-level features. HBi-GRU incorporates character-level features by Bi-GRU on letter sequence of word. We can see that HBi-GRU improves about 1.85% and 1.16% in TSA on both datasets compared with Integrate. The performance of HBi-GRU demonstrates the importance of character-level features in TSA, and the hierarchical architecture is good at leaning multi-level (character-level, word-level) features.
Our model improves 3.20%, 2.59% in TSA and 2.39%, 0.27% in DT on both datasets compared with the existing best system: Integrate. Compared with No-Target, our model introduces target label information into predicting sentiment label and improves about 0.66%, 1.44% in TSA and 0.59%, 0.91% in DT on both datasets. The improvements demonstrate that target label information plays important roles in predicting sentiment label. It is noticed that the results of our model in DT are also improved compared with No-Target. The reason may be that the gradients from sentiment loss have positive effects on detecting targets.
In a word, our model achieves state-of-the-art in DT and TSA on both datasets. Character-level features play great roles in DT and TSA, and HMBi-GRU is good at learning multi-level features. It is useful to learn boundary consistence by introducing target label information into predicting sentiment label.

Case Study
Here, we use a tweet from English Dataset as a case study, and the tweet is "Congratulations to our Champ Roger Federer ...". We apply . We can see that No-Target wrongly regard Champ as the beginning position and ignore Federer. The reasons are that the first letter of Champ is capitalized, which may mislead No-Target and there is no correlation between target and sentiment label. In our model, we incorporate target label information into predicting sentiment label. Therefore, our model tends to force target and sentiment label to have same boundary information.
This case study shows that the target label information plays important roles in predicting sentiment label because they share the same boundary information.

Related Work
Early works on target sentiment analysis were based on subjects and features. For example, Yi et al. (2003) extracted all references to the given subject and determined the sentiment of each reference. Hu and Liu (2004) first proposed several techniques to mine the product features that customers have expressed their opinions and determined their sentiment, and Popescu and Etzioni (2007) utilized unsupervised methods to identify opinions with respect to features and determine the polarity of opinions. Jin et al. (2009) proposed a novel lexicalized HMMs model to mine customer reviews of a product and extract highly specific product related entities which reviewers expressed their opinion, and they also identified the sentiment of opinion entities. The works of (Yang and Cardie, 2013) and (Li et al., 2010) are similar to (Jin et al., 2009). However, these works only take pre-defined features into account and can not find new features. To automatically extract targets and predict their sentiment, Mitchell et al. (2013) first proposed a conditional random fields (CRF) framework to jointly detect entities and identify their sentiment. Based on the work of (Mitchell et al., 2013), Zhang et al. (2015) explored the effect of word embeddings and automatic feature combinations by extending a CRF baseline using neural networks.
We propose a neural networks based joint model which extracts targets and their sentiments simultaneously. Our model takes full advantages of neural networks' potential in capturing sequence labeling features such as long distance dependencies and character-level features. Furthermore, Our model allows the target label to have positive effects on their sentiment label because target label shares boundary information with sentiment label.

Conclusion
In this paper, we propose a HMBi-GRU based joint model for targeted sentiment analysis. Our model will simultaneously extract targets and predict their sentiment. Furthermore, our model introduces target information into predicting corresponding sentiment label. Experiments show that the well-designed neural networks can greatly improve the result for targeted sentiment analysis, and target label information plays great roles in predicting sentiment label.