Neural Networks for Open Domain Targeted Sentiment

Open domain targeted sentiment is the joint information extraction task that ﬁnds target mentions together with the sentiment towards each mention from a text corpus. The task is typically modeled as a sequence labeling problem


Introduction
Targeted sentiment analysis has drawn growing research interests over the past few years. Compared with traditional sentiment analysis tasks, which extract the overall sentiment of a document, a sentence or a tweet, targeted sentiment analysis extracts the sentiment over given targeted entities from a text, and therefore is practically more informative. An example is shown in Figure 1. There are at least two practical scenarios: (1) Certain entities of concern are specified, and the requirement is to extract the sentiment towards their mentions in a text. For example, one can be interested in the sentiment towards Google Inc., Microsoft and Facebook in financial news texts, or the sentiment towards Manchester United, Liverpool and Chelsea in tweets. [AW service] 0 will be back at work . (2) No specified target is given, and the requirement is to find sentiments towards entities in the open domain. For example, one might be interested extracting the mentions to all persons and organizations, together with the sentiments towards each mention, from a news archive or a collection of novels.
There are two sub tasks in targeted sentiment analysis, namely entity recognition and sentiment classification for each entity mention which apply to both scenarios above. In scenario (1), entity recognition is relatively trivial, and can typically be achieved by pattern matching. Partly due to this reason, most previous work has addressed targeted sentiment analysis as a pure classification task, assuming that target mentions have been given (Jiang et al., 2011;Chen et al., 2012;Dong et al., 2014;Vo and Zhang, 2015). For scenario (2), a named entity recognition (NER) system can be used to extract targets, before the same targeted sentiment classification algorithms are applied. There has also been work that concentrates on extracting opinion targets (Jin et al., 2009;Jakob and Gurevych, 2010). In both cases, the data in Figure 1 can be used for training sentiment classifiers. Mitchell et al. (2013) took a different approach, extracting named entities and their sentiment classes jointly. They model the joint task sentence: So excited to meet my baby Farah !!! entity: sentiment: as an extension to the NER task, where an extra sentiment label is assigned to each named entity, in addition to the entity label. As a result, the task can be solved using sequence labeling methods. As claimed by Mitchell et al. (2013), the joint task is particularly suitable when no extra resources are available for training separate syntactic analyzers or name entity recognizers. Such situations can include tweets and low-resource languages/domains. Interestingly, because of containing entity information, the annotation in Figure 1 suffices for training joint entity and sentiment labels even if it is the only resource available. The annotations in Figure 1 can be transformed into label sequences, as shown in Figure 2. Figure 2 consists of two types of labels, where the B/I/O labels indicate span boundaries, and the +/-/0 labels indicate sentiment classes. The two types of labels can be assigned in a span→sentiment pipeline, or jointly as a multi-label task. Alternatively, as shown in Figure 2(b), the two types of labels can be collapsed into a joint label, such as B+ and I-, indicating the beginning of a positive entity and the middle of a negative entity, respectively. The collapsed labels allow joint entity recognition and sentiment classification to be achieved using a standard sequence labeler. Mitchell et al. (2013) compare a pipeline model, a joint model and a collapsed model under the same conditional random field (CRF) framework, finding that the pipeline method outperforms the joint model on a tweet dataset. Intuitively, the interaction between entity boundaries and sentiment classes might not be as strong as that between more closely-coupled sources of information, such as word boundaries and POS (Zhang and Clark, 2008), or named entities and constituents (Finkel and Manning, 2009), for which joint models significantly outperform pipeline models. On the other hand, there do exist cases where entity boundaries and sentiment classes reinforce each other. For example, in a tweet such as 'I like X.', the contextual pattern indicate both a positive sentiment and an entity in the place of X.
Recently, neural network models have been increasingly used for sentiment analysis (Socher et al., 2013;Kalchbrenner et al., 2014;dos Santos and Gatti, 2014), achieving highly competitive results, which show large potentials of neural network models for this task. The main advantages of neural networks are two-fold. First, neural models use real-valued hidden layers to automatically learn feature combinations, which can capture complex semantic information that are difficult to express using traditional discrete manual features. Second, neural networks take distributed word embeddings as inputs, which can be trained from large-scale raw text, thus alleviating the scarcity of annotated data to some extent. In this paper, we exploit structured neural models for open targeted sentiment.
We take the CRF model of Mitchell et al. (2013) as the baseline, and explore two research questions. First, we make an empirical comparison between discrete and neural CRF models, and further combine the strengths of each model via feature integration. Second, we compare the effects of the pipeline, joint and collapsed models for open targeted sentiment analysis under the neural model settings. Our experiments show that the neural model gives competitive results compared with the discrete baseline, with relatively higher recalls. In addition, the integrated model significantly improves over both the discrete and the neural models.

Related Work
Targeted sentiment analysis is closely related prior work on aspect-oriented (Hu and Liu, 2004), feature-oriented (Popescu and Etzioni, 2007) and topic-oriented (Yi et al., 2003) sentiment analysis. These related tasks are typically concentrated on product review settings. In contrast, targeted sentiment analysis has a more general setting.
Recently, Wang et al. (2011) proposed a topicoriented model, which extracts sentiments towards certain topics from tweets. Topics in their model resemble targets in our work, although topics are represented by hashtags, which exists in 14.6% tweets and 27.5% subjective tweets (Wang et al., 2011). In contrast, targeted sentiment analysis can identify all the mentions to target entities in tweets, thereby having a larger coverage. The drawback is that the identification of mentions is subject to errors, and thus suffers a lower precision compared to hashtag matching.
Sequence labeling models have been used for extracting opinions and target entities as a joint task. Jin et al. (2009) use HMM to extract opinionbaring expressions and opinion targets. Li et al. (2010) improve the results by using CRF to identify the opinion expressions and targets jointly. The task is sometimes referred to as fine-grained sentiment analysis (Wiebe et al., 2005). It is different from our setting in that the predicate-argument relation between opinion-baring expressions and target entities are not explicitly modeled.
Recently, Yang and Cardie (2013) use CRF to extract opinion-baring expressions, opinion holders and opinion targets simultaneously. Their method is also centralized on opinion-baring expressions and therefore in line with Jin et al. (2009) and Li et al. (2010). In contrast, targeted sentiment analysis directly studies entity mentions and the sentiment on each mention, without explicitly modeling the way in which the opinion is expressed. As a result, our task is more useful for applications such as broad-stroke reputation management, but offer less fine-grained operational insight. It requires less fine-grained manual annotation.
As discussed in the introduction, targeted sentiment analysis falls into two main settings. The first is targeted sentiment classification, assuming that entity mentions are given. Most previous work fall under this category (Jiang et al., 2011;Chen et al., 2012;Dong et al., 2014). The second is open domain targeted sentiment, which has been discussed by Mitchell et al. (2013). The task jointly extracts entities and sentiment classes, and is analogous to joint entity and relation extraction (Li and Ji, 2014) in that both are information extraction tasks with multi-label outputs.
Our work is related to the line of work on using neural networks for sentiment analysis. Socher et al. (2011) use recursive auto-encoders for sentiment analysis on the sentence level. They further extend the method to a syntactic treebank annotated with sentiment labels (Socher et al., 2013). More recently, Kalchbrenner et al. (2014) use a dynamic pooling network to include the structure of a sentence automatically, before classifying its sentiment.  apply deep belief networks for semi-supervised sentiment classification. dos Santos and Gatti (2014) use deep convolution neural networks with rich features to classify sentiments over tweets and movie reviews. These methods use different models to represent sentence structures, performing sentiment analysis on the sentence level, without modeling targets. Dong et al. (2014) perform targeted sentiment classification by using a recursive neural network to model the transmission of sentiment signal from opinion baring expressions to a target. They assume that the target mention is given, and perform three-way sentiment classification. In contrast, we apply a structural neural model for open domain targeted sentiment analysis, identifying and classifying all targets in a sentence simultaneously.

Discrete CRF Baselines
As shown in Figure 2, the input x to our tasks is a word sequence. Assuming no external resources, there is no POS given to each input word x i . For the pipeline and collapsed tasks, there is a single output label sequence y. For the joint task, there are two label sequences y and z, for entity and sentiment labels, respectively. We take the models of Mitchell et al. (2013) as our baseline, which are standard CRFs with discrete manual features. To facilitate comparison between the discrete baseline and our neural models, we give a unified formulation to all the models in this paper, introducing the neural and integrated models as extensions to the discrete models.
The baseline CRF structures for pipeline, joint and collapsed targeted sentiment analysis are shown in Figure 3(a), 3(b) and 3(c), respectively. In the figures, the input features are represented as black and white circles, indicating that they take 0/1 binary values. The labels O, B and I indicate a non-target, the beginning of a target, and part of a target, respectively. The labels +, −, 0 and Φ indicate positive, negative, neutral and NULL sentiments, respectively. The NULL sentiment is assigned to O entities automatically, and modeled as a hidden variable in the pipeline and joint CRFs. 1 The collapsed labels take combined meanings from their components.
The links between labels and inputs represent output clique potentials: where f ( x, y i ), is a discrete manual feature vector, and θ is the model parameter vector.
The links between labels represent edge clique potentials: where τ (y i , y i−1 ) is the transition weight, which is also a model parameter.
For both the pipeline and collapsed models, the conditional probability of a label sequence given an input sequence is:  where Z( x) is the partition function: For the joint model, we apply a multi-label CRF structure, where there are two separate sets of output clique potentials Ψ 1 ( x, y i ) and Ψ 2 ( x, z i ) and two separate sets of edge clique potentials Φ 1 ( x, y i , y i−1 ) and Φ 2 ( x, z i , z i−1 ) for the label sets {B, I, O} and {+, −, 0}, respectively. In the Figure 3(b), there are also links between the span label y i and the sentiment label z i for each word x i . These links indicate label dependencies, which are constraints for decoding. For example, if y i = O, then z i must be φ.
We apply Viterbi decoding for all tasks, and training is performed using a max-margin objective, which is discussed in Section 6. Our training algorithm is different from that of Mitchell et al. (2013), but gives similar discrete CRF accuracies in our experiments. Wang and Mori (2009) also applied a max-margin trainig strategy to train CRF models. The set of features is taken from Mitchell et al. (2013) without changes, as shown in Table  1. Here the cluster features refer to Brown word clusters (Brown et al., 1992).

Neural Models
We extend the discrete baseline system with two salient changes, which are illustrated in Figure 4. First, the input discrete features are replaced with continuous word embeddings. Each node in the input takes a real value between 0 and 1, as represented by grey nodes in Figure 4. Second, a hidden O · · · · · · my B · · · · · · baby I · · · · · · Farah · · · · · · step 1: entity neural layer h is added between the input nodes x and the label nodes y i . Formally, the links between the input nodes x and the hidden nodes h i for the node y i in Figure  4 represent a feature combination function: where e is the embedding lookup function, ⊕ is the vector concatenation function, the matrix W and vector b are model parameters and tanh is the activation function.
The output clique potential of y i becomes: where σ is a model parameter, and the edge clique potentials remain the same as the baseline. By B • · · · · · · · · · · · · baby my Farah step 1: entity using a hidden layer for automatic feature combinations, the neural model is free of manual features, and can benefit from unsupervised embeddings. Decoding and training are performed using the same algorithms as the baseline. The major neural architectures in Figure 4 have been explored as conditional neural fields by Peng et al. (2009) and neural conditional random fields by Do et al. (2010), and is connected to the sentence-level likelihood neural networks of Collobert et al. (2011), as pointed out by Wang and Manning (2013b). The main differences between our model and the prior work are in the multi-label settings and training details.

Integrated Models
Gleaning different sources of information, neural features and discrete linear features comple-ments each other. As a result, a model that integrates both features can potentially achieve performance improvements. Most work attempts to add neural word embeddings into a discrete linear model (Turian et al., 2010;Yu et al., 2013;Guo et al., 2014), or add discreted features into a neural model (Ma et al., 2014). We make a novel combination of the discrete models and the neural models by integrating both types of inputs into a same CRF framework. 2 The architectures of the integrated models are shown in Figure 5. The main difference between Figure 5 and Figure 3 is the input layer. The integrated model takes both continuous word embeddings, which are shown in grey nodes, and discrete manual features, which are shown in black or white nodes, as the input.
A separate hidden layer is given to each type of input nodes, with the hidden layer for the embeddings being the same as the neural baseline: The hidden nodes g i between the discrete features and the node y i are: Finally, the output clique potential of y i becomes: The edge clique potentials remain the same as the baseline models; the same training and decoding algorithms are used.

Training
We use a max-margin objective to train our model parameters Θ, which consist of θ, τ , W, b and σ for each model. The objective function is defined as: l( x n , y n , Θ) + λ 2 Θ 2 , 2 Wang and Manning (2013a) also investigated the integration of discrete and neural features in CRF models. They compared the effect of integration without hidden layers (i.e. Turian et al. (2010)) and with hidden layers (i.e. our methods) for NER and chunking, finding that the formal outperforms the latter. Our results are different from theirs, and a hidden layer gives significant improvements to the targeted sentiment analysis task.
where ( x n , y n )| N n=1 are the set of training examples, λ is a regularization parameter, and l( x n , y n , Θ) is the loss function towards one example ( x n , y n ).
The loss function is defined as: l( x n , y n , Θ) = max y (s( x n , y, Θ) + δ( y, y n )) − s( x n , y n , Θ), where s( x, y, Θ) = logP ( y| x) is the log probability of y, and δ( y, y n ) is the Hamming distance between y and y n . We use online learning to train model parameters, updating the parameters using the AdaGrad algorithm (Duchi et al., 2011). One thing to note is that, our objective function is not differentiable because of the loss function l( x n , y n , Θ). Thus we use sub-gradients for l( x n , y n , Θ) instead, which can be computed by the formula: where ŷ is the predicted label sequence which corresponds to l( x n , y n , Θ). Maximum-likelihood training is a commonly used alternative to max-margin training for neural networks. It has been applied to the models of Do et al. (2010) and Collobert et al. (2011), for example. However, our experiments show that maximum-likelihood training cannot be applied to open-domain targeted sentiment tasks. Although giving comparable overall accuracies in both entity and sentiment labels, it suffers from unbalanced sentiment labels, assigning the neutral sentiment to most entities. This problem can be addressed by imposing a polarity-sensitive cost to the training, such as the sentence-level averaged F1-score between positive, negative and neutral labels. We skip these results due to space limitations. In contrast, max-margin training does not suffer from the label skew issue, thanks to the use of Hamming loss in the objective function.   conducted to replace all usernames and URLs into the special tokens username and url , respectively. Following Mitchell et al. (2013), we report ten-fold cross-validation results. During training, we split 10% of the training corpus as the development corpus to tune hyper-parameters. Table 2 shows the corpus statistics.
Parameters: For all the neural models, we set the hidden layer size | h| for neural features to 200, the hidden layer size | g| for discrete features to 30, the initial learning rate for adagrad to 0.01 and the regularization parameter λ to 10 −8 . English and Spanish word embeddings are trained using the word2vec tool 4 , with respective corpora of 20 minion random tweets crawled by tweet API 5 . The size of word embeddings is 100. For English, there are 8,061 unique words, for which 25% are out of word embedding vocabulary (OOE) words, while for Spanish, there are 14,648 unique words, for which 15% are OOE words.
Metrics: We take full-span metrics for evaluation, which is different from Mitchell et al. (2013), who evaluate mainly the beginning of spans. We measure the precision, recall and F-score of entity recognition (Entity), targeted sentiment analysis (SA) (both entity and sentiment), and targeted subjectivity detection (Subjectivity) (both entity and subjectivity, namely merging the + and -labels as "1" label, and performing two-way 0/1 subjectivity classification on entities). For SA, an entity is taken as correct only when the span and the sentiment are both correctly recognized. Similarly, for Subjectivity, an entity is taken as correct only when both the span and the subjectivity are correctly recognized.
Code: We make the C++ implementations of the discrete, neural and combined models available and GPL, at https://github.com/ SUTDNLP/OpenTargetedSentiment.

Comparing Neural and Discrete Models
The main results on both the English and Spanish dataset are shown in Table 3, which are mea- sured on the pipeline, the joint and the collapsed tasks, respectively. As can be seen from the table, the neural models give higher F-scores than the discrete CRF models on the English dataset, while comparable overall F-scores on the Spanish dataset. The gains on English are mostly attributed to improved recalls, while the precision of the neural CRF models are relatively lower. A likely reason for this observation is that the neural model takes embedding inputs, which allow semantically similar words to be represented with similar vectors. As a result, the neural model can better capture patterns that do not occur in the training data. In contrast, the discrete model is based on manually defined binary features, which do not fire if not contained in the training data. Because discrete feature instantiation is based on exact matching, the discrete model gives a relatively higher precision.
To further contrast the discrete and neural models, we draw the per-word accuracies of sentiment labels according to both models in Figure 6. In the figure, each dot represents the accuracy of a sentence, measured in the pipeline task. The dots for both English and Spanish are scattered from the reverse diagonal, showing that the two models make very different errors, which suggests that model integration can lead to better accuracies.

The Integrated Model
As shown in Table 3, the integrated model combines the relative advantages of both pure models, improving the recall over the discrete model and the precision over the neural model. In most cases, it gives the best results in terms of both precision and recall. For the English pipeline model, the integrated model improves the entity recognition F-score from 43.84% to 55.67% (significant with p < 10 −5 by pair-wise t-test) as compared to the discrete baseline, namely Mitchell et al. (2013).   The overall SA score is improved from 31.73% to 40.06% (p < 10 −5 ). Similar improvements are achieved to the other test datasets.

Fine-tuning Word Embeddings
In the experiments above, word embeddings are fine-tuned for the neural models, but not for the integrated models. By fine-tuning, embeddings of in-vocabulary words are treated as model parameters, and updated with other parameters in supervised training. This can improve the accuracy of the model by significantly enlarging the parameter space. However, it can make the embeddings of OOV words less useful to the model, because the hidden layers are tuned with adjusted embeddings. Figure 7 shows the effectiveness of fine-tuning on the neural and integrated models using the Spanish data. Similar findings apply to the English data. The neural model heavily relies on fine-tuning of embeddings, and a likely reason is that manual discrete features offer sufficient parameters for capturing in-vocabulary patterns. On the other hand, thanks to the rich discrete features in parameter space, the integrated model does not rely on fine-tuning of word embeddings, which even caused slight overfitting and reduced the performances. This makes the non-fine-tuned integrated model potentially advantageous in handling test data with many OOV words.
7.5 Comparing pipeline, joint and collapsed models Mitchell et al. (2013) find that for discrete CRF, the pipeline task gives competitive overall performances compared with the joint task. This suggests a relatively weak connection between entity boundary information and sentiment classes. We re-examine the comparisons under the neural network setting, where automatic feature combinations can be useful in capturing more subtle correlations between two sources of information. As shown in Table 3, the overall results are similar to those of Mitchell et al. (2013), with both the neural and the integrated models demonstrating the same trends as the discrete baselines. A more detail analysis, however, shows some relative strengths of the joint task. Table 4 give the precision, recall and F-scores of subjectivity, and those of SA excluding neutral sentiment labels on the Spanish data. Findings on the English dataset are consistent.
The latter metrics highlight sentiment polarities, which can be relatively more useful. The joint task gives better F-scores on both metrics, which suggest that is a considerable choice for open targeted sentiment. When there is external resource for en-  Table 4: Results on subjectivity and polarity. tity recognition, the pipeline can be a favorable choice. On the other hand, although useful for some joint sequence labeling task (Ng and Low, 2004), the collapsed task does not seem to address the joint sentiment task as effectively. We find this result empirical, but consistent across our datasets.

Conclusion
We explored open domain targeted sentiment analysis using neural network models, which gave competitive results when evaluated against a strong discrete CRF baseline, with relatively higher recalls. Given complementary error distributions by the discrete and neural CRFs, we proposed a novel combination which significantly outperformed both models. Under the neural setting, we find that it is preferable to solve open targeted sentiment as a pipeline or joint multi-label task, but not as a joint task with collapsed labels.