TDParse: Multi-target-specific sentiment recognition on Twitter

Existing target-specific sentiment recognition methods consider only a single target per tweet, and have been shown to miss nearly half of the actual targets mentioned. We present a corpus of UK election tweets, with an average of 3.09 entities per tweet and more than one type of sentiment in half of the tweets. This requires a method for multi-target specific sentiment recognition, which we develop by using the context around a target as well as syntactic dependencies involving the target. We present results of our method on both a benchmark corpus of single targets and the multi-target election corpus, showing state-of-the art performance in both corpora and outperforming previous approaches to multi-target sentiment task as well as deep learning models for single-target sentiment.


Introduction
Recent years have seen increasing interest in mining Twitter to assess public opinion on political affairs and controversial issues (Tumasjan et al., May 2010;Wang et al., 2012) as well as products and brands (Pak and Paroubek, 2010). Opinion mining from Twitter is usually achieved by determining the overall sentiment expressed in an entire tweet. However, inferring the sentiment towards specific targets (e.g. people or organisations) is severely limited by such an approach since a tweet may contain different types of sentiment expressed towards each of the targets mentioned. An early study by Jiang et al. (2011) showed that 40% of classification errors are caused by using tweetlevel approaches that are independent of the target. Consider the tweet: "I will b voting 4 Greens ... 1st reason: 2 remove 2 party alt. of labour or conservative every 5 years. 2nd: fracking" The overall sentiment is positive but there is a negative sentiment towards "labour", "conservative" and "fracking" and a positive sentiment towards "Greens". Examples like this are common in tweets discussing topics like politics. As has been demonstrated by the failure of election polls in both referenda and general elections (Burnap et al., 2016), it is important to understand not only the overall mood of the electorate, but also to distinguish and identify sentiment towards different key issues and entities, many of which are discussed on social media on the run up to elections.
Recent developments on target-specific Twitter sentiment classification have explored different ways of modelling the association between target entities and their contexts. Jiang et al. (2011) propose a rule-based approach that utilises dependency parsing and contextual tweets. Dong et al. (2014), Tang et al. (2016a) and Zhang et al. (2016) have studied the use of different recurrent neural network models for such a task but the gain in performance from the complex neural architectures is rather unclear 1 In this work we introduce the multi-targetspecific sentiment recognition task, building a corpus of tweets from the 2015 UK general election campaign suited to the task. In this dataset, target entities have been semi-automatically selected, and sentiment expressed towards multiple target entities as well as high-level topics in a tweet have been manually annotated. Unlike all existing studies on target-specific Twitter sentiment analysis, we move away from the assumption that each tweet mentions a single target; we introduce a more realistic and challenging task of identifying sentiment towards multiple targets within a tweet. To tackle this task, we propose TDParse, a method that divides a tweet into different segments building on the approach introduced by Vo and Zhang (2015). TDParse exploits a syntactic dependency parser designed explicitly for tweets (Kong et al., 2014), and combines syntactic information for each target with its left-right context.
We evaluate and compare our proposed system both on our new multi-target UK election dataset, as well as on the benchmarking dataset for single-target dependent sentiment (Dong et al., 2014). We show a clear state-of-the-art performance of TDParse over existing approaches for tweets with multiple targets, which encourages further research on the multi-target-specific sentiment recognition task. 2 2 Related Work: Target-dependent Sentiment Classification on Twitter The 2015 Semeval challenge introduced a task on target-specific Twitter sentiment (Rosenthal et al., 2015) which most systems (Boag et al., 2015;Plotnikova et al., 2015) treated in the same way as tweet level sentiment. The best performing system in the 2016 Semeval Twitter challenge substask B (Nakov et al., 2016), named Tweester, also performs on tweet level sentiment classification. This is unsurprising since tweets in both tasks only contain a single predefined target entity and as a result often a tweet-level approach is sufficient. An exception to tweet level approaches for this task, showing promise, is Townsend et al. (2015), who trained a SVM classifier for tweet segmentation, then used a phrase-based sentiment classifier for assigning sentiment around the target. The Semeval aspect-based sentiment analysis task (Pontiki et al., 2015;Pateria and Choubey, 2016) aims to identify sentiment towards entityattribute pairs in customer reviews. This differs from our goal in the following way: both the entities and attributes are limited to a predefined inventory of limited size; they are aspect categories reflected in the reviews rather than specific targets, while each review only has one target entity, e.g. a laptop or a restaurant. Also sentiment classification in formal text such as product reviews 2 The data and code can be found at https://goo.gl/ S2T1GO is very different from that in tweets. Recently Vargas et al. (2016) analysed the differences between the overall and target-dependent sentiment of tweets for three events containing 30 targets, showing many significant differences between the corresponding overall and target-dependent sentiment labels, thus confirming that these are distinct tasks.
Early work tackling target-dependent sentiment in tweets (Jiang et al., 2011) designed targetdependent features manually, relying on the syntactic parse tree and a set of grammar-based rules, and incorporating the sentiment labels of related tweets to improve the classification performance. Recent work (Dong et al., 2014) used recursive neural networks and adaptively chose composition functions to combine child feature vectors according to their dependency type, to reflect sentiment signal propagation to the target. Their datadriven composition selection approach replies on the dependency types as features and a small set of rules for constructing target-dependent trees. Their manually annotated dataset contains only one target per tweet and has since been used for benchmarking by several subsequent studies (Vo and Zhang, 2015;Tang et al., 2016a;Zhang et al., 2016). Vo and Zhang (2015) exploit the left and right context around a target in a tweet and combine low-dimensional embedding features from both contexts and the full tweet using a number of different pooling functions. Despite not fully capturing semantic and syntactic information given the target entity, they show a much better performance than Dong et al. (2014), indicating useful signals in relation to the target can be drawn from such context representation. Both Tang et al. (2016a) and Zhang et al. (2016) adopt and integrate left-right target-dependent context into their recurrent neural network (RNN) respectively. While Tang et al (2016a) propose two long shortterm memory (LSTM) models showing competitive performance to Vo and Zhang (2015), Zhang et al (2016) design a gated neural network layer between the left and right context in a deep neural network structure but require a combination of three corpora for training and evaluation. Results show that conventional neural network models like LSTM are incapable of explicitly capturing important context information of a target (Tang et al., 2016b). Tang et al. (2016a) also experiment with adding attention layers for LSTM but fail to achieve competitive results possibly due to the small training corpus.
Going beyond the existing work we study the more challenging task of classifying sentiment towards multiple target entities within a tweet. Using the syntactic information drawn from tweetspecific parsing, in conjunction with the left-right contexts, we show the state-of-the-art performance in both single and multi-target classification tasks. We also show that the tweet level approach that many sentiment systems adopted in both Semeval challenges, fail to capture all target-sentiments in a multi-target scenario (Section 5.1).

Creating a Corpus for Target Specific Sentiment in Twitter
We describe the design, collection and annotation of a corpus of tweets about the 2015 UK election.

Data Harvesting and Entity Recognition
We collected a corpus of tweets about the UK elections, as we wanted to select a political event that would trigger discussions on multiple entities and topics. Collection was performed through Twitter's streaming API and tracking 14 hashtags 3 . Data harvesting was performed between 7th February and 30th March 2015. This led to the collection of 712k tweets, from which a subset was sampled for manual annotation of targetspecific sentiment. We also created a list of 438 topic keywords relevant to 9 popular election issues 4 for data sampling. The initial list of 438 seed words provided by a team of journalists was augmented by searching for similar words within a vector space on the basis of cosine similarity. Keywords are used both in order to identify thematically relevant tweets and also targets. We also consider named entities as targets. Sampling of tweets was performed by removing retweets and making sure each tweet contained at least one topic keyword from one of the 9 election issues, leading to 52,190 highly relevant tweets. For the latter we ranked tweets based on a "similarity" relation, where "similarity" is measured as a function of content overlap (Mihalcea, 2004). Formally, given a tweet S i being represented by 3 #ukelection2015, #ge2015, #ukge2015, #ukgeneralelec-tion2015, #bbcqt, #bbcsp, #bbcdp, #marrshow, #generalelec-tion2015, #ge15, #generalelection, #electionuk, #ukelection and #electionuk2015 4 EU and immigration, economy, NHS, education, crime, housing, defense, public spending, environment and energy the set of N words that appear in the tweet: S i = W 1 i , W 2 i , ..., W N i and our list of curated topic keywords T , the ranking function is defined as: where |S i | is the total number of words in the tweet; unlike Mihalcea (2004) we prefer longer tweets. We used exact matching with flexibility on the special characters at either end. TF-IDF normalisation and cosine similarity were then applied to the dataset to remove very similar tweets (empirically we set the cosine similarity threshold to 0.6). We also collected all external URLs mentioned in our dataset and their web content throughout the data harvesting period, filtering out tweets that only contain an external link or snippets of a web page. Finally we sampled 4,500 top-ranked tweets keeping the representation of tweets mentioning each election issue proportionate to the original dataset. For annotation we considered sentiment towards two types of targets: entities and topic keywords. Entities were processed in two ways: firstly, named entities (people, locations, and organisations) were automatically annotated by combining the output of Stanford Named Entity Recognition (NER) (Finkel et al., 2005), NLTK NER (Bird, 2006) and a Twitter-specific NER (Ritter et al., 2011). All three were combined for a more complete coverage of entities mentioned in tweets and subsequently corrected by removing wrongly marked entities through manual annotation. Secondly, to make sure we covered all key entities in the tweets, we also matched tweets against a manually curated list of 7 political-party names and added users mentioned therein as entities. The second type of targets matched the topic keywords from our curated list.

Manual Annotation of Target Specific Sentiment
We developed a tool for manual annotation of sentiment towards the targets (i.e. entities and topic keywords) mentioned in each tweet. The annotation was performed by nine PhD-level journalism students, each of them annotating approximately a ninth of the dataset, i.e. 500 tweets. Additionally, they annotated a common subset of 500 tweets consistign of 2,197 target entities, which was used to measure inter-annotator agreement (IAA). An- Figure 1: Annotation tool for human annotation of target specific sentiment analysis notators were shown detailed guidelines 5 before taking up the task, after which they were redirected to the annotation tool itself (see Figure 1). Tweets were shown to annotators one by one, and they had to complete the annotation of all targets in a tweet to proceed. The tool shows a tweet with the targets highlighted in bold. Possible annotation actions consisted in: (1) marking the sentiment for a target as being positive, negative, or neutral, (2) marking a target as being mistakenly highlighted (i.e. 'doesnotapply') and hence removing it, and (3) highlighting new targets that our preprocessing step had missed, and associating a sentiment value with them. In this way we obtained a corrected list of targets for each tweet, each with an associated sentiment value.
We measure inter-annotator agreement in two different ways. On the one hand, annotators achieved κ = 0.345 (z = 92.2, p < 0.0001) (fair agreement) 6 when choosing targets to be added or removed. On the other hand, they achieved a similar score of κ = 0.341 (z = 77.7, p < 0.0001) (fair agreement) when annotating the sentiment of the resulting targets. It is worth noting that the sentiment annotation for each target also involves choosing among not only positive/negative/neutral but also a fourth category 'doesnotapply'. The resulting dataset contains 4,077 tweets, with an average of 3.09 entity mentions (targets) per tweet. As many as 3,713 tweets have more than a single entity mention (target) per tweet, which makes the task different from 2015 Semeval 10 subtask C (Rosenthal et al., 2015) and a target-dependent benchmarking dataset of Dong et al. (2014) where each tweet has only one target annotated and thus one sentiment label assigned. The number of targets in the 4,077 tweets to be annotated originally amounted to 12,874. However, the annotators unhighlighted 975 of them, and added 688 new ones, so that the final number of targets in the dataset is 12,587. These are distributed as follows: 1,865 are positive, 4,707 are neutral, and 6,015 are negative. This distribution shows the tendency of a theme like politics, where users tend to have more negative opinions. This is different from the Semeval dataset, which has a majority of neutral sentiment. Looking at the annotations provided for different targets within each tweet, we observe that 2,051 tweets (50.3%) have all their targets consistently annotated with a single sentiment value, 1,753 tweets (43.0%) have two different sentiments, and 273 tweets (6.7%) have three different sentiment values. These statistics suggest that providing a single sentiment for the entire tweet would not be appropriate in nearly half of the cases confirming earlier observations (Jiang et al., 2011).
We also labelled each tweet containing one or more topics from the 9 election issues, and asked the annotators to mark the author's sentiment towards the topic. Unlike entities, topics may not be directly present in tweets. We compare topic sentiment with target/entity sentiment for 3963 tweets from our dataset adopting the approach by Vargas et al. (2016). Table 1 reports the individual c(s target ), c(s topic ) and joint c(s target , s topic ) distributions of the target/entity s target and topic s topic sentiment. While s target and s topic report how often each sentiment category occurs in the dataset, the joint distribution c(s target , s topic ) (the inner portions of the table) shows the discrepancies between target and topic sentiments. We observe marked differences between the two sentiment labels. For example it shows the topic sentiment is more neutral (1438.7 vs. 1104.1) and less negative (1930.7 vs. 2285.5) than the target sen-timent. There is also a number of tweets expressing neutrality towards the topics mentioned but polarised sentiment towards targets (i.e. we observe c(s topic = neu ∩ s targets = neg) = 258.6 also c(s topic = neu ∩ s targets = pos) = 101.4), and vice versa. This emphasises the importance of distinguishing target entity sentiment not only on the basis of overall tweet sentiment but also in terms of sentiment towards a topic. Firstly we adopt the context-based approach by Vo and Zhang (2015), which divides each tweet into three parts (left context, target and right context), and where the sentiment towards a target entity results from the interaction between its left and right contexts. Such sentiment signal is drawn by mapping all the words in each context into lowdimensional vectors (i.e. word embeddings), using pre-trained embedding resources, and applying neural pooling functions to extract useful features. Such context set-up does not fully capture the syntactic information of the tweet and the given target entity, and by adding features from the full tweet (as done by Vo and Zhang (2015)) interactions between the left and right context are only implicitly modeled. Here we use a syntactic dependency parser designed explicitly for tweets (Kong et al., 2014) to find the syntactically connected parts of the tweet to each target. We then extract word embedding features from these syntactically dependent tokens [D 1 , ..., D n ] along its dependency path in the parsing tree to the target 7 , as well as from the left-target-right contexts (i.e. L − T − R). Feature vectors generated from different contexts are concatenated into a final feature 7 Empirically the proximity/location of such syntactic relations have not made much difference when used in feature weighting and is thus ignored. vector as shown in (2), where P (X) presents a list of k different pooling functions on an embedding matrix X. Not only does this proposed framework make the learning process efficient without labor intensive manual feature engineering and heavy architecture engineering for neural models, it has also shown that complex syntactic and semantic information can be effectively drawn by simply concatenating different types of context together without the use of deep learning (other than pretrained word embeddings). (2) Data set: We evaluate and compare our proposed system to the state-of-the-art baselines on a benchmarking corpus (Dong et al., 2014) that has been used by several previous studies (Vo and Zhang, 2015;Tang et al., 2016a;Zhang et al., 2016). This corpus contains 6248 training tweets and 692 testing tweets with a sentiment class balance of 25% negative, 50% neutral and 25% positive. Although the original corpus has only annotated one target per tweet, without specifying the location of the target, we expand this notion to consider cases where the target entity may appear more than once at different locations in the tweet, e.g.: "Nicki Minaj has brought back the female rapper. -really? Nicki Minaj is the biggest parody in popular music since the Lonely Island." Semantically it is more appropriate and meaningful to consider both target appearances when determining the sentiment polarity of "Nicki Minaj" expressed in this tweet. While it isn't clear if Dong et al. (2014) and Tang et al. (2016a) have considered this realistic same-target-multiappearance scenario, Vo et al. (2015) and Zhang et al. (2016) do not take it into account when extracting target-dependent contexts. Contrary to these studies we extend our system to fully incorporate the situation where a target appears multiple times at different locations in the tweet. We add another pooling layer in (2) where we apply a medium pooling function to combine extracted feature vectors from each target appearance together into the final feature vector for the sentiment classification of such targets. Now the feature extraction function P (X) in (2) becomes: where m is the number of appearances of the target and P medium represents the dimension-wise medium pooling function.
Models: To investigate different ways of modelling target-specific context and evaluate the benefit of incorporating the same-target-multiappearance scenario, we build these models: • Semeval-best: is a tweet-level model using various types of features, namely ngrams, lexica and word embeddings with extensive data pre-processing and feature engineering. We use this model as a target-independent baseline as it approximates and beats the best performing system (Boag et al., 2015) in Semeval 2015 task 10. It also outperforms the highest ranking system, Tweester, on the Semeval 2016 corpus (by +4.0% in macro-averaged recall) and therefore constitutes a state-of-the art tweet level baseline.
• Naive-seg models: Naive-seg-slices each tweet into a sequence of sub-sentences by using punctuation (i.e. ',' '.' '?' '!'). Embedding features are extracted from each subsentence and pooling functions are applied to combine word vectors. Naive-seg extends it by adding features extracted from the lefttarget-right contexts, while Naive-seg+ extends Naive-seg by adding lexicon filtered sentiment features.
TDParse-uses a dependency parser to extract a syntactic parse tree to the target and map all child nodes to low-dimensional vectors. Final feature vectors for each target are generated using neural pooling functions. While TDParse extends it by adding features extracted from the left-target-right contexts, TD-Parse+ uses three sentiment lexica for filtering words. TDParse+ (m) differs from TDParse+ by taking into account the 'sametarget-multi-appearance' scenario. Both TD-Parse+ and TDParse+ (m) outperform stateof-the-art target-specific models.
• TDPWindow-N: the same as TDParse+ with a window to constrain the left-right context.
For example if N = 3 then we only consider 3 tokens on each side of the target when extracting features from the left-right context.

Experimental Settings
To compare our proposed models with Vo & Zhang (2015), we have used the same pre-trained embedding resources and pooling functions (i.e. max, min, mean, standard deviation and product). For classification we have used LIBLINEAR (Fan et al., 2008), which approximates a linear SVM. In tuning the cost factor C we perform five-fold cross validation on the training data over the same set of parameter values for both Vo and Zhang (2015)'s implementation and our system. This makes sure our proposed models are comparable with those of Vo and Zhang (2015). Evaluation metrics: We follow previous work on target-dependent Twitter sentiment classification, and report our performance in accuracy, 3-class macro-averaged (i.e. negative, neutral and positive) F 1 score as well as 2-class macroaveraged (i.e. negative and positive) F 1 score 8 , as used by the Semeval competitions (Rosenthal et al., 2015) for measuring Twitter sentiment classification performance.

Experimental results and comparison with other baselines
We report our experimental results in Table 2 on the single-target benchmarking corpus (Dong et al., 2014), with three model categories: 1) tweet-level target-independent models, 2) targetdependent models without considering the 'sametarget-multi-appearance' scenario and 3) targetdependent models incorporating the 'same-targetmulti-appearance' scenario. We include the models presented in the previous section as well as models for target specific sentiment from the literature where possible. Among the target-independent baseline models Target-ind (Vo and Zhang, 2015) and Semevalbest have shown strong performance compared with SSWE  and SVM-ind (Jiang et al., 2011) as they use more features, especially rich automatic features using the embeddings of Mikolov et al. (2013). Interestingly they also perform better than some of the targetdependent baseline systems, namely SVM-dep (Jiang et al., 2011), Recursive NN and AdaRNN (Dong et al., 2014), showing the difficulty of fully extracting and incorporating target information in tweets. Basic LSTM models (Tang et al., 2016a) completely ignore such target information and as a result do not perform as well.
Among the target-dependent systems neural network baselines have shown varying results. The adaptive recursive neural network, namely AdaRNN (Dong et al., 2014), adaptively selects composition functions based on the input data and thus performs better than a standard recursive neural network model (Recursive NN (Dong et al., 2014)). TD-LSTM and TC-LSTM from Tang et al. (2016a) model left-target-right contexts using two LSTM neural networks and by doing so incorporate target-dependent information. TD-LSTM uses two LSTM neural networks for modeling the left and right contexts respectively. TC-LSTM differs from (and outperforms) TD-LSTM in that it concatenates target word vectors with embedding vectors of each context word. We also test the Gated recurrent neural network models proposed by Zhang et al. (2016) on the same dataset. The gated models include: GRNN, that includes gates in its recurrent hidden layers, G3 that connects left-right context using a gated NN structure, and a combination of the two -GRNN+G3. Results show these gated neural network models do not achieve state-of-theart performance. When we compare our targetdependent model TDParse+, which incorporates target-dependent features from syntactic parses, against the target-dependent models proposed by Vo and Zhang (2015), namely Target-dep which combines full tweet (pooled) word embedding features with features extracted from left-targetright contexts and Target-dep+ that adds targetdependent sentiment features on top of Targetdep, we see that our method beats both of these, without using full tweet features 9 . TDParse+ also outperforms the state-of-the-art TC-LSTM.
When considering the 'same-target-multiappearance' scenario, our best model -TDParse+ improves its performance further (shown as TD-Parse+ (m) in Table 2). Even though TDParse doesn't use lexica, it shows competitive results to Target-dep+ which uses lexicon filtered sen-9 Note that the results reported in Vo and Zhang (2015) (71.1 in accuracy and 69.9 in F1) were not possible to reproduce by running their code with very fine parameter tuning, as suggested by the authors

Model
Accuracy  Table 2: Performance comparison on the benchmarking data (Dong et al., 2014) timent features. In the case of TDParse-, which uses exclusively features from syntactic parses, while it performs significantly worse than Targetind, that uses only full tweet features, when the former is used in conjunction with features from left-target-right contexts it achieves better results than the equivalent Target-dep and Target-dep+. This indicates that syntactic target information derived from parses complements well with the left-target-right context representation. Clausal segmentation of tweets or sentences can provide a simple approximation to parse-tree based models (Li et al., 2015). In Table 2 we can see our naive tweet segmentation models Naive-seg and Naive-seg+ also achieve competitive performance suggesting to some extent that such simple parse-tree approximation preserves the semantic structure of text and that useful target-specific information can be drawn from each segment or clause rather than the entire tweet. and applying our models described in Section 4.1.
We compare the results with our other developed baseline models in Section 4.1, including a tweet-level model Semeval-best and clausalsegmentation models that provide simple parsetree approximation, as well as state-of-the-art target-dependent models by Vo and Zhang (2015) and Zhang et al. (2016). The experimentation setup is the same as described in Section 4.2 10 . Data set: Our election data has a training/testing ratio of 3.70, containing 3210 training tweets with 9912 target entities and 867 testing tweets with 2675 target entities. Models: In order to limit our use of external resources we do not include Naive-seg+ and TD-Parse+ for evaluation as they both use lexica for feature generation. Since most of our tweets here contain N > 1 targets and the target-independent classifiers produce a single output per tweet, we evaluate its result N times against the ground truth labels, to make different models comparable.
Results: Overall the models perform much poorer than for the single-target benchmarking corpus, especially in 2-class F 1 score, indicating the challenge of the multi-target-specific sentiment recognition. As seen in Table 3 though the feature-rich tweet-level model Semeval-best gives a reasonably strong baseline performance (same as in Table 2), both it and Target-ind perform worse than the target-dependent baseline models Target-dep/Target-dep+ (Vo and Zhang, 2015), indicating the need to capture and utilise target-dependent signals in the sentiment classification model. The Gated neural network models -G3/GRNN/GRNN+G3 (Zhang et al., 2016) also perform worse than Target-dep+ while the combined model -GRNN+G3 fails to boost performance, presumably due to the small corpus size.
Our final model TDParse achieves the best performance especially in 3-class F 1 and 2-class F 1 scores in comparison with other target-dependent and target-independent models. This indicates that our proposed models can provide better and more balanced performance between precision and recall. It also shows the target-dependent syntactic information acquired from parse-trees is beneficial to determine the target's sentiment particularly when used in conjunction with the left-   Table 4: Performance analysis in S1, S2 and S3 target-right contexts originally proposed by Vo and Zhang (2015) and in a scenario of multiple targets per tweet. Our clausal-segmentation baseline -Naive-seg models approximate such parse-trees by identifying segments of the tweet relevant to the target, and as a result Naive-seg achieves competitive performance compared to other baselines.

State-of-the-art tweet level sentiment vs target-specific sentiment in a multi-target setting
To fully compare our multi-target-specific models against other target-dependent and targetindependent baseline methods, we conduct an additional experiment by dividing our election data test set into three disjoint subsets, on the basis of number of distinct target sentiment values per tweet: (S1) contains tweets having only one target sentiment, where the sentiment towards each target is the same; (S2) and (S3) contain two and three different types of targeted sentiment respec-tively (i.e. in S3, positive, neutral and negative sentiment are all expressed in each tweet). As described in Section 3.2, there are 2,051, 1,753 and 273 tweets in S1, S2 and S3 respectively. Table 4 shows results achieved by the tweetlevel target-independent model -Semeval-best, the state-of-the-art target-dependent baseline model -Target-dep+, and our proposed final model -TDParse, in each of the three subsets. We observe Semeval-best performs the best in S1 compared to the two other models but its performance gets worse when different types of target sentiment are mentioned in the tweet. It has the worst performance in S2 and S3, which again emphasises the need for multi-target-specific sentiment classification. Finally, our proposed final model TDParse achieves better performance than Target-dep+ consistently over all subsets indicating its effectiveness even in the most difficult scenario S3.

Conclusion and Future work
In this work we introduce the challenging task of multi-target-specific sentiment classification for tweets. To help the study we have generated a multi-target Twitter corpus on UK elections which will be made publicly available. We develop a state-of-the-art approach which utilises the syntactic information from parse-tree in conjunction with the left-right context of the target. Our method outperforms previous approaches on a benchmarking single-target corpus as well as our new multi-target election data. Future work could investigate sentiment connections among all targets appearing in the same tweet as a multi-target learning task, as well as a hybrid approach that applies either Semeval-best or TDParse depending on the number of targets detected in the tweet.