Multi-task Learning of Negation and Speculation for Targeted Sentiment Classification

The majority of work in targeted sentiment analysis has concentrated on finding better methods to improve the overall results. Within this paper we show that these models are not robust to linguistic phenomena, specifically negation and speculation. In this paper, we propose a multi-task learning method to incorporate information from syntactic and semantic auxiliary tasks, including negation and speculation scope detection, to create English-language models that are more robust to these phenomena. Further we create two challenge datasets to evaluate model performance on negated and speculative samples. We find that multi-task models and transfer learning via language modelling can improve performance on these challenge datasets, but the overall performances indicate that there is still much room for improvement. We release both the datasets and the source code at https://github.com/jerbarnes/multitask_negation_for_targeted_sentiment.


Introduction
Targeted sentiment analysis (TSA) involves jointly predicting entities which are the targets of an opinion, as well as the polarity expressed towards them (Mitchell et al., 2013). The TSA task, which is part of the larger set of fine-grained sentiment analysis tasks, can enable companies to provide better recommendations (Bauman et al., 2017), as well as give digital humanities scholars a quantitative approach to identifying how sentiment and emotions develop in literature (Alm et al., 2005;Kim and Klinger, 2019).
Modelling TSA has moved from sequence labeling using conditional random fields (CRFs) (Mitchell et al., 2013) or Recurrent Neural Networks (RNN) (Zhang et al., 2015a;Katiyar and Cardie, 2016;Ma et al., 2018), to Transformer * The authors contributed equally. models (Hu et al., 2019). However, all these improvements have concentrated on making the best of the relatively small task-specific datasets. As annotation for fine-grained sentiment is difficult and often has low inter-annotator agreement (Wiebe et al., 2005;Øvrelid et al., 2020), this data tends to be small and of varying quality. This lack of high-quality training data prevents TSA models from learning complex, compositional linguistic phenomena. For sentence-level sentiment classification, incorporating compositional information from relatively small amounts of negation or speculation data improves both robustness and general performance (Councill et al., 2010;Cruz et al., 2016;Barnes et al., 2020). Furthermore, transfer learning via language-modelling also improves fine-grained sentiment analysis (Hu et al., 2019;Li et al., 2019b). In this paper, we wish to explore two research questions: 1. Does multi-task learning of negation and speculation lead to more robust targeted sentiment models?
2. Does transfer learning based on languagemodelling already incorporate this information in a way that is useful for targeted sentiment models?
We explore a multi-task learning (MTL) approach to incorporate auxiliary task information in targeted sentiment classifiers in English in order to investigate the effects of negation and speculation in detail, we also annotate two new challenge datasets which contain negated and speculative examples. We find that the performance is negatively affected by negation and speculation, but MTL and transfer learning (TL) models are more robust than single task learning (STL). TL reduces the improvements of MTL, suggesting that TL is similarly effective at learning negation and speculation. The overall performance on the challenge datasets, however, confirms that there is still room for improvement.
The contributions of the paper are the following: i) we introduce two English challenge datasets annotated for negation and speculation, ii) we propose a multi-task model to incorporate negation and speculation information and evaluate it across four English datasets, iii) Finally, using the challenge datasets, we show the quantitative effect of negation and speculation on TSA.

Background and related work
Fine-grained sentiment analysis is a complex task which can be broken into four subtasks (Liu, 2015): i) opinion holder extraction, ii) opinion target extraction, iii) opinion expression extraction, iv) and resolving the polarity relationship between the holder, target, and expression. From these four subtasks, targeted sentiment analysis (TSA) (Jin and Ho, 2009;Chen et al., 2012;Mitchell et al., 2013) reduces the fine-grained task to only the second and final subtasks, namely extracting the opinion target and the polarity towards it.
English TSA datasets include MPQA (Wiebe et al., 2005), the SemEval Laptop and Restaurant reviews (Pontiki et al., 2014(Pontiki et al., , 2016, and Twitter datasets (Mitchell et al., 2013;Wang et al., 2017). Further annotation projects have led to review datasets for Arabic, Dutch, French, Russian, and Spanish (Pontiki et al., 2016) and Twitter datasets for Spanish (Mitchell et al., 2013) and Turkish (Pontiki et al., 2016). Prior work has also explored the effects of different phenomena on TSA through error analysis and challenge datasets. Wang et al. (2017), Xue andLi (2018), andJiang et al. (2019) showed the difficulties of polarity classification of targets on texts with multiple different polarities through the distinct sentiment error splits, the hard split, and the MAMS challenge dataset respectively. Both Kaushik et al. (2020) and Gardner et al. (2020) augment document sentiment datasets by asking annotators to create counterfactual examples for the IMDB dataset. More recently, Ribeiro et al. (2020) showed how sentence-level sentiment models are affected by various linguistic phenomena including negation, semantic role labelling, temporal changes, and name entity recognition. Previous approaches to modelling TSA have often relied on general sequence labelling models, e. g. CRFs (Mitchell et al., 2013), probabilistic graphical models (Klinger and Cimiano, 2013), RNNs (Zhang et al., 2015b;Ma et al., 2018), and more recently pretrained Transformer models (Li et al., 2019b).
Multi-task and transfer learning The main idea of MTL (Caruana, 1993) is that a model which receives signal from two or more correlated tasks will more quickly develop a useful inductive bias, allowing it to generalize better. This approach has gained traction in NLP, where several benchmark datasets have been created (Wang et al., 2019b,a). Under some circumstances, MTL can also be seen as a kind of data augmentation, where a model takes advantage of extra training data available in an auxiliary task to improve the main task (Kshirsagar et al., 2015;Plank, 2016). Much of MTL uses hard parameter sharing (Caruana, 1993), which shares all parameters across some layers of a neural network. When the main task and auxiliary task are closely related, this approach can be an effective way to improve model performance (Collobert et al., 2011;Peng and Dredze, 2017;Martínez Alonso and Plank, 2017;Augenstein et al., 2018), although it is often preferable to make predictions for low-level auxiliary tasks at lower layers of a multi-layer MTL setup (Søgaard and Goldberg, 2016), which we refer to as hierarchical MTL.
Transfer learning methods (Mikolov et al., 2013;Peters et al., 2018a;Devlin et al., 2019) can leverage unlabeled data, but require training large models on large amounts of data. However, it seems even these models can be sensitive to negation (Ettinger, 2020;Ribeiro et al., 2020;Kassner and Schütze, 2020) Specific to TSA, previous research has used MTL to incorporate document-level sentiment (He et al., 2019), or to jointly learn to extract opinion expressions (Li et al., 2019b;Chen and Qian, 2020).

Negation and Speculation Detection
As negation is such a common linguistic phenomenon and one that has a direct impact on sentiment, previous work has shown that incorporating negation information is crucial for accurate sentiment prediction. Feature-based approaches did this by including features from negation detection modules (Das and Chen, 2007;Councill et al., 2010;Lapponi et al., 2012), while it has now become more common to assume that neural models learn negation features in an end-to-end fashion (Socher et al., 2013). However, recent research suggests that end-to-end models are not able to robustly interpret the effect of negation on sentiment (Barnes et al., 2019), and that explicitly learning negation can improve sentiment results (Barnes, 2019;Barnes et al., 2020).
On the other hand, speculation refers to whether a statement is described as a fact, a possibility, or a counterfact (Saurí and Pustejovsky, 2009). Although there are fewer speculation annotated corpora available (Vincze et al., 2008;Kim et al., 2013;Konstantinova et al., 2012), including speculation information has shown promise for improving sentiment analysis at document-level (Cruz et al., 2016).
There has, however, been little research on how these phenomena specifically affect fine-grained approaches to sentiment analysis. This is important because, compared to document-or sentence-level tasks where there is often a certain redundancy in sentiment signal, for fine-grained tasks negation and speculation often completely change the sentiment (see Table 2), making their identification and integration within a fine-grained sentiment models essential to resolve.

Data
We perform the main experiments on four English language datasets: The Laptop dataset from Se-mEval 2014(Pontiki et al., 2014, the Restaurant dataset which combines the SemEval 2014 (Pontiki et al., 2014), 2015(Pontiki et al., 2015), and 2016(Pontiki et al., 2016, the Multi-aspect Multisentiment (MAMS) dataset (Jiang et al., 2019), and finally the Multi-perspective Question Answering (MPQA) dataset (Wiebe et al., 2005) 1 shows the distribution of the sentiment classes . We take the pre-processed Laptop and Restaurant datasets from Li et al. (2019a), and use the train, dev, and test splits that they provide. We use the NLTK word tokenizer to tokenise the Laptop, Restaurant, and MPQA datasets and Spacy for the MAMS dataset.
We choose datasets that differ largely in their domain, size, and annotation style in order to determine if any trends we see are robust to these data characteristics or whether they are instead correlated. We convert all datasets to a targeted setup by extracting only the aspect targets and their polarity. We use the unified tagging scheme 2 following recent work (Li et al., 2019a,b) and convert all data to BIOUL format 3 with unified sentiment tags, e. g. B-POS for a beginning tag with a positive sentiment, so that we can cast the TSA problem as a sequence labeling task.
The statistics for these datasets are shown in Table 1. MAMS has the largest number of training targets (11,162), followed by Restaurant (3,896), Laptop (2,044) and finally MPQA has the fewest (1,264). MPQA, however, has the longest average targets (6.3 tokens) compared to 1.3-1.5 for the other datasets. This derives from the fact that entire phrases are often targets in MPQA. Finally, due to the annotation criteria, the MAMS data also has the highest number of sentences with multiple aspects with multiple polarities -nearly 100% in train, compared to less than 10% for Restaurant.

Annotation for negation and speculation
Although negation and speculation are prevalent in the original data -negation and speculation occur in 13-25% and 9-20% of the sentences, respectively -it is difficult to pry apart improvement on the original data with improvement on these two phenomena. Therefore, we further annotate the dev and test set for the Laptop and Restaurant datasets 4 , and when possible 5 , insert negation and speculation cues into sentences lacking them, which we call Laptop N eg , Laptop Spec , Restaurant N eg , and Restaurant Spec . Inserting negation and speculation cues often leads to a change in polarity from the original annotation, as shown in the example in Table 2. We finally keep all sentences that contain a negation or speculation cue, including those that occur naturally in the data. As this process could introduce errors regarding the polarity expressed towards the targets, we doubly annotate the polarity for 50 sentences from the original dev data, the negated dev data, and the speculation dev data and calculate Cohen's Kappa scores. The statistics and inter-annotator agreement scores (IAA) are shown in Table 1  5 While inserting negation into new sentences is quite trivial, as one can always negate full clauses, e. g. It's good → It's not true that it's good, adding speculation often requires rewording of the sentence. We did not include sentences that speculation made unnatural. 6 Table 7 of Appendix A shows the distribution of the sentiment classes. (0.67-0.71), confirming the quality of the annotations.

Auxiliary task data
For the multi-task learning experiments, we use six auxiliary tasks: negation scope detection using the Conan Doyle (NEG CD ) (Morante and Daelemans, 2012), both negation detection (NEG SF U ) and speculation detection (SPEC) on the SFU N egSpec dataset (Konstantinova et al., 2012), and Universal Part-of-Speech tagging (UPOS), Dependency Relation prediction (DR) and prediction of full lexical analysis (LEX) on the Streusle dataset (Schneider and Smith, 2015). We show the train, dev, test splits, as well as the number of labels, label entropy and label kurtosis (Martínez Alonso and Plank, 2017) in Table 3. An example sentence with auxiliary labels is shown in Appendix B. Although it may appear that the SFU dataset is an order of magnitude larger than the Conan Doyle dataset, in reality, most of the training sentences do not contain annotations, leaving similar sized data if these are filtered. Similar to the sentiment data, we convert the auxiliary tasks to BIO format and treat them as sequence labelling tasks.

Experiments
We experiment with a single task baseline (STL) and a hierarchical multi-task model with a skipconnection (MTL), both of which are shown in Figure 1. For the STL model, we first embed a sentence and then pass the embeddings to a Bidirectional LSTM (Bi-LSTM). These features are then concatenated to the input embeddings and fed to the second Bi-LSTM layer, ending with the token-wise sentiment predictions from the CRF tagger. For the MTL model, we additionally use the output of the first Bi-LSTM layer as features for the separate auxiliary task CRF tagger. As seen from Figure 1, the STL model and the MTL main task model use the same the green layers. The MTL additionally uses the pink layer for the auxiliary task, adding less than 3.4% trainable parameters 7 for all auxiliary tasks except LEX, which adds 221.4% due to the large label set (see Table 3). Furthermore, at inference time the MTL model is as efficient as STL, given that it only uses the green layers when predicting the targeted sentiment, of which this is empirically shown in Table 20 of Appendix F. Embeddings: For the embedding layer, we perform experiments using 300 dimensional GloVe original this is good, inexpensive sushi. negated this is not good, inexpensive sushi. speculative I'm not sure if this is good, inexpensive sushi.   Tables 17 and  18 of Appendix E along with Figure 2 showing the expected validation scores from the hyperparameter tuning). For the MTL model, a single epoch involves training for one epoch on the auxiliary task and then an epoch on the main task, as previous work has shown training the lower-level task first improves overall results (Hashimoto et al., 2017). In this work, we assume all of the auxiliary training tasks are conceptually lower than TSA.
Evaluation: For all experiments, we run each model five times (Reimers and Gurevych, 2017) and report the mean and standard derivation. We also take the distribution of the five runs to perform significance testing (Reimers and Gurevych, 2018), eliminating the need for Bonferroni correction. Following Dror et al. (2018), we use the nonparametric Wilcoxon signed-rank test (Wilcoxon, 1945) for the F 1 metrics and a more powerful parametric Welch's t-test (Welch, 1947) for the accutransformer layers and the output from the non-contextualised character encoder, thus in total 7 layers are weighted and summed.  racy metric.

Results
We report the F 1 score for the target extraction (F 1 -a), macro F 1 (F 1 -s) and accuracy score (acc-s) for the sentiment classification for all targets that have been correctly identified by the model, and finally the F 1 score for the full targeted task (F 1 -i),  Table 4, and the other metrics for the test split are reported in Tables 9 and 10 of Appendix C.
The MTL models outperform STL on four of the eight experiments (see Table 4), although the STL TL model is significantly better than the majority of MTL models on MPQA. Of the MTL models, NEG CD + GloVe performs best on MPQA (18.88), DR + GloVe is best on Restaurant (66.06), and LEX is the best model on Laptop (54.85) with GloVe and Restaurant (71.77) with TL. The TL models consistently outperform the GloVe models -by an average of 5.4 percentage points (pp) across all experiments -and give the best performance on all datasets.
The results suggest that transfer learning reduces the beneficial effects of MTL. At the same time, the results suggest that MTL does not hurt the STL models, as no STL model is significantly better than all of the MTL models across the datasets and embeddings for the F 1 -i metric. 14

Challenge Dataset Results
In order to isolate the effects of negation and speculation on the results, we test all models trained on the original Laptop and Restaurant datasets on the Laptop N eg , Restaurant N eg , Laptop Spec , and Restaurant Spec test splits. Tables 5 and 6 show the results for negation and speculation, respectively. The results for the dev split and the F 1 -s of the test split are shown in Appendix D.
Firstly, all models perform comparatively worse on the challenge datasets, dropping an average of 24 and 25 pp on F 1 -i on the negation and speculation data, respectively. Nearly all of this drop comes from poorer classification (acc-s, F 1s), while target extraction (F 1 -a) is relatively stable. This demonstrates the importance of resolving negation and speculation for TSA and the usefulness of the annotated data to determine these effects.
On Laptop N eg and Restaurant N eg incorporating negation auxiliary tasks gives an average improvement of 3.8 pp on the F 1 -i metric when using GloVe embeddings. More specifically, MTL with negation improves the sentiment classification scores, but does not help extraction. This makes sense conceptually, as negation has little effect on whether or not a word is part of a sentiment target. Instead,  The bold values represent the best model, while highlighted models are those that perform better than the single task baseline. The represents the models that are significantly worse (p < 0.05) than the best performing model on the respective dataset, metric, and embedding.
jointly learning dependency relations (DR) and full lexical analysis (LEX) improve extraction results. Furthermore, when using TL instead of GloVe embeddings, the best MTL model (NEG SF U ) does marginally beat the STL TL equivalent on average, indicating that multi-task learning is still able to contribute something to transfer learning. On Laptop Spec and Restaurant Spec MTL models improve results when using GloVe embeddings, with the additional speculation (SPEC) and dependency relation (DR) data improving the F 1 -i metric by 0.5 pp and 0.49 pp respectively on average. However, with TL, MTL only leads to benefits on the Restaurant dataset. Unlike the negation data results, the speculation results appear to be helped more by syntactic auxiliary tasks like DR than semantic tasks like NEG CD and to some extent NEG SF U .
The best MTL GloVe models on the original datasets (LEX 15 and DR, respectively) also outper- 15 The development F1-i result for LEX on the Laptop form the STL GloVe models on the challenge data, indicating that MTL leads to greater robustness. When comparing the STL model using GloVe and TL on average the model improves by 9.55 pp on the negation dataset compared to 3.65 pp for the speculation suggesting that transfer learning is less effective for speculation.

Conclusion
In this paper, we have compared the effects of MTL using various auxiliary tasks for TSA and have created a negation and speculation annotated challenge dataset 16 for TSA in order to isolate the effects of MTL. We show that TSA methods are drastically affected by negation and speculation effects in the data. These effects can be similarly reduced by either incorporating auxiliary task information into the model through MTL or through transfer learning. Additionally, MTL of negation dataset is worse than STL by 0.05 but for all other F1-i Laptop results LEX is better than STL. 16 Table 6: Sentiment (acc-s), extraction (F 1 -a) and full targeted (F 1 -i) results for Laptop Spec and Restaurant Spec test split, where the values represent the mean (standard deviation) of five runs with a different random seeds. The bold values represent the best model, while highlighted models are those that perform better than the single task baseline. The represents the models that are significantly worse (p < 0.05) than the best performing model on the respective dataset, metric, and embedding.
can lead to small improvements when combined with transfer learning. Returning to the two original research questions, we can conclude that in general 1) MTL using negation (speculation) as an auxiliary task does make TSA models more robust to negated (speculative) samples and 2) transfer learning seems to incorporate much of the same knowledge. Additionally, incorporating syntactic information as an auxiliary task within MTL creates models that are more robust to both negation and speculation. Neither MTL nor TL are currently guarantees for improved performance 17 . Additionally, the results from the challenge datasets indicate that different auxiliary tasks improve the performance of different subtasks of TSA. This may suggest that the target extraction and sentiment classification tasks should not be treated as a collapsed labelling task, as the sentiment and extraction tasks are too dissimilar (Hu et al., 2019). Future work should consider 17 Compare the performance of LEX using GloVe (28.59) to when it uses TL (28.56) in Table 6 for the Laptop dataset. using pipeline or joint approaches, where each subtask can be paired with the most beneficial auxiliary tasks. This decoupling could also allow MTL and transfer learning to compliment each other more.
Finally, in order to improve reproducibility and to encourage further work, we release the code 18 , dataset, and trained models associated with this paper, hyperparameter search details with compute infrastructure (Appendix E), number of parameters and runtime details (Appendix F), and further detailed dev and test results (appendices C and D), in line with the result checklist from Dodge et al.   Table 7: Sentiment class distribution statistics as a percentage of the number of targets (samples), for the sentiment datasets used in the experiments. pos, neu, neg, and both represent the sentiment classes positive, neutral, negative, and both respectively.

24.66
(1.07) Table 9: acc-s, F 1 -s, extraction (F 1 -a) and full targeted (F 1 -i) results for Laptop and MPQA test split, where the values represent the mean (standard deviation) of five runs with a different random seed. The bold values represent the best model, while highlighted models are those that perform better than the single task baseline. The represent the models that are statistically significantly worse than the best performing model on the respective dataset, metric and TL at a 95% confidence level.        Table 16: acc-s, F 1 -s, extraction (F 1 -a) and full targeted (F 1 -i) results for the speculation development split, where the values represent the mean (standard deviation) of five runs with a different random seed. The bold values represent the best model, while highlighted models are those that perform better than the single task baseline. The represent the models that are statistically significantly worse than the best performing model on the respective dataset, metric and embedding at a 95% confidence level.      Table 20: Run/inference times for STL and MTL models that have been trained on the Laptop dataset using either GloVe or TL embeddings. Each model was timed in seconds (s) to generate predictions for 800 sentences, that were taken from the Laptop test split, of which this process was repeated five times and here we report the minimum (min) and maximum (max) time to generate predictions for those 800 sentences. We report these timings across different model configurations based on different batch sizes at prediction time and different devices. The trained MTL model used in this experiment was the MTL (NEG SF U ) version, this was chosen as it contains the largest number of total parameters as shown in Table 19. Further all of these times were based on the model already loaded into memory and using the Python timeit library for timings. Additionally the GPU used was a GeForce GTX 1060 6GB GPU, CPU was an AMD Ryzen 5 1600 CPU, and the computer had 16GB of RAM.