Sentiment Analysis Is Not Solved! Assessing and Probing Sentiment Classification

Neural methods for sentiment analysis have led to quantitative improvements over previous approaches, but these advances are not always accompanied with a thorough analysis of the qualitative differences. Therefore, it is not clear what outstanding conceptual challenges for sentiment analysis remain. In this work, we attempt to discover what challenges still prove a problem for sentiment classifiers for English and to provide a challenging dataset. We collect the subset of sentences that an (oracle) ensemble of state-of-the-art sentiment classifiers misclassify and then annotate them for 18 linguistic and paralinguistic phenomena, such as negation, sarcasm, modality, etc. Finally, we provide a case study that demonstrates the usefulness of the dataset to probe the performance of a given sentiment classifier with respect to linguistic phenomena.


Introduction
Over the last 15 years, approaches to sentiment analysis which concentrated on creating and curating sentiment lexicons (Turney, 2002;Liu et al., 2005) or used n-grams for classification (Pang et al., 2002) have been replaced by models that are able to exploit compositionality (Socher et al., 2013;Irsoy and Cardie, 2014) or implicitly learn relations between tokens (Peters et al., 2018;Howard and Ruder, 2018;Devlin et al., 2018).These neural models push the state of the art to over 90% accuracy on binary sentence-level sentiment analysis.
Although these methods show a quantitative improvement over previous approaches, they are not often accompanied with a thorough analysis of the qualitative differences.This has led to the current situation, where we are aware of quantitative, but not qualitative differences between state-of-the-art sentiment classifiers.It also means that we are not aware of the outstanding conceptual challenges that we still face in sentiment analysis.
In this work, we attempt to discover what conceptual challenges still prove a problem for all stateof-the-art sentiment methods for English.To do so, we train and test three state-of-the-art machine learning classifiers (BERT, ELMo, and a BiLSTM) as well as a bag-of-words classifier on six sentencelevel sentiment datasets available for English.We then collect the subset of sentences that all models misclassify and annotate them for 18 linguistic and paralinguistic phenomena, such as negation, sarcasm, modality or world knowledge.We present this new data as a challenging dataset for future research in sentiment analysis, which enables probing the problems that sentiment classifiers still face in more depth.
Specifically, the contributions of this work are: • the creation of a challenging sentiment dataset from previously available data, • the annotation of errors in this dataset for 18 linguistic and paralinguistic phenomena, • a thorough analysis of the dataset, • and finally presenting a practical use-case demonstrating how the dataset can be used to probe the particular types of errors made by a new model.
The rest of the paper is organized into related work (Section 2), a description of the experimental setup (Section 3), a brief description of the dataset (Section 4), an in-depth analysis (Section 5), a case-study that demonstrates the usefulness of the dataset (Section 6), and finally a conclusion (Section 7).

arXiv:1906.05887v1 [cs.CL] 13 Jun 2019
Neural networks are now ubiquitous in NLP tasks, often giving state-of-the-art results.However, they are known for being "black boxes" which are not easily interpretable.Recent interest in interpreting these methods has led to new lines of research which attempt to discover what linguistic phenomena neural networks are able to learn (Linzen et al., 2016;Gulordava et al., 2018;Conneau et al., 2018), how robust neural networks are to perturbations in input data (Ribeiro et al., 2018;Ebrahimi et al., 2018;Schluter and Varab, 2018), and what biases they propagate (Park et al., 2018;Zhao et al., 2018;Kiritchenko and Mohammad, 2018).
Specifically within the task of sentiment analysis, certain linguistic phenomena are known to be challenging.Negation is one of the aspects of language that most clearly affects expressions of sentiment and that has been studied widely within sentiment analysis (see Wiegand et al. (2010) for an early survey).The difficulties of resolving negation for sentiment analysis include determining negation scope (Hogenboom et al., 2011;Lapponi et al., 2012;Reitan et al., 2015), and semantic composition (Wilson et al., 2005;Choi and Cardie, 2008;Kiritchenko and Mohammad, 2016).
Verbal polarity shifters have also been studied.Schulder et al. (2018) annotate verbal shifters at the sense-level.They conclude that, although individual negation words are more frequent in the Amazon Product Review Data corpus, the overall frequency of negation words and shifters is likely similar.This suggests that there is a Zipfian tail of shifters which are not often handled within sentiment analysis.
Furthermore, the linguistic phenomenon of modality has also been shown to be problematic.Both Narayanan et al. (2009) and Liu et al. (2014) explore the effect of modality on sentiment classification and find that explicitly modeling certain modalities improves classification results.They advocate for a divide-and-conquer approach, which would address the various realizations of modality individually.Benamara et al. (2012) perform linguistic experiments using native speakers concerning the effects of both negation and modality on opinions, and similarly find that the type of negation and modality determines the final interpretation of polarity.
The sentiment models inspected in these analyses, however, were lexicon-and word-and n- gram-based models.It is not clear that neural networks have the same weaknesses, as they have been shown to deal with compositionality and long-distance dependencies to some degree (Socher et al., 2013;Linzen et al., 2016).Additionally, authors did not attempt to discover from the data what phenomena were present that could affect sentiment.In the current paper we aim to provide a systematic analysis of error types found across a range of datasets, domains and classifiers.

Experimental Setup
In these experiments, we test three state-of-the-art models for sentence-level sentiment classification.We choose to focus on sentence-level classification for three reasons: 1) sentence-level classification is a popular and useful task, 2) there is a large amount of high-quality annotated data available, and 3) annotation of linguistic phenomena is easier at sentence-level than document-level.It is also likely that most phenomena that occur at sentencelevel, e. g., negation, comparative sentiment, or modality, will transfer to other sentiment tasks.

Datasets
In order to discover a subset of sentences that all state-of-the-art models are unable to correctly predict, we collect six English-language datasets previously annotated for sentence-level sentiment from five domains (news wire, hotel reviews, movie reviews, twitter, and micro-blogs).Table 1 shows the statistics for each of the datasets.
MPQA The Multi-perspective Question Answer (MPQA) Opinion Corpus (Wilson et al., 2005) provides contextual polarity annotations for English news documents from world press.The annotations are private state frames, which include annotations for text anchor, source, target, and attitude type, among others.We extract sentiment labeled sentences by taking only those sentences that have sentiment annotations.Additionally, we remove sentences that contain both positive and negative sentiment.This leaves a three-class (positive, neutral, negative) sentence-level dataset.
OpeNER The Open Polarity Enhanced Named Entity Recognition (OpeNER) sentiment datasets (Agerri et al., 2013) contain hotel reviews annotated for 4-class (strong positive, positive, negative, strong negative) sentiment classification.We take the English dataset, where self-attention networks give state-of-the-art results (Ambartsoumian and Popowich, 2018).

Stanford Sentiment Treebank
The Stanford Sentiment Treebank (Socher et al., 2013) contains 11,855 English sentences from movie reviews which have been annotated at each node of a constituency parse tree.Contextualized word representations combined with a bi-attentive sentiment network currently give state-of-the-art results (Peters et al., 2018).
Täckström dataset The Täckström dataset (Täckström and McDonald, 2011) contains product reviews which have been annotated at both document-and sentence-level for three-class sentiment, although the sentence-level annotations also have a "not relevant" label.We keep the sentencelevel annotations, which gives 3,662 sentences annotated for three-class sentiment.
Thelwall dataset The Thelwall dataset derives from datasets provided with SentiStrength2 (Thelwall et al., 2010).It contains microblogs annotated for both positive and negative sentiment on a scale from 1 to 5. We map these to single sentiment labels such that sentences which are clearly positive (pos >= 3 and neg < 3) are given the positive label, clearly negative sentences (pos < 3 and neg >= 3) the negative label, and clearly neutral sentences ( 3 < pos > 2 and 3 < neg > 2) the neutral.We discard all other sentences, which finally leaves 6,334 annotated sentences.

Models
In order to gain an idea of what errors most models suffer from, we test three state-of-the-art models on the datasets.Additionally, we use a bag-of-words model as it is a strong baseline for text classification.For the SINGLE setup, we train all models on the training and development data for each dataset and test on the corresponding test set, therefore avoiding domain problems.
BERT The BERT model (Devlin et al., 2018) is a bidirectional transformer that is pretrained on two tasks: 1) a cloze-like language modeling task and 2) a binary next-sentence prediction task.It is pretrained on 330 million words from the BooksCorpus (Zhu et al., 2015) and English Wikipedia.We fine-tune the available pretrained model3 on each sentiment dataset.
ELMo We use the bi-attentive classification network4 from Peters et al. (2018).The network uses both word embeddings, as well as creating character-based embeddings from a character-level CNN-BiLSTM network.The word representations are first passed through a feedforward layer, and then through a sequence-to-sequence network with biattention.This new representation of the text is combined with the original representation and passed through another sequence-to-sequence network.Finally, a max, min, mean and self-attention pool representation is created from this last sequence.For classification, these features are sent to a maxout layer.
BiLSTM Bidirectional long short-term memory (BiLSTM) networks have shown to be strong baselines for sentiment tasks (Tai et al., 2015;Barnes et al., 2017).We implement a single-layered BiL-STM which takes pretrained skipgram embeddings as input, creates a sentence representation by concatenating the final hidden layer of both left and right LSTMs, and then passes this representation to a softmax layer for classification.Additionally, dropout serves as a regularizer.
Bag-of-Words classifier Finally, bag-of-words classifiers are strong baselines for sentiment and when combined with other features can still give state-of-the-art results for sentiment tasks (Mohammad et al., 2013).Therefore, we train a Linear SVM on a bag-of-words representation of the training sentences.

Model performance
Table 2 shows the accuracy of the models on the six tasks.Both methods that use pretrained language model classifiers (ELMo and BERT) are the best performing models, with an average of 11.8 difference between the language model classifiers and standard models (BOW and BILSTM).The error rates range between 8.3 on OpeNER and 20.5 on SST (see Table 3), indicating that there are differences in difficulty of datasets due to domain and annotation characteristics.
Additional experiments on a MERGED setup, where the labels from OpeNER and SST are mapped to the three-class setup, and a single model is trained on the concatenation of the training sets from all datasets, indicate that no clear performance gain is achieved.We therefore prefer to avoid the problem of domain differences and keep only the original results.

Challenging Dataset
We create a challenging dataset by collecting the subset of test sentences that all of the sentiment systems predicted incorrectly (statistics are shown in Table 3).After removing sentences with incorrect gold labels, there are a total of 836 sentences in the dataset, with a similar number of positive, neutral, and negative labels and fewer strong labels.This is expected, as only two datasets have strong labels.
Furthermore, the main sources of examples are the SemEval task (249), Stanford Sentiment Treebank (452) and Thelwall datasets (215), while the Täckström dataset (129), MPQA (39) and OpeNER (29) contribute much less.This is a result of both dataset size and difficulty.

Dataset Analysis
In order to give a clearer view of the data found in the dataset, we annotate these instances using 19 linguistic and paralinguistic labels.While most of these come from previous attempts to qualitatively analyze sentiment classifiers (Hu and Liu, 2004;Das and Chen, 2007;Pang and Lee, 2008;Socher et al., 2013;Barnes et al., 2018), others (incorrect label, no sentiment, morphology) emerged during the error annotation process.We further chose to manually annotate for the polarity of the sentence irrespective of the gold label in order to be able to locate possible annotation errors during our analysis.The annotation scheme and (manually constructed) examples of each label are shown in Table 6.Note that we did not limit the number of labels that the annotator could assign to each sentence and in principle they should assign all suitable labels during annotation.An initial analysis of the errors shown in (185), non-standard spelling and hashtags (180), desirable elements (144), and the strong label (122).
The distribution of errors across labels (strong negative: 106, negative: 299, neutral: 303, positive: 296, strong positive: 109) compared to the gold distribution (strong negative: 294, negative: 1742, neutral: 2249, positive: 2402, strong positive: 475) shows that the strong negative is the most difficult and least common class, while positive is the easiest to classify.In the following we briefly discuss the error categories, also showing examples for each.

Mixed Polarity
The largest set of errors, with 185 sentences labeled, are what we refer to as "mixed" polarity sentences.These are sentences where two differing polarities are expressed, either towards two separate entities, or towards the same entity.While the first can be solved by a more fine-grained approach (aspect-level or targeted sentiment), the second is more difficult and is often considered a category of its own (Shamma et al., 2009;Saif et al., 2013;Kenyon-Dean et al., 2018)  An analysis of the mixed category errors reveals that while most of the examples are in the "neutral" category (45%), the other 55% are annotated as having mostly positive or negative sentiment.This is a confusing situation for both annotators and sentiment classifiers, and a direct product of performing sentence-level classification rather than aspect-level.Nearly a third of the errors contain "but" clauses, which could be correctly classified by splitting them.
A more problematic situation is found among nearly 20% of the examples (34), where the annotator found the original label to be completely incorrect. 5on-standard spelling Most errors in this category (180 total) are labeled either negative (49%) or positive (29%), with almost no strong positive or strong negative, which comes mainly from the fact that the noisier datasets do not contain the strong labels.
Around a third of the examples contain hashtags that clearly express the sentiment of the whole sentence, e. g., "#imtiredof this SNOW and COLD weather!!!".This indicates the need to properly deal with hashtags in order to correctly classify sentiment.
Idioms Table 4  sentences labeled) are spread relatively uniformly across labels.Learning these correctly from sentence-level annotations is unlikely, especially because they are seldom found repeatedly, even in a training corpus of decent size.Therefore, incorporating idiomatic information from external data sources may be necessary to improve the classification of sentences within this category.
Strong Labels This category (122 total) is particularly difficult for sentiment classifiers for several reasons.First, strong negative sentiment is often expressed in an understated or ironic manner.For example, "Better at putting you to sleep than a sound machine."For strong positive examples in the dataset, there is often difficult vocabulary and morphologically creative uses of language, e. g., "It is a kickass , dense sci-fi action thriller hybrid that delivers and then some.", while strong negative examples often contain sarcasm or non-standard spelling, e. g., "All prints of this film should be sent to and buried on Pluto.".
Negation Negation, which accounts for 97 errors, directly affects the classification of polar sentence (Wiegand et al., 2010).Therefore, we look at the differences between correctly and incorrectly classified sentences containing negation, by analyzing 100 correctly and incorrectly classified sentences containing negation.
From our analysis, there is no specific negator that is more difficult to resolve regarding its effect on sentiment classification.
We also perform an analysis of negation scope under the assumption that when a negator occurs farther from its negated element, it is more difficult for the sentiment classifier to correctly resolve the negation.Let d be the distance between the negator n and the relevant sentiment element se, such that d = |ind(se) − ind(n)| where the function ind calculates the index of a token in a sentence.We find that the incorrectly classified examples have an average d of 2.7, while the correctly classified examples had 2.5.This seems to rule out a problem of negation scope as the underlying difference.
High-level or clausal negation occurs when the negator negates a full clause, rather than an adjective or noun phrase, e. g., "I don't think it is a particularly interesting film".In the dataset this phenomenon is found more prevalently in the incorrectly classified examples (8%) versus the correctly classified examples (3%), but does not occur often in absolute terms.
The main source of difference regarding correctly classifying examples involving negation seems to be irrelevant negation.Irrelevant negation refers to cases where a sentence contains a negation but where the sentiment-bearing expression is not within the scope of negation.In our data, there is a strong difference in the distribution of irrelevant negation in correctly and incorrectly classified examples (80% vs. 25%, respectively), suggesting that sentiment classifiers learn to ignore most occurrences of negation.
World Knowledge Examples from the dataset where world knowledge is necessary to correctly classify a sentence (81 sentences) include comparisons with entities commonly associated with positive or negative polarity, e. g., "Elicits more groans from the audience than Jar Jar Binks, Scrappy Doo and Scooby Dumb, all wrapped up into one.",analogies, e. g., "Adam Sandler is to Gary Cooper what a gnat is to a racehorse.",or rating scales, e. g., "10/10 overall".
This category is also highly correlated with sarcasm and irony.In fact, irony is often defined as "violating expectations" (Hao and Veale, 2010)  which presupposes that we possess a world knowledge containing expectations of a situation.
Amplified Amplifiers occur mainly in negative and strong positive examples, such as "It's an awfully derivative story."Most of the amplified sentences found in the dataset (71/79) contain amplifiers other than "very", such as "super", "incredibly", or "so".
Comparative Comparative sentiment, with 68 errors, is known to be difficult (Hu and Liu, 2004;Liu, 2012), as it is necessary to determine which entity is on which side of the inequality.Sentences like "Will probably stay in the shadow of its two older, more accessible Qatsi siblings" are difficult for sentiment classifiers that do not model this phenomenon explicitly.
Sarcasm/Irony Sarcasm and irony (58 errors), which are often treated separately from sentiment analysis (Filatova, 2012;Barbieri et al., 2014), are present mainly in negative and strong negative examples in the dataset.Correctly capturing sarcasm and irony is necessary to classify some negative and strong negative examples, e. g., "If Melville is creatively a great whale, this film is canned tuna." Shifters Shifters (50 errors), such as "abandon", "lessen", or "reject" are less common within the dataset, but normally move positive polarity words towards a more negative sentiment.The most common shifter is the word "miss", used as in "We miss the quirky amazement that used to come along for an integral part of the ride."tend to lead to positive or neutral sentiment, e. g., "It was a lot less hassle."

Case Study: Training with phrase-level annotations
As a case study for the usage of the dataset presented here, we evaluate a model that has access to more compositional information.Besides having sentence-level annotations, the SST dataset also contains annotations for each phrase in a constituency tree, which gives a considerable amount more training data, specifically 155,019 annotated phrases vs. 8,544 annotated sentences.It has been claimed that this data allows models to learn more compositionality (Socher et al., 2013).Therefore, we fine-tune the best performing model (BERT) on this data and test on our dataset.The BERT model trained on phrases achieves 55.1 accuracy on the SST dataset, versus 53.0 for the model trained only on sentence-level annotations.
Table 7 shows that the model trained on the SST phrases performs overall much better than the model trained on SST sentences6 on the dataset.Using the error annotations in the challenge data set, we find that results improve greatly on the sentences which contain the labels negation, world knowledge, amplified, emoji, and reduced, while performing worse on irony, shifters and equally on morphology.This analysis seems to indicate that phrase-level annotations help primarily with learning compositional sentiment (negation, amplified, reduced), while other phenomena, such as irony or morphology do not receive improvements.This confirms that training on the phraselevel annotations improves a sentiment model's ability to classify compositional sentiment, while also demonstrating the usefulness of our dataset for introspection.

Conclusion and Future Work
In this paper, we tested three state-of-the-art sentiment classifiers and a baseline bag-of-words classifier on six English sentence-level sentiment datasets.We gathered the sentences that all methods misclassified in order to create a dataset.Additionally, we performed a fine-grained annotation of error types in order to provide insight into the kinds of problems sentiment classifiers have.We will release both the code and the annotated data with the hope that future research will utilize this resource to probe sentiment classifiers for qualitative differences, rather than rely only on quantitative scores, which often obscure the plentiful challenges that still exist.
Many of the phenomena found in the dataset, e. g., negation or modality, have been discussed in depth in (Liu, 2012).However, the dataset that resulted from this work demonstrates that modern neural methods still fail on many examples of these phenomena.Additionally, our dataset enables a quick analysis of qualitative differences between models, probing their performance with respect to the linguistic and paralinguistic categories of errors.
Additionally, many of the findings from this paper are likely to vary to a degree for other languages, due to typological differences, as well as differences in available training data.The annotation method proposed in this paper, however, should enable the creation of similar analyses and datasets in other languages.
We expect that this approach to creating a dataset is also easily transferable to other tasks which are affected by linguistic or paralinguistic phenomena, such as hate speech detection or sarcasm detection.It would be more useful to have some knowledge of the phenomena that could affect the task beforehand, but a careful error analysis can also lead to insights which can be translated into annotation labels.
Regarding ways of moving forward, there are already many sources of data for the linguistic phenomena we have analyzed in this work, ranging from datasets annotated for negation (Morante and Blanco, 2012;Liu et al., 2018), irony (Van Hee et al., 2018), emoji (Barbieri et al., 2018), as well as datasets for idioms (Muzny and Zettlemoyer, 2013) and their relationship with sentiment (Jochim et al., 2018).We believe that discovering ways to explicitly incorporate this available information into state-of-the-art sentiment models may provide a way to improve current approaches.Multi-task learning (Caruana, 1993) and transfer learning (Peters et al., 2018;Devlin et al., 2018;Howard and Ruder, 2018) have shown promise in this respect, but have not been exploited for improving sentiment classification with regards to these specific phenomena.

Figure 1 :
Figure 1: Distribution of labels across error categories.

Table 1 :
Statistics for the sentence-level annotations in each dataset.

Table 5 and
Figure 1 reveals that the most common errors come from the no-sentiment (214), mixed category

Table 2 :
Accuracy of models on the sentiment datasets, where a different classifier is trained for each dataset.

Table 3 :
Statistics of dataset, including the number of sentences from each dataset and for each label, the percentage of the original dataset kept in the dataset, and average length (in tokens) of sentences. .

Table 5 :
Number of labels for each category in annotation study.Bold numbers indicate the five most frequent sources of errors.The total number of labels does not sum to the number of sentences in the dataset, as each sentence can have multiple labels.

Table 6 :
Categories and examples for error annotation guidelines.

Table 7 :
Per category accuracy and relative improvement (last column) of BERT model trained on SST sentences (8,544) and SST phrases (155,019).