CrystalNest at SemEval-2017 Task 4: Using Sarcasm Detection for Enhancing Sentiment Classification and Quantification

This paper describes a system developed for a shared sentiment analysis task and its subtasks organized by SemEval-2017. A key feature of our system is the embedded ability to detect sarcasm in order to enhance the performance of sentiment classification. We first constructed an affect-cognition-sociolinguistics sarcasm features model and trained a SVM-based classifier for detecting sarcastic expressions from general tweets. For sentiment prediction, we developed CrystalNest– a two-level cascade classification system using features combining sarcasm score derived from our sarcasm classifier, sentiment scores from Alchemy, NRC lexicon, n-grams, word embedding vectors, and part-of-speech features. We found that the sarcasm detection derived features consistently benefited key sentiment analysis evaluation metrics, in different degrees, across four subtasks A-D.


Introduction
Sentiment analysis, also known as opinion mining, is the study of the feelings and opinions from usergenerated content. Sarcasm detection, though very related, is a different topic of interest. As a classification task, the primary objective of sentiment analysis is to determine if a message is positive, negative, or neutral. In contrast, the objective of sarcasm detection is to determine if a message is sarcastic or not sarcastic.
To illustrate, let us look at two short text examples. Example 1 expresses a positive sentiment * Both authors contributed to this research equally. For correspondence, please contact yangyp@ihpc.a-star.edu.sg. which has a slight mixed feeling, but it is not sarcastic. A very similar-looking Example 2 is sarcastic, and its underlying sentiment is negative. In computational linguistics and NLP, detecting sarcasm is receiving increasing research interest (e.g., González-Ibáñez et al., 2011;Reyes et al., 2012;Liebrecht et al., 2013;Riloff et al., 2013;Rajadesingan et al., 2015;Bamman and Smith, 2015). While these studies recognized the linkage between sarcasm and sentiment and have proposed various techniques for detecting sarcasm, none directly studied the impact of sarcasm detection on sentiment analysis. Maynard and Greenwood (2014) is among the first to explore how to use sarcasm-related information to improve sentiment analysis. They proposed a rule-based method involving five rules such as using "#sarcasm" to flip a sentiment from positive to negative. However, their evaluation was performed on a relatively small test dataset of 400 tweets.
We believe that sentiment analysis systems will benefit from a systematically embedded ability to detect sarcasm. In the following, we describe our approach and present supportive findings evaluated on a large set of test data provided by SemEval-2017Task 4 (Rosenthal et al., 2017.

Sarcasm Detection: An
Affect-Cognition-Sociolinguistics (ACS) Feature Model In order to capture discriminative and explainable sarcasm features, we sought to design a feature model based on review and synthesis across related studies such as natural language processing, linguistics, psychology, speech and communication, as well as neuroscience. Our Figure 1: The Key Components of the Crystalace Sarcasm Detection Method model characterizes sarcasm with three key feature groups: affect-related, cognition-related, and sociolinguistics-related features. Figure 1 presents an overview of the proposed sarcasm detection method that we name it as "Crystalace". Crystalace will subsequently produce a key feature, i.e. sarcasm score, for the final CrystalNest sentiment analysis system. Crystalace's core processing layer is the affectcognition-sociolinguistics sarcasm feature model (sections 2.1-2.4). Crystalace also includes a supporting layer that pre-processes raw text into crystallized text (section 2.5) for effective feature extraction.

Affect-related features
A fundamental understanding of sarcasm is that it involves a negative emotional connotation through a seemingly positive expression (Brant, 2012). Riloff et al. (2013) suggested that count of positive and negative words, location and order of positive words and negative words are useful features in sarcasm detection. Rajadesingan et al. (2015) further used strength of positive words and negative words and found that strength-related features (e.g., count of very positive words in a tweet) are among top ten sarcasm features in their study.
In our model, beyond the valence and strengthrelated features, we propose to incorporate the intensity aspect of affective expressions. Conceptually, psychologists characterized emotion with two fundamental dimensions: the strength dimension (Osgood et al., 1957 called it "evaluation") in that an expression would have a positive or negative meaning that is strong, moderate or weak, and the intensity dimension (Shaver et al., 1987) which further concerns what Osgood et al. called motivational "potency" and physical "activity" 1 . With the intensity dimension, anger-based expressions (high in potency), for example, can be differentiated from sadness-based expressions (low in potency). Because sarcasm is featured with an underlying emotional connation (Brant, 2012), it is conceivable that expressers would tend to leverage seemingly positive emotions such as joy or gratitude words to implicate underlying negative mental experiences such as contempt or disapproval. Thus, in addition to the strength dimension, we explore capturing the emotional intensity variances to further differentiate sarcastic from non-sarcastic expressions.
Other than using words, Twitter users often use special punctuations to highlight their affective experiences, which can be useful cues to sarcasm. For example, users tend to capitalize certain letters to express strong feelings. Others may also use repetitive exclamations marks "!!!". Therefore, we consider these special punctuations as affectrelated features. Lastly, we consider percentage of first-persons singular pronouns (I, me, mine etc.) as a feature as research in linguistic psychology has indicated that such words give an expresser power to make an emotional connection with the audience (Cohen, 2014).

Cognition-related features
Besides affect, sarcasm is also significantly associated with cognitive processes. As Haiman (1998) puts it, what is essential to sarcasm is that it is "overt irony intentionally used by the speaker as a form of verbal aggression". Neuropsychology studies also indicated that damage of certain cognitive functions in the brain harms people's ability in recognizing sarcasm (Shamay-Tsoory et al., 2005;Davis et al., 2016). Because sarcasm is intentional, there is a degree of deliberation in order to construct sarcasm. Thus, if a sarcastic tweet is produced, the tweet is probably manifested with a high degree of lexical complexity which is also likely constructed by a high cognitive complexity individual. Conversely, a low cognitive complexity individual would tend to be more straightforward to communicate their feelings.
In linguistics, certain words have been found to reveal "depth of thinking" (Tausczik and Pennebaker, 2009). These include cognitive processes words (e.g., because), conjunctions (e.g., although), prepositions (e.g., to) and words greater than six letters. In addition, psycholinguistic analysis of tweets has suggested that a well-prepared and constructed tweet is correlated with higher lexical density, which is marked by informationcarrying words (Hu et al., 2013). Therefore, we include nouns, negation, verbs, adjectives, numbers, and quantifiers which are information-carrying words in this feature category.

Sociolinguistics-related features
In verbal communication, average pitch, pitch slop, and laughter or responses to questions have been found to be prosodic cues to sarcasm utterances (Tepperman et al., 2006). In online digital platforms such as Twitter, users do not have facial and vocal cues at their disposal to communicate sarcastic expressions (Burgers, 2010). In consequence, they would find some alternative and "creative" ways to effectively express sarcasm cues as a hint to their intended audiences. Users would use hashtags to highlight a specific key phrase for easy search by others, use at-mentions to bring attention to a specific user, or use emoticons to provide cues to the underlying feelings. Therefore, we incorporate user-created hashtags, at-mentions, URLs and emoticons in our feature model.

Features Extraction
In total, our proposed sarcasm feature model includes a total of 82 features. The affectrelated features include 50 valence-based features, strength-based features, intensity-based fea-tures and other indirect affective features. The cognition-related features include a total of 26 depth-of-thinking features (e.g., prep, conj). The sociolinguistics-related features refer to 6 Twitterspecific contextual cues features (e.g., #, @).
Appendix A shows the full list of the 82 features, the feature codes and the respective linguistic resources/tools used for the features extraction.

Tweets Preprocessing
For supporting effective feature extraction, we designed a procedure to pre-process raw tweets. The first step is hashtag segmentation (Davidov et al., 2010), which involves tokenizing each hashtag such that the words can be more readily captured by existing lexical sources (e.g., #shitnooneeversay will be shit no one ever say). The second step is misspelt word correction, which converts words with more than two consecutive letters into those with two consecutive letters (e.g., greaaat will be greaat, awwww will be aww), such that intentionally misspelt words are standardized for the subsequent step. The third step is expressions substi-  tution. Even after the first two steps, many tweets could still contain a great variety of unusual expressions. Therefore, we constructed a mapped list of such expressions with more common words or phrases that carry a similar meaning, referencing Internet resources such as Urban Dictionary and Wikipedia. For example, gonna will be going to, :/ will be annoyed, aww will be sweet, classier will be excellent, rainy will be bad weather, and sneezing will be poor health. Note that we do not remove stop words, as removing stop words that helps in classic NLP tasks has been found to harm sentiment analysis performance (Saif et al., 2014).

Sarcasm Classifier
To train and evaluate our sarcasm classifier, we downloaded the annotated tweets dataset from Riloff et al. (2013), pre-processed the tweets, and trained a linear SVM classifier using our ACSbased features model. Similar to the final condition reported in Riloff et al. (2013), we also added unigrams and bigrams features to complement the theoretical features model. We then ran 10-fold cross validations to evaluate our method's performance. The results in Table 1 show that our ACSbased method obtained F 1 -score of .60, which gained an additional .09 as compared to the best condition reported in Riloff et al.'s original study. Based on the results, we trained the final Crystalace sarcasm classifier using the full dataset.

System Description
Our sarcasm detection-enhanced sentiment analysis system, CrystalNest, is designed with five features groups and a cascade classifier with two levels of training. The following provides the development details.

Sarcasm and Sentiment Features
We used our Crystalace sarcasm classifier and Alchemy Language API 7 to form a twodimensional feature vector. Alchemy Language is a component of the cognitive APIs offered on IBM Watson Developer Cloud. The first dimension of this feature vector contains the confidence score obtained using the sarcasm classifier and the second dimension contains the confidence score that has been obtained by calling Alchemy.

NRC SemEval-2015 English Twitter Lexicons Features
We also leveraged NRC SemEval-2015 English Twitter Sentiment Lexicons 8 which aims to capture the degree of the positiveness of a given word or phrase (Rosenthal et al., 2015) and a list of negator 9 words to extract a six-dimensional feature vector for each tweet. This feature vector contains the counts of positive, negative, neutral, negators words respectively, as well as maximum and minimum strengths of sentiment for a given tweet.

N-grams Features
N-grams are a common feature used for sentiment analysis. We extracted unigrams and bigrams from each tweet without removing stop words.
To build the n-gram dictionary, we downloaded 25,000 general tweets using Twitter's Streaming API and extracted all possible unigrams and bigrams from those tweets. After extraction, we filtered these unigrams and bigrams based on their occurrences and removed all that appeared less than three times in our tweets dataset. We then used this n-gram dictionary to represent a tweet into the feature space where each of the feature dimensions represents the number of occurrences of that n-gram in the tweet.

Word Embedding Features
Word embedding has been used in recent Twitter sentiment analysis methods (Zhang et al., 2015;Rouvier and Favre, 2016) due to its ability to represent the semantic and syntactic meaning of 8 http://saifmohammad.com/WebPages/lexicons.html 9 http://dictionary.cambridge.org/grammar/britishgrammar/questions-and-negative-sentences/negation and https://www.grammarly.com/handbook/sentences/negatives/1/ negatives/ the word into a low-dimensional feature vector. Here, we used Gensim 10 based Sentence2Vec 11 to convert the tweets into 500-dimensional feature vectors. To train the word-embedding model, we downloaded approximately 8 million general tweets from Twitter using Twitter Streaming API.

Tweet Part-of-Speech (POS) Features
Lastly, we extracted 25-dimensional part-ofspeech (Owoputi et al., 2013) features for each tweet without any preprocessing, as the TweetPOS tool has been specially designed to capture tweetsspecific linguistic elements. These features help to capture cues such as tweets-specific linguistic counts, punctuation, as well as conversational markers including hashtags, at-mentions, emoticons and URLs.

Cascade Sentiment Classifier
For our final system, we used a cascade classification approach to predict the sentiment outcome. Before extracting the features, tweets are preprocessed as described in Section 2.5. For each of the five feature groups described in sections 3.1-3.5, we used linear SVM to train three different classifiers using one-against-all approach for positive, negative and neutral classes. For each of these classifiers (first-level classification), we used SemEval-2013 training data for training and SemEval-2016 and SemEval-2017 test tweets for final evaluation.
After obtaining the outputs from all three classifiers of each feature group, we formed a 15dimensional feature vector and used Naive Bayes classifier to train the final classifier. In this final classifier (second-level classification), we used SemEval-2016 test data for training 12 and SemEval-2017 test data for final evaluation.
For topic-based tweet quantification subtask D, we calibrated CrystalNest using a dynamic basesentiment selection approach as there was no clear prior knowledge to determine if topic-specific information would be benefiting or harming the quantification performance. We first obtained two sets of sentiment scores (sentiment general and sentiment topic) by using Alchemy to process each individual tweet's sentiment score with and without using the specific topic information. Then when sentiment general and sentiment topic converged on the same polarity, we used the converged consensus. When sentiment general and sentiment topic produced conflicting polarity for a given tweet, we used the "majority voted" polarity from the other tweets under the same topic to assign the polarity to the particular tweet that received conflicting polarity values. Using this dynamic approach, we found the error terms were reduced as compared to those resulted from simply relying on any of the individual sentiment general and sentiment topic base sentiment features.

Results
We evaluated the proposed approach using the official test datasets provided by SemEval-2017 Task 4's subtasks A-D. Tables 2-4 summarize the results. For subtasks A & B, recall and F 1 scores are assessed as averaged scores according to the task organizers (see Rosenthal et al. 2017 for detailed discussion on the evaluation metrics).

System
Recall (    The test data provided by SemEval-2017 Task 4 is so far one of the largest annotated sentiment analysis test datasets. Subtask A consists of 12,284 annotated tweets, Subtasks B and D consist of 6,185 annotated tweets, and Subtask C consists of 12,379 annotated tweets. The results indicated that CrystalNest consistently benefited the performance more than the full-fledged, off-the-shelf sentiment analysis service offered by Alchemy. Furthermore, when we experimented with the subsystem combining only Alchemy and sarcasm features, the enhancements from sarcasm classifier over Alchemy's base sentiment features were also found in subtasks A, B and D, in particular in the two two-point subtasks B and D. In comparison with other participating systems, CrystalNest obtained relatively good rankings in subtask A (#18 out of 37 systems), subtask B (#9 out of 23), subtask C (#6 out of 15) and subtask D (#4 out of 15).

Conclusion
This paper described a new sentiment analysis system featuring a sarcasm detection classifier in conjunction with other complementary features derived from Alchemy, NRC sentiment lexicon, n-grams, word embedding vectors, and part-ofspeech features. The evaluation results using sentiment analysis subtasks A-D test data provided initial evidence on the value of embedding sarcasm detection in sentiment analysis systems. For future work, we plan to explore deep learning methods and conduct more experiments to further augment the system performance. pos4SS Count of 3-scored words (awesome, fantastic, great, wow*, joy*) strengthp3SS Count of 2-scored words (fun, glad, thank, nice*, brillian*) strengthp2SS Count of 1-scored words (ok, peace*) strengthp1SS Count of -1-scored words (dark, lost) strengthn1SS Count of -2-scored words (against, aloof) strengthn2SS Count of -3-scored words (envy*, foe*) strengthn3SS Count of -4-scored words (cry, fear) strengthn4SS Absolute value of highest positive strength score of words (e.g., 3 is returned if a tweet contains "excitement" and "amused", which have SentiStrength scores of 3 and 2 respectively) maxpstrengthSS Absolute value of lowest negative strength score of words (e.g., 4 is returned if a tweet contains "anguish" and "alone", which have SentiStrength scores of -4 and -2 respectively) intensityp3EI Count of 2-scored intensity words (love, awesome, glad, fun,:P,=D) intensityp2EI Count of 1-scored intensity words (thank, cooperative, concern, :), :d) intensityp1EI Count of 0-scored intensity words (great, haze, fulfill, sick, sleepy) intensity0EI Count of -1-scored intensity words (anger, annoyed) intensityn1EI Count of -2-scored intensity words (sorry, agh, :/) intensityn2EI Count of -3-scored intensity words (hate, resented, D:) intensityn3EI Absolute value of highest positive score of intensity words maxpintensityEI Absolute value of lowest negative score of intensity words minnintensityEI Percentage of uppercase characters uppcase