#NonDicevoSulSerio at SemEval-2018 Task 3: Exploiting Emojis and Affective Content for Irony Detection in English Tweets

This paper describes the participation of the #NonDicevoSulSerio team at SemEval2018-Task3, which focused on Irony Detection in English Tweets and was articulated in two tasks addressing the identification of irony at different levels of granularity. We participated in both tasks proposed: Task A is a classical binary classification task to determine whether a tweet is ironic or not, while Task B is a multiclass classification task devoted to distinguish different types of irony, where systems have to predict one out of four labels describing verbal irony by clash, other verbal irony, situational irony, and non-irony. We addressed both tasks by proposing a model built upon a well-engineered features set involving both syntactic and lexical features, and a wide range of affective-based features, covering different facets of sentiment and emotions. The use of new features for taking advantage of the affective information conveyed by emojis has been analyzed. On this line, we also tried to exploit the possible incongruity between sentiment expressed in the text and in the emojis included in a tweet. We used a Support Vector Machine classifier, and obtained promising results. We also carried on experiments in an unconstrained setting.


Introduction
The use of creative language and figurative language devices such as irony has been proven to be pervasive in social media (Ghosh et al., 2015). The presence of these devices makes the process of mining social media texts challenging, especially because they can influence and twist the sentiment polarity of an utterance in different ways. Glossing over differences across different theoretical accounts proposed in the context of various disciplines (Gibbs and Colston, 2007;Grice, 1975;Wilson and Sperber, 1992;Attardo, 2007;Giora, 2003), irony can be defined as an incongruity between the literal meaning of an utterance and its intended meaning (Karoui et al., 2017). The term irony covers mainly two phenomena: verbal and situational irony (Attardo, 2006). Situational irony refers to events or situations which fail to meet expectations, such as for instance "warnings the dangerous effect of smoking on the cigarette advertisement", while verbal irony occurs when the speaker intend to communicate a different meaning w.r.t what he/she is literally saying. Most of the time it involves the intention of communicating an opposite meaning, and this kind of opposition can be expressed by polarity contrast. However this is not the only possibility, and social media messages well reflect such variety, including different expressions of verbal irony and descriptions of situational irony (Van Hee et al., 2016a;Sulis et al., 2016). Automatic irony detection is an important task to improve sentiment analysis (Reyes et al., 2013;Maynard and Greenwood, 2014). However, detecting irony automatically from textual messages is still a challenging task for scholars (Joshi et al., 2017). The linguistic and social factors which impact on the perception of irony contribute to make the task complex.
In this paper, we will describe the irony detection systems we developed for participating in SemEval2018-Task3: Irony Detection in English Tweets (Van Hee et al., 2018). Our systems used a support vector classifier model by exploiting some novel and well-handcrafted features including lexical, syntactical and affective based features. We participated in 3 different scenarios (Task A constrained, Task A unconstrained, and Task B unconstrained). The official results show that our system outperformed all systems in the unconstrained setting on both tasks and was able to achieve a reasonable score in Task A constrained, ranking in the top ten out of 44 submissions.

649
We performed our experiments using a support vector machine classifier, with radial basis function kernel. We exploited different kind of features (lexical, syntactical and affective-based), which has been proven effective in literature to identify ironic phenomena. In addition, we also investigated the use of novel features aimed at exploiting information conveyed by emojis, studying in particular sentiment incongruity between sentiment expressed in the text and in the emojis of a tweet.

Structural Features
Structural features consist of lexical and syntactical features which characterize Twitter data. Such kind of features has been proven beneficial in several tasks dealing with Twitter data, and we selected the most relevant ones for irony detection. Hashtag Presence: binary value 0 (if no hashtag in tweet) and 1 (if hashtag contained in tweet). Hashtag Count: number of hashtags contained in tweet. Mention Count: number of mentions contained in tweet. Exclamation Mark Count: number of exclamation marks contained in tweet. Upper Case Count: number of upper case characters in tweet. Link Count: number of links (http) contained in tweet. Link Presence: binary value 0 (if no link in tweet) and 1 (if at least one link found in tweet). Has Quote: binary value 0 (if quote (" " or ' ') not found in tweet) and 1 (if at least one pair of quote (" " or ' ') found in tweet). Intensifiers & Overstatement Words Count: number of intensifiers and words typically used in ironic overstatements 1 found in tweet. Emoji Presence: binary value 0 (if no emoji found in tweet) and 1 (if at least one emoji found in tweet). Repeated Character: binary value 0 (if there is no repeated character found in tweet) and 1 (if at least three characters repeated consequently in one word found in tweet). Text Length: the length of characters in each tweet. Conjunction Count: the number of conjunctions found in tweet. Verb Count: the number of verbs found in tweet. Noun Count: the number of nouns found in tweet. Adjective Count: the number of adjectives found in tweet. We use Standford PoS-Tagger 2 to get the count of conjunctions, verbs, nouns, and adjectives.

Affective-Based Features
Affective features were proven effective in prior work to detect irony in tweets . We exploited available affective resources to extract affective information trying to capture multiple facets of affects -sentiment polarity and emotions-by selecting a few resources developed for English, which refers to both categorical and dimensional models of emotions. AFINN.: AFINN is a sentiment lexicon consisting of English words labeled with valence score between -5 and 5. We used the normalized version of AFINN in , where the valence score was already normalized to the range between 0 and 1. Emolex. Emolex (Mohammad and Turney, 2013) was developed by using crowdsourcing. Emolex contains 14,182 words associated with eight primary emotion based on (Plutchik, 2001). EmoSenticNet. EmoSenticNet(EmoSN) (Poria et al., 2013) is an enriched version of Sentic-Net, where emotion labels were added by mapping WordNet-Affect labels to the SenticNet concepts. WordNet-Affect labels refers to six Ekman's basic emotions. Linguistic Inquiry and Word Count (LIWC). LIWC dictionary (Pennebaker et al., 2001) has 4,500 words distributed into 64 different emotional categories including positive and negative. Here we only use the positive (PosEMO) and negative emotion (NegEMO) categories. Dictionary of Affect in Language (DAL). DAL was developed by (Whissell, 2009) and composed of 8,742 English words. These words were labeled by three scores representing the emotion dimensions Pleasantness, Activation, and Imagery. Emoji Sentiment Ranking. Since we observed the presence of a lot of emojis in Twitter data, we used the emoji sentiment ranking lexicon by (Novak et al., 2015) to get the sentiment score of each emoji in the tweet. We also tried to detect sentiment incongruity between text and emoji in the same tweet. We used VADER (Hutto and Gilbert, 2014) to extract the polarity score of the text.

Task Description and Dataset
SemEval2018-Task3's organizers proposed two subtasks related to the topic of detecting irony in Twitter automatically (Van Hee et al., 2018). Sub-Task A is a binary classification task, where every system should determine whether a tweet is ironic or not ironic. Meanwhile, SubTask B is defined as a multi-class classification problem, where the aim is to classify each tweet into four different categories including: verbal irony by polarity contrast, other verbal irony, situational irony, and not irony. In both tasks, organizers allowed submissions in two scenarios: constrained and unconstrained. In unconstrained settings, participants were allowed to exploit external data from other corpora annotated with irony labels in the training phase. Standard evaluation metrics were proposed for the task, including, precision, recall, accuracy, and F 1 -score. Dataset The organizers provided 3,834 training data and 784 test data for both tasks. Table 1 shows the dataset distribution. Data were collected by using three irony-related hashtags: #irony, #sarcasm, and #not. Datasets for both tasks were manually labeled by using the fine-grained annotation scheme in (Van Hee et al., 2016b). A twolayer annotation has been applied on the same tweets, one concerning the presence and absence of irony, the second one identifying different types of irony, when irony is present. As a consequence, as Table 1 shows, there is a class imbalance on SubTask B dataset in favor of nonironic class (50%), verbal irony by polarity contrast (25%), other verbal irony (13%) and situational irony (12%). The irony-related hashtags were removed from the final dataset release.

Experimental Setup
We built our supervised systems based on available training data. In this phase performances  were evaluated based on the mean of F 1 -score, by using 10-fold cross validation. We chose an SVM classifier with radial basis function kernel 3 . Our system implementation is free available for research purpose in GitHub page 4 . Therefore, we lean on feature selection process to improve the system performance. We carried on an ablation test on our feature sets to get the highest F 1 -score. We decided to participate in three different scenarios: SubTask A constrained, SubTask A unconstrained, and SubTask B unconstrained. For unconstrained scenario in SubTask A, we used the available corpora from previous work. We tried to add new data with balance proportion (1500 ironic and 1500 non-ironic). We also added a balance proportion of ironic data based on different hashtag (500 #irony, 500 #sarcasm, and 500 #not) from three different corpora, with the aim of enriching the training data with ironic samples of various provenance and trying to avoid biases. The distribution and source of our additional data can be seen in Table 2.
In SubTask B, we proposed to use a pipeline approach in three-steps classification scenario. First, we classify the ironic and non-ironic (similar configuration with SubTask A). Second, we classify the ironic data from step one into two categories, verbal irony by polarity contrast and the rest (other verbal irony+situational irony). In the second step, we add more training data on the other verbal irony+situational irony class to  overcome the imbalance issue. We decided to use only additional tweets marked with #irony hashtags, relying on the analysis in (Sulis et al., 2016) suggesting that the polarity reversal phenomenon seems to be relevant in messages marked with #sarcasm and #not, but less relevant for messages tagged with #irony. In the last step, we classify between other verbal irony and situational irony. Table 3 shows selected features on each submitted system based on our ablation test. Table 4 shows our experimental results based on four different metrics including accuracy, precision, recall, and F 1 -score. For experiments on the training set we used 10-fold cross validation, and we report the score for each metric. However, F 1score has been used as the criterion to tune the configuration. Official Codalab results show that our system ranked 10 th out of 44 submissions on Sub-Task A and 9 th out of 32 on SubTask B. We obtained F 1-score 0.6216 (Best system: 0.7054) on SubTask A and 0.4131 (Best system: 0.5074) on SubTask B. However, our system outperformed all systems in the unconstrained setting on both tasks. Based on our analysis, several stylistic features were very effective in Task A (both in constrained and unconstrained settings). Especially, Twitter specific symbols such as hashtags, mentions, and URLs were very useful to discriminate non ironic tweets. In addition, we found that affective resource were very helpful in the Step 2 and Step 3 of Task B, especially Emolex (Step 2) and EmoSenticNet (Step 3). Another important finding is that additional data on Task A did not improve the classifier performance. Instead, additional tweets marked with #irony on Task B were very useful to handle the imbalance dataset in Step-2 (verbal irony by polarity contrast vs other verbal irony+situational irony). Our clas-  sifier was able to achieve a high F 1 score on the training phase in this case. Furthermore, we also found that our new features for capturing affective information in emojis (e.g. emoji incongruity) were very helpful in classifying between ironic and not ironic data. Table 5 shows the confusion matrix of our classification result on SubTask B. Our system performed quite well in Step 1 (irony vs non-irony) and

Result and Analysis
Step 2 (verbal irony by polarity contrast vs other verbal irony+situational irony). However, our system was struggling in distinguishing between other verbal irony and situational irony (Step 3). Our system got very low precision in detecting situational irony, and this has a huge impact on macro average F-score. The difficulties to find an important feature to discriminate other verbal and situational irony was, indeed, for us the main challenge in Task B. A qualitative error analysis was conducted. We found a lot of tweets which where difficult to understand without the context), like: (tw1)"Produce Mobile Apps http://t.co/3OV57ZhqcH http://t.co/wX1DbI8W9M" (tw2) "#Consensus of Absolute Hilarious -#MichaelMann to lecture on #Professional #Ethics for #Climate #Scientists? http://t.co/pD0TEMq1Z0" The first tweet is featured by situational irony and was originally including a #not hashtag before the link. Also for humans it is very difficult to get the  0 : Not irony 1 : Verbal irony by polarity contrast 2 : Others irony 3 : Situational irony ironic intention behind the tweet when the #not hashtag is removed and without having access to the information in the URL, which was anyway inactive. The second example was labelled as other verbal irony. Although it is very difficult to resolve the context of this tweet, accessing to the URL contained was helpful in understanding the ironic intent.

Conclusion
This paper described the participation of the #NonDicevoSulSerio 5 team at SemEval2018-Task3: Irony Detection of English Tweets. We proposed to use several stylistic features and exploited several affective resources to deal with this task. Based on our evaluation and analysis, classifying irony into its several types (verbal irony by polarity contrast, other verbal irony, and situational irony) is a very challenging task. Especially, getting relevant features to discriminate between other verbal irony and situational irony will become our main focus on the future research direction. In this case, capturing semantic incongruity by exploiting word embedding semantic similarity is an issue worth to be explored (Joshi et al., 2015).