NLPRL-IITBHU at SemEval-2018 Task 3: Combining Linguistic Features and Emoji pre-trained CNN for Irony Detection in Tweets

This paper describes our participation in SemEval 2018 Task 3 on Irony Detection in Tweets. We combine linguistic features with pre-trained activations of a neural network. The CNN is trained on the emoji prediction task. We combine the two feature sets and feed them into an XGBoost Classifier for classification. Subtask-A involves classification of tweets into ironic and non-ironic instances whereas Subtask-B involves classification of the tweet into - non-ironic, verbal irony, situational irony or other verbal irony. It is observed that combining features from these two different feature spaces improves our system results. We leverage the SMOTE algorithm to handle the problem of class imbalance in Subtask-B. Our final model achieves an F1-score of 0.65 and 0.47 on Subtask-A and Subtask-B respectively. Our system ranks 4th on both tasks respectively, outperforming the baseline by 6% on Subtask-A and 14% on Subtask-B.


Introduction
According to the Merriam-Webster dictionary 1 , one of the meanings of irony is defined as 'the use of words to express something other than and especially the opposite of the literal meaning' (e.g. I love getting spam emails.). Irony can have different forms, such as verbal, situational, dramatic etc. Sarcasm is also categorized as a form of verbal irony. Various attempts have been made in the past for detection of sarcasm (Joshi et al., 2017). Sarcastic texts are characterized by the presence of humor and ridicule, which are not always present in the case of ironic texts (Kreuz and Glucksberg, 1989). The absence of these characteristics makes automatic irony detection a more difficult problem than sarcasm detection.
Irony detection is a problem that is important for the working of many Natural Language Understanding Systems. For example, people often use irony to express their opinions on social media like Twitter (Buschmeier et al., 2014). Detecting irony in social texts can aid in improving opinion analysis.
The SemEval 2018 task 3 (Van Hee et al., 2018) consists of two subtasks. Subtask-A involves predicting whether a tweet is ironic or not and Subtask-B involves categorizing a tweet into Non-Ironic, Verbal Irony (by means of a polarity contrast), Situational Irony and Other Forms of Verbal Irony. The task organizers use macro averaged F1, rather than accuracy to force systems to optimize to work well on all the four classes of tweets, as described in Section 3.1.
Systems built in the past primarily used handcrafted linguistic features for classification of ironic texts (Buschmeier et al. 2014;Farías et al. 2016). In our system, we try to combine them with the pre-trained activations of a neural network. Our results show that both types of features complement each other, as the results produced by the combination of them surpass the results of using either the linguistic or the pre-trained activation features individually by a large margin. We use XGBoost Classifier (Chen and Guestrin, 2016), as it performs at par with neural networks when the provided training data is of small size.
Our results indicate that oversampling techniques like SMOTE (Chawla et al., 2002) can also be used to oversample the representations generated using neural networks to improve performance on imbalanced datasets.
The rest of the paper is organized as follows: Section 2 gives a detailed description of how our system was built, Section 3 then describes the experimental setup and the results obtained and Section 4 concludes the paper.

Proposed Approach
For modeling irony in tweets, our system makes use of a combination of features. These features can be classified into two broad groups: • Linguistic (Structure and User Behavior) • Pre-trained Activations of a Neural Network.
These features were concatenated and the XG-Boost classifier (Chen and Guestrin, 2016) was used to perform the classification.
For subtask B, to counter the imbalance in the dataset, which might lead classifiers to favor the majority class in classification, we used SMOTE for oversampling the data (Chawla et al., 2002). Then we used XGBoost Classifier again for classification into various classes.
The details of the classifier parameters are provided in Section 2.2. Basic preprocessing of tweets was performed before feature extraction, which involved removing hash symbols ('#'), converting contractions ('doesn't' to 'does not'), removing links and quotations and normalizing the text into lower case. We will explicitly mention those features whose extraction require the original tweets.

Feature Extraction
Our system generates a 72-dimensional handcrafted feature vector, based on the linguistic structure and user behavior. We then combine this with a 2304 dimensional feature vector generated using activations of a pre-trained CNN. The combined features are categorized into 11 broad classes: Contrast Based Features: Contrast of sentiments is a feature that has been observed in sarcastic and ironic texts (Rajadesingan et al., 2015), e.g. I love being ignored #not. For capturing contrast, we use the affect score of lemmas (Warriner et al., 2013) and the sentiment score of words based on SentiStrength (Thelwall et al., 2010). The final feature vector consists of: • The difference between the highest and lowest sentiment values of the words present in the tweet.
(1 feature) • The difference between the highest and lowest affect scores of the words present in the tweet.
(1 feature) • Longest unimodal sequence size and the number of transitions of sentiment polarity.
(2 features) • Sum of sentiment scores and counts of positive and negative n-grams. (4 features) Readability Based Features: Ironical texts are usually complex, and hence we use the total number of syllables in the tweet, along with number of words that contain polysyllables as features. According to Automated Readability Index (Senter and Smith, 1967), the standard deviation, the average and the median of the word length serve as indicators of the complexity of the text (Rajadesingan et al., 2015).
Incongruity of Context: Ironic similes are common in literature (e.g. as clear as mud in which both clear and mud are sentiment neutral words.). Due to this neutrality, the lexicon based methods are unable to capture the incongruity present. Therefore, maximum and minimum GloVe (Pennington et al., 2014) cosine similarity between any two words in a tweet are used as features in our system (Joshi et al., 2016).
Repetition-based Features: Users often change their writing style to depict sarcasm and irony, which is analogous to the change of tone in speech while expressing sarcasm, e.g. Loooovvveeeeeee when my phone gets wiped. We use the count of of words with repetitive characters and the count of 'senti words' (sentiment score ≥ 2 and sentiment score ≤ -2) with repetitive characters as our features (Rajadesingan et al., 2015).
Presence of Markers: Discourse markers are certain words that help in expressing ideas and performing specific functions (Farías et al., 2016). Our system uses a curated list of discourse markers. Similar to the list of discourse markers, we also use a list of intensifiers (e.g. heck ), laughter words (e.g. lmao, lol etc.), interjections (e.g. oops) and swear words (e.g. shit) as their appearance in a tweet indicates the presence of unexpectedness, which can, in turn, serve as an indicator of irony. We use counts of these different types of words separately as features.
Word Count Features: According to (2016), ironic tweets depict their content in fewer words compared to normal tweets. Hence we use the word count of tweets as a feature. Apart from the word count, (Kreuz and Caucci, 2007) suggest that the counts of adjectives and adverbs can also be used as markers of ironic content. We also use the preposition count as a separate feature.
Semantic Similarity: Ironic tweets that span multiple lines are often found to have lines that are very much semantically dissimilar to each other (Farías et al., 2016). We use the WordNet based similarity function (Mihalcea et al., 2006) available online 2 to obtain a similarity score, which is used as a feature.
Polarity and Subjectivity: Ironic texts are usually subjective and often convey something negative (or positive) about the target (Wallace et al., 2015). We use the Polarity and Subjectivity Scores (Sentiment Score) generated using TextBlob as features in our model (Loria et al., 2014).
URL Counts: We observed in the training set that users often used irony to express their opinion about online content, e.g. blogs, images, tweets, etc. For specifying the context of a comment (tweet), they often add a URL to the original content. So we used the counts of URLs in a tweet as a feature. Our system requires raw tweets for extracting this feature.
Apart from the above features, we also experimented with Named Entity Count and occurrence of popular hashtags like (#hypocrisy), using a curated list, as our features (Van Hee, 2017).

Pre-trained CNN Features
Apart from extracting linguistic features from tweets, we leverage the activations of a Convolutional Neural Network (CNN) pre-trained on emoji prediction task. We use DeepMoji 3 (Felbo et al., 2017), a model trained on 1.2 billion tweets with emojis, and tested on eight benchmark datasets within sentiment, emotion and sarcasm detection. Since sarcasm is a form of verbal irony that expresses ridicule or contempt (Long and Graesser, 1988), we believe transferring the knowledge of CNN trained on sarcasm can improve the results of Irony Detection task.

Classifiers
We construct XGBoost (Chen and Guestrin, 2016) feature-based classifiers for irony detection using the above features. Based on the 10-fold cross validation performance, the best performing parameters prove to be the default parameters used by the XGBoost Classifier Package 4 .

Handling Class Imbalance
The data provided for subtask-B is highly skewed.
To perform well on every class of irony, we used an oversampling technique (SMOTE (Chawla et al., 2002)). In SMOTE, for generating a new synthetic sample, from the k-nearest neighbors of an instance, one is chosen at random, and a new sample is generated on the line joining the instance and the chosen neighbor. We use the SMOTE implementation available in imblearn (Lemaître et al., 2017) package for our system, with kneighbors equal to 5.

Dataset and Metrics
The annotated tweet corpus provided for training consists of 1390 instances of Verbal Irony due to polarity contrast, 205 instances of Other Types of Verbal Irony, 316 Situational Ironic instances, and 1923 Non Ironic instances. Our system only uses the training data provided by the organizers and no other annotated data is used (Constrained System). The test dataset for Subtask-A contains 473 non-ironical tweets and 311 ironical tweets. For Subtask-B, the 311 ironical tweets are further classified into Verbal Irony by means of Polarity Contrast (164), Situational Irony (85) and Other Forms Of Verbal Irony (62).
The evaluation metric used for ranking teams in Sub-task A is the F1 score of the positive (Ironic) class whereas in Subtask-B, the organizers use macro averaged F1 (average of F1 for each class) as an evaluation metric for ranking teams.

Results and Discussion
We present the results achieved by our approaches, as well as the combination of our methods in Ta • Our submitted models achieve 4th position in public leaderboard 5 on both Task-A and Task-B and beat the task baselines by about 6% and 14%, respectively, on both tasks on the test set.
• Leveraging DeepMoji model for Irony detection domain yields a considerable improvement over purely linguistic features (0.03 and 0.12). This is because the model is trained on over a billion tweets on sarcasm and four other domains. As stated earlier, sarcasm is a verbal form of irony (Long and Graesser, 1988), and transfer learning works as domains are quite similar.
• Our combination of linguistic features with pre-trained CNN achieves an F-score of 0.65 and 0.42, with an improvement of at least 0.03 on Task-A and significant improvement in Task-B, compared to linguistic features. The higher accuracy points to the power of ensemble learning by combining different feature spaces, as both feature sets specialize in different types of tweets.
• The use of SMOTE oversampling technique leads to an F-score of 0.47 in Task-B, which is an improvement of 0.05 over (Linguistic + Pretrained CNN) model.
• The improvement in scores due to linguistic features are not as pronounced in Subtask-B, as compared to Subtask-A. One of the possible reasons for this is that linguistic features are not able to capture the fine grained differences between different forms of irony.