INAOE-UPV at SemEval-2018 Task 3: An Ensemble Approach for Irony Detection in Twitter

This paper describes an ensemble approach to the SemEval-2018 Task 3. The proposed method is composed of two renowned methods in text classification together with a novel approach for capturing ironic content by exploiting a tailored lexicon for irony detection. We experimented with different ensemble settings. The obtained results show that our method has a good performance for detecting the presence of ironic content in Twitter.


Introduction
Social media provide a perfect scenario for exploiting language beyond its literal sense by using figurative language devices such as, for example, irony. Correctly identifying the real intention behind user-generated content is a big challenge for different areas related to computational linguistics. For example, in Sentiment Analysis (SA), the presence of irony could undermine the performance of systems dedicated to this task . There are several disciplines studying irony from different perspectives. The most prevalent definition is that from Grice (1975), stating that the function of irony is to effectively communicate the opposite of the literal interpretation a given utterance.
Nowadays, with the growing interest in irony detection, there are several approaches 1 for addressing such an interesting task. Probably, the most widely used is that exploiting characteristics extracted from the text (such as n-grams, punctuation marks, part-of-speech labels, among others) on its own (Riloff et al., 2013;Ptáček et al., 2014). Inherent aspects of irony such as its very subjective component have also been considered (Reyes et al., 2013;Barbieri et al., 2014;. Other methods have opted for taking advantage of information coming from the context in which a given utterance is produced (Rajadesingan et al., 2015). There are also some approaches exploiting deep learning techniques and word embeddings (Poria et al., 2016;Ghosh and Veale, 2016;Joshi et al., 2016;Nozza et al., 2016). A less explored strategy for addressing irony detection is the use of ensemble methods. Fersini et al. (2015) and Liu et al. (2014) compared the performance of ensemble approaches against traditional classifiers; the best results were obtained by the ensemble strategy setting.
In this paper we describe our participation to the SemEval-2018 Task 3: Irony detection in English tweets (Van Hee et al., 2018). The INAOE-UPV system explores the use of an ensemble approach that considers different combinations of three methods. The main contribution of our approach lies on the use of a list of potentially ironic and non-ironic terms in order to identify irony in tweets.

Method Description
In order to determine the presence of ironic content in tweets, we propose an ensemble of different methods, namely, a bag-of-words and word embeddings classifiers, as well as a voting scheme based on a list of potentially ironic and non-ironic terms.

Individual classifiers Ironic/nonironic Orientation (irO)
This approach attempts to capture the ironic and non-ironic connotation of the words in a tweet in order to identify the presence of ironic content. Building a lexicon for irony detection is not a trivial task. It has been recognized in (Nozza et al., 2016) that a lexicon for irony detection can be derived by using a huge amount of data.
To develop a lexicon for irony detection it is needed to calculate how much a word could be associated with an ironic or non-ironic sense. A widely exploited measure in SA for developing lexica is the Pointwise Mutual Information (PMI) (Church and Hanks, 1990). We decided to adopt a similar strategy to generate two lists of terms associated to ironic and non-ironic senses. As starting point we took advantage of a set of corpora from the state of the art in irony detection (henceforth benchmark-corpora). The datasets we used are described in (Reyes et al., 2013;Riloff et al., 2013;Barbieri et al., 2014;Ptáček et al., 2014;Mohammad et al., 2015;Ghosh et al., 2015;Sulis et al., 2016;Karoui et al., 2017). Overall, more than 165,000 tweets were used to generate the lists of words: ironic terms and nonironic terms. We calculate the PMI score for each term 2 in the benchmark-corpora. After that, we selected only those terms with a PMI score greater than zero.
In order to determine the class of an instance we assigned a vote (v) for each word (w) in a given tweet (t). First, we filter out the stopwords in each tweet. Then, we search for the most similar term in each of the lists in order to determine whether w is more related to an ironic or nonironic sense. Mainly we compute a score that indicates the higher cosine similarity 3 among w and each of the N terms defined in our lists of words. As expected, the score for the words in t that are directly included in ironic terms or nonironic terms will be 1.
After this, the vote v(w) is assigned according to the following criterion: Finally, the class of a tweet is determined by the sum of the votes from all words in t.

Bag-of-words based classifier (BOW)
This approach is based on a bag-of-words (Salton et al., 1975) representation of the tweets. It uses unigrams as binary features. For the classification it employs a SVM classifier 4 . From here, we will use the acronym BOW to refer the use of the aforementioned individual approach.

Word Embeddings based classifier (wEmb)
This approach is based on the use of word embeddings. Particularly, it employs embeddings pre-trained on the Google News corpus (Mikolov et al., 2013) using the Continuous Bag-of-Words (CBOW) model 5 . In this case, tweets are represented by the centroid of the vectors from their words. Similar to the BOW approach, the classification is done by a SVM classifier. From now on the acronym wEmb will be used to refer to this approach.

Ensemble approaches for irony detection
We explored the use of different techniques relying on the words content in each tweet in order to identify the presence of irony. Each of the techniques we exploited has its own advantages and limitations. The BOW model allows to capture the existing topics in the vocabulary as well as discursive markers used in an ironic writing style. On the other hand, wEmb makes possible to catch abstract semantics of the words regardless of the available data for the task. With respect to irO, it attempts to simulate the interpretative process carried out to understand the ironic intention. Irony comprehension at an initial stage involves getting the literal sense of words (Giora and Fein, 1999) and then recognizing the figurative intention behind them. Thus, our method quantifies how many words are likely to be used in a literal or figurative sense before deciding whether a tweet is ironic or not. By proposing an ensemble using all the methods together we attempt to encompass different aspects of the use of vocabulary when the ironic phenomenon is present. Below, we introduce some ensemble approaches 6 proposed for capturing the presence of irony in Twitter.

Coverage-based ensemble (ENS cov)
It is composed by BOW and wEmb. In Twitter data, there are many terms such as mentions, hashtags, emoji, URL, etc., that are unlikely to have an embedding. However, such kinds of terms are indeed covered by a model like BOW. To take advantage of both methods, we decided to combine them by considering a simple criterion depending on the coverage rate of the word embeddings (cov emb) in each single tweet. That is, if the cov emb is greater than 75%, the tweet will be classified by the wEmb model, otherwise the decision will be made by the BOW approach.

Majority vote ensemble (ENS vot)
In this approach, the decisions from the three individual methods (irO, BOW and wEmb) are combined following a majority vote strategy.

Task Description
This year, as part of SemEval-2018 the Task 3 on Irony detection in English tweets (Van Hee et al., 2018), was dedicated to the identification of ironic content in Twitter. The task is composed by two subtasks: Task A. Ironic vs. non-ironic, the aim was to identify whether a tweet contains an ironic intention or not. The objective of the second one, Task B. Different types of irony, was to classify a tweet in one out of four classes: (i) verbal irony realized through a polarity contrast, (ii) other verbal irony, (iii) situational irony, and (iv) non ironic. Participants were allowed to submit two different kinds of systems: Constrained (C) where only data provided for the task were used for training purposes, and Unconstrained (U) where additional data were exploited.

Task A
In order to address Task A, we applied two different ensemble approaches. Our first submission was based on the coverage-based ensemble using a constrained setting (henceforth taskA ENS cov C). The second submission (henceforth taskA ENS vot U) used the majority vote ensemble built on an unconstrained setting. BOW and wEmb models were trained by using only the training set provided by the organizers. Instead, irO involves the use of the benchmark corpora. Additionally, we collected a set of tweets containing the hashtags #irony and #sarcasm during the 2016 US Elections week 7 as well as the training data provided for the task for building the lists of ironic and non-ironic terms.
For experimental purposes, we applied a three fold cross-validation using the training data during the developing phase of the shared task. Table 1 shows the obtained results in F 1 -Score. First, we evaluated each of the methods described in Section 2 individually (the first three rows in Table 1). The first two rows present the obtained results when the performance of BOW and wEmb was assessed using only the training data. Meanwhile, irO exploits both data from the task and external data. The highest result was achieved by the wEmb model. Despite being a basic method for identifying irony in tweets, our proposed approach (irO) achieves good performance even in comparison to powerful techniques such as word embeddings. Regarding the ensemble approaches, the best performance was reached by the majority vote approach.

Task B
In order to address the Task B, we employed two different configurations of the majority vote approach (henceforth taskB ENS vot U1 and taskB ENS vot U2), adding an additional criterion: in both cases, when the result of irO 8 indicates the presence of irony, we assigned one of the ironic-related classes by exploiting three different lists of words (one for each class in Task B) created following the same strategy described in Section 2.1. For taskB ENS vot U1, the BOW and wEmb models were trained using the four classes in Task B; while in taskB ENS vot U2 four binary classifiers considering the combinations between ironic classes and the non-ironic class in Task B. A weighted voting strategy was adopted in both ensembles.  The three methods were also evaluated individually for Task B. As it can be noticed, the best performance was achieved by BOW. The irO method performs better than wEmb. This is probably due to the fact of having few data for training the classifier. Neither of the ensemble methods improves the baseline, i.e., the BOW results.  Our best result was in the constrained version of Task A (we ranked in the 11 th position). Regarding this, our intuition is that having data retrieved during the same time-frame the probabilities of sharing a similar vocabulary 9 (in terms of trending-topic hashtags, mentions, etc.) are higher than when using external data. Therefore, an approach exploiting only data provided in the task could perform better than one using additional data. With reference to the unconstrained setting, we observed a drop in the performance. In spite of this, we ranked in the 2 nd position when only unconstrained systems were considered.

Official Results
Concerning Task B, our approach showed worst performance than in Task A. The results of both submissions were quite different. Probably this is due to the amount of classifiers involved in each ensemble. Overall, all the teams participating in the shared task had a lower performance in Task B demonstrating the difficulty of such a task. It is important to highlight that the taskB ENS vot U2 submission ranked in the 3 rd position when only the unconstrained setting was considered.

Error Analysis
We analyze those instances that were misclassified by our submissions in Task A observing different kinds of errors: • Tweets where the ironic sense highly depends on the context where they are produced. In the following example it is not possible to understand the ironic intention without having more information: @LukeLPearson hmm... let me think about that 10 • Tweets containing terms often used in ironic instances, such as "really". This is a disadvantage of word-based methods where terms highly related to a particular class provoke misleading classifications when they appear in other classes. The following is an example of this: I'm really excited for next semester 11 • Tweets containing several hashtags. Most of the time our methods predicted such instances as ironic being in reality non-ironic: @NormanWalshUK Stunning work. #british #textiles #footwear #madeinbritain #not-anike-clone

Conclusions
In this paper we describe our participation at SemEval-2018 Task 3. We propose an ensemble method including well-known techniques together with a novel approach based on the words in a tweet to identify the presence of irony. From the results, we observe that our approach obtained relatively good results considering its simplicity. As future work, it could be interesting to enhance the tailored lexicon by exploiting more data and other strategies for collecting words which are likely to be used for achieving an ironic sense. Moreover, considering different criteria to assign the votes in our approach is also matter of further experiments.