NIHRIO at SemEval-2018 Task 3: A Simple and Accurate Neural Network Model for Irony Detection in Twitter

This paper describes our NIHRIO system for SemEval-2018 Task 3 “Irony detection in English tweets.” We propose to use a simple neural network architecture of Multilayer Perceptron with various types of input features including: lexical, syntactic, semantic and polarity features. Our system achieves very high performance in both subtasks of binary and multi-class irony detection in tweets. In particular, we rank at least fourth using the accuracy metric and sixth using the F1 metric. Our code is available at: https://github.com/NIHRIO/IronyDetectionInTwitter


Introduction
Mining Twitter data has increasingly been attracting much research attention in many NLP applications such as in sentiment analysis (Pak and Paroubek, 2010;Kouloumpis et al., 2011;Agarwal et al., 2011;Liu et al., 2012;Rosenthal et al., 2017;Cambria et al., 2018) and stock market prediction (Bollen et al., 2011;Vu et al., 2012;Bartov et al., 2015;Nofer and Hinz, 2015;Oliveira et al., 2017). Recently, Davidov et al. (2010) and Reyes et al. (2013) have shown that Twitter data includes a high volume of "ironic" tweets. For example, a user can use positive words in a Twitter message to her intended negative meaning (e.g., "It is awesome to go to bed at 3 am #not"). This especially results in a research challenge to assign correct sentiment labels for ironic tweets (Bosco et al., 2013;Ghosh et al., 2015;Nozza et al., 2017;Kannangara, 2018).
To handle that problem, much attention has been focused on automatic irony detection in Twitter (Davidov et al., 2010;Reyes et al., 2013;Barbieri and Saggion, 2014;Rajadesingan et al., 2015;Sulis et al., 2016;Karoui et al., 2017;Joshi et al., 2017;Huang et al., 2017;Ravi and Ravi, 2017). In this paper, we propose a neural network model for irony detection in tweets. Our model obtains the fifth best performances in both binary and multi-class irony detection subtasks in terms of F 1 score (Van Hee et al., 2018). Details of the two subtasks can be found in the task description paper (Van Hee et al., 2018). We briefly describe the subtasks as follows: Subtask 1 (A): Ironic vs non-ironic This first subtask is a binary classification problem, in which we predict whether or not a tweet is ironic. For example, "I just love when you test my patience!! #not" is ironic, but "Had no sleep and have got school now #not happy" is non-ironic.
Subtask 2 (B): Different types of irony This second subtask is a multi-class classification problem, where we predict the correct label of a tweet from four classes: (1) non-irony, (2) verbal irony by means of a polarity contrast, (3) other verbal irony and (4) situational irony.
The remainder of this paper is organized as follows: We describe the ironic tweet dataset provided by the SemEval-2018 Task 3 in Section 2. We then describe our system in Section 3. The experimental results and conclusion are detailed in Section 4 and Section 5, respectively.

Dataset
The dataset consists of 4,618 tweets (2,222 ironic + 2,396 non-ironic) that are manually labelled by three students. Some pre-processing steps were applied to the dataset, such as the emoji icons in a tweet are replaced by a describing text using the Python emoji package. 1 Additionally, all the

Statistics
Training  ironic hashtags, such as #not, #sarcasm, #irony, in the dataset have been removed. This makes difficult to correctly predict the label of a tweet. For example, "@coreybking thanks for the spoiler!!!! #not" is an ironic tweet but without #not, it probably is a non-ironic tweet. The dataset is split into the training and test sets as detailed in Table 1. Note that there is also an extended version of the training set, which contains the ironic hashtags. However, we only use the training set which does not contain the ironic hashtags to train our model as it is in line with the test set.
Our data pre-processing step: Tweet normalization is an important pre-processing step as there are around 15% of tweets containing 50% or more out-of-vocabulary tokens (Han and Baldwin, 2011). We normalize each tweet from the dataset using a lexicon-based approach proposed by Han et al. (2012), using a manually constructed normalization dictionary (e.g., "reeeaaalll" is normalized by "real'). We then replace all tagged users and urls by specific word tokens "<USER>" and "<URL>", respectively. It is because they are likely not correlated with the ironic labels.

Our modeling approach
We first describe our MLP-based model for ironic tweet detection in Section 3.1. We then present the features used in our model in Section 3.2.

Neural network model
We propose to use the Multilayer Perceptron (MLP) model (Hornik et al., 1989) to handle both the ironic tweet detection subtasks. Figure 1 presents an overview of our model architecture including an input layer, two hidden layers and a softmax output layer. Given a tweet, the input layer represents the tweet by a feature vector which concatenates lexical, syntactic, semantic and polarity feature representations. The two hidden layers with ReLU activation function take the input feature vector to select the most important features which are then fed into the softmax layer for ironic detection and classification. Table 2 shows the number of lexical, syntactic, semantic and polarity features used in our model.

Features
Lexical features: Our lexical features include 1-, 2-, and 3-grams in both word and character levels. For each type of n-grams, we utilize only the top 1,000 n-grams based on the term frequencyinverse document frequency (tf-idf) values. That is, each n-gram appearing in a tweet becomes an entry in the feature vector with the corresponding feature value tf-idf. We also use the number of characters and the number of words as features.   Syntactic features: We use the NLTK toolkit to tokenize and annotate part-of-speech tags (POS tags) for all tweets in the dataset. We then use all the POS tags with their corresponding tf-idf values as our syntactic features and feature values, respectively.
Semantic features: A major challenge when dealing with the tweet data is that the lexicon used in a tweet is informal and much different from tweet to tweet. The lexical and syntactic features seem not to well-capture that property. To handle this problem, we apply three approaches to compute tweet vector representations. Firstly, we employ 300-dimensional pre-trained word embeddings from GloVe (Pennington et al., 2014) to compute a tweet embedding as the average of the embeddings of words in the tweet.
Secondly, we apply the latent semantic indexing (Papadimitriou et al., 1998) to capture the underlying semantics of the dataset. Here, each tweet is represented as a vector of 100 dimensions.
Thirdly, we also extract tweet representation by applying the Brown clustering algorithm (Brown et al., 1992;Liang, 2005) 2 -a hierarchical clustering algorithm which groups the words with similar meaning and syntactical function together. Applying the Brown clustering algorithm, we obtain a set of clusters, where each word belongs to only one cluster. For example in Table 3, words that indicate the members of a family (e.g., "mum", "dad") or positive sentiment (e.g., "interesting", "awesome") are grouped into the same cluster. We run the algorithm with different number of clustering settings (i.e., 80, 100, 120) to capture multiple semantic and syntactic aspects. For each clustering setting, we use the number of tweet words in each cluster as a feature. After that, for each tweet, we concatenate the features from all the clustering settings to form a cluster-based tweet embedding.
Polarity features: Motivated by the verbal irony by means of polarity contrast, such as "I really love this year's summer; weeks and weeks of awful weather", we use the number of polarity signals appearing in a tweet as the polarity features. The signals include positive words (e.g., love), negative words (e.g., awful), positive emoji icon and negative emoji icon. We use the sentiment dictionaries provided by Hu and Liu (2004) to identify positive and negative words in a tweet. We further use boolean features that check whether or not a negation word is in a tweet (e.g., not, n't).

Implementation details
We use Tensorflow (Abadi et al., 2015) to implement our model. Model parameters are learned to minimize the the cross-entropy loss with L 2 regularization. Figure 2 shows our training mechanism. In particular, we follow a 10-fold crossvalidation based voting strategy. First, we split the training set into 10 folds. Each time, we combine 9 folds to train a classification model and use the remaining fold to find the optimal hyperparameters.    Table 4 shows optimal settings for each subtask.
In total, we have 10 classification models to produce 10 predicted labels for each test tweet. Then, we use the voting technique to return the final predicted label.

Metrics
The metrics used to evaluate our model include accuracy, precision, recall and F 1 . The accuracy is calculated using all classes in both tasks. The remainders are calculated using only the positive label in subtask 1 or per class label (i.e., macroaveraged) in subtask 2. Detail description of the metrics can be found in Van Hee et al. (2018). Table 5 shows our official results on the test set for subtask 1 with regards to the four metrics. By using a simple MLP neural network architecture, our system achieves a high performance which is ranked third and fifth out of forty-four teams using accuracy and F 1 metrics, respectively. Table 6 presents our results on the test set for subtask 2. Our system also achieves a high performance which is ranked third and fifth out of thirty-two teams using accuracy and F 1 metrics, respectively. We also show in Table 7 the performance of our system on different class labels. For ironic classes, our system achieves the best performance on the verbal irony by means of a polarity contrast with F 1 of 60.73%. Note that the performance on the situational class is not high. The   reason is probably that the number of situational tweets in the training set is small (205/3,834), i.e. not enough to learn a good classifier.

Discussions
Apart from the described MLP models, we have also tried other neural network models, such as Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) and Convolutional Neural Network (CNN) for relation classification (Kim, 2014). We found that LSTM achieves much higher performance than MLP does on the extended training set containing the ironic hashtags (about 92% vs 87% with 10-fold cross-validation using F 1 on subtask 1). However, without the ironic hashtags, the performance is lower than MLP's. We also employed popular machine learning techniques, such as SVM (Hearst et al., 1998), Logistic Regression (Harrell, 2001, Ridge Regression Classifier (Le Cessie and Van Houwelingen, 1992), but none of them produces as good results as MLP does. We have also implemented ensemble models, such as voting, bagging and stacking. We found that with 10-fold cross-validation based voting strategy, our MLP models produce the best irony detection and classification results.

Conclusion
We have presented our NIHRIO system for participating the Semeval-2018 Task 3 on "Irony detection in English tweets". We proposed to use Multilayer Perceptron to handle the task using various features including lexical features, syntactic features, semantic features and polarity features. Our system was ranked the fifth best performing one with regards to F 1 score in both the subtasks of binary and multi-class irony detection in tweets. 528