THU_NGN at SemEval-2018 Task 3: Tweet Irony Detection with Densely connected LSTM and Multi-task Learning

Detecting irony is an important task to mine fine-grained information from social web messages. Therefore, the Semeval-2018 task 3 is aimed to detect the ironic tweets (subtask A) and their ironic types (subtask B). In order to address this task, we propose a system based on a densely connected LSTM network with multi-task learning strategy. In our dense LSTM model, each layer will take all outputs from previous layers as input. The last LSTM layer will output the hidden representations of texts, and they will be used in three classification task. In addition, we incorporate several types of features to improve the model performance. Our model achieved an F-score of 70.54 (ranked 2/43) in the subtask A and 49.47 (ranked 3/29) in the subtask B. The experimental results validate the effectiveness of our system.


Introduction
Figurative languages such as irony are widely used in web messages such as tweets to convey different sentiment. Identifying the ironic texts can help to understand the social web better and has many applications such as sentiment analysis (Ghosh and Veale, 2016). Irony detecting techniques are important to improve the performance of sentiment analysis. For example, the tweet "Monday mornings are my fave:)# not" is an irony with negative sentiment, but it will be probably classified as a positive one by a standard sentiment analysis model (Van Hee et al., 2016b). Thus, capturing the ironic information in texts is useful to predict sentiment more accurately (Van Hee et al., 2016a).
However, determining whether a text is ironic is a challenging task since the the differences between ironic and non-ironic texts are usually subtle. For example, the tweet "Love this weather #not" is ironic, but a similar tweet "Hate this weather #not happy" is non-ironic. Different approaches are proposed to recognize the complex irony in texts. Existing methods to detect irony are mainly based on rules or machine learning techniques (Joshi et al., 2017). Rules based methods usually depend on lexicons to identify irony (Khattri et al., 2015;Maynard and Greenwood, 2014). However, these methods cannot utilize the contextual information from texts. Traditional machine learning based methods such as SVM (Desai and Dave, 2016) are also effective in this task, but they usually need manually feature engineering (Barbieri et al., 2014). Recently, deep learning techniques are successfully applied to this task. For example, Ghosh et al. (2016) propose to use a CNN-LSTM model to classify the ironic and non-ironic tweets. Their method can significantly improve the classification performance without heavy feature engineering. However, existing methods are aimed to detect irony in tweets with explicit irony related hashtags. For example, tweets with #irony or #sarcasm hashtags are very likely to be ironic. Therefore, models may focus on these hashtags rather than the contextual information.
To fill this gap, the SemEval-2018 task 3 1 aims to detect irony of tweets without explicit irony hashtags (Van Hee et al., 2018). The subtask A is aimed to determine whether a tweet is ironic. the subtask B is aimed to identify the irony types of tweets: Verbal irony by means of a polarity contrast, other verbal irony and situational irony. Several examples are as follows: • verbal irony by means of a polarity contrast: I love waking up with migraines #not  • situational irony: most of us didn't focus in the #ADHD lecture. #irony In order to address this problem, we propose a system 2 based on a densely connected LSTM model (Wu et al., 2017) with multitask learning techniques. In our model, each LSTM layer will take all outputs of previous LSTM layers as input. Then different levels of contextual information can be learned at the same time. Our model is required to predict in three tasks simultaneously: 1) identifying the missing irony related hashtags; 2) classify ironic or non-ironic; 3) irony type classification. By using multitask learning strategy, the model can combine the information in the different tasks to improve the performance. The experimental results in both subtasks validate the effectiveness of our method.

Densely Connected LSTM with
Multi-task Learning The architecture of our densely connected LSTM model is shown in Figure 1. We denote this model as Dense-LSTM. The detailed information will be introduced in the following paragraphs. In our model, the embedding layer is used to convert the input tweets into a sequence of dense vectors. The POS tag features P i are one-hot encoded and concatenated with the word embedding vectors E i . Usually the affective words and creative languages in tweets are important 2 https://github.com/wuch15/SemEval-2018-task3-THU NGN.git irony clues. Since these words usually have specific POS tags, adding these features can help our model to capture the ironic information better. We use tweetokenize 3 tool to tokenize and the Ark-Tweet-NLP 4 tool to obtain the POS tags of tweets (Owoputi et al., 2013).
The first Bi-LSTM layer takes the sequential vectors as input. For the j th Bi-LSTM layer, its output H j will input all LSTM layers after it. As shown in Figure 1, the blue dashed lines represent such over-layer connections. All inputs of an LSTM layer will be concatenated together. Thus, the input of the j th (j > 1) layer is [H 1 ; ...; H j−1 ]. It indicates that each layer can learn different levels of information at the same time. Since the irony information is complex, jointly using all levels of information is beneficial to predict irony more accurately. The last LSTM layer will output the hidden representation H of texts. It will be concatenated with the sentiment features and the sentence embedding features. The sentiment features can provide additional sentiment information to detect irony, such as the sentiment polarity assigned by lexicons. The sentiment features are generated via the Af-fectiveTweets 5 package in weka provided by Mohammad et al. (Mohammad and Bravo-Marquez, 2017). We use the TweetToLexiconFeatureVector (Bravo-Marquez et al., 2014) and TweetToSen-tiStrengthFeatureVector (Thelwall et al., 2012) filters in this package. The embedding of a sentence is obtained by taking the average of all words in this sentence using the 100-dim pre-trained embedding weights provided by Bravo et al. (Bravo-Marquez et al., 2016). By incorporating the vector representation of tweet sentence, the irony information can be easier to be captured.
Three dense layers with ReLU activation are used to predict for three different tasks including: determining the missing ironic hashtags (i.e. #not, #sarcasm, #irony or none of them) (task1); identifying ironic or non-ironic (task2) ; identifying the irony types (task3). Thus, the objective function of our model can be formulated as: where L i and α i denote the loss function and its weight of task i. L 1 and L 2 are categorical and binary cross-entropy respectively. In addition, the numbers of tweets with different irony types are very unbalanced. Motivated by the cost-sensitive entropy used by Santos et al. (2009), we formulate L 3 as follows: where N is the number of tweets, y i is the irony type of the i th tweet,ŷ i is the prediction score, and w y i is the loss weight of irony type label , where C is the number of irony types and N j is the number of tweets with irony type label j. Thus, the infrequent irony types will gain relatively larger loss weights. By using this multi-task learning method, our model can incorporate different information such as the irony hashtags. In addition, classifying ironic/non-ironic and the irony types are similar tasks. Therefore, the performance of both tasks can be improved by combining the information of both tasks.
In order to improve the performance of our system, we use an ensemble strategy by averaging the classification results predicted by 10 models. Each model will be trained using a random dropout rate. Therefore in this way, the classification results will be voted by different models, which can improve the model performance.

Dataset and Experimental Settings
The detailed statistics of the dataset 6 in this task are shown in Table 1. V-irony, O-irony and S-irony represent the three types respectively: verbal irony by means of a polarity contrast, other types of verbal irony and situational irony (Van Hee et al., 2018). In subtask A, the performance of systems is evaluated by F-score for the positive class. In subtask B, the macro-averaged F-score over all classes is used as the metric.  We combine two pre-trained word embeddings: 1) the embeddings provided by Godin et al. (2015), which are trained on a corpus with 400 million tweets; 2) the embeddings provided by Barbieri et al. (2016), which are trained on 20 million tweets. The dimensions of them are 400 and 300 respectively. They are concatenated together as the embeddings of words.
In our network, the Dense-LSTM model has 4 LSTM layers with 200-dim hidden states. The hidden dimensions of dense layers are set to 300. The dropout rate of each layer is set to a random number between 0.2 to 0.4, and it will be set to a fixed value 0.3 in the comparative experiments without ensemble strategy. In subtask A, the loss weights α of the three task are set to 0.5, 1 and 0.5 respectively. In subtask B, they are 0.5, 0.5 and 1. We use RMSProp as the optimizer, and the batch size is set to 64. In addition, we use 10% training data for validation to select the hyperparameters above.

Performance Evaluation
We compare the performance of different methods including: 1) SVM, the benchmark system using SVM and BOW model; 2) CNN, using CNN with a global average pooling layer to obtain the hidden vector h, which is used to predict in the three tasks; 3) LSTM, using one Bi-LSTM layer in the network to get h; 4) 2-layer LSTM, using 2 Bi-LSTM layers; 5) Dense-LSTM, using our Dense-LSTM model; 6) Dense-LSTM+ens, using our Dense-LSTM model and ensemble strategy. In addition, we apply multi-task learning technique to all models except the benchmark system based on SVM. The results are shown in Table  1. The experimental results show that our Dense-LSTM model significantly outperforms the baselines. Since the layers in our Dense-LSTM can learn from all previous outputs, our model can combine different levels of contextual information to capture the high-level irony clues. In addition, our model can predict more accurately via ensemble. Since models with random dropout can extract different information, we can take advantage of all models by voting. The ensemble strategy can reduce the noise in the dataset and make our system more stable (Xia et al., 2011

Effectiveness of Multi-task Learning
The performance of our Dense-LSTM model using different combinations of training tasks is shown in Table 3. Note that we don't apply model ensemble here. Compared with the models trained in task2 or task3 only, the combination of both tasks can improve the performance. It may be because the two tasks have inherent relatedness and can share rich mutual information. Learning to predict the missing ironic hashtags (task1) can also improve the model performance. Since the ironic hashtags are often important ironic clues, identifying such clues can help our model to mine ironic information better.

Influence of Pre-trained Word Embedding
We compare the performance using different combinations of pre-trained embeddings in our model. The results are illustrated in Table 4. The results show that the pre-trained embeddings are important to capture irony information, and using the   Table 4: Influence of pre-trained word embedding. The emb1 and emb2 denote the embeddings provided by Godin et al. (2015) and Barbieri et al. (2016) respectively.

Influence of Additional Features
The influence of different features on our model is shown in Table 5. According to this table, all features can improve the classification performance in both subtasks, and the combination of the three features can achieve better performance. The improvement brought by POS tags is most significant. Affective words are important irony clues and they are usually verbs, adjectives or hashtags. Thus, incorporating the POS tag features can help to identify these words and capture the ironic information better. The sentiment features also improve our model, which can be inferred from the results. The sentiment polarities of ironic tweets are usually negative, but these texts often contain positive sentiment words. Since our sentiment features are obtained by several different sentiment or emotion lexicons, they can be used to assign the sentiment scores of texts, which can provide rich information to detect irony. The sentence embedding can also slightly improve the performance. The sentence embedding contains information of each word in the sentence. Thus, it can help to capture the word information better, which is ben-eficial to identify the overall sentiment of texts. The combination of all three types of features can take advantage of them and gain significant performance improvement. It validates the effectiveness of each type of features.

Conclusion
Detecting irony in web texts is an important task to mine fine-grained sentiment information. In order to address this problem, we develop a system based on a densely connected LSTM model to participate in the SemEval-2018 Task 3. In our model, every LSTM layer will take all outputs of previous layers as inputs. Thus, the different levels of information can be learned at the same time.
In addition, we propose to combine three different tasks to train our model jointly, which includes identifying the missing irony hashtags, determining ironic or non-ironic and classifying the irony types. These tasks have inherent relatedness thus the performance can be improved by sharing the mutual information. Our system achieved an Fscore of 70.54 and 49.47 which ranked the 2nd and 3rd place in the two subtasks. The experimental results validates the effectiveness of our method.