Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model

Sarcasm is a subtle form of language in which people express the opposite of what is implied. Previous works of sarcasm detection focused on texts. However, more and more social media platforms like Twitter allow users to create multi-modal messages, including texts, images, and videos. It is insufficient to detect sarcasm from multi-model messages based only on texts. In this paper, we focus on multi-modal sarcasm detection for tweets consisting of texts and images in Twitter. We treat text features, image features and image attributes as three modalities and propose a multi-modal hierarchical fusion model to address this task. Our model first extracts image features and attribute features, and then leverages attribute features and bidirectional LSTM network to extract text features. Features of three modalities are then reconstructed and fused into one feature vector for prediction. We create a multi-modal sarcasm detection dataset based on Twitter. Evaluation results on the dataset demonstrate the efficacy of our proposed model and the usefulness of the three modalities.


Introduction
Merriam Webster defines sarcasm as "a mode of satirical wit depending for its effect on bitter, caustic, and often ironic language that is usually directed against an individual". It has the magical power to disguise the hostility of the speaker (Dews and Winner, 1995) while enhancing the effect of mockery or humor on the listener. Sarcasm is prevalent on today's social media platforms, and its automatic detection bears great significance in customer service, opinion mining, online harassment detection and all sorts of tasks that require knowledge of people's real sentiment.
Twitter has become a focus of sarcasm detection research due to its ample resources of pub-licly available sarcastic posts. Previous works on Twitter sarcasm detection focus on the text modality and propose many supervised approaches, including conventional machine learning methods with lexical features (Bouazizi and Ohtsuki, 2015;Ptáček et al., 2014), and deep learning methods (Wu et al., 2018;Baziotis et al., 2018).
However, detecting sarcasm with only text modality can never be certain of the true intention of the simple tweet "What a wonderful weather!" until the dark clouds in the attached picture (Figure 1(a)) are seen. Images, while are ubiquitous on social platforms, can help reveal (Figure 1(a)), affirm (Figure 1(b)) or disprove the sarcastic nature of tweets, thus are intuitively crucial to Twitter sarcasm detection tasks.
In this work, we propose a multi-modal hierarchical fusion model for detecting sarcasm in Twitter. We leverage three types of features, namely text, image and image attribute features, and fuse them in a novel way. During early fusion, the attribute features are used to initialize a bi-directional LSTM network (Bi-LSTM), which is then used to extract the text features. The three features then undergo representation fusion, where they are transformed into reconstructed representation vectors. A modality fusion layer performs weighted average to the vectors and pumps them to a classification layer to yield the final result. Our results show that all three types of features contribute to the model performance. Furthermore, our fusion strategy successfully refines the representation of each modality and is significantly more effective than simply concatenating the three types of features.
Our main contributions are summarized as follows: • We propose a novel hierarchical fusion model to address the challenging multi-modal sar-(a)"What a wonderful weather!" (b)"Yep, totally normal <user>. Nothing is off about this. Nothing at all. #itstoohotalready #climatechangeisreal" Figure 1: Examples of image modality aiding sarcasm detection. (a) The image is necessary for the sarcasm to be spotted due to the contradiction of dark clouds in the image and "wonderful weather" in the text; (b) The image affirms the sarcastic nature of the tweet by showing the weather is actually very "hot" and is not at all "totally normal". casm detection task in Twitter. To the best of our knowledge, we are the first to deeply fuse the three modalities of image, attribute and text, rather than naïve concatenation, for Twitter sarcasm detection.
• We create a new dataset for multi-modal Twitter sarcasm detection and release it 1 .
• We quantitatively show the significance of each modality in Twitter sarcasm detection. We further show that to fully unleash the potential of images, we would need to consider image attributes -a high-level abstract information bridging the gap between texts and images.
2 Related Works

Sarcasm Detection
Various methods have been proposed for sarcasm detection from texts. Earlier methods extract carefully engineered discrete features from texts (Davidov et al., 2010;Riloff et al., 2013;Ptáček et al., 2014;Bouazizi and Ohtsuki, 2015), including n-grams, word's sentiment, punctuations, emoticons, part-of-speech tags, etc. More recently, researchers leverage the powerful techniques of deep learning to get more precise semantic representations of tweet texts. Ghosh and Veale (2016)  However, little has been revealed by far on how to effectively combine textual and visual information to boost performance of Twitter sarcasm detection. Schifanella et al. (2016) simply concatenate manually designed features or deep learning based features of texts and images to make prediction with two modalities. Different from this work, we propose a hierarchical fusion model to deeply fuse three modalities.

Other Multi-Modal Tasks
Sentiment analysis is a related task with sarcasm detection. Many researches on multi-modal sen-timent analysis deal with video data (Wang et al., 2016;Zadeh et al., 2017), where text, image and audio data can usually be aligned and support each other. Though inputs are different, their fusion mechanisms can be inspiring to our task. Poria, Cambria, and Gelbukh (2015) use multiple kernel learning to fuse different modalities. Zadeh et al. (2017) build their fusion layer by outer product instead of simple concatenation in order to get more features. Gu et al. (2018b) align text and audio at word level and apply several attention mechanisms. Gu et al. (2018a) first introduce modality fusion structure attempting to reveal the actual importance of multiple modalities, but their methods are quite different from our hierarchical fusion techniques.
Inspiration can also be drawn from other multimodal tasks, such as visual question answering (VQA) tasks where a frame of image and a query sentence are provided as model inputs. A question-guided attention mechanism is proposed in VQA tasks (Chen et al., 2015) and can boost model performance compared to those using global image features. Attribute prediction layer is introduced (Wu et al., 2016) as a way to incorporate high-level concepts into the CNN-LSTM framework. Wang et al. (2017) exploit a handful of off-the-shelf algorithms, gluing them with a co-attention model and achieve generalizability as well as scalability. Yang et al. (2014) try image emotion extraction tasks with image comments and propose a model to bridge images and comment information by learning Bernoulli parameters. Figure 2 shows the architecture of our proposed hierarchical fusion model. In this work, we treat text, image and image attribute as three modalities. Image attribute modality has been shown to boost model performance by adding high-level concept of the image content (Wu et al., 2016). Modality fusion techniques are proposed to make full use of the three modalities. In the following paragraph, we will first define raw vectors and guidance vectors, and then briefly introduce our hierarchical fusion techniques.

Proposed Hierarchical Fusion Model
For the image modality, we use a pre-trained and fine-tuned ResNet model to obtain 14 × 14 regional vectors of the tweet image, which is defined as the raw image vectors, and average them to get our image guidance vector. For the (image) attribute modality, we use another pre-trained and fine-tuned ResNet models to predict 5 attributes for each image, the GloVe embeddings of which are considered as the raw attribute vectors. Our attribute guidance vector is a weighted average of the raw attribute vectors. We use Bi-LSTM to obtain our text vectors. The raw text vectors are the concatenated forward and backward hidden states for each time step of the Bi-LSTM, while the text guidance vector is the average of the above raw vectors. In the belief that the attached image could aid the model's understanding of the tweet text, we apply non-linear transformations on the attribute guidance vector and feed the result to the Bi-LSTM as its initial hidden state. This process is named early fusion. In order to utilize multimodal information to refine representations of all modalities, representation fusion is proposed in which feature vectors of the three modalities are reconstructed using raw vectors and guidance vectors. The refined vectors of three modalities are combined into one vector with weighted average instead of simple concatenation in the process of modality fusion. Lastly, the fused vector is pumped into a two layer fully-connected neural network to obtain classification result. More details of our model are provided below.

Image Feature Representation
We use ResNet-50 V2 (He et al., 2016) to obtain representations of tweet images. We chop the last fully-connected (FC) layer of the pre-trained model and replace it with a new one for the sake of model fine-tuning. Following (Wang et al., 2017), a input image I is re-sized to 448 × 448 and divided into 14 × 14 regions. Each region I i (i = 1, 2 . . . , 196) is then sent through the ResNet model to obtain a regional feature repre- As is described before, the image guidance vector v image is the average of all regional image vectors.
where N r is the number of regions and is 196 in this work.

Attribute Feature Representation
Previous work (Wu et al., 2016) in image captioning and visual question answering introduces attributes as high-level concepts of images. In their work, single-label and multi-label losses are proposed to train the attribute prediction CNN, whose parameters are transferred to generate the final image representation. While they use parameter sharing for better image representation with attributelabeling tasks, we take a more explicit approach.
We treat attributes as an extra modality bridging the tweet text and image, by directly using the word embeddings of five predicted attributes of each tweet image as the raw attribute vectors.
We first train an attribute predictor with ResNet-101 and COCO image captioning dataset (Lin et al., 2014). We build the multi-label dataset by extracting 1000 attributes from sentences of the COCO dataset. We use a ResNet model pretrained on ImageNet (Russakovsky et al., 2015) and fine-tune it on the multi-label dataset. Then the attribute predictor is used to predict five attributes a i (i = 1, . . . , 5) for each image.
We generate the attribute guidance vector by weighted average. Raw attribute vectors e(a i ) are passed through a two-layer neural network to obtain the attention weights α i for constructing the attribute guidance vector v attr . The related equa-tions are as follows.
where a i is the i th image attribute, literally a word out of a vocabulary of 1000; e is the GloVe embedding operation; W 1 and W 2 are weight matrices; b 1 and b 2 are biases; N a is the number of attributes, and is 5 in our settings.

Text Feature Representation
Bidirectional LSTM (Bi-LSTM) (Hochreiter and Schmidhuber, 1997) are used to obtain the representation of the tweet text. The equations of operations performed by LSTM at time step t are as follows: x t , h t are input state and hidden state at time step t, respectively; σ is the sigmoid function; denotes element-wise product. The text guidance vector is the arithmetic average of hidden states in each time step.
where L is the length of the tweet text.

Early Fusion
The Bi-LSTM initial states are usually set to zeroes in text classification tasks, but it is a potential spot where multi-modal information could be infused to promote the modal's comprehension of the text. In the proposed model, we apply the nonlinearly transformed attribute guidance vector as the initial state of Bi-LSTM.
where h f 0 , c f 0 are forward LSTM initial states and h b0 , c b0 are backward LSTM initial states; [; ] is vector concatenation; ReLu denotes elementwise application of the Rectified Linear Units activation function; W and b are weight matrix and bias.
We also try to use image guidance vector for early fusion, in which the LSTM initial states are obtained with means similar to the one described above, but it does not perform very well, as will be discussed in the experiments.

Representation Fusion
Inspired by attention mechanism in VQA tasks, representation fusion aims at reconstructing the feature vectors v image , v text , v attr with the help of low-level raw vectors (namely, the hidden states of time step t {h t } for the text modality, the 196 regional vectors for the image modality, and the five attribute embeddings for the attribute modality) and high-level guidance vectors from different modalities.
We denote X (i) m as the i th raw vector from modality m (which may be text, image or attribute). The key in this stage is to calculate the weight for each X To leverage as much information as possible and more accurately model the relationship between multiple modalities, we exploit information from all three modalities -more explicitly, guidance vectors v n where n could be text, image or attribute, when calculating the weights of raw vectors in each modality. For the i th raw vector of each modality m, we calculate three guided weights α (i) mn from the guidance vectors of different modalities n. The final reconstruction weight for the raw vector is the average of the normalized guided weights.
where m, n ∈ {text, image, attr} denote modalities; α After representation fusion, v image , v text , v attr , previously denoted as guidance vectors, are now considered feature vectors of each modality and ready to serve as inputs of the next layer.

Modality Fusion
Instead of simply concatenating the feature vectors from different modalities to form a longer vector, we perform modality fusion motivated by the work of (Gu et al., 2018a). The feature vector for each modality m, denoted as v m , is first transformed into a fixed-length form v m . A twolayer feed-forward neural network is implemented to calculate the attention weights for each modality m, which is then used in the weighted average of transformed feature vectors v m . The result is a single, fixed-length vector v fused .
where m is one of the three modalities andα is a vector containingα m ; W m 1 , W m 2 , W m 3 are  weight matrices. b m 1 , b m 2 , b m 3 are biases; v m represents reconstructed feature vectors in the representation fusion process.

Classification layer
We use a two layer fully-connected neural network as our classification layer. The activation function of the hidden layer and the output layer are element-wise ReLu and sigmoid functions, respectively. The loss function is cross entropy.

Dataset and Preprocessing
There is no publicly available dataset for evaluating the multi-modal sarcasm detection task, and thus we build our own dataset, which will be released later. We collect and preprocess our data similar to (Schifanella et al., 2016). We collect English tweets containing a picture and some special hashtag (e.g., #sarcasm, etc.) as positive examples (i.e. sarcastic) and collect English tweets with images but without such hashtags as negative examples (i.e. not sarcastic). We further clean up the data as follows. First, we discard tweets containing sarcasm, sarcastic, irony, ironic as regular words. We also discard tweets containing URLs in order to avoid introducing additional information. Furthermore, we discard tweets with words that frequently co-occur with sarcastic tweets and thus may express sarcasm, for instance jokes, humor and exgag. We divide the data into training set, development set and test set with a ratio of 80%:10%:10%. In order to evaluate models more accurately, we manually check the development set and the test set to ensure the accuracy of the labels. The statistics of our final dataset are listed in table 1. For preprocessing, we first replace mentions with a certain symbol user . We then separate words, emoticons and hashtags with the NLTK toolkit. We also separate hashtag sign # from hashtags and replace capitals with their lowercases. Finally, words appearing only once in the training set and words not appearing in the training set but appearing in the development set or test

Training Details
Pre-trained models. The pre-trained ResNet model is available online. The word embeddings and attribute embeddings are trained on the Twitter dataset using Glove (Pennington et al., 2014). Fine tuning. Parameters of the pre-trained ResNet model are fixed during training. Parameters of word and attribute embeddings are updated during training.
Optimization. We use the Adam optimizer (Kingma and Ba, 2014) to optimize the loss function.
Hyper-parameters. The hidden layer size in the neural networks described in the fusion techniques is half of its input size. Other hyper-parameters are listed in table 2. Table 3 shows the comparison results (F-score and Accuracy) of baseline models and our proposed model. We implement models with one or multiple modalities as baseline models. We also present the results of naïve solution (all negative, random) of this task. Random. It randomly predicts whether a tweet is sarcastic or not. Text(Bi-LSTM). Bi-LSTM is one of the most popular method for addressing many text classification problems. It leverages a bidirectional LSTM network for learning text representations and then uses a classification layer to make prediction. Text(CNN). CNN is also one of the state-of-theart methods to address text classification problems. We implement text CNN (Kim, 2014) as a baseline model.  Image. Image vectors after the pooling layer of ResNet are inputs of the classification layer. We only update parameters of the classification layer.

Comparison Results
Attr. Since image attribute is one of the modalities in our proposed model, we also try to use only attribute features to make prediction. The attribute feature vectors are inputs of the classification layer.
Concat. Previous work (Schifanella et al., 2016) concatenates different feature vectors of different modalities as the input of the classification layer.
We implement this concatenation model with our feature vectors of different modalities and apply it for classification. The number in parentheses is the number of modalities we use.
(2) means concatenating text features and image features, while (3) means concatenating all text, image and attribute features.
We can see that the models based only on the image or attribute modality do not perform well, while Text(Bi-LSTM) and Text(CNN) models perform much better, indicating the important role of text modality. The Concat(3) model outperforms Concat(2), because adding attributes as a new modality actually introduces external semantic information of images and helps the model when it fails to extract valid image features. Our proposed hierarchical fusion model further improves the performance and achieves the state-of-the-art scores, revealing that our fusion model leverages features of three modalities in a more effective way.
We further apply sign tests between our proposed model and Text(Bi-LSTM), Concat(2), Concat(3) models. The null hypotheses are that our proposed model doesn't perform better than each baseline model. The statistics of the sign tests are listed in table 4. All significance levels are less than 0.05. Therefore, all of the null hypotheses is rejected and our proposed model significantly per-  forms better than baseline models.

Component Analysis of Our Model
We further evaluate the influence of early fusion, representation fusion, as well as different modality representation in early fusion on the final performance. The evaluation results are listed in Table 5.  We can see that the removal of early fusion decreases the performance, which shows that early fusion can improve the text representation. Early fusion with attribute representation performs better than that with image representation, indicating the gap between text representation and image representation. If representation fusion is removed, the performance is also decreased, which indicates that representation fusion is necessary and that the representation fusion can refine the feature representation of each modality.
6 Visualization Analysis 6.1 Running Examples Figure 3 shows some sarcastic examples that our proposed model predicts them correctly while the model with only text modality fails to label them right. It shows that with our model, images and attributes can contribute to sarcasm detection. For example, an image with a dangerous tackle and a text saying 'not dangerous' convey strong sarcasm in example (a). 'Respectful customers' is contradicted to the messy parcels as well as the attribute  Figure 4: Attention visualization of sarcastic tweets 'messy' in example (b). Without images, successfully detecting these sarcasm instances is almost impossible. The model with only text modality fails to detect sarcasm as for example (c), though the word so is repeated several times in example (c). However, with image and attribute modalities, our proposed model correctly detects sarcasm in these tweets. Figure 4 shows the attention of some examples at the representation fusion stage. Our model can successfully focus on the appropriate parts of the image, the essential words in the sentences and the important attributes. For example, our model pays more attention on the unamused face emoji and the word 'amazing' for texts, and pays more attention on the gloomy sky in example (a), thus this tweet is predicted as sarcastic tweet because of the inconsistency of these two modalities. In example (b), our model focuses on the word 'serious' in texts and focuses on the simple meal in the picture that contradicts to the 'good breakfast', revealing that this tweet should be sarcastic. In example (c), the word 'yum', the attribute 'meat' and the food in the image indicate the sarcastic meaning of the tweet. Figure 5 shows an example that our model fails to label it right.

Error Analysis
yo <user> thanks for the yearly fee reminder! Here's to you! #planetfitness #hiddenfee #mrmet Attributes: ball holding shoes little white In the example, the insulting gesture in the picture is contrast to the phrase 'thanks for'. However, the model is unable to obtain the common sense that this gesture is insulting. Therefore, the attention of this picture does not focus on the insulting gesture. Moreover, attributes do not reveal the insulting meaning of the pictures as well, thus our model fails to predict this tweet as sarcastic.

Conclusion and Future Work
In this paper we propose a new hierarchical fusion model to make full use of three modalities (images, texts and image attributes) to address the challenging multi-modal sarcasm detection task. Evaluation results demonstrate the effectiveness of our proposed model and the usefulness of the three modalities. In future work, we will incorporate other modality such as audio into the sarcasm detection task and we will also investigate to make use of common sense knowledge in our model.