Reasoning with Multimodal Sarcastic Tweets via Modeling Cross-Modality Contrast and Semantic Association

Sarcasm is a sophisticated linguistic phenomenon to express the opposite of what one really means. With the rapid growth of social media, multimodal sarcastic tweets are widely posted on various social platforms. In multimodal context, sarcasm is no longer a pure linguistic phenomenon, and due to the nature of social media short text, the opposite is more often manifested via cross-modality expressions. Thus traditional text-based methods are insufficient to detect multimodal sarcasm. To reason with multimodal sarcastic tweets, in this paper, we propose a novel method for modeling cross-modality contrast in the associated context. Our method models both cross-modality contrast and semantic association by constructing the Decomposition and Relation Network (namely D&R Net). The decomposition network represents the commonality and discrepancy between image and text, and the relation network models the semantic association in cross-modality context. Experimental results on a public dataset demonstrate the effectiveness of our model in multimodal sarcasm detection.


Introduction
Sarcasm is a sophisticated linguistic phenomenon, defined by Merriam-Webster Dictionary as 'The use of words that mean the opposite of what you really want to say, especially in order to insult someone, to show irritation, or to be funny'. It can not only disguise the hostility of the speaker, but also enhance the effect of mockery or humor on the listener (Tay et al., 2018). As an important clue to analyze people's true sentiment and intentions in communication from implicit expressions, automatic sarcasm detection plays a significant role in various applications that require the knowledge of people's sentiment or opinion (Cai et al., 2019), such as customer service, political stance detection,  Existing work on sarcasm detection mainly focuses on text data. Early feature engineering approaches rely on the signal indicators of sarcasm, such as syntactic patterns, lexical indicators and special symbols González-Ibánez et al., 2011). As sarcasm is often associated with implicit contrast or disparity between conveyed sentiment and user's situation in context (Riloff et al., 2013), contextual contrast information at conversation, tweet or word level is also employed to detect sarcasm in text (Bamman and Smith, 2015;Rajadesingan et al., 2015;Joshi et al., 2016). Recently, deep learning based methods are adopted to train end-to-end neural networks (Baziotis et al., 2018;Tay et al., 2018), achieving state-of-the-art performance.
With the fast growing and diverse trend of social media, multimodal sarcastic tweets which convey abundant user sentiment are widely posted on various social platforms. There is a great demand for multimodal sarcasm detection to facilitate various applications. However, traditional text-based methods are not applicable to detect multimodal sarcastic tweets (Fig.1). In multimodal context, sarcasm is no longer a pure linguistic phenomenon, but rather the combined expressions of multiple modalities (i.e. text, image, etc.). As the short text in tweet often has insufficient contextual information, contextual contrast implied in multimodal sarcasm is typically conveyed by cross-modality expressions. For example, in Fig.1b, we can not reason about sarcasm intention simply from the short text 'Perfect flying weather in April' until we notice the downpour outside the airplane window in the attached image. Therefore, compared to text-based methods, the essential research issue in multimodal sarcasm detection is the reasoning of cross-modality contrast in the associated situation.
Several related work on multimodal sarcasm detection has been proposed (Schifanella et al., 2016;Cai et al., 2019;Castro et al., 2019). However, they mainly focus on the fusion of multimodal data, and did not address the above key research issue in reasoning with multimodal sarcasm. There are still two main research challenges for multimodal sarcasm detection. First, since sarcasm commonly manifests with a contrastive theme, this requires the detection model to have the ability to reason about cross-modality contrast or incongruity of situations. Second, to ensure cross-modality contrast assessed in the associated common ground, the detection model needs the mechanism to concentrate on the semantic associated aspects of situations in cross-modality context. This contextual contrast and semantic association information acquired, in turn, can provide salient evidence to interpret the detection of multimodal sarcasm.
To tackle the above challenges, in this paper, we propose a novel method to model both crossmodality contrast and semantic association by constructing the Decomposition and Relation Network (i.e. D&R Net) for multimodal sarcasm detection task. The decomposition network implicitly models cross-modality contrast information via representing the commonality and discrepancy between image and text in tweets. The relation network explicitly captures the semantic association between image and text via a cross-modality attention mechanism. The main contributions of our work are as follows: • We identify the essential research issue in multimodal sarcasm detection, and propose a method to model cross-modality contrast in the associated context of multimodal sarcastic tweets.
• We construct the Decomposition and Relation Network (D&R Net) to implicitly represent the contextual contrast and explicitly capture the semantic association between image and text, which provides the reasoning ability and word-level interpretability for multimodal sarcasm detection.
• We compare our model with the existing stateof-the-art methods, and experimental results on a publicly available dataset demonstrate the effectiveness of our model in multimodal sarcasm detection.
2 Related Work

Textual Sarcasm Detection
Traditional sarcasm detection takes text-based approaches, including feature engineering, context based and neural network models. Earlier feature engineering approaches are based on the insight that sarcasm usually occurs with specific signals, such as syntactic patterns (e.g. using highfrequency words and content words) , lexical indicators (e.g. interjections and intensifiers) (González-Ibánez et al., 2011), or special symbols (e.g. '?', '!', hashtags and emojis) Felbo et al., 2017). As sarcasm is often associated with an implicit contrast or disparity between conveyed sentiment and user's situation in context (Riloff et al., 2013), some studies rely on this basic character of sarcasm to detect contextual contrast at different linguistic levels, including immediate communicative context between speaker and audience (Bamman and Smith, 2015), historical context between current and past tweets (Rajadesingan et al., 2015;Joshi et al., 2015), or word-level context by computing semantic similarity (Hernández-Farías et al., 2015;Joshi et al., 2016). Recently, researchers utilize the powerful techniques of neural networks to get more precise semantic representations of sarcastic text and model the sequential information of sarcastic context. Some approaches consider the contextual tweets of target tweet, using RNN model for contextual tweets representation and modeling the relationship between target and contextual tweets for sarcastic text classification (González-Ibánez et al., 2011;. To conceive more indicative information, user embedding (Amir et al., 2016), emotion, sentiment, personality (Poria et al., 2016), speaker's psychological profile (Ghosh and Veale, 2017), cognitive features (Mishra et al., 2017), and syntactic features (Baziotis et al., 2018) are also incorporated into CNN/LSTM models to enhance the performance. Furthermore, to overcome the black box problem of neural network model and reasoning with sarcasm, some novel methods such as neural machine translation framework (Peled and Reichart, 2017), and intra-attention mechanism (Tay et al., 2018) are explored to improve the interpretability of sarcasm detection.

Multimodal Sarcasm Detection
With the prevalence of multimodal tweets, multimodal sarcasm detection has gained increasing research attention recently. Schifanella et al. (2016) firstly tackle this task as a multimodal classification problem and concatenate manually designed features of image and text to classify sarcasm. Cai et al. (2019) extend the input modalities with triple features (i.e. text feature, image feature and image attributes), and propose a hierarchical fusion model for the task. Castro et al. (2019) firstly propose video-level multimodal sarcasm detection task and deal with it based on feature engineering via SVM. However, these methods pay more attention to the fusion of multimodal features, and did not consider cross-modality contrast and semantic association information which is essential to deduce multimodal sarcastic tweets.
In this paper, we propose a novel method to model the cross-modality contrast and semantic association in multimodal context by constructing the Decomposition and Relation Network (D&R Net), which enables our model to reason with multimodal sarcastic tweets and provides pertinent evidence for interpretation. Fig.2 illustrates the overall architecture of our proposed D&R Net for multimodal sarcasm detection, which is composed of four modules, preprocessing, encoding, decomposition network and relation network. We first preprocess the image and text inputs and extract adjective-noun pairs (ANPs) from each image. We then encode these triple inputs into hidden representations. After that, we learn to represent the commonality and discrepancy between image and text in decomposition network as well as the multi-view semantic association information in relation network. Finally, we feed these crossmodality representations into classification module for multimodal sarcasm detection.

Preprocessing
Standard image, text and visual attributes (e.g. sunnet, scene, snow) are utilized in the previous multimodal sarcasm detection (Cai et al., 2019). To enhance the image semantic understanding, we practice a better way to get more visual semantic information via extracting extra adjective-noun pairs from each image (e.g. great sunset, pretty scene, fresh snow in Fig.2). Thus, our model accepts triple inputs.
i , N is the number of adjective-noun pair, in which each pair P i contains an adjective word A i , a noun word N i and the probability value p i of this kind of ANP existing in the attached Image,

Encoding
In encoding module, we map these triple inputs into hidden representations. All textual words For each text, we utilize the bi-directional long short term memory (BiLSTM) network to represent textual sequence into a hidden representation vector and incorporate the contextual information. It maps word embedding w j into hidden state h w j ∈ R d . (2) For each ANP, we directly compute the maxpooling result of its adjective and noun word embeddings as the hidden representation.
For each image, we adopt a pre-trained convolutional neural network to extract image feature and also encode the result into d-dimensional space.

Decomposition Network (D-Net)
We focus on contextual contrast of multimodal sarcastic tweets and design the decomposition network (D-Net) to represent the commonality and discrepancy of image and text in high-level spaces.

Cross-modality Decomposition
The D-Net breaks down the raw visual or textual representation into a shared subspace and unique visual or textual subspace through three layers. The shared layer tends to extract invariant shared features f * shared of image and text, and image or text layer is forced to decompose image or text into unique variant contrast features f * unique , which can be defined as where f * is the feature of input modality * ∈ {image, text}, f image is the raw image encoding representation H m , f text is the last hidden state h w T of BiLSTM which is used as the overall representation of text, and W shared ∈ R ds×d , P * ∈ R du×d are projection matrices of shared space, unique visual space and textual space.

Decomposition Fusion
In multimodal sarcastic tweets, we expect our model to focus more on the opposite between different modality information. Thus, we reinforce discrepancy between image and text, and on the contrary, weaken their commonality. Specifically, we combine the above unique variant contrast features as the cross-modality contrast representation.
where ⊕ denotes the concatenation operation.

Relation Network (R-Net)
We propose the relation network (R-Net) to fully capture the contextual association between image and text from multiple views.

ANP-Aware Cross-Modality Attention
The relationship between image and text is usually multi-coupled, that is text may involve multiple entities in images, whereas different regions of the image may also involve different text words. We have already extracted multiple ANPs as the visual semantic information, which is beneficial to model multi-view associations between image and text according to different views of ANPs. Thus, we propose the ANP-aware cross-modality attention layer to align textual words and ANPs via utilizing each ANP to query each textual word and computing their pertinence. We first calculate the cross interactive attention matrix S ∈ R N ×T to measure how text words and image ANPs relate.
where W ∈ R d×d is the parameter of bi-linear function, and each score s ij ∈ S indicates the semantic similarity between i-th ANP encoding h p i ∈ H p and j-th text word encoding h w j ∈ H w . We then compute the cross-modality attention weight α i j of i-th ANP for j-th textual word by normalizing the i-th row of attention matrix S, and calculate the weighted average of textual hidden states as the i-th ANP-aware textual representation r i ∈ R d : Thus, we query the text N times with different ANPs to get multi-view textual representations [r 1 , r 2 , . . . , r N ]. Our proposed ANP-aware cross-modality attention mechanism is a variant of multi-head attention (Vaswani et al., 2017) and can be considered as the cross-modality adaptation of topic-aware mechanism (Wei et al., 2019), modeling the cross-modality association between image and text from multiple ANP-aware points. Next, we detail how to fuse such representations to get the final text representation.

ANP-Probability Fusion
We extract ANPs from each image and only select the Top N ANPs according to their extracted probability values [p 1 , p 2 , . . . , p N ]. Hence, different textual representations should be influenced by different ANP probability values. Thus, we get the final cross-modality association representation r rel ∈ R d by calculating weighted average of these ANP-aware textual representations [r 1 , r 2 , . . . , r N ] according to the related normalized ANP probability distributions.

Sarcasm Classification
Finally, we feed the above acquired cross-modality contrast and semantic association representations, denoted as r dec and r rel respectively, into the top fully-connected layer and use the sigmod function for binary sarcasm classification.
where w s ∈ R 1×(2du+d) , b s ∈ R 1 are the parameters of fully-connected layer.

Optimization
Our model optimizes two losses, including classification loss and orthogonal loss. We use cross entropy loss function as the sarcasm classification loss: where y i is the ground truth of i-th sample (i.e., 1 for sarcasm and 0 for non-sarcasm ), andŷ i is the predicted label of our model. In D-Net (Subsection 3.3), we share the same matrix for both image and text to ensure projecting them into the same subspace. Besides, in initialization and training process, to ensure that the decomposed unique subspaces are unrelated or in conflict with each other, we impose their projection matrices P * with the additional orthogonal constraint for the shared projection matrix W shared .
We convert these orthogonal constraints into the following orthogonal loss: where · 2 F denotes the Frobenius norm. We finally minimize the combined loss function: where λ is the weight of orthogonal loss.

Implementation Details
For fair comparison, we adopt the same data preprocessing used in (Cai et al., 2019), replacing the mentions with a certain symbol user, cleaning up samples in which the regular words include 'sarcasm' related words (e.g. sarcasm, sarcastic, irony, ironic) and co-occur words (e.g. jokes, humor, exgag), and removing the stop words and URLs. We separate the text sentence by NLTK toolkit and embed each token into 200-dimensional word embedding by GloVe (Pennington et al., 2014). For image preprocessing, we first resize it into 224*224 and utilize pre-trained ResNet (He et al., 2016) to extract image feature. We also use SentiBank toolkit 1 to extract 1200 ANPs and select the Top 5 ANPs as the visual semantic information of each image. We encode the multimodal inputs into 200-dimensional hidden space, and set the dimension of invariant shared feature to 40, the dimension of unique variant contrast feature to 40, Finally, we optimize our model by Adam update rule with learning rate 0.01, mini-batch 128, and weight of orthogonal loss 0.5. The dropout and early-stopping tricks are utilized to avoid overfitting.

Comparison with multimodal baselines
Our work focus on the multimodal sarcasm detection using image and text modalities. Thus, we compare our model with the only two existing related models using the same modalities.
• MLP+CNN (Schifanella et al., 2016)  We compare our model with multimodal baseline models with the F1-score and Accuracy metrics. Table 1 shows the comparative results. The MLP+CNN model simply takes the multimodal sarcasm detection as a general multimodal classification task via directly concatenating multimodal features for classification. Thus, it gets the worst performance. Hierarchical FM performs better than MLP+CNN by incorporating additional attributes that provide the visual semantic information and generating better feature representations via a hierarchical fusion framework. However, these multimodal baselines pay more attention to the fusion of multimodal features. In contrast, our D&R Net captures the essence of multimodal sarcasm via modeling cross-modality contrast in the associated context and achieves the best performance.

Comparison with unimodal baselines
To further explore the effects of multimodal inputs for sarcasm detection, we compare our model with the representative text-based sarcasm detection models and an image-based baseline model.
• ResNet (He et al., 2016) is widely used in many image classification tasks with prominent performance. As there is no related work on image sarcasm detection, we fine-tune it for image sarcasm classification.
• CNN (Kim, 2014) is a well-known model for many text classification tasks, which captures n-gram features by multichannel parameterized sliding windows.
• BiLSTM (Graves and Schmidhuber, 2005) is a popular recurrent neural network to model text sequence and incorporate bidirectional context information.   (Tay et al., 2018) learns the intrasentence relationship and sequential composition of sarcastic text, which is state-of-the-art method for text-only sarcasm detection.
We use F1-score and Accuracy as the evaluation metrics. Table 3 shows the comparative results of our model and these unimodal baseline models. Though ResNet demonstrates the superior performance in many image classification tasks, it performs relatively poor in sarcasm detection task. It is because that the sarcasm intention or visual contrast context in the image is usually unobvious. CNN and BiLSTM just treat the sarcasm detection task as a text classification task, ignoring the contextual contrast information. Thus, their performances are worse than MIARN, which focuses on textual context to model the contrast information between individual words and phrases. However, due to the nature of short text, relying on textual information is often insufficient, especially in multimodal tweets where cross-modality context relies the most important role. Our D&R Net performs better than unimodal baselines, demonstrating the usefulness of modeling multiple modality information in providing additional cues through reasoning contextual contrast and association.

Ablation Study
To evaluate the performance of each component used in our D&R Net, we conduct the detailed ablation studies on various variants of our model. The ablation results are shown in Table 4.
In general, we find those variants underperform our model. The most obvious declines come from the direct removal of our two core modules, D-Net and R-Net (see row 1, 3). Comparing these two variants, we find that removing D-Net has greater performance drop than removing R-Net. This suggests that modeling the cross-modality contrast in D-Net is more useful than cross-modality association in R-Net. After removing the D-Net, the model only accepts the text and ANPs inputs. Thus we  Table 4: Ablation results of our D&R Net further incorporate image information via directly concatenating image encoding in the final fusion layer (see row 2). The improvement compared with -D-Net shows the effectiveness of using image modality for multimodal sarcasm detection. Similarly, we also add the representation of ANPs to the fusion layer after removing the R-Net module (see row 4). However, the performance unexpectedly continues to decrease. One possible reason for this is that the fusion of ANPs affects the original decomposition results in spite of using triple inputs. It is worth mentioning that replacing our ANPs with noun attributes used in (Cai et al., 2019) underperforms our model (see row 5). This result indicates that ANPs are more useful in modeling semantic association between image and text compared with noun attributes. It is because that the adjectivenoun words in ANPs are more semantically informative than noun-only words. Finally, we notice that our ANP-probability fusion (i.e. ANP-P.F.) strategy provides a means for obtaining reasonable performance compared with standard pooling operations, MaxPool and AvgPool (see row 6, 7), with ANP-probability weighted average performing the best.

Case Study
In this section, we provide case studies through several practical examples to illustrate that our D&R Net really learns to reason multimodal sarcastic tweets with interpretability. For those text-only or image-only models, it's almost impossible to detect the sarcasm intention of Fig.3a and 3b. We also show the results of the extracted ANPs from each image and these ANPs actually provide useful information for sarcasm  detection. For example, the ANPs heavy snow, cloudy mountains, minsty winter of Fig.3a show the great conflict with text word 'Spring', conveying the strong intention of sarcasm. In addition, our extracted ANPs are more semantically meaningful than the noun-only attributes used in (Cai et al., 2019). The wet road and empty street are more informative than noun-only words road and street in Fig.3b. The cute girls and energetic performance are more in line with the text words 'so beautiful' compared with noun-only words girls and performance in Fig.3d to discriminate between sarcasm and non-sarcasm.

Attention Visualization
Our proposed ANP-aware cross-modality attention mechanism explicitly calculates the cross interactive attention between text words and image ANPs, providing the explainable reasoning evidence for sarcasm detection. We further illustrate this attention mechanism by visualizing its outputs on two multimodal sarcastic tweets in Fig.4. The results show that our proposed attention mechanism works well for multimodal sarcasm detection by explicitly identify the relationship between image regions and text words. For instance, in Fig.4a, the user satirically mentions eclipse for too many clouds covering the sun. Our D&R Net accurately detects sarcasm intention via focusing on the text words 'eclipse', '!', 'EclipseDay' with multiple visual semantic ANP views: stormy, fluffy, lovely and rainy clouds. In Fig.4b, our model pays more attention to the textual phrase 'these lovely books' with stupid sign, strange sign, and bad sign ANPs which refer to the emoji in the attached image. Consequently, it is easy for our model to detect the sarcasm intention that the books are NOT 'lovely' at all.

Conclusion
In this paper, we identify the essential research issue in multimodal sarcasm detection. To model the cross-modality contrast in the associated context of multimodal sarcastic tweets, we propose the D&R Net to represent the commonality and discrepancy between image and text and multi-view semantic associations in cross-modality context. Our model is capable of reasoning multimodal sarcastic tweets with word-level interpretation. Experimental results on a public dataset show that our model achieves the state-of-the-art performance compared with the existing models.