Building a Bridge: A Method for Image-Text Sarcasm Detection Without Pretraining on Image-Text Data

Sarcasm detection in social media with text and image is becoming more challenging. Previous works of image-text sarcasm detection were mainly to fuse the summaries of text and image: different sub-models read the text and image respectively to get the summaries, and fuses the summaries. Recently, some multi-modal models based on the architecture of BERT are proposed such as ViLBERT. However, they can only be pretrained on the image-text data. In this paper, we propose an image-text model for sarcasm detection using the pretrained BERT and ResNet without any further pretraining. BERT and ResNet have been pretrained on much larger text or image data than image-text data. We connect the vector spaces of BERT and ResNet to utilize more data. We use the pretrained Multi-Head Attention of BERT to model the text and image. Besides, we propose a 2D-Intra-Attention to extract the relationships between words and images. In experiments, our model outperforms the state-of-the-art model.


Introduction
It is becoming popular today for people using text with images to express their emotions and feelings in social media. This makes sarcasm detection more challenging. Sometimes, only when the text and image are read together can one know whether it is sarcasm. For example in Figure 1, which are from the multimodal Twitter dataset (Cai et al., 2019), the images contain the necessary information to determine whether it is a sarcasm.
The previous works about the image-text sarcasm detection (Cai et al., 2019) and also the imagetext sentiment analysis (Gaspar and Alexandre, 2019;Huang et al., 2019;Kruk et al., 2019) have about two steps: (1) summarizing the image and text; (2) fusing the summaries of the image and text. Although some works try to explore the early fusion, it is still limited. Some details of text and image would be dropped when summarizing.
Recently, some multi-modal models based on the architecture of BERT are proposed such as ViL-BERT (Lu et al., 2019a,b), LXMERT (Tan and Bansal, 2019), VisualBERT , and B2T2 (Alberti et al., 2019). However, these models are pretrained only on image-text data. In contrast, BERT can be pretrained on much lager text data than image-text data. ResNet can also make use of more image data.
In linear algebra, matrix multiplication can be understood as a kind of vector space transformation. In this paper, we provide a new perspective, the vector space transformation perspective, on this task. We propose a model to connect the text and image, and design a Bridge Layer to build the connection. The low-level and high-level image features are passed into BERT (Devlin et al., 2019) as the embedding of BERT.
We use the pretrained BERT and pretrained ResNet directly without any further pretraining for this task. Any BERT-like models that are based on Transformer (Vaswani et al., 2017) can still be used in our model in the future. Any visual models can be used in our model in the future as well. Besides, our model does not require huge computing resources and time for pretraining.
Based on the idea that sarcasm relies on the semantic relationships and contrasts between words, Tay et al. (2018) uses a softmax and a max function to extract the relationships and contrasts. However, the max function drops some information. In this paper, we propose a method called 2D-Intra-Attention with a 2D-softmax to handle the 2D relationships. Assuming n is the number of inputs, with the 2D-softmax, n 2 relationships are considered every time. In contrast, with the max function (Tay et al., 2018), only n relationships are con-(a) "packing is so relaxing" (b) "finally! a thermometer that meets my precision requirements for cooking." sidered. Besides, we also add the image features into the 2D-Intra-Attention, so the relationships between words and images are considered.
In experiments, our model outperforms the stateof-the-art model. Our main contributions are summarized as follows: • We connect the text and image: we use image features extracted by pretrained ResNet as the input of the pretrained BERT and utilize the Multi-Head Attention of BERT to model the image features.
• We propose a 2D-softmax to model the 2D relationships considering n 2 relationships and add image features to the 2D-Intra-Attention to extract the relationships and contrasts between words and images.
• We use the pretrained BERT and ResNet directly. Our model can adopt new BERT-like models or visual models in the future and does not require extra computing resources and time for pretraining.

Multimodal Sarcasm Detection
Previous works explored the character and behavior of the reader for multimodal sarcasm detection (Mishra et al., 2016a,b). Some works tried to introduce visual information in sarcasm detection (Schifanella et al., 2016;Cai et al., 2019), but the fusion is mainly used for summaries. Besides sarcasm detection, some works about multimodal sentiment analysis have been done (Wang et al., 2017;Zadeh et al., 2017;Poria et al., 2015;Gu et al., 2018;Gaspar and Alexandre, 2019;Huang et al., 2019;. Some ideas of multimodal sentiment analysis are similar to multimodal sarcasm detection, so it is also possible to adapt our method to sentiment analysis in the future.
3 Approach Figure 2 shows the architecture of our model. Our model contains two parts: Image-Text Fusion and 2D-Intra-Attention.

Image-Text Fusion
Image-Text Fusion includes BERT Layer, ResNet Layer, and Bridge Layer. In this paper, the term "BERT" refers to the BERT-like models (

ResNet Layer
ResNet Layer provides the detail and summary of an image. The "Block" of ResNet Layer in Figure 2 means the "building block" in (He et al., 2016), which contains two 3x3 convolution layers, or two 1x1 convolution layers and one 3x3 convolution layer. The image features are the tensor of size 7x7 after the "Block, 512" called feature 7x7 and the tensor of size 1x1 after "avg pool" called feature 1x1 in the following.
The feature 7x7 provides the details of the image. The feature 1x1 provides a summary of the image. In this way, every word in the text can pay attention to the different parts of the image and get more detail information.

Bridge Layer
Bridge Layer is to build the connection between ResNet and BERT. Bridge Layer is very important since ResNet and BERT are pretrained in different spaces. The image features of ResNet cannot be passed into BERT directly. The term "space" here means the vector space or semantic space and is to describe the representations of ResNet and BERT. Bridge Layer maps the image features from ResNet space into BERT space.
Formally, the image features of 7x7 and 1x1 are passed into two 1x1 convolutions respectively, one for feature 7x7 and another for feature 1x1. For the two 1x1 convolutions, the kernel size is 1x1, the stride is 1, the padding length is 0, the number of input channel is the number of the channel of image features such as 1024 or 2048, and the number of output channel is the hidden size of BERT such as 768 or 1024. The function of 1x1 convolutions here equals to fully connected layers. Using 1x1 convolutions and fully connected layers are both feasible when implementing the Bridge Layer. The outputs of the two 1x1 convolutions are flattened and passed into BERT as the embedding of BERT as shown in Figure 2.
The purpose of Bridge Layer is only to build the connection and do transformation instead of learning something. Other methods such as 3x3 conv are suboptimal because it is more likely to overfit than learning something we believe. The task of learning image information should be done by ResNet and the task of integrating image and text should be done by BERT.

BERT Layer
BERT Layer has two parts of inputs. One part is the normal text input. Another part is the image features that have been mapped into BERT space by Bridge Layer. The text is passed through the embedding layer and then the transformer, whereas the image features are passed into the transformer directly without going through the embedding layer.
The text and image features are passed through Multi-Head Attention in different ways. Formally, the attention for words of text is: i denotes the i-th word at the l-th layer and g (l) i denotes the i-th image feature at the l-th layer; the |v| denotes the number of words and the |g| denotes the number of image features. In this way, every word v i can pay attention to other words and image features. A word can get detail information from the 7x7 features and summary of the image from the 1x1 features.
However, the attention for image features is: Even though Bridge Layer has mapped image features into BERT space, the mapped image features are still not text, and BERT is never pretrained on the image features. Besides, the CNN of ResNet has a stronger capacity to learn images and the spatial relationships. The relationships between image features have been learned in ResNet. Image feature g i can only "see" itself. The way of g i passing through Multi-Head Attention is similar to passing through a fully connected layer. One head of the normal Multi-Head Attention (Vaswani et al., 2017) is: where Q, K, V are query, key, value. For the attention of g i , the output of softmax will be 1 for g i and be 0 for others. Therefore, the attention of one vector g i of one head of the Multi-Head

23
Attention becomes: W Q ∈ R m×d , W K ∈ R m×d , and W V ∈ R m×d are parameters for attention calculation as shown in (Vaswani et al., 2017); m denotes the hidden size of BERT and d denotes the hidden size of one head of Multi-Head Attention. I g i ∈ R 1×n is a vector where only the (i + |v|)-th element is 1 and others are 0. n is the total number of words and image features, where n = |v| + |g|.
The attention for g i is only to map the g i from the previous-layer semantic space into the next-layer semantic space with W V .
g i "seeing" other words and images does not perform well because it will cause noises and overfitting. On the other hand, g i has to "see" itself because the Multi-Head Attention layer can map the g i from the previous layer into the next layer. If we use Bridge Layer to map the image features from image space into the next-layer semantic space directly, the model cannot utilize the existed parameters of BERT and need to learn the same information from the beginning.

Space Transformation
To explain the idea behind Image-Text Fusion that matrix multiplication is a kind of vector space transformation, Figure 3 shows the process in space view. Bridge Layer is the connection between the image space and the BERT embedding space. The image features are projected from image space into BERT embedding space by Bridge Layer. The Multi-Head Attention of BERT projects the image features from BERT embedding space into BERT Layer space, then projects the image features from the previous-layer space into next-layer space at every layer.

Some Details
In this section, we will introduce some important details about the Image-Text Fusion.  Learning Rate Since BERT and ResNet are pretrained models, they usually use a small learning rate. However, Bridge Layer is not pretrained and need to build the connection quickly. The learning rate of Bridge Layer should be greater than BERT and ResNet. In this way, Bridge Layer can learn fast and catch up with BERT and ResNet. This is very important because using the same learning rate will hinder the model from convergence. Empirically, we suggest that the learning rate of Bridge Layer should be at least 10x greater than BERT and ResNet.

Length Limitation
The input length of BERT should be less than 512 because only 512 position embeddings are pretrained. However, the image features skip the embedding layer, so introducing image features will not influence the input length limitation of text.

Image Features Sequence
The Multi-Head Attention does not consider position by itself, so BERT introduces the position embedding. However, since image features skip the position embedding, the input order of image features is trivial.

2D-Intra-Attention
We propose a 2D-softmax function to handle the 2D scores. We also add image features to the 2D-Intra-Attention to explore the contrasts and disparities between words and images. Formally, we define the outputs of both words and images of BERT as {h i } n i=1 , where n is the total number of words and image features. In other words: A pair is defined as: where [.; .] denotes the concatenation and h ij is a vector. Every h ij is passed through a fully connected layer to get score s ij as: where W s ∈ R 1×2m and b s ∈ R 1 are learnable parameters, and m denotes the hidden size of BERT. Then values of s ij where i = j are masked, which is similar to (Tay et al., 2018), and s ij where i > |v| and j > |v| are masked as well. The 2D-softmax is: This 2D-softmax considers s ij from two dimension instead of only one. Then the attention weightâ i is calculated as: The a ij in 2D is projected into theâ i in 1D. The a ip and a qi are divided by 2 because every a ij is added twice. The final step of 2D-Intra-Attention where W a ∈ R m×m is a learnable parameter. With the 2D-softmax, n 2 pairs can be considered. For example, if a word has obvious contrasts with many other words or other parts of images, the attention weight of the word will be high.

Final Fusion
The concatenation of the [CLS] of BERT, theĥ from 2D-Intra-Attention, and the features 1x1 from ResNet are passed through a fully connected layer and a sigmoid function for classification.

Training Details
In this section, we will introduce the details and hyper-parameters for training our model.
Optimizer The optimizer is Adam (Kingma and Ba, 2014) for BERT with linear schedule and a warm-up ratio of 0.05.
Learning rate The learning rate for RoBERTa and ResNet50 is 1e-5, and for other parameters including Bridge Layer is 1e-3.
Image preprocessing For predicting, we resize the original image making the smaller edge of the image is 224, then crop the image at the center. For training, we implement data augment for images including random crop and randomly change the brightness, contrast and saturation of the image.
Parameters number The number of parameters of our model for experiments is 151M. The learnable parameters are initialized by (He et al., 2015).

GPU & Environment
The model is running on a GPU of NVIDIA GeForce RTX 2080 Ti. Due to the limited GPU RAM, we use gradient accumulation for training. The operating system is Ubuntu 18.04. We use PyTorch 1.4.0 (Paszke et al., 2019) and Transformers 2.4.1 ) to implement our model. We also use mixed precision training with NVIDIA Apex 0.1 (Micikevicius et al., 2017) to accelerate our model.
Running time It takes an average of 343 seconds per epoch. We run the model 10 times and record the best result.

Metrics
The metrics for evaluation are F1-score, precision, recall, and accuracy, which are implemented by Scikit-learn (Pedregosa et al., 2011).

Comparison
The dataset for experiments is the multimodal image-text Twitter dataset (Cai et al., 2019). This data contains image and text as shown in Figure 1.
The description of other compared models are as follows:
DMAF Deep Multimodal Attentive Fusion (DMAF) (Huang et al., 2019). We use this imagetext sentiment analysis model in comparison since sarcasm detection and sentiment analysis share some similarities. ViLBERT ViLBERT (Lu et al., 2019a,b) is a pretrained visual-text model which extends the BERT architecture to a multi-modal model. ViLBERT was proposed in (Lu et al., 2019a) at first, then was improved by multi-task training in (Lu et al., 2019b). Table 1 shows the results. Since ViLBERT is based on BERT (Devlin et al., 2019), we also use BERT (Devlin et al., 2019) in our model to give a fair comparison. Our model with BERT outperforming other models verifies the effectiveness of our model. Moreover, due to the advantage that our model can adopt different pretrained models, if we use RoBERTa , which was proposed at the time close to ViLBERT, our model can outperform other models significantly. One improvement of RoBERTa comes from using larger data, and our model can make use of the data by adopting RoBERTa.
Our model outperforming ViLBERT and other pretrained visual-text models is mainly because ViLBERT is only pretrained on limited image-text data. In contrast, our model utilizes more unsupervised text data and image data, and only needs to learn a transformation.
BERT + 2D-Intra-Att A text-only model that uses the output of 2D-Intra-Attention, whose inputs are the outputs of BERT, for classification. 1D-Intra-Attention (Tay et al., 2018) was designed for text-only model, so we also add 2D-Intra-Attention to this text-only model to compare these two attentions.

Concatenation of BERT and ResNet
An image-text model that concatenates the [CLS] of BERT whose inputs are text and the output of ResNet for classification. In other words, the model   uses image features but does not use them as the inputs for BERT.

Concatenation of BERT and ResNet + 2D-Intra-Att
An image-text model that concatenates the [CLS] of BERT whose inputs are text, the output of ResNet, and the output of 2D-Intra-Attention for classification.

Image-Text Fusion
The Image-Text Fusion part in this paper. It is important to note that the output of ResNet is used in Final Fusion for classification instead of in Image-Text Fusion. We do not use the output of ResNet for classification here but only the [CLS] as shown in Figure 4, so the image information must go through Bridge Layer and BERT Layer to reach the classification. If Bridge Layer cannot transform image features or BERT Layer cannot integrate text and image, the result should be similar to BERT or even worse because image features may cause noises.
Image-Text Fusion with Bridge Layer using 3x3 conv The Image-Text Fusion in this paper that uses the [CLS] of BERT for classification with Bridge Layer using 3x3 conv with padding length 1 instead of 1x1 conv. Table 2 shows the results. Both Image-Text Fusion and BERT only use the [CLS] of BERT for classification, and the difference is that BERT Layer of Image-Text Fusion has image input. This is proof that BERT Layer and Bridge Layer are effective because image information must go through them to reach the classification. BERT Layer and Bridge Layer must handle image inputs well to give a better result. With image input, the score of Concatenation of BERT and ResNet is improved by 0.93% compared with BERT, but is still worse than ViLBERT. Image-Text Fusion achieves 1.63% improvement compared with BERT and outperforms ViLBERT without 2D-Intra-Attention.
The bad result of Image-Text Fusion with Bridge Layer using 3x3 conv verifies the effectiveness of using 1x1 conv in Bridge Layer. Our idea for Bridge Layer is just transforming so that the model can utilize pretrained parameters as much as possible instead of learning them from the beginning.

Conclusion
In this paper, we propose an image-text model for image-text sarcasm detection. We propose a novel way to integrate image and text information. Our model outperforms the state-of-the-art model. Comparing with multi-modal models, our model utilizes more text and image data instead of only the image-text data. Our model can adopt different pretrained language models and visual models directly without any further pretraining.