Neural Sarcasm Detection using Conversation Context

Social media platforms and discussion forums such as Reddit, Twitter, etc. are filled with figurative languages. Sarcasm is one such category of figurative language whose presence in a conversation makes language understanding a challenging task. In this paper, we present a deep neural architecture for sarcasm detection. We investigate various pre-trained language representation models (PLRMs) like BERT, RoBERTa, etc. and fine-tune it on the Twitter dataset. We experiment with a variety of PLRMs either on the twitter utterance in isolation or utilizing the contextual information along with the utterance. Our findings indicate that by taking into consideration the previous three most recent utterances, the model is more accurately able to classify a conversation as being sarcastic or not. Our best performing ensemble model achieves an overall F1 score of 0.790, which ranks us second on the leaderboard of the Sarcasm Shared Task 2020.


Introduction
Sarcasm can be defined as a communicative act of intentionally using words or phrases which tend to transform the polarity of a positive utterance into its negative counterpart and vice versa. The significant increase in the usage of social media channels has generated content that is sarcastic and ironic in nature. The apparent reason for this is that social media users tend to use various figurative language forms to convey their message. The detection of sarcasm is thus vital for several NLP applications such as opinion minings, sentiment analysis, etc (Maynard and Greenwood, 2014). This leads to 1 The dataset is provided by the organizers of Sarcasm Shared Task FigLang-2020 2 We are ranked 8th with an F1 score of 0.702 on the Reddit dataset leaderboard using the same approach. But we do not describe those results here as we could not test all our experiments within the timing constraints of the Shared Task. a considerable amount of research in the sarcasm detection domain among the NLP community in recent years.
The Shared Task on Sarcasm Detection 2020 aims to explore various approaches for sarcasm detection in a given textual utterance. Specifically, the task is to understand how much conversation context is needed or helpful for sarcasm detection. Our approach for this task focuses on utilizing various state-of-the-art PLRMs and fine-tuning it to detect whether a given conversation is sarcastic. We apply an ensembling strategy consisting of models trained on different length conversational contexts to make more accurate predictions. Our best performing model (Team name -nclabj) achieves an F1 score of 0.790 on the test data in the CoadaLab evaluation platform.

Problem Description
The dataset assigned for this task is collected from the popular social media platform, Twitter. Each training data contains the following fields: "label" (i.e., "SARCASM" or "NOTSARCASM"), "response" (the Tweet utterance), "context" (i.e., the conversation context of the "response"). Our objective here is to take as input a response along with its optional conversational context and predict whether the response is sarcastic or not. This problem can be modeled as a binary classification task. The predicted response on the test set is evaluated against the true label. Three performance metrics, namely, Precision, Recall, and F1 Score are used for final evaluation. tion problem (Ghosh et al., 2015) or considering sarcasm as a contrast between a positive sentiment and negative situation (Riloff et al., 2013;Maynard and Greenwood, 2014;Joshi et al., 2015Joshi et al., , 2016bGhosh and Veale, 2016). Recently, few works have taken into account the additional context information along with the utterance. (Wallace et al., 2014) demonstrate how additional contextual information beyond the utterance is often necessary for humans as well as computers to identify sarcasm. (Schifanella et al., 2016) propose a multi-modal approach to combine textual and visual features for sarcasm detection. (Joshi et al., 2016a) model sarcasm detection as a sequence labeling task instead of a classification task. (Ghosh et al., 2017) investigated that the conditional LSTM network (Rocktäschel et al., 2015) and LSTM networks with sentence-level attention on context and response achieved significant improvement over the LSTM model that reads only the response. Therefore, the new trend in the field of sarcasm detection is to take into account the additional context information along with the utterance. The objective of this Shared Task is to investigate how much of the context information is necessary to classify an utterance as being sarcastic or not.

System Description
We describe our proposed system for sarcasm detection in this section. We frame this problem as a binary classification task and apply a transfer learning approach to classify the tweet as either sarcastic or not. We experiment with several state of the art PLRMs like BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), as well as pre-trained embeddings representations models such as ELMo (Peters et al., 2018), USE (Cer et al., 2018), etc. and fine-tune it on the assigned Twitter dataset. We briefly review these models in subsections 4. For fine-tuning, we add additional dense layers and train the entire model in an end to end manner. Figure 1 illustrates one such approach for fine-tuning a RoBERTa model. We sequentially unfreeze the layers with each ongoing epoch. We apply a model ensembling strategy called "majority voting", as shown in Figure 2 to come out with our final predictions on the test data. In this ensemble technique, we take the prediction of several models and choose the label predicted by the maximum number of models.

Embeddings from Language Models (ELMo)
ELMo introduces a method to obtain deep contextualized word representation. Here, the researchers build a bidirectional Language model (biLM) with a two-layered bidirectional LSTM architecture and obtain the word vectors through a learned function of the internal states of biLM. This model is trained on 30 million sentence corpus, and thus the word embeddings obtained using this model can be used to increase the classification performance in several NLP tasks. For our task, we utilize the ELMo embeddings to come out with a feature representation of the words in the input utterance and pass it through three dense layers to perform the binary classification task.

Universal Sentence Encoder (USE)
USE presents an approach to create embedding vector representation of a complete sentence to specifically target transfer learning to other NLP tasks. There are two variants of USE based on trade-offs in compute resources and accuracy. The first variant uses an encoding sub-graph of the transformer architecture to construct sentence embeddings (Vaswani et al., 2017) and achieve higher performance figures. The second variant is a light model that uses a deep averaging network (DAN) (Iyyer et al., 2015) in which first the input embed-ding for words and bi-grams are averaged and then passed through a feedforward neural network to obtain sentence embeddings. We utilize the USE embeddings from the Transformer architecture on our data and perform the classification task by passing them through three dense layers.

Bidirectional Encoder Representations from Transformers (BERT)
BERT, a Transformer language model, achieved state-of-the-art results on eleven NLP tasks. There are two pre-training tasks on which BERT is trained on. In the first task, also known as masked language modeling (MLMs), 15% of words are randomly masked in each sequence, and the model is used to predict the masked words. The second task, also known as the next sentence prediction (NSP), in which given two sentences, the model tries to predict whether one sentence is the next sentence of the other. Once the above pre-training phase is completed, this can be extended for classification related task with minimal changes. This is also known as BERT fine-tuning, which we apply for our sarcasm detection task.

Dataset Preparation
The dataset assigned for this task is collected from Twitter. There are 5,000 English Tweets for training, and 1,800 English Tweets for testing purpose. We use 10% of the training data for the validation set to tune the hyper-parameters of our model.We apply several preprocessing steps to clean the given raw data. Apart from the standard preprocessing steps such as lowercasing the letters, removal of punctuations and emojis, expansion of contractions, etc., we remove the usernames from the tweets. Also, since hashtags generally consist of phrases in CamelCase letters, we split them into individual words since they carry the essential information about the tweet. To incorporate contextual information along with a given tweet, we prepare the data in the manner, as shown in Table 1. For data in which only the previous two turns are available, for them, only those two turns are considered in CON3 & CON case illustrated in Table 1. We fix the maximum sequence length based on the coverage of the data (greater than 90th percentile) in the training and test set. This sequence length is determined by considering each word as a single token.

Implementation Details
Here, we describe a detailed set up of our experiments and the different hyper-parameters of our models for better reproducibility. We experiment with various advanced state-of-the-art methodologies such as ELMo, USE, BERT, and RoBERTa. We use the validation set to tune the hyper-parameters. We use Adam (Kingma and Ba, 2014) optimizer in all our experiments. We use dropout regularization (Srivastava et al., 2014) and early stopping (Yao et al., 2007) to prevent overfitting. We use a batch size of {2, 4, 8, 16} depending on the model size and the sequence length.
Firstly, the data is prepared as mentioned in subsection 5.1. For fine-tuning ELMo, USE, and BERT LARGE models, we use the module from Ten-    sorflow Hub 345 and wrap it in a Keras Lambda layer whose weights are also trained during the fine-tuning process. We add three dense layers {512, 256, 1} with a dropout of 0.5 between these layers. The relu activation function is being applied between the first two layers whereas sigmoid is used at the final layer. ELMo and USE models are trained for 20 epochs while BERT LARGE is trained for 5 epochs. During the training, only the best model based on the minimum validation loss was saved by using the Keras ModelCheckpoint callback. Instead of using a threshold value of 0.5 for binary classification, a whole range of threshold values from 0.1 to 0.9 with an interval of 0.01 is experimented on the validation set. The threshold value for which the highest validation accuracy is obtained is selected as the final threshold and is being applied on the test set to get the test class predictions.
3 https://tfhub.dev/google/elmo/2 4 https://tfhub.dev/google/ universal-sentence-encoder-large/3 5 https://tfhub.dev/tensorflow/bert_en_ uncased_L-24_H-1024_A-16/1 For fine-tuning RoBERTa LARGE model, we use the fastai (Howard and Gugger, 2020) framework and utilize PLRMs from HuggingFace's Transformers library (Wolf et al., 2019). HuggingFace library contains a collection of state-of-the-art PLRMs which is being widely used by the researcher and practitioner communities. Incorporating Hugging-Face library with fastai allows us to utilize powerful fastai capabilities such as Discriminate Learning Rate, Slanted Triangular Learning Rate and Gradual Unfreezing Learning Rate on the powerful pretrained Transformer models. For our experiment, first of all, we extract the pooled output (i.e. the last layer hidden-state of the first token of the sequence (CLS token) further processed by a linear layer and a Tanh activation function). It is then passed through a linear layer with two neurons followed by a softmax activation function. We use a learning rate of 1e-5 and utilize the "1cycle" learning rate policy for super-convergence, as suggested by (Smith, 2015). We gradually unfreeze the layers and train on a 1cycle manner. After unfreezing the last three layers, we unfreeze the entire layers and train in the similar 1cycle manner. We stop the training when the validation accuracy does not improve consecutively for three epochs.
We use a simple ensembling technique called majority voting to ensemble the predictions of different models to further improvise the test accuracy.

Results and Error Analysis
Here, we compare and discuss the results of our experiments. First, we summarize the results of the individual model on the test set using different variants of data in Tables 2 & 3. From Table 2, we can observe that adding context information of specific lengths helps in improving the classification performance in almost all the models. USE results are better as compared to the ELMo model since Transformers utilized in the USE are able to handle sequential data comparatively better than that as LSTMs being used in ELMo. On the other hand, BERT LARGE outperforms USE with the increase in the length of context history. The highest test accuracy by BERT LARGE is obtained on the CON3 variant of data which depicts the fact that adding most recent three turns of context history helps the model to classify more accurately. This hypothesis is further supported from the experiments when a similar trend occurs with the RoBERTa LARGE model. Since the results obtained by RoBERTa are comparatively better than other models, we train this model once again on the same train and validation data with different weight initialization. By doing this, we can have a variety of models to build our final ensemble architecture. The evaluation metrics used are Precision (Pr), Recall (Re), F1-score (F1).
As observed in Table 3, RoBERTa fine-tuned on the CON3 variant of data outperforms all other approaches. In the case of fine-tuning PLRMs like BERT LARGE & RoBERTa LARGE on this data, we can observe the importance of most recent three turns of context history. From the experiments, we conclude that on increasing the context history along with the utterance, the model can learn a better representation of the utterance and can classify the correct class more accurately. Finally, RoBERTa model outperforms every other model because this model is already an optimized and improved version of the BERT model. Table 4 summarizes the results of our various ensemble models. For ensembling, we choose different variants of best performing models on the test data and apply majority voting on it to get the final test predictions. We experiment with several combinations of the models and report here the results of some of the best performing ensembles. We can observe that the ensemble model consisting of top three individual models gave us the best results.

Conclusion & Future Work
In this work, we have presented an effective methodology to tackle the sarcasm detection task on the twitter dataset by framing it as a binary classification problem. We showed that by finetuning PLRMs on a given utterance along with its specific length context history, we could successfully classify the utterance as being sarcastic or not. We experimented with different length context history and concluded that by taking into account the most recent three conversation turns, the model was able to obtain the best results. The fine-tuned RoBERTa LARGE model outperformed every other experimented models in terms of precision, recall, and F1 score. We also demonstrated that we could obtain a significant gain in the performance by using a simple ensembling technique called majority voting.
In the future, we would like to explore with these PLRMs on other publicly available datasets. We also aim to dive deep into the context history information and derive insights about the contextual part, which helps the model in improvising the classification result. We also wish to investigate more complex ensemble techniques to observe the performance gain.