A Report on the 2020 Sarcasm Detection Shared Task

Detecting sarcasm and verbal irony is critical for understanding people’s actual sentiments and beliefs. Thus, the field of sarcasm analysis has become a popular research problem in natural language processing. As the community working on computational approaches for sarcasm detection is growing, it is imperative to conduct benchmarking studies to analyze the current state-of-the-art, facilitating progress in this area. We report on the shared task on sarcasm detection we conducted as a part of the 2nd Workshop on Figurative Language Processing (FigLang 2020) at ACL 2020.


Introduction
Sarcasm and verbal irony are a type of figurative language where the speakers usually mean the opposite of what they say. Recognizing whether a speaker is ironic or sarcastic is essential to downstream applications for correctly understanding speakers' intended sentiments and beliefs. Consequently, in the last decade, the problem of irony and sarcasm detection has attracted a considerable interest from computational linguistics researchers. The task has been usually framed as a binary classification task (sarcastic vs. non-sarcastic) using either the utterance in isolation or adding contextual information such as conversation context, author context, visual context, or cognitive features (Davidov et al., 2010;Tsur et al., 2010;González-Ibáñez et al., 2011;Riloff et al., 2013;Maynard and Greenwood, 2014;Wallace et al., 2014;Ghosh et al., 2015;Muresan et al., 2016;Amir et al., 2016;Mishra et al., 2016;Ghosh and Veale, 2017;Felbo et al., 2017;Ghosh et al., 2017;Hazarika et al., 2018;Tay et al., 2018;Oprea and Magdy, 2019;Majumder et al., 2019;Castro et al., 2019;Ghosh et al., 2019).
In this paper, we report on the shared task on sarcasm detection that we conducted as part of the  (FigLang 2020) at ACL 2020. The task aims to study the role of conversation context for sarcasm detection. Two types of social media content are used as training data for the two tracks -microblogging platform such as Twitter and online discussion forum such as Reddit. Table 1 and Table 2 show examples of three turn dialogues, where Response is the sarcastic reply. Without using the conversation context Context i , it is difficult to identify the sarcastic intent expressed in Response. The shared task is designed to benchmark the usefulness of modeling the entire conversation context (i.e., all the prior dialogue turns) for sarcasm detection.
Section 2 discusses the current state of research on sarcasm detection with a focus on the role of context. Section 3 provides a description of the shared task, datasets, and metrics. Section 4 contains brief summaries of each of the participating systems whereas Section 5 reports a comparative evaluation of the systems and our observations about trends in designs and performance of the systems that participated in the shared task.

Turns
Message Context 1 This is the greatest video in the history of college football. Context 2 Hes gonna have a short career if he keeps smoking . Not good for your health Response Awesome !!! Everybody does it.
That's the greatest reason to do something. Table 2: Sarcastic replies to conversation context in Twitter. Response turn is a reply to Context 2 turn that is a reply to Context 1 turn

Related Work
A considerable amount of work on sarcasm detection has considered the utterance in isolation when predicting the sarcastic or non-sarcastic label. Initial approaches used feature-based machine learning models that rely on different types of features from lexical (e.g., sarcasm markers, word embeddings) to pragmatic such as emoticons or learned patterns of contrast between positive sentiment and negative situations (Davidov et al., 2010;Veale and Hao, 2010;González-Ibáñez et al., 2011;Liebrecht et al., 2013;Riloff et al., 2013;Maynard and Greenwood, 2014;Ghosh et al., 2015;Ghosh and Muresan, 2018). Recently, deep learning methods have been applied for this task (Ghosh and Veale, 2016;Tay et al., 2018). For excellent surveys on sarcasm and irony detection see (Wallace, 2015;Joshi et al., 2017). However, when recognizing sarcastic intent even humans have difficulties sometimes when considering an utterance in isolation (Wallace et al., 2014). Recently an increasing number of researchers have started to explore the role of contextual information for irony and sarcasm analysis. The term context loosely refers to any information that is available beyond the utterance itself (Joshi et al., 2017). A few researchers have examined author context (Bamman and Smith, 2015;Khattri et al., 2015;Rajadesingan et al., 2015;Amir et al., 2016;Ghosh and Veale, 2017), multi-modal context (Schifanella et al., 2016;Cai et al., 2019;Castro et al., 2019), eye-tracking information (Mishra et al., 2016), or conversation context (Bamman andSmith, 2015;Wang et al., 2015;Joshi et al., 2016;Zhang et al., 2016;Ghosh et al., 2017;Ghosh and Veale, 2017).
Related to shared tasks on figurative language analysis, recently, Van Hee et al. (2018) have con-ducted a SemEval task on irony detection in Twitter focusing on utterances in isolation. Besides the binary classification task of identifying the ironic tweet the authors also conducted a multi-class irony classification to identify the specific type of irony: whether it contains verbal irony, situational irony, or other types of irony. In our case, the current shared task aims to study the role of conversation context for sarcasm detection. In particular, we focus on benchmark the effectiveness of modeling the conversation context (e.g., all the prior dialogue turns or a subset of the prior dialogue turns) for sarcasm detection.

Task Description
The design of our shared task is guided by two specific issues. First, we plan to leverage a particular type of context -the entire prior conversation context -for sarcasm detection. Second, we plan to investigate the systems' performance on conversations from two types of social media platforms: Twitter and Reddit. Both of these platforms allow the writers to mark whether their messages are sarcastic (e.g., #sarcasm hashtag in Twitter and "/s" marker in Reddit).
The competition is organized in two phases: training and evaluation. By making available common datasets and frameworks for evaluation, we hope to contribute to the consolidation and strengthening of the growing community of researchers working on computational approaches to sarcasm analysis. Khodak et al. (2017) introduced the self-annotated Reddit Corpus which is a very large collection of sarcastic and non-sarcastic posts (over one million) curated from different subreddits such as politics, religion, sports, technology, etc. This corpus contains self-labeled sarcastic posts where users label their posts as sarcastic by marking "/s" to the end of sarcastic posts. For any such sarcastic post, the corpus also provides the full conversation context, i.e., all the prior turns that took place in the dialogue.

Reddit Training Dataset
We select the training data for the Reddit track from Khodak et al. (2017). We considered a couple of criteria. First, we choose sarcastic responses with at least two prior turns. Note, for many responses in our training corpus the number of turns is much more. Second, we curated sarcastic re-sponses from a variety of subreddits such that no single subreddit (e.g., politics) dominates the training corpus. In addition, we avoid responses from subreddits that we believe are too specific and narrow (e.g., subreddit dedicated to a specific video game) that might not generalize well. The nonsarcastic partition of the training dataset is collected from the same set of subreddits that are used to collect sarcastic responses. We finally end up in selecting 4,400 posts (as well as their conversation context) for the training dataset equally balanced between sarcastic and non-sarcastic posts.

Twitter Training Dataset
For the Twitter dataset, we have relied upon the annotations that users assign to their tweets using hashtags. The sarcastic tweets were collected using hashtags: #sarcasm and #sarcastic. As nonsarcastic utterances, we consider sentiment tweets, i.e., we adopt the methodology proposed in related work (Muresan et al., 2016). Such sentiment tweets do not contain the sarcasm hashtags but include hashtags that contain positive or negative sentiment words. The positive tweets express direct positive sentiment and they are collected based on tweets with positive hashtags such as #happy, #love, #lucky. Likewise, the negative tweets express direct negative sentiment and are collected based on tweets with negative hashtags such as #sad, #hate, #angry. Classifying sarcastic utterances against sentiment utterances is a considerably harder task than classifying against random objective tweets since many sarcastic utterances also contain sentiment terms. Here, we are relying on self-labeled tweets, thus, it is always possible that sarcastic tweets were mislabeled with sentiment hashtags or users did not use the #sarcasm hashtag at all. We manually evaluated around 200 sentiment tweets and found very few such cases in the training corpus. Similar to the Reddit dataset we apply a couple of criteria while selecting the training dataset. First, we select sarcastic or non-sarcastic tweets only when they appear in a dialogue (i.e., begins with "@"-user symbol) and at least have two or more prior turns as conversation context. Second, for the non-sarcastic posts, we maintain a strict upper limit (i.e., not-greater than 10%) for any sentiment hashtag. Third, we apply heuristics such as avoiding short tweets, discarding tweets with only multiple URLs, etc. We end up selecting 5,000 tweets for training balanced between sarcastic and non-sarcastic tweets. Figure 1: Plot of Reddit (blue) and Twitter (orange) training datasets on the basis of context length. X-axis represents context length (i.e., number of prior turns) and Y-axis represents the % of training utterances. Figure 1 presents a plot of number of training utterances on the basis of context length, for Reddit and Twitter tracks respectively. We notice, although the numbers are comparable for utterances with context length equal to two or three, for Twitter corpus, utterances with a higher number of context (i.e., prior turns) is much higher.

Evaluation Data
The Twitter data for evaluation is curated similarly to the training data. For Reddit, we do not use Khodak et al. (2017) rather collected new sarcastic and non-sarcastic responses from Reddit. First, for sarcastic responses we utilize the same set of subreddits utilized in the training dataset, thus, keeping the same genre between the evaluation and training. For the non-sarcastic partition, we utilized the same set of subreddits and submission threads as the sarcastic partition. For both tracks the evaluation dataset contains 1800 instances partitioned equally between the sarcastic and the non-sarcastic categories.

Training Phase
In the first phase, data is released for training and/or development of sarcasm detection models (both Reddit and Twitter). Participants can choose to partition the training data further to a validation set for preliminary evaluations and/or tuning of hyper-parameters. Likewise, they can also elect to perform cross-validation on the training data.

Evaluation Phase
In the second phase, instances for evaluation are released. Each participating system generated predictions for the evaluation instances, for up to N models. 1 Predictions are submitted to the Co-daLab site and evaluated automatically against the gold labels. CodaLab is an established platform to organize shared-tasks (Leong et al., 2018) because it is easy to use, provides easy communication with the participants (e.g., allows mass-emailing) as well as tracks all the submissions updating the leaderboard in real-time. The metrics used for evaluation is the average F1 score between the two categories -sarcastic and non-sarcastic. The leaderboards displayed the Precision, Recall, and F1 scores in the descending order of the F1 scores, separately for the two tracks -Twitter and Reddit.

Systems
The shared task started on January 19, 2020, when the training data was made available to all the registered participants. We released the evaluation data on February 25, 2020. Submissions were accepted until March 16, 2020. Overall, we received an overwhelming number of submissions: 655 for the Reddit track and 1070 for the Twitter track. The CodaLab leaderboard showcases results from 39 systems for the Reddit track and 38 systems for the Twitter track, respectively. Out of all submissions, 14 shared task system papers were submitted. In the following section we summarize each system paper. We also put forward a comparative analysis based on their performance and the choice of features/models in Section 5. Interested readers can refer to the individual teams' papers for more details. But first, we discuss the baseline classification model that we used.

Baseline Classifier
We use prior published work as the baseline that used conversation context to detect sarcasm from social media platforms such as Twitter and Reddit (Ghosh et al., 2018). Ghosh et al. (2018) proposed a dual LSTM architecture with hierarchical attention where one LSTM models the conversation context and the other models sarcastic response. The hierarchical attention (Yang et al., 2016) implements two levels of attention -one at the word level and another at the sentence level. We used their system based on only the immediate conversation context (i.e., the immediate prior turn). 2 This is denoted as LST M attn in Table 3 and Table 4. 1 N is set to 999. 2 https://github.com/Alex-Fabbri/deep_ learning_nlp_sarcasm

System Descriptions
We describe the participating systems in the following section (in alphabetical order).
abaruah (Baruah et al., 2020): Fine-tuned a BERT model (Devlin et al., 2018) and reported results on varying maximum sequence length (corresponding to varying level of context inclusion from just response to entire context). They also reported results of BiLSTM with FastText embeddings (of response and entire context) and SVM based on char n-gram features (again on both response and entire context). One interesting result was SVM with discrete features performed better than BiLSTM. They achieved best results with BERT on response and most immediate context.
ad6398 (Kumar and Anand, 2020): Report results comparing multiple transformer architectures (BERT, SpanBERT (Joshi et al., 2020), RoBERTa ) both in single sentence classification (with concatenated context and response string) and sentence pair classification (with context and response being separate inputs to a Siamese type architecture). Their best result was with using RoBERTa + LSTM model.
aditya604 (Avvaru et al., 2020): Used BERT on simple concatenation of last-k context texts and response text. The authors included details of data cleaning (de-emojification, hashtag text extraction, apostrophe expansion) as well experiments on other architectures (LSTM, CNN, XLNet ) and varying size of context (5, 7, complete) in their report. The best results were obtained by BERT with 7 length context for Twitter dataset and BERT with 5 context for Reddit dataset.
amitjena40 (Jena et al., 2020): Used a timeseries analysis inspired approach for integrating context. Each text in conversational thread (context and response) was individually scored using BERT and Simple Exponential Smoothing (SES) was utilized to get probability of final response being sarcastic. They used the final response label as a pseudo-label for scoring the context entries, which is not theoretically grounded. If final response is sarcastic, the previous context dialogue cannot be assumed to be sarcastic (with respect to its preceding dialogue). However, the effect of this error is attenuated due to exponentially decreasing contribution of context to final label under SES scheme.  AnandKumaR (Khatri and P, 2020): Experimented with using traditional ML classifiers like SVM and Logisitic Regression over embeddings through BERT and GloVe (Pennington et al., 2014). Using BERT as a feature extraction method as opposed to fine-tuning it was not beneficial and Logisitic Regression over GloVe embeddings outperformed them in their experiment. Context was used in their best model but no details were available about the depth of context usage (full vs. immediate). Additionally, they only experimented with Twitter data and no submission was made to the Reddit track. They provided details of data cleaning measures for their experiments which involved stopword removal, lowercasing, stemming, punctuation removal and spelling normalization.
andy3223 (Dong et al., 2020): Used the transformer-based architecture for sarcasm detection, reporting the performance of three architecture, BERT, RoBERTa, and ALBERT (Lan et al., 2019). They considered two models, the targetoriented where only the target (i.e., sarcastic response) is modeled and context-aware, where the context is also modeled with the target. The authors conducted extensive hyper-parameter search, and set the learning rate to 3e-5, the number of epochs to 30, and use different seed values, 21, 42, 63, for three runs. Additionally, they set the maximum sequence length 128 for the target-oriented models while it is set to 256 for the context-aware models.
burtenshaw (Lemmens et al., 2020): Employed an ensemble of four models -LSTM (on word, emoji and hashtag representations), CNN-LSTM (on GloVe embeddings with discrete punctuation and sentiment features), MLP (on sentence embeddings through Infersent (Conneau et al., 2017)) and SVM (on character and stylometric features). The first three models (except SVM) used the last two immediate contexts along with the response.

duke DS (Gregory et al., 2020):
Here the authors have conducted extensive set of experiments using discrete features, DNNs, as well as transformer models, however, reporting only the results on the Twitter track. Regarding discrete features, one of novelties in their approach is including a predictor to identify whether the tweet is political or not, since many sarcastic tweets are on political topics. Regarding the models, the best performing model is an ensemble of five transformers: BERTbase-uncased, RoBERTa-base, XLNet-base-cased, RoBERTa-large, and ALBERT-base-v2.

kalaivani.A (kalaivani A and D, 2020):
Compared traditional machine learning classifiers (e.g., Logistic Regression/Random Forest/XGBoost/Linear SVC/ Gaussian Naive Bayes) on discrete bag-of-word features/Doc2Vec features with LSTM models on Word2Vec embeddings (Mikolov et al., 2013) and BERT models. For context usage they report results on using isolated response, isolated context and context-response combined (unclear as to how deep the context usage is). The best performance for their experiments was by BERT on isolated response.
miroblog (Lee et al., 2020): Implemented a classifier composed of BERT followed by BiLSTM and NeXtVLAD (Lin et al., 2018) (a differentiable pooling mechanism which empirically performed better than Mean/Max pooling). 3 They employed an ensembling approach for including varying length context and reported that gains in F1 after context of length three are negligible. Just with these two contributions alone, their model outperformed all others. Additionally, they devised a novel approach of data augmentation (i.e., Contextual Response Augmentation) from unlabelled conversational contexts based on next sentence prediction confidence score of BERT. Leveraging large-scale unlabelled conversation data from web, their model outperformed the second best system by 14% and 8.4% for Twitter and Reddit respectively (absolute F1 score). salokr/vaibhav (Srivastava et al., 2020) : Employed a CNN-LSTM based architecture on BERT embeddings to utilize the full context thread and the response. The entire context after encoding through BERT is passed through CNN and LSTM layers to get a representation of the context. Convolution and dense layers over this summarized context representation and BERT encoding of response make up the final classifier.
taha (ataei et al., 2020): Reported experiments comparing SVM on character n-gram features, LSTM-CNN models, Transformer models as well as a novel usage of aspect based sentiment classification approaches like Interactive Attention Networks(IAN) (Ma et al., 2017), Local Context Focus(LCF)-BERT (Zeng et al., 2019) and BERT-Attentional Encoder network (AEN) (Song et al., 2019). For aspect based approaches, they viewed the last dialogue of conversational context as aspect of the target response. LCF-BERT was their best model for the Twitter task but due to computational resource limitations they were not able to try it for Reddit task (where BERT on just the response text performed best).
tanvidadu (Dadu and Pant, 2020): Fine-tuned RoBERTa-large model (355 Million parameters with over a 50K vocabulary size) on response and its two immediate contexts. They reported results on three different types of inputs: response-only model, concatenation of immediate two context with response, and using an explicit separator token between the response and the final context. The best result is reported in the setting where they used the separation token. Table 3 and Table 4 present the results for the Reddit track and the Twitter track, respectively. We show the rank of the submitted systems (best result from their submitted reports) both in terms of the system submissions (out of 14) as well as their rank on the Codalab leaderboard. Note, for a couple of entries we observe a discrepancy between their best reported system(s) and the leaderboard entries. For the sake of fairness, for such cases, we selected the leaderboard entries to present in Table 3 and Table  4. 4 Also, out of the 14 system descriptions duke DS and AnadKumR report the performance on the Twitter dataset, only. For overall results on both tracks, we observe majority of the models outperformed the LST M attn baseline (Ghosh et al., 2018). Almost all the submitted systems have used the transformer-architecture that seems to perform better than RNN-architecture, even without any task-specific fine-tuning. Although most of the models are similar and perform comparably, we observe a particular system -miroblog -has outperformed the other models in both the tracks by posting an improvement over the 2nd ranked system by more than 7% F1-score in the Reddit track and by 14% F1-score in the Twitter track. 4 Also, for such cases (e.g., abaruah, under the Approach column we reported the approach described in the system paper that is not necessarily reflect the scores of Table 3 Table 4: Performance of the best system per team and baseline for the Twitter track. We include two ranks -ranks from the submitted systems as well as the Leaderboard ranks from the CodaLab site

Results and Discussions
In the following paragraphs, we inspect the performance of the different systems more closely. We discuss a couple of particular aspects.
Context Usage: One of the prime motivating factors for conducting this shared task was to investigate the role of contextual information. We notice the most common approach for integrating context was simply concatenating it with the response text. Novel approaches include : 1. Taking immediate context as aspect for response in Aspect-based Sentiment Classification architectures (taha) 2. CNN-LSTM based summarization of entire context thread (salokr) 3. Time-series fusion with proxy labels for context (amitjena40) 4. Ensemble of multiple models with different depth of context (miroblog) 5. Using explicit separator between context and response when concatenating (tanvidadu) Depth of Context: Results suggest that beyond three context turns, gains from context information are negligible and may also reduce the performance due to sparsity of long context threads. The depth of context required is dependent on the architecture and CNN-LSTM based summarization of context thread (salokr) was the only approach that effectively used the whole dialogue.
Discrete vs. Embedding Features The leaderboard was dominated by Transformer based architectures and we saw submissions using BERT or RoBERTa and other variants. Other sentence embedding architectures like Infersent, CNN/LSTM over word embeddings were also used but had middling performances. Discrete features were involved in only two submissions (burtenshaw and duke DS) and were the focus of burtenshaw system.
Leveraging other datasets The large difference between the best model (miroblog) and other systems can be attributed to their dataset augmentation strategies. Using just the context thread as a negative example when the context+response is a positive example, is a straight-forward approach for augmentation from labeled dialogues. Their novel contribution lies in leveraging large-scaled unlabelled dialogue threads, showing another use of BERT by using NSP confidence score for assigning pseudo-labels.
Analysis of predictions: Finally, we conducted an error analysis based on the predictions of the systems. We particularly focused on addressing two questions. First, we investigate whether any particular pattern exists in the evaluation instances that are wrongly classified by the majority of the systems. Second, we compare the predictions of the top-performing systems to identify instances correctly classified by the candidate system but missed by the remaining systems. Here, we attempt to recognize specific characteristics that are unique to a model, if any. Instead of looking at the predictions of all the systems we decided to analyze only the top-three submissions in both tracks because of their high performances. We identify 80 instances (30 sarcastic) from the Reddit evaluation dataset and 20 instances (10 sarcastic) from the Twitter evaluation set, respectively, that are missed by all the top-performing systems. Our interpretation of this finding is that all these test instances more or less belong to a variety of topics including sarcastic remarks on baseball teams, internet bills, vaccination, etc., that probably do not generalize well during the training. For both Twitter and Reddit, we also found many sarcastic examples that contain common non-sarcastic markers such as laughs (e.g., "haha"), jokes, positive-sentiment emoticons (e.g., :)) in terms of Twitter track. We did not find any correlation to context length. Most of the instances contain varied context length, from two to six.
While analyzing the predictions of individual systems we noted that miroblog correctly identifies the most number of predictions for both the tracks. In fact, miroblog has successfully predicted over two hundred examples (with almost equal distribution of sarcastic and non-sarcastic instances) in comparison to the second-ranked and third-ranked systems for both tracks. As stated earlier, this can be attributed to their data augmentation strategies that have assisted miroblog's models to generalize best. However, we still notice that instances with subtle humor or positive sentiment are missed by the best-performing models even if they are pretrained on a very large-scale corpora. We foresee models that are able to detect subtle humor or witty wordplay will perform even better in a sarcasm detection task.

Conclusion
This paper summarizes the results of the shared task on sarcasm detection using conversation from two social media platforms (Reddit and Twitter), organized as part of the 2nd Workshop on the Figurative Language Processing at ACL 2020. This shared task aimed to investigate the role of conversation context for sarcasm detection. The goal was to understand how much conversation context is needed or helpful for sarcasm detection. For Reddit, the training data was sampled from the standard corpus from Khodak et al. (2017) whereas we curated a new evaluation dataset. For Twitter, both the training and the test datasets are new and collected using standard hashtags. We received 655 submissions (from 39 unique participants) and 1070 submissions (from 38 unique participants) for Reddit and Twitter tracks, respectively. We provided brief descriptions of each of the participating systems who submitted a shared task paper (14 systems).
We notice that almost every submitted system have used transformer-based architectures, such as BERT and RoBERTa and other variants, emphasizing the increasing popularity of using pre-trained language models for various classification tasks. The best systems, however, have employed a clever mix of ensemble techniques and/or data augmentation setups, which seem to be a promising direction for future work. We hope that some of the teams will make their implementations publicly available, which would facilitate further research on improving performance on the sarcasm detection task.