Transformer-based Context-aware Sarcasm Detection in Conversation Threads from Social Media

We present a transformer-based sarcasm detection model that accounts for the context from the entire conversation thread for more robust predictions. Our model uses deep transformer layers to perform multi-head attentions among the target utterance and the relevant context in the thread. The context-aware models are evaluated on two datasets from social media, Twitter and Reddit, and show 3.1% and 7.0% improvements over their baselines. Our best models give the F1-scores of 79.0% and 75.0% for the Twitter and Reddit datasets respectively, becoming one of the highest performing systems among 36 participants in this shared task.


Introduction
Sarcasm is a form of figurative language that implies a negative sentiment while displaying a positive sentiment on the surface (Joshi et al., 2017). Because of its conflicting nature and subtlety in language, sarcasm detection has been considered one of the most challenging tasks in natural language processing. Furthermore, when sarcasm is used in social media platforms such as Twitter or Reddit to express users' nuanced intents, the language is often full of spelling errors, acronyms, slangs, emojis, and special characters, which adds another level of difficulty in this task.
Despite of its challenges, sarcasm detection has recently gained substantial attention because it can bring the last gist to deep contextual understanding for various applications such as author profiling, harassment detection, and irony detection (Van Hee et al., 2018). Many computational approaches have been proposed to detect sarcasm in conversations (Ghosh et al., 2015;Joshi et al., 2015Joshi et al., , 2016. However, most of the previous studies use the utterances in isolation, which makes it hard even for human to detect sarcasm without the contexts. Thus, it's essential to interpret the target utterances along with contextual information comprising textual features from the conversation thread, metadata about the conversation from external sources, or visual context (Bamman and Smith, 2015;Ghosh and Veale, 2017;Ghosh et al., 2018).
This paper presents a transformer-based sarcasm detection model that takes both the target utterance and its context and predicts if the target utterance involves sarcasm. Our model uses a transformer encoder to coherently generate the embedding representation for the target utterance and the context by performing multi-head attentions (Section 4). This approach is evaluated on two types of datasets collected from Twitter and Reddit (Section 3), and depicts significant improvement over the baseline using only the target utterance as input (Section 5). Our error analysis illustrates that the context-aware model can catch subtle nuance that cannot be captured by the target-oriented model (Section 6).

Related Work
Just as most other types of figurative languages are, sarcasm is not necessarily complicated to express but requires comprehensive understanding in context as well as commonsense knowledge rather than its literal sense (Van Hee et al., 2018). Various approaches have been presented for this task.
Most earlier works had taken the target utterance without context as input. Both explicit and implicit incongruity features were explored in these works (Joshi et al., 2015). To detect whether certain words in the target utterance involve sarcasm, several approaches based on distributional semantics were proposed (Ghosh et al., 2015). Additionally, word embedding-based features like distance-weighted similarities were also adapted to capture the subtle forms of context incongruity (Joshi et al., 2016). Nonetheless, it is difficult to detect sarcasm by considering only the target utterances in isolation.
Non-textual features such as the properties of the author, audience and environment were also taken into account (Bamman and Smith, 2015). Both the linguistic and context features were used to distinguish between information-seeking and rhetorical questions in forums and tweets (Oraby et al., 2017). Traditional machine learning methods such as Support Vector Machines were used to model sarcasm detection as a sequential classification task over the target utterance and its surrounding utterances (Wang et al., 2015). Recently, deep learning methods using LSTM were introduced, considering the prior turns  as well as the succeeding turns (Ghosh et al., 2018).

Data Description
Given a conversation thread, either from Twitter or Reddit, a target utterance is the turn to be predicted, whether or not it involves sarcasm, and the context is an ordered list of other utterances in the thread. Table 1 shows the examples of conversation threads where the target utterances involve sarcasm. 1

C 1
This feels apt this morning but I don't feel fine ... <URL> C 2 @USER it is what's going round in the heads of many I know ...

T
@USER @USER I remember a few months back we were saying the Americans shouldn't tell us how to vote on brexit (a) Sarcasm example from Twitter.
Utterance C 1 Promotional images for some guy's Facebook page C 2 I wouldn't let that robot near me T Sounds like you don't like science, you theist sheep (b) Sarcasm example from Reddit. The Twitter data is collected by using the hashtags #sarcasm and #sarcastic. The Reddit data is a subset of the Self-Annotated Reddit Corpus that consists of 1.3 million sarcastic and non-sarcastic posts (Khodak et al., 2017). Every target utterance is annotated with one of the two labels, SARCASM and NOT_SARCASM. Table 2 shows the statistics of the two datasets provided by this shared task.

Approach
Two types of transformer-based sarcasm detection models are used for our experiments: a) The target-oriented model takes only the target utterance as input (Section 4.1).
b) The context-aware model takes both the target utterance and the context utterances as input (Section 4.2).
These two models are coupled with the latest transformer encoders e.g., BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2020), and ALBERT (Lan et al., 2019), and compared to evaluate how much impact the context makes to predict whether or not the target utterance involves sarcasm. Figure 1a shows the overview of the target-oriented model. Let W = {w 1 , . . . , w n } be the input target utterance, where w i is the i'th token in W and n is the max-number of tokens in any target utterance. W is first prepended by the special token c representing the entire target utterance, which creates the input sequence I to = {c} ⊕ W . I to is then fed into the transformer encoder, which generates the sequence of embeddings {e c } ⊕ E w , where E w = {e w 1 , . . . , e w n } is the embedding list for W and (e c , e w i ) are the embeddings of (c, w i ) respectively. Finally, e c is fed into the linear decoder to generate the output vector o to that makes the binary decision of whether or not W involves sarcasm.  Figure 1b shows the overview of the context-aware model. Let L i be the i'th utterance in the context. Then, V = L 1 ⊕ · · · ⊕ L k = {v 1 , . . . , v m } is the concatenated list of tokens in all context utterances, where k is the number of utterances in the context, v 1 is the first token in L 1 and v m is the last token in L k . The input sequence I to from Section 4.1 is appended by the special token s representing the separator between the target utterance and the context, and also V , which creates the input sequence I ca = I to ⊕ {s} ⊕ V . Then, I ca gets fed into the transformer encoder, which generates a sequence of embeddings

Context-aware Model
. , e v m } is the embedding list for V , and (e s , e v i ) are the embeddings of (s, v i ) respectively. Finally, e c is fed into the linear decoder to generate the output vector o ca that makes the same binary decision to detect sarcasm.

Data Split
For all our experiments, a mixture of the Twitter and the Reddit datasets is used. The Twitter training set provided by the shared task consists of 5,000 tweets, where the labels are equally balanced between SARCASM and NOT_SARCASM (Table 2). We find, however, 4.82% of them are duplicates, which are removed before data splitting. As a result, 4,759 tweets are used for our experiments. Labels in the Reddit training set are also equally balanced and no duplicate is found in this dataset.

Models
Three types of transformers are used for our experiments, that are BERT-Large (Devlin et al., 2019), RoBERTa-Large (Liu et al., 2020), and ALBERT-xxLarge (Lan et al., 2019), to compare the performance among the current state-of-the-art encoders. Every model is run three times and their average scores as well as standard deviations are reported. All models are trained on the combined Twitter + Reddit training set and evaluated on the combined development set (Table 3).

Experimental Setup
After an extensive hyper-parameter search, we set the learning rate to 3e-5, the number of epochs to 30, and use different seed values, 21, 42, 63, for the three runs. Additionally, based on the statistics of each dataset, we set the maximum sequence length to 128 for the target-oriented models while it is set to 256 for the context-aware models by considering the different lengths of the input sequences required by those approaches.

Results
The baseline scores are provided by the organizers, that are 60.0% for Reddit and 67.0% for Twitter using the single layer LSTM attention model (Ghosh et al., 2018). Table 4 shows the results achieved by our target-oriented (Section 4.1) and the contextaware (Section 4.2) models on the combined development set. The RoBERTa-Large model gives the highest F1-scores for both the target-oriented and context-aware models. The context-aware model using RoBERTa-Large show an improvement of 1.1% over its counterpart baseline so that this model is used for our final submission to the shared task. Note that it may be possible to achieve higher performance by fine-tuning hyperparameters for the Twitter and Reddit datasets separately, which we will explore in the future.

Analysis
For a better understanding in our final model, errors from the following three situations are analyzed (TO: target-oriented, CA: context-aware): • TwCc: TO is wrong and CA is correct.
• TcCw: TO is correct and CA is wrong.
• TwCw: Both TO and CA are wrong. Table 6 shows examples for every error situation. For TwCc, TO predicts it to be NOT_SARCASM. In this example, it is difficult to tell if the target utterance involves sarcasm without having the context. For TcCw, CA predicts it to be NOT_SARCASM. It appears that the target utterance is long enough to provide enough features for TO to make the correct prediction, whereas considering the extra context may increase noise for CA to make the incorrect decision. For TwCw, both TO and CA predict it to be NOT_SARCASM. This example seems to require deeper reasoning to make the correct prediction.
Utterance C 1 who has ever cared about y * utube r * wind .
C 2 @USER Back when YouTube was beginning it was a cool giveback to the community to do a super polished high production value video with YT talent . Not the same now . The better move for them would be to do like 5-6 of them in several categories to give that shine .
T @USER @USER I look forward to the eventual annual Tubies Awards livestream .
(a) Example when TO is wrong and CA is correct.
Utterance C 1 I am asking the chairs of the House and Senate committees to investigate top secret intelligence shared with NBC prior to me seeing it.
C 2 @USER Good for you, sweetie! But using the legislative branch of the US Government to fix your media grudges seems a bit much.
T @USER @USER @USER you look triggered after someone criticizes me, are conservatives skeptic of ppl in power?
(b) Example when TO is correct and CA is wrong.
Utterance C 1 If I could start my #Brand over, this is what I would emulate my #Site to look like .. And I might, once my anual contract with #WordPress is up . Even tho I don't think is very; I can't help but to find ... <URL> <URL> C 2 @USER There is no design on it except for links ?
T @USER It's the of what #Works in this current #Mindset of #MassConsumption; wannabe fast due to caused by, and being just another and. is the light, bringing color back to this sad world of and.
(c) Example when both TO and CA are wrong. Table 6: Examples of the three error situations. C i : i'th utterance in the context, T: the target utterance.

Conclusion
This paper explores the benefit of considering relevant contexts for the task of sarcasm detection. Three types of state-of-the-art transformer encoders are adapted to establish the strong baseline for the target-oriented models, which are compared to the context-aware models that show significant improvements for both Twitter and Reddit datasets and become one of the highest performing models in this shared task.