Reactive Supervision: A New Method for Collecting Sarcasm Data

Sarcasm detection is an important task in affective computing, requiring large amounts of labeled data. We introduce reactive supervision, a novel data collection method that utilizes the dynamics of online conversations to overcome the limitations of existing data collection techniques. We use the new method to create and release a first-of-its-kind large dataset of tweets with sarcasm perspective labels and new contextual features. The dataset is expected to advance sarcasm detection research. Our method can be adapted to other affective computing domains, thus opening up new research opportunities.


Introduction
Sarcasm is ubiquitous in human conversations. As a form of insincere speech, the intent behind a sarcastic utterance is integral to its meaning. Perceiving a sarcastic utterance as genuine will often result in a complete reversal of the intended meaning, and vice versa (Gibbs, 1986). It is therefore crucial for affective computing systems and tasks, such as sentiment analysis and dialogue systems, to automatically detect sarcasm from the perspective of the author as well as the reader in order to avoid misunderstandings. Oprea and Magdy (2019) recently pioneered the study of intended sarcasm (by the author) vs. perceived sarcasm (by the reader) in the context of sarcasm detection tasks. The training of models for these tasks requires large amounts of labeled sarcasm data, with Twitter becoming a major source due to its popularity as a social network as well as the huge amounts of conversational text its users generate. Previous works describe three methods for collecting sarcasm data: distant supervision, manual annotation, and manual collection.
To improve quality, manual annotation asks humans to label given tweets as sarcastic or not. Since finding sarcasm in a large corpus is "a needle-in-ahaystack problem" (Liebrecht et al., 2013), manual annotation can be combined with distant supervision (Riloff et al., 2013). Still, low inter-annotator reliability is often reported (Swanson et al., 2014), resulting not only from the subjective nature of sarcasm but also the lack of cultural context (Joshi et al., 2016). Moreover, neither method collects both sarcasm perspectives: distant supervision collects intended sarcasm, while manual annotation can only collect perceived sarcasm.
Lastly, in manual collection, humans are asked to gather and report sarcastic texts, either their own (Oprea and Magdy, 2020) or by others (Filatova, 2012). However, both manual methods are slower and more expensive than distant supervision, resulting in smaller datasets.
To overcome the above limitations, we propose reactive supervision, a novel conversation-based method that offers automated, high-volume, "inthe-wild" collection of high-quality intended and perceived sarcasm data. We use our method to create and release the SPIRS sarcasm dataset 1 .

Reactive Supervision
Reactive supervision exploits the frequent use in online conversations of a cue tweet -a reply that highlights sarcasm in a prior tweet. Figure 1 (  alerts B by replying with a cue tweet (She was just being sarcastic!). Since A replies to B but refers to the sarcastic author in the 3rd person (She), C is necessarily the author of the perceived sarcastic tweet. Similarly, Figure 1 (right panel) shows how a 1st person cue (I was just being sarcastic!) can be used to unequivocally label intended sarcasm.
To capture sarcastic tweets, we thus first search for cue tweets (using the query phrase "being sarcastic", often used in responses to sarcastic tweets), then carefully examine each cue tweet to identify the corresponding sarcastic tweet.
The following formalizes our method.

Method
Definitions We define a thread to be a sequence of tweets {t n , t n−1 , . . . , t 1 }, where t i+1 is a reply to t i , i = 1, . . . , n − 1. Tweets are listed in reverse chronological order, with t 1 being the root tweet. The corresponding author sequence is a n a n−1 . . . a 1 , were we replace the original author names with consecutive capital letters (A, B, C, ...), starting with a n = A. For example, Figure 1 (right panel) depicts a thread of length n = 4 with author sequence ABAC. Here a 4 = a 2 = A, a 3 = B, and a 1 = C is the author of the root tweet.
Algorithm Given a thread {t n , t n−1 , . . . , t 1 } with cue tweet t n by a n = A, our aim is to identify the sarcastic tweet among {t n−1 , . . . , t 1 }. We first examine the personal subject pronoun used in the cue (I, you, s/he) and map it to a grammatical person class (1st, 2nd, 3rd). This informs us whether the sarcastic author is also the author of the cue (1st), its addressee (2nd), or another party (3rd). For each person class we then apply a heuristic to identify the sarcastic tweet.
For example, for a 1st-person cue tweet (e.g., I was just being sarcastic!), the sarcastic tweet must also be authored by A. If the earlier tweets in T contain exactly one tweet from A, it is unambiguously the sarcastic tweet. Otherwise, if there are two or more earlier tweets from A (or none), the sarcastic tweet cannot be unambiguously pinpointed and the entire thread is discarded. We formalize this rule by requiring the author sequence to match the regular expression /ˆA[ˆA] * (A)[ˆA] * $/, where the capturing group (A) corresponds to the sarcastic tweet 2 . We are able to use regular expressions because we use a string of letters to represent the author sequence. 2nd-and 3rd-person cues produce corresponding rules and patterns. Table 1 lists the three person classes, corresponding regular expressions, and example author sequences.

Advantages
Additional Tweet Types Along with each sarcastic tweet, we collect the oblivious tweet (the unsuspecting reply to the sarcastic tweet) when available. As far as we know, this is the first work that identifies and collects oblivious texts, a new type of data that can improve research on the (mis)understanding of sarcasm, with applications such as automated assistive systems for people with emotional or cognitive disabilities. If the sarcastic tweet is a reply, we also capture the eliciting tweet, which is the tweet that evoked the sarcastic reply. We provide more details in Appendix A.

Extraction of Semantic Relations
Being able to identify the various tweets types (cue, oblivious, sarcastic, eliciting), reactive supervision can be understood more abstractly as capturing semantic dependency relations between utterances 3 . Reactive supervision can thus be useful in the context of discourse analysis.
Context-Aware Annotation Our method uses cues from thread participants, who therefore serve as de facto annotators. As participants are familiar with the conversation's context, we overcome some quality issues of using external annotators, who are often unfamiliar with the conversation context due to cultural and social gaps (Joshi et al., 2016).
Sarcasm Perspective Previous datasets contain either intended or perceived sarcasm, but not both (Oprea and Magdy, 2019). Our method identifies and labels both intended and perceived sarcasm within the same data context: by their essence, 1stperson cue tweets capture intended sarcasm, while 2nd-and 3rd-person cues capture perceived sarcasm. We label a tweet as perceived sarcasm when at least one reader perceives the tweet as sarcastic and posts a cue tweet. Detecting perceived sarcasm is useful, for example, for training algorithms that flag sensitive texts which might be (mis)perceived as sarcastic (even by a single reader).

Faster Data Collection
We tested González-Ibáñez et al. (2011)'s distant supervision method of collecting tweets ending with #sarcasm and related hashtags, fetching 171 tweets/day on average. During the same period, our method collected 312 tweets/day on average, an 82% rate improvement. Table 2 summarizes the advantages of our best-of-all-worlds method over other approaches. Reactive supervision offers automated, in-the-wild, and context-aware detection of intended and perceived sarcasm data.  3 It is worth noting that Hearst (1992) uses patterns to automatically extract lexical relations between words.

Summary of Advantages
Algorithm 1: Data collection pipeline.

SPIRS Dataset
We implemented reactive supervision using a 4step pipeline (see Algorithm 1): 1. Fetch calls the Twitter Search API to collect cue tweets, using "being sarcastic" as the query.
2. Classify is a rule-based, precision-oriented classifier that classifies cues as 1st-, 2nd-, or 3rdperson according to the referred pronoun (I, you, s/he). If the cue cannot be accurately classified (e.g., a pronoun cannot be found, the cue contains multiple pronouns, negation words are present), the cue is classified as unknown and discarded.
3. Traverse calls the Twitter Lookup API to retrieve the thread by starting from the cue tweet and repeatedly fetching the parent tweet up to the root tweet.
4. Finally, Match matches the thread's author sequence with the corresponding regular expression. Unmatched sequences are discarded. Otherwise, the sarcastic tweet is identified and saved along with the cue tweet, as well as the eliciting and oblivious tweets when available.
The pipeline collected 65K cue tweets containing the phrase "being sarcastic" and corresponding threads during 48 days in October and November 2019. 77% of the cues were classified as unknown and discarded, ending with 15 000 English sarcastic tweets. In addition, 10 648 oblivious and 9 156 eliciting tweets were automatically captured. Table  3 summarizes the SPIRS dataset. We added 15 000 negative instances by sampling random English tweets captured during the same period, discarding tweets with sarcasm-related words or hashtags.  Sarcastic tweets can be either root tweets or replies. We found that the majority of intended sarcasm tweets are replies (78.4%), while the majority of perceived sarcasm tweets are root tweets (77.0%). Further dataset statistics on author sequence and tweet position distributions are available in Appendices B and C.

Experiments and Analysis
We present dataset baselines for three tasks: sarcasm detection, sarcasm detection with conversation context, and sarcasm perspective classification, a new task enabled by our dataset.

Sarcasm Detection
The first experiment is sarcasm detection. We trained a total of three models: CNN (100 filters with a kernel size 3) and BiLSTM (100 units), both max-pooled and Adam-optimized with a learning rate of 0.0005; data was preprocessed as described in Tay et al. (2018); the embedding layer was preloaded with GloVe embeddings (Twitter data, 100 dimensions) (Pennington et al., 2014). We also fine-tuned a pre-trained base uncased BERT model (Devlin et al., 2019). For all three models, we used 5-fold cross-validation for training, holding out 20% of the data for testing.
Results are shown in Table 5 (top panel). BERT is the best performing model, with 70.3% accuracy. We compared SPIRS's classification results to the Ptáček et al. (2014) dataset, commonly used in sarcasm benchmarks. We found that Ptáček's accuracy is significantly higher (86.6%). We posit that it is because sarcasm is confounded with locale in the Ptáček (sarcastic tweets are from worldwide users; non-sarcastic tweets are from users near Prague), and thus classifiers learn features correlated to locale. We tested our hypothesis by replacing our negative samples with Ptáček's, which indeed resulted in boosting the accuracy by 19.1%.

Detection with Conversation Context
Our second sarcasm classification experiment uses conversation context by adding eliciting and oblivious tweets to the model. As far as we know, this is the first sarcasm-related task that uses oblivious texts. Our model concatenated the outputs of three identical 100-unit BiLSTMs (one per tweet: sarcastic, oblivious, eliciting) before feeding it into dense layers for classification. Tweets without surrounding context were not used in this task. Results are shown in Table 5 (middle panel). Accuracy for the full-context model was 74.7% (MCC 0.398).

Ablation Study
We conducted context ablation experiments to identify the contribution of each tweet type. We found that removing the eliciting tweets reduces accuracy by 0.5% and MCC by 0.026. Removing the oblivious tweets, however, lowered accuracy by 3.4% to 71.4%, and the MCC dropped significantly by 31%, from 0.398 to 0.275. This illustrates the importance of the new oblivious text data provided in the dataset and suggests its usefulness in sarcasm-related tasks.

Perspective Classification
Taking advantage of the new labels in our dataset, we propose a new task to classify a sarcastic text's perspective: intended vs. perceived. Our results are displayed in Table 5 (bottom panel), demonstrating the superiority of BERT over the other models, with an accuracy of 68.2% and MCC of 0.366.

Error Analysis
We carefully examined the errors to analyze the causes of perspective misclassification. We observed that misclassified-as-intended tweets (e.g., "You're lost!", "Omg that was so  Mean and standard deviation were calculated using 5-fold cross-validation. N is the number of instances after preprocessing. * Dataset classes were balanced using majority class downsampling.  We posit that longer, more informative texts make sarcasm easier to perceive; hence, short perceived sarcasm or long intended sarcasm might introduce errors. Analysis of the dataset's word count distribution supports our hypothesis (see Figure 2).
Looking for further error sources, we inspected short intended tweets that were misclassified, for example "great friends i have!" and "My mom is so beautiful". These tweets can be read as root tweets and not as replies, yet most intended sarcasm tweets are replies while most perceived sarcasm tweets are root tweets (see Section 3). We hypothesize that the classifier learns discourse-related features (original tweet vs. reply tweet), which can lead to these errors. Further analysis of sarcasm perspective and its interplay with sarcasm pragmatics is a promising avenue for future research.

Conclusion
We present an innovative method for collecting sarcasm data that exploits the natural dynamics of online conversations. Our approach has multiple advantages over all existing methods. We used it to create and release SPIRS, a large sarcasm dataset with multiple novel features. These new features, including labels for sarcasm perspective and unique context (e.g., oblivious texts), offer opportunities for advances in sarcasm detection.
Reactive supervision is generalizable. By modifying the cue tweet selection criteria, our method can be adapted to related domains such as sentiment analysis and emotion detection, thereby advancing the quality and quantity of data collection and offering new research directions in affective computing. Silviu Oprea and Walid Magdy. 2020

A Search Pattern Production
We construct the regular expression for capturing all tweet types -sarcastic, oblivious, and eliciting -given a 3rd-person cue tweet. Similar logic produces the patterns for 1st-and 2nd-person cues. The cue tweet author (A) refers to the sarcastic tweet author in the 3rd person (e.g., She was being sarcastic!); we thus assume that A's tweet is a response to a second author B, but refers to a third author C (the sarcastic author). To unambiguously pinpoint the sarcastic tweet, C can only appear once in the author sequence. Moreover, only A, B, and C can participate in the thread. Finally, C's tweet can either be a root tweet or a reply to another tweet. The combination of these constraints leads to the regular expression (C) is the sarcastic tweet. Finally, ([AB] * ) represents optional tweets from A or B. If the author sequence matches the regular expression, we can unambiguously identify the sarcastic author and the corresponding sarcastic tweet. We also use the search pattern to find the oblivious and eliciting tweets. We assume that the cue tweet (A) is triggered by an oblivious tweet from B. Thus, if (A * B[AB] * ) contains exactly one B, we designate the corresponding tweet as oblivious. Likewise, ([AB] * ) contains the eliciting tweet. Table 6 lists the search patterns for the three person classes. Note that the 2nd-person pattern does not include an oblivious tweet because A's cue tweet is a response to a sarcastic tweet from B, i.e., it is not triggered by an oblivious tweet.   Table 7 shows the most common author sequences in SPIRS. The different colors correspond to the different tweet types. The most common pattern for 1st-person cues is ABAC (as in Figure 1, right panel). AB is the most common pattern for 2ndperson cues, which denote a sarcastic root tweet followed immediately by a cue tweet (e.g., Why are you being sarcastic?). For 3rd-person cues, the most common pattern is ABC (as in Figure 1, left panel). Note that some patterns appear in more than one person class. For example, ABA appears in both 1st-and 2nd-person classes, while ABAC appears in both 1st-and 3rd-person.

C Tweet Position Distribution
Reactive supervision enables the measurement of conversation position statistics for sarcastic tweets on Twitter. Given a thread {t n , . . . , t i = s, . . . , t 1 } with cue tweet t n , sarcastic tweet t i = s, and root tweet t 1 , we define the position of the sarcastic tweet as the distance i − 1 between the sarcastic tweet and the root. Furthermore, the cue lag is the distance n − i between the cue and the sarcastic tweet. Table 8 shows the distribution of sarcastic tweets by position and cue lag in the SPIRS dataset. Root tweets (position = 0) account for 39% of sarcastic tweets. A further 39% of sarcastic tweets are direct replies to root tweets (position = 1). Interestingly, only 25% of cue tweets are direct replies to their sarcastic targets (lag = 1), while an overwhelming 71% have a lag of 2, mostly reflecting a response to an intermediate oblivious tweet. We further find that the average thread length is 3.9 tweets, while the average lag is 1.8 tweets.
Distance from the root tweet