Creating and Characterizing a Diverse Corpus of Sarcasm in Dialogue

The use of irony and sarcasm in social media allows us to study them at scale for the first time. However, their diversity has made it difficult to construct a high-quality corpus of sarcasm in dialogue. Here, we describe the process of creating a large- scale, highly-diverse corpus of online debate forums dialogue, and our novel methods for operationalizing classes of sarcasm in the form of rhetorical questions and hyperbole. We show that we can use lexico-syntactic cues to reliably retrieve sarcastic utterances with high accuracy. To demonstrate the properties and quality of our corpus, we conduct supervised learning experiments with simple features, and show that we achieve both higher precision and F than previous work on sarcasm in debate forums dialogue. We apply a weakly-supervised linguistic pattern learner and qualitatively analyze the linguistic differences in each class.


Introduction
Irony and sarcasm in dialogue constitute a highly creative use of language signaled by a large range of situational, semantic, pragmatic and lexical cues. Previous work draws attention to the use of both hyperbole and rhetorical questions in conversation as distinct types of lexico-syntactic cues defining diverse classes of sarcasm (Gibbs, 2000).
Theoretical models posit that a single semantic basis underlies sarcasm's diversity of form, namely "a contrast" between expected and experienced events, giving rise to a contrast between what is said and a literal description of the actual situation (Colston and O'Brien, 2000;Partington, 2007). This semantic characterization has not been straightforward to operationalize computationally for sarcasm in dialogue. Riloff et al. (2013) operationalize this notion for sarcasm in tweets, achieving good results. Joshi et al. (2015) develop several incongruity features to capture it, but although they improve performance on tweets, their features do not yield improvements for dialogue.
Previous work on the Internet Argument Corpus (IAC) 1.0 dataset aimed to develop a highprecision classifier for sarcasm in order to bootstrap a much larger corpus (Lukin and Walker, 2013), but was only able to obtain a precision of just 0.62, with a best F of 0.57, not high enough for bootstrapping (Riloff and Wiebe, 2003;Thelen and Riloff, 2002). Justo et al. (2014) experimented with the same corpus, using supervised learning, and achieved a best precision of 0.66 and a best F of 0.70. Joshi et al. (2015)'s explicit congruity features achieve precision around 0.70 and best F of 0.64 on a subset of IAC 1.0.
We decided that we need a larger and more diverse corpus of sarcasm in dialogue. It is difficult to efficiently gather sarcastic data, because only about 12% of the utterances in written online debate forums dialogue are sarcastic (Walker et al., 2012a), and it is difficult to achieve high reliability for sarcasm annotation (Filatova, 2012;Swanson et al., 2014;González-Ibáñez et al., 2011;Wallace et al., 2014). Thus, our contributions are: • We develop a new larger corpus, using several methods that filter non-sarcastic utterances to skew the distribution toward/in favor of sarcastic utterances. We put filtered data out for annotation, and are able to achieve high annotation reliability.
• We present a novel operationalization of both rhetorical questions and hyperbole to develop subcorpora to explore the differences between them and general sarcasm.
• We show that our new corpus is of high quality by applying supervised machine learning with simple features to explore how different corpus properties affect classification results. We achieve a highest precision of 0.73 and a highest F of 0.74 on the new corpus with basic n-gram and Word2Vec features, showcasing the quality of the corpus, and improving on previous work.
• We apply a weakly-supervised learner to characterize linguistic patterns in each corpus, and describe the differences across generic sarcasm, rhetorical questions and hyperbole in terms of the patterns learned.
• We show for the first time that it is straightforward to develop very high precision classifiers for NOT-SARCASTIC utterances across our rhetorical questions and hyperbole subtypes, due to the nature of these utterances in debate forum dialogue.

Creating a Diverse Sarcasm Corpus
There has been relatively little theoretical work on sarcasm in dialogue that has had access to a large corpus of naturally occurring examples. Gibbs (2000) analyzes a corpus of 62 conversations between friends and argues that a robust theory of verbal irony must account for the large diversity in form. He defines several subtypes, including rhetorical questions and hyperbole: • Rhetorical Questions: asking a question that implies a humorous or critical assertion • Hyperbole: expressing a non-literal meaning by exaggerating the reality of a situation Other categories of irony defined by Gibbs (2000) include understatements, jocularity, and sarcasm (which he defines as a critical/mocking form of irony). Other work has also tackled jocularity and humor, using different approaches for data aggregation, including filtering by Twitter hashtags, or analyzing laugh-tracks from recordings (Reyes et al., 2012;Bertero and Fung, 2016).
Previous work has not, however, attempted to operationalize these subtypes in any concrete way. Here we describe our methods for creating a corpus for generic sarcasm (Gen) (Sec. 2.1), rhetorical questions (RQ), and hyperbole (Hyp) (Sec. 2.2) using data from the Internet Argument Corpus (IAC 2.0). 1 Table 1 provides examples of SARCASTIC and NOT-SARCASTIC posts from the corpus we create. Table 2 summarizes the final composition of our sarcasm corpus.

Generic Dataset (Gen)
We first replicated the pattern-extraction experiments of Lukin and Walker (2013) on their dataset using AutoSlog-TS (Riloff, 1996), a weaklysupervised pattern learner that extracts lexicosyntactic patterns associated with the input data. We set up the learner to extract patterns for both SARCASTIC and NOT-SARCASTIC utterances. Our first discovery is that we can classify NOT-SARCASTIC posts with very high precision, ranging between 80-90%. 2 Because our main goal is to build a larger, more diverse corpus of sarcasm, we use the highprecision NOT-SARCASTIC patterns extracted by AutoSlog-TS to create a "not-sarcastic" filter. We did this by randomly selecting a new set of 30K posts (restricting to posts with between 10 and 150 words) from IAC 2.0 (Abbott et al., 2016), and applying the high-precision NOT-SARCASTIC patterns from AutoSlog-TS to filter out any posts that contain at least one NOT-SARCASTIC cue. We end up filtering out two-thirds of the pool, only keeping posts that did not contain any of our highprecision NOT-SARCASTIC cues. We acknowledge that this may also filter out sarcastic posts, but we expect it to increase the ratio of sarcastic posts in the remaining pool.
We put out the remaining 11,040 posts on Mechanical Turk. As in Lukin and Walker (2013), we present the posts in "quote-response" pairs, where the response post to be annotated is presented in the context of its "dialogic parent", another post earlier in the thread, or a quote from another post earlier in the thread (Walker et al., 2012b). In the task instructions, annotators are presented with a definition of sarcasm, followed by one example of a quote-response pair that clearly contains sarcasm, and one pair that clearly does not. Each task consists of 20 quote-response pairs that follow the instructions. Figure 1 shows the instructions and layout of a single quote-response pair presented to annotators. As in Lukin and Walker (2013) and Walker et al. (2012b), annotators are asked a binary question: Is any part of the response to this quote sarcastic?.
To help filter out unreliable annotators, we create a qualifier consisting of a set of 20 manuallyselected quote-response pairs (10 that should receive a SARCASTIC label and 10 that should receive a NOT-SARCASTIC label). A Turker must pass the qualifier with a score above 70% to participate in our sarcasm annotations tasks.
Our baseline ratio of sarcasm in online debate forums dialogue is the estimated 12% sarcastic posts in the IAC, which was found previously by Walker et al. by gathering annotations for sarcasm, agreement, emotional language, attacks, and nastiness from a subset of around 20K posts from the IAC across various topics (Walker et al., 2012a). Similarly, in his study of recorded conversation among friends, Gibbs cites 8% sarcastic utterances among all conversational turns (Gibbs, 2000).
We choose a conservative threshold: a post is only added to the sarcastic set if at least 6 out of 9 annotators labeled it sarcastic. Of the 11,040 posts we put out for annotation, we thus obtain 2,220 new posts, giving us a ratio of about 20% sarcasm -significantly higher than our baseline of 12%. We choose this conservative threshold to ensure the quality of our annotations, and we leave aside posts that 5 out of 9 annotators label as sarcastic for future work -noting that we can get even higher ratios of sarcasm by including them (up to 31%). The percentage agreement between We then expand this set, using only 3 highlyreliable Turkers (based on our first round of annotations), giving them an exclusive sarcasm qualification to do additional HITs. We gain an additional 1,040 posts for each class when using majority agreement (at least 2 out of 3 sarcasm labels) for the additional set (to add to the 2,220 original posts). The average percent agreement with the majority vote is 89% for these three annotators. We supplement our sarcastic data with 2,360 notsarcastic posts from the original data by (Lukin and Walker, 2013) that follow our 150-word length restriction, and complete the set with 900 posts that were filtered out by our NOT-SARCASTIC filter 3 -resulting in a total of 3,260 posts per class (6,520 total posts).
Rows 1 and 2 of Table 1 show examples of posts that are labeled sarcastic in our final generic sarcasm set. Using our filtering method, we are able to reduce the number of posts annotated from our original 30K to around 11K, achieving a percentage of 20% sarcastic posts, even though we choose to use a conservative threshold of at least 6 out of 9 sarcasm labels. Since the number of posts being annotated is only a third of the original set size, this method reduces annotation effort, time, and cost, and helps us shift the distribution of sarcasm to more efficiently expand our dataset than would otherwise be possible.

Rhetorical Questions and Hyperbole
The goal of collecting additional corpora for rhetorical questions and hyperbole is to increase the diversity of the corpus, and to allow us to explore the semantic differences between SARCAS-TIC and NOT-SARCASTIC utterances when particular lexico-syntactic cues are held constant. We hypothesize that identifying surface-level cues that are instantiated in both sarcastic and not sarcastic posts will force learning models to find deeper semantic cues to distinguish between the classes.
Using a combination of findings in the theoretical literature, and observations of sarcasm patterns in our generic set, we developed a regex pattern matcher that runs against the 400K unannotated posts in the IAC 2.0 database and retrieves matching posts, only pulling posts that have parent posts and a maximum of 150 words. Table 3 only shows a small subset of the "more successful" regex patterns we defined for each class.  Cue annotation experiments. After running a large number of retrieval experiments with our regex pattern matcher, we select batches of the resulting posts that mix different cue classes to put out for annotation, in such a way as to not allow the annotators to determine what regex cues were used. We then successively put out various batches for annotation by 5 of our highly-qualified annotators, in order to determine what percentage of posts with these cues are sarcastic. Table 3 summarizes the results for a sample set of cues, showing the number of posts found containing the cue, the subset that we put out for annotation, and the percentage of posts labeled sarcastic in the annotation experiments. For example, for the hyperbolic cue "wow", 977 utterances with the cue were found, 153 were annotated, and 44% of those were found to be sarcastic (i.e. 56% were found to be not-sarcastic). Posts with the cue "oh wait" had the highest sarcasm ratio, at 87%. It is the distinction between the sarcastic and notsarcastic instances that we are specifically interested in. We describe the corpus collection process for each subclass below.
It is important to note that using particular cues (regex) to retrieve sarcastic posts does not result in posts whose only cue is the regex pattern. We demonstrate this quantitatively in Sec. 4. Sarcasm is characterized by multiple lexical and morphosyntactic cues: these include the use of intensifiers, elongated words, quotations, false politeness, negative evaluations, emoticons, and tag questions inter alia. Table 4 shows how sarcastic utterances often contain combinations of multiple indicators, each playing a role in the overall sarcastic tone of the post.

Sarcastic Utterance
Forgive me if I doubt your sincerity, but you seem like a troll to me. I suspect that you aren't interested in learning about evolution at all. Your questions, while they do support your claim to know almost nothing, are pretty typical of creationist "prove it to me" questions. Wrong again! You obviously can't recognize refutation when its printed before you. I haven't made the tag "you liberals" derogatory. You liberals have done that to yourselves! I suppose you'd rather be called a social reformist! Actually, socialist is closer to a true description. Rhetorical Questions. There is no previous work on distinguishing sarcastic from non-sarcastic uses of rhetorical questions (RQs). RQs are syntactically formulated as a question, but function as an indirect assertion (Frank, 1990). The polarity of the question implies an assertion of the opposite polarity, e.g. Can you read? implies You can't read. RQs are prevalent in persuasive discourse, and are frequently used ironically (Schaffer, 2005;Ilie, 1994;Gibbs, 2000). Previous work focuses on their formal semantic properties (Han, 1997), or distinguishing RQs from standard questions (Bhattasali et al., 2015).
We hypothesized that we could find RQs in abundance by searching for questions in the middle of a post, that are followed by a statement, using the assumption that questions followed by a statement are unlikely to be standard information-seeking questions. We test this assumption by randomly extracting 100 potential RQs as per our definition and putting them out on Mechanical Turk to 3 annotators, asking them whether or not the questions (displayed with their following statement) were rhetorical. According to majority vote, 75% of the posts were rhetorical.
We thus use this "middle of post" heuristic to obviate the need to gather manual annotations for RQs, and developed regex patterns to find RQs that were more likely to be sarcastic. A sample of the patterns, number of matches in the corpus, the numbers we had annotated, and the percent that are sarcastic after annotation are summarized in Table 3.
Rhetorical Questions and Self-Answering So you do not wish to have a logical debate? Alrighty then. god bless you anyway, brother. Prove that? You can't prove that i've given nothing but insults. i'm defending myself, to mackindale, that's all. do you have a problem with how i am defending myself against mackindale? Apparently. We extract 357 posts following the intermediate question-answer pairs heuristic from our generic (Gen) corpus. We then supplement these with posts containing RQ cues from our cue-annotation experiments: posts that received 3 out of 5 sarcastic labels in the experiments were considered sarcastic, and posts that received 2 or fewer sarcastic labels were considered not-sarcastic. Our final rhetorical questions corpus consists of 851 posts per class (1,702 total posts). Table 5 shows some examples of rhetorical questions and selfanswering from our corpus. Hyperbole. Hyperbole (Hyp) has been studied as an independent form of figurative language, that can coincide with ironic intent (McCarthy and Carter, 2004;Cano Mora, 2009), and previous computational work on sarcasm typically includes features to capture hyperbole (Reyes et al., 2013). Kreuz and Roberts (1995) describe a standard frame for hyperbole in English where an adverb modifies an extreme, positive adjective, e.g. "That was absolutely amazing!" or "That was simply the most incredible dining experience in my entire life." Colston and O'Brien (2000) provide a theoretical framework that explains why hyperbole is so strongly associated with sarcasm. Hyperbole exaggerates the literal situation, introducing a discrepancy between the "truth" and what is said, as a matter of degree. A key observation is that this is a type of contrast (Colston and Keller, 1998;Colston and O'Brien, 2000). In their framework: • An event or situation evokes a scale; • An event can be placed on that scale; • The utterance about the event contrasts with actual scale placement.  2 illustrates that the scales that can be evoked range from negative to positive, undesirable to desirable, unexpected to expected and certain to uncertain. Hyperbole moves the strength of an assertion further up or down the scale from the literal meaning, the degree of movement corresponds to the degree of contrast. Depending on what they modify, adverbial intensifiers like totally, absolutely, incredibly shift the strength of the assertion to extreme negative or positive.

Hyperbole with Intensifiers
Wow! I am soooooooo amazed by your come back skills... another epic fail! My goodness...i'm utterly amazed at the number of men out there that are so willing to decide how a woman should use her own body! Oh do go on. I am so impressed by your 'intellectuall' argument. pfft. I am very impressed with your ability to copy and paste links now what this proves about what you know about it is still unproven.  Table 6 shows examples of hyperbole from our corpus, showcasing the effect that intensifiers have in terms of strengthening the emotional evaluation of the response. To construct a balanced corpus of sarcastic and not-sarcastic utterances with hyperbole, we developed a number of patterns based on the literature and our observations of the generic corpus. The patterns, number matches on the whole corpus, the numbers we had annotated and the percent that are sarcastic after annotation are summarized in Table 3. Again, we extract a small subset of examples from our Gen corpus (30 per class), and supplement them with posts that contain our hyperbole cues (considering them sarcastic if they received at least 3/5 sarcastic labels, notsarcastic otherwise). The final hyperbole dataset consists of 582 posts per class (1,164 posts in total).
To recap, Table 2 summarizes the total number of posts for each subset of our final corpus.

Learning Experiments
Our primary goal is not to optimize classification results, but to explore how results vary across different subcorpora and corpus properties. We also aim to demonstrate that the quality of our corpus makes it more straightforward to achieve high classification performance. We apply both supervised learning using SVM (from Scikit-Learn (Pedregosa et al., 2011)) and weakly-supervised linguistic pattern learning using AutoSlog-TS (Riloff, 1996). These reveal different aspects of the corpus.
Supervised Learning. We restrict our supervised experiments to a default linear SVM learner with Stochastic Gradient Descent (SGD) training and L2 regularization, available in the SciKit-Learn toolkit (Pedregosa et al., 2011). We use 10-fold cross-validation, and only two types of features: n-grams and Word2Vec word embeddings. We expect Word2Vec to be able to capture semantic generalizations that n-grams do not (Socher et al., 2013;Li et al., 2016). The n-gram features include unigrams, bigrams, and trigrams, including sequences of punctuation (for example, ellipses or "!!!"), and emoticons. We use GoogleNews Word2Vec features (Mikolov et al., 2013). 4 Table 7 summarizes the results of our supervised learning experiments on our datasets using 10-fold cross validation. The data is balanced evenly between the SARCASTIC and NOT-SARCASTIC classes, and the best F-Measures for each class are shown in bold. The default W2V model, (trained on Google News), gives the best overall F-measure of 0.74 on the Gen corpus for the SARCASTIC class, while n-grams give the best NOT-SARCASTIC F-measure of 0.73. Both of these results are higher F than previously reported for classifying sarcasm in dialogue, and we might expect that feature engineering could yield even greater performance. 4 We test our own custom 300-dimensional embeddings created for the dialogic domain using the Gensim library (Řehůřek and Sojka, 2010), and a very large corpus of user-generated dialogue. While this custom model works well for other tasks on IAC 2.0, it did not work well for sarcasm classification, so we do not discuss it further.   On the RQ corpus, n-grams provide the best F-measure for SARCASTIC at 0.70 and NOT-SARCASTIC at 0.71. Although W2V performs well, the n-gram model includes features involving repeated punctuation and emoticons, which the W2V model excludes. Punctuation and emoticons are often used as distinctive feature of sarcasm (i.e. "Oh, really?!?!", [emoticon-rolleyes]).
For the Hyp corpus, the best F-measure for both the SARCASTIC and NOT-SARCASTIC classes again comes from n-grams, with F-measures of 0.65 and 0.68 respectively. It is interesting to note that the overall results of the Hyp data are lower than those for Gen and RQs, likely due to the smaller size of the Hyp dataset.
To examine the effect of dataset size, we com-pare F-measure (using the same 10-fold crossvalidation setup) for each dataset while holding the number of posts per class constant. Figure 3 shows the performance of each of the Gen, RQ, and Hyp datasets at intervals of 100 posts per class (up to the maximum size of 582 posts per class for Hyp, and 851 posts per class for RQ). From the graph, we can see that as a general trend, the datasets benefit from larger dataset sizes. Interestingly, the results for the RQ dataset are very comparable to those of Gen. The Gen dataset eventually gets the highest sarcastic F-measure (0.74) at its full dataset size of 3,260 posts per class.
Weakly-Supervised Learning. AutoSlog-TS is a weakly supervised pattern learner that only requires training documents labeled broadly as SAR-CASTIC or NOT-SARCASTIC. AutoSlog-TS uses a set of syntactic templates to define different types of linguistic expressions. The left-hand side of Table 8 lists each pattern template and the right-hand side illustrates a specific lexicosyntactic pattern (in bold) that represents an instantiation of each general pattern template for learning sarcastic patterns in our data. 5 In addition to these 17 templates, we added patterns to AutoSlog for adjective-noun, adverb-adjective and adjective-adjective, because these patterns are frequent in hyperbolic sarcastic utterances. The examples in Table 8 show that Colston's notion of contrast shows up in many learned patterns, and that the source of the contrast is highly variable. For example, Row 1 implies a contrast with a set of people who are not your mother.
Row 5 contrasts what you were asked with what you've (just) done. Row 10 contrasts chapter 12 and chapter 13 (Hirschberg, 1985). Row 11 contrasts what I am allowed vs. what you have to do.
AutoSlog-TS computes statistics on the strength of association of each pattern with each class, i.e. P(SARCASTIC | p) and P(NOT-SARCASTIC | p), along with the pattern's overall frequency. We define two tuning parameters for each class: θ f , the frequency with which a pattern occurs, θ p , the probability with which a pattern is associated with the given class. We do a grid-search, testing the performance of our patterns thresholds from θ f = {2-6} in intervals of 1, θ p ={0.60-0.85} in intervals of 0.05. Once we extract the subset of patterns passing our thresholds, we search for these patterns in the posts in our development set, classifying a post as a given class if it contains θ n ={1,

Linguistic Analysis
Here we aim to provide a linguistic characterization of the differences between the sarcastic and the not-sarcastic classes. We use the AutoSlog-TS pattern learner to generate patterns automatically, and the Stanford dependency parser to examine relationships between arguments (Riloff, 1996;Manning et al., 2014). Table 10 shows the number of sarcastic patterns we extract with AutoSlog-TS, with a frequency of at least 2 and a probability of at least 0.75 for each corpus. We learn many novel lexico-syntactic cue patterns that are not the regex that we search for. We discuss specific novel learned patterns for each class below.
Generic Sarcasm. We first examine the different patterns learned on the Gen dataset. Table 9 show examples of extracted patterns for each class. We observe that the NOT-SARCASTIC patterns appear to capture technical and scientific language, while the SARCASTIC patterns tend to capture subjective language that is not topic-specific. We observe an abundance of adjective and adverb patterns for the sarcastic class, although we do not use adjective and adverb patterns in our regex retrieval method. Instead, such cues co-occur with the cues we search for, expanding our pattern inventory as we show in Table 10.  Rhetorical Questions. We notice that while the NOT-SARCASTIC patterns generated for RQs are similar to the topic-specific NOT-SARCASTIC patterns we find in the general dataset, there are some interesting features of the SARCASTIC patterns that are more unique to the RQs. Many of our sarcastic questions focus specifically on attacks on the mental abilities of the addressee. This generalization is made clear when we extract and analyze the verb, subject, and object arguments using the Stanford dependency parser (Manning et al., 2014) for the questions in the RQ dataset. If these dummies don't have a problem with information increasing, but do have a problem with beneficial information increasing, don't you think there is a problem? Table 11: Attacks on Mental Ability in RQs involves adverbs and adjectives, as noted above. We did not use this pattern to retrieve hyperbole, but because each hyperbolic sarcastic utterance contains multiple cues, we learn an expanded class of patterns for hyperbole. Table 12 illustrates some of the new adverb adjective patterns that are frequent, high-precision indicators of sarcasm.
We learn a number of verbal patterns that we had not previously associated with hyperbole, as shown in Table 13. Interestingly, many of these instantiate the observations of Cano Mora (2009) on hyperbole and its related semantic fields: creating contrast by exclusion, e.g. no limit and no way, or by expanding a predicated class, e.g. everyone knows. Many of them are also contrastive. Table 12 shows just a few examples, such as though it in no way and so much knowledge.

Pattern
Freq Example no way 4 that is a pretty impresive education you are working on (though it in no way makes you a shoe in for any political position). so much 17 but nooooooo we are launching missiles on libia thats solves alot .... because we gained so much knowledge and learned from our mistakes oh dear 12 oh dear, he already added to the gene pool how much 8 you have no idea how much of a hippocrit you are, do you exactly what 5 simone, exactly what is a gun-loving fool anyway, other than something you...

Conclusion and Future Work
We have developed a large scale, highly diverse corpus of sarcasm using a combination of linguistic analysis and crowd-sourced annotation. We use filtering methods to skew the distribution of sarcasm in posts to be annotated to 20-31%, much higher than the estimated 12% distribution of sarcasm in online debate forums. We note that when Pattern Freq Example i bet 9 i bet there is a university thesis in there somewhere you don't see 7 you don't see us driving in a horse and carriage, do you everyone knows 9 everyone knows blacks commit more crime than other races I wonder 5 hmm i wonder ware the hot bed for violent christian extremists is you trying 7 if you are seriously trying to prove your god by comparing real life things with fictional things, then yes, you have proved your god is fictional Table 13: Verb Patterns in Hyperbole using Mechanical Turk for sarcasm annotation, it is possible that the level of agreement signals how lexically-signaled the sarcasm is, so we settle on a conservative threshold (at least 6 out of 9 annotators agreeing that a post is sarcastic) to ensure the quality of our annotations. We operationalize lexico-syntactic cues prevalent in sarcasm, finding cues that are highly indicative of sarcasm, with ratios up to 87%. Our final corpus consists of data representing generic sarcasm, rhetorical questions, and hyperbole.
We conduct supervised learning experiments to highlight the quality of our corpus, achieving a best F of 0.74 using very simple feature sets. We use weakly-supervised learning to show that we can also achieve high precision (albeit with a low recall) for our rhetorical questions and hyperbole datasets; much higher than the best precision that is possible for the Generic dataset. These high precision values may be used for bootstrapping these two classes in the future.
We also present qualitative analysis of the different characteristics of rhetorical questions and hyperbole in sarcastic acts, and of the distinctions between sarcastic/not-sarcastic cues in generic sarcasm data. Our analysis shows that the forms of sarcasm and its underlying semantic contrast in dialogue are highly diverse.
In future work, we will focus on feature engineering to improve results on the task of sarcasm classification for both our generic data and subclasses. We will also begin to explore evaluation on real-world data distributions, where the ratio of sarcastic/not-sarcastic posts is inherently unbalanced. As we continue our analysis of the generic and fine-grained categories of sarcasm, we aim to better characterize and model the great diversity of sarcasm in dialogue.