Are you serious?: Rhetorical Questions and Sarcasm in Social Media Dialog

Effective models of social dialog must understand a broad range of rhetorical and figurative devices. Rhetorical questions (RQs) are a type of figurative language whose aim is to achieve a pragmatic goal, such as structuring an argument, being persuasive, emphasizing a point, or being ironic. While there are computational models for other forms of figurative language, rhetorical questions have received little attention to date. We expand a small dataset from previous work, presenting a corpus of 10,270 RQs from debate forums and Twitter that represent different discourse functions. We show that we can clearly distinguish between RQs and sincere questions (0.76 F1). We then show that RQs can be used both sarcastically and non-sarcastically, observing that non-sarcastic (other) uses of RQs are frequently argumentative in forums, and persuasive in tweets. We present experiments to distinguish between these uses of RQs using SVM and LSTM models that represent linguistic features and post-level context, achieving results as high as 0.76 F1 for “sarcastic” and 0.77 F1 for “other” in forums, and 0.83 F1 for both “sarcastic” and “other” in tweets. We supplement our quantitative experiments with an in-depth characterization of the linguistic variation in RQs.


Introduction
Theoretical frameworks for figurative language posit eight standard forms: indirect questions, idiom, irony and sarcasm, metaphor, simile, hyperbole, understatement, and rhetorical questions 4 You lost this debate Skeptic, why drag it back up again? There are plenty of other subjects that we could debate instead.
(a) RQs in Forums Dialog 5 Are you completely revolting? Then you should slide into my DMs, because apparently thats the place to be. #Sarcasm 6 Do you have problems falling asleep? Reduce anxiety, calm the mind, sleep better naturally [link] 7 The officials messed something up? I'm shocked I tell you.SHOCKED.
8 Does ANY review get better than this? From a journalist in New York.
(b) RQs in Twitter Dialog  (Roberts and Kreuz, 1994). While computational models have been developed for many of these forms, rhetorical questions (RQs) have received little attention to date. Table 1 shows examples of RQs from social media in debate forums and Twitter, where their use is prevalent. RQs are defined as utterances that have the structure of a question, but which are not intended to seek information or elicit an answer (Rohde, 2006;Frank, 1990;Ilie, 1994;Sadock, 1971). RQs are often used in arguments and expressions of opinion, advertisements and other persuasive domains (Petty et al., 1981), and are frequent in social media and other types of informal language.
Corpus creation and computational models for some forms of figurative language have been facilitated by the use of hashtags in Twitter, e.g. the #sarcasm hashtag (Bamman and Smith, 2015;Riloff et al., 2013;Liebrecht et al., 2013). Other figurative forms, such as similes, can be identified via lexico-syntactic patterns (Qadir et al., 2016(Qadir et al., , 2015Veale and Hao, 2007). RQs are not marked by a hashtag, and their syntactic form is indistinguishable from standard questions (Han, 2002;Sadock, 1971). Previous theoretical work examines the discourse functions of RQs and compares the overlap in discourse functions across all forms of figurative language (Roberts and Kreuz, 1994). For RQs, 72% of subjects assign to clarify as a function, 39% assign discourse management, 28% mention to emphasize, 56% percent of subjects assign negative emotion, and another 28% mention positive emotion. 1 The discourse functions of clarification, discourse management and emphasis are clearly related to argumentation. One of the other largest overlaps in discourse function between RQs and other figurative forms is between RQs and irony/sarcasm (62% overlap), and there are many studies describing how RQs are used sarcastically (Gibbs, 2000;Ilie, 1994).
To better understand the relationship between RQs and irony/sarcasm, we expand on a small existing dataset of RQs in debate forums from our previous work (Oraby et al., 2016), ending up with a corpus of 2,496 RQs and the self-answers or statements that follow them. We use the heuristic described in that work to collect a completely novel corpus of 7,774 RQs from Twitter. Examples from our final dataset of 10,270 RQs and their following self-answers/statements are shown in Table 1. We observe great diversity in the use of RQs, ranging from sarcastic and mocking (such as the forum post in Row 2), to offering advice based on some anticipated answer (such as the tweet in Row 6).
In this study, we first show that RQs can clearly be distinguished from sincere, informationseeking questions (0.76 F1). Because we are interested in how RQs are used sarcastically, we define our task as distinguishing sarcastic uses from other uses RQs, observing that non-sarcastic RQs are often used argumentatively in forums (as opposed to the more mocking sarcastic uses), and persua-sively in Twitter (as frequent advertisements and calls-to-action). To distinguish between sarcastic and other uses, we perform classification experiments using SVM and LSTM models, exploring different levels of context, and showing that adding linguistic features improves classification results in both domains.
This paper provides the first in-depth investigation of the use of RQs in different forms of social media dialog. We present a novel task, dataset 2 , and results aimed at understanding how RQs can be recognized, and how sarcastic and other uses of RQs can be distinguished.

Related Work
Much of the previous work on RQs has focused on RQs as a form of figurative language, and on describing their discourse functions (Schaffer, 2005;Gibbs, 2000;Roberts and Kreuz, 1994;Frank, 1990;Petty et al., 1981). Related work in linguistics has primarily focused on the differences between RQs and standard questions (Han, 2002;Ilie, 1994;Han, 1997). For example Sadock (1971) shows that RQs can be followed by a yet clause, and that the discourse cue after all at the beginning of the question leads to its interpretation as an RQ. Phrases such as by any chance are primarily used on information seeking questions, while negative polarity items such as lift a finger or budge an inch can only be used with RQs, e.g. Did John help with the party? vs. Did John lift a finger to help with the party?
RQs were introduced into the DAMSL coding scheme when it was applied to the Switchboard corpus (Jurafsky et al., 1997). To our knowledge, the only computational work utilizing that data is by Battasali et al. (2015), who used n-gram language models with pre-and post-context to distinguish RQs from regular questions in SWBD-DAMSL. Using context improved their results to 0.83 F1 on a balanced dataset of 958 instances, demonstrating that context information could be very useful for this task.
Although it has been observed in the literature that RQs are often used sarcastically (Gibbs, 2000;Ilie, 1994), previous work on sarcasm classification has not focused on RQs (Bamman and Smith, 2015;Riloff et al., 2013;Liebrecht et al., 2013;Filatova, 2012;González-Ibáñez et al., 2011;Davi-dov et al., 2010;. Riloff et al. (2013) investigated the utility of sequential features in tweets, emphasizing a subtype of sarcasm that consists of an expression of positive emotion contrasted with a negative situation, and showed that sequential features performed much better than features that did not capture sequential information. More recent work on sarcasm has focused specifically on sarcasm identification on Twitter using neural network approaches (Poria et al, 2016;Ghosh and Veale, 2016;Zhang et al., 2016;Amir et al., 2016).
Other work emphasizes features of semantic incongruity in recognizing sarcasm (Joshi et al., 2015;Reyes et al., 2012). Sarcastic RQs clearly feature semantic incongruity, in some cases by expressing the certainty of particular facts in the frame of a question, and in other cases by asking questions like "Can you read?" (Row 2 in Table 1), a competence which a speaker must have, prima facie, to participate in online discussion.
To our knowledge, our previous work is the first to consider the task of distinguishing sarcastic vs. not-sarcastic RQs, where we construct a corpus of sarcasm in three types: generic, RQ, and hyperbole, and provide simple baseline experiments using ngrams (0.70 F1 for SARC and 0.71 F1 for NOT-SARC) (Oraby et al., 2016). Here, we adopt the same heuristic for gathering RQs and expand the corpus in debate forums, also collecting a novel Twitter corpus. We show that we can distinguish between SARCASTIC and OTHER uses of RQs that we observe, such as argumentation and persuasion in forums and Twitter, respectively. We show that linguistic features aid in the classification task, and explore the effects of context, using traditional and neural models.

Corpus Creation
Sarcasm is a prevalent discourse function of RQs. In previous work, we observe both sarcastic and not-sarcastic uses of RQs in forums, and collect a set of sarcastic and not-sarcastic RQs in debate by using a heuristic stating that an RQ is a question that occurs in the middle of a turn, and which is answered immediately by the speaker themselves (Oraby et al., 2016). RQs are thus defined intentionally: the speaker indicates that their intention is not to elicit an answer by not ceding the turn. 3 3 We acknowledge that this method may miss RQs that do not follow this heuristic, but opt to use this conservative pat-SARCASTIC  In this work, we are interested in doing a closer analysis of RQs in social media. We use the same RQ-collection heuristic from previous work to expand our corpus of SARCASTIC vs. OTHER uses RQs in debate forums, and create another completely novel corpus of RQs in Twitter. We observe that the other uses of RQs in forums are often argumentative, aimed at structuring an argument more emphatically, clearly, or concisely, whereas in Twitter they are frequently persuasive in nature, aimed at advertising or grabbing attention. Table 2 shows examples of sarcastic and other uses of RQs in our corpus, and we describe our data collection methods for both domains below.
Debate Forums: The Internet Argument Corpus (IAC 2.0) (Abbott et al., 2016) contains a large number of discussions about politics and social issues, making it a good source of RQs. Following our previous work (2016), we first extract RQs in tern for expanding the data to avoid introducing extra noise. posts whose length varies from 10-150 words, and collect five annotations for each of the RQs paired with the context of their following statements.
We ask Turkers to specify whether or not the RQ-response pair is sarcastic, as a binary question. We count a post as "sarcastic" if the majority of annotators (at least 3 of the 5) labeled the post as sarcastic. Including the 851 posts per class from previous work (Oraby et al., 2016), this resulted in 1,248 sarcastic posts out of 4,840 (25.8%), a significantly larger percentage than the estimated 12% sarcasm ratio in debate forums (Swanson et al., 2014). We then balance the 1,248 sarcastic RQs with an equal number of RQs that 0 or 1 annotators voted as sarcastic, giving us a total of 2,496 RQ pairs. For our experiments, all annotators had above 80% agreement with the majority vote.
Twitter: We also extract RQs defined as above from a set of 80,000 tweets with a #sarcasm, #sarcastic, or #sarcastictweet hashtag. We use the hashtags as "labels", as in other work (Riloff et al., 2013;Reyes et al., 2012). This yields 3,887 sarcastic RQ tweets, again balanced with 3,887 RQ pairs from a set of random tweets (not containing any sarcasm-related hashtags). We remove all sarcasm-related hashtags and username mentions (prefixed with an "@") from the posts, for a total of 7,774 total RQ tweets.

Experimental Results
In this section, we present experiments classifying rhetorical vs. information-seeking questions, then sarcastic vs. other uses of RQs.

RQs vs. Information-Seeking Qs
By definition, fact-seeking questions are not RQs. We take advantage of the annotations provided for subsets of the IAC, in particular the subcorpus that distinguishes FACTUAL posts from EMOTIONAL posts (Abbott et al., 2016;Oraby et al., 2015). 4 Table 3 shows examples of FACTUAL/INFO-SEEKING questions.
To test whether RQ and FACTUAL/INFO-SEEKING questions are easily distinguishable, we randomly select a sample of 1,020 questions from our forums RQ corpus, and balance them with the same number of questions from FACT corpus. We divide the question data into 80% train and  (Pedregosa et al., 2011), with GoogleNews Word2Vec (W2V) (Mikolov et al., 2013) features. We perform a grid-search on our training set using 3-fold crossvalidation for parameter tuning, and report results on our test set.

Sarcastic vs. Other Uses of RQs
Next, we focus on distinguishing SARCASTIC from OTHER uses of RQs in forums and Twitter. We divide the full RQ data from each domain (2,496 forums and 7,774 tweets, balanced between the two classes) into 80% train and 20% test data. We experiment with two models, an SVM classifier from Scikit Learn (Pedregosa et al., 2011), and a bidirectional LSTM model (Chollet, 2015) with a TensorFlow backend (Abadi et al., 2016). We perform a grid-search using cross-validation on our training set for parameter tuning, and report results on our test set. For each of the models, we establish a baseline with W2V features (Google News-trained Word2Vec size 300 (Mikolov et al., 2013) for the debate forums, and Twitter-trained Word2Vec size 400 (Godin et al., 2015), for the tweets). We experiment with different embedding representations, finding that we achieve best results by averaging the word embeddings for each input when using SVM, and creating an embedding matrix (number of words by embedding size for each in- For our LSTM model, we experiment with various different layer architectures from previous work (Poria et al, 2016;Ghosh and Veale, 2016;Zhang et al., 2016;Amir et al., 2016). For our final model (shown in Figure 1), we use a sequential embedding layer, 1D convolutional layer, maxpooling, a bidirectional LSTM, dropout layer, and a sequence of dense and dropout layers with a final sigmoid activation layer for the output.
For additional features, we experiment with using post-level scores (frequency of each category in the input, normalized by word count) from the Linguistic Inquiry and Word Count (LIWC) tool (Pennebaker et al., 2001). We experiment with which LIWC categories to include as features on our training data, and end up with a set of 20 categories for each domain 6 , as shown in Table 5. When adding features to the LSTM model, we include a dense and merge layer to concatenate features, followed by the dense and dropout layers and sigmoid output.
We experiment with different levels of textual context in training for both the forums and Twitter data (keeping our test set constant, always testing on only the RQ and self-answer portion of the text). We are motivated by the intuition that training on larger context will help us identify more informative segments of RQs in test. Specifically, 5 In future work, we plan to further explore the effects of different embedding representations on model performance. 6 We discuss some of the highly-informative LIWC categories by domain in Sec. 5.  • RQ: only the RQ and its self-answer • P re+RQ: the preceding context and the RQ • RQ + P ost: the RQ and following context • F ullT ext: the full text or tweet (all context) Table 6 presents our results on the classification task by model for each domain, showing P, R, and F1 scores for each class (forums in Table 6a and Twitter in Table 6b). For each domain, we present the same experiments for both models (SVM and LSTM), first showing a W2V baseline (Rows 1 and 6 in both tables), then adding in LIWC (Rows 2 and 7), and finally presenting results for W2V and LIWC features on different context levels (Rows 2-5 for SVM and Rows 7-10 for LSTM).  We observe that while the SVM results with LIWC features do not change significantly depending on the training context (Rows 3-5), the LSTM model is highly sensitive to context changes for the SARC class (Rows 8-10). Some interesting findings emerge when training on different context granularities for LSTM: our best LSTM results for the SARC class come from training on the RQ + P ost context (0.75 F1 in Row 9), and for the P re + RQ context for the OTHER class (0.76 F1 in Row 8). We note that this increase in the SARC class from plain word embeddings to word embeddings combined with LIWC and context is larger than the increase in the OTHER class, indicating that post-level context for SARC captures more diverse instances in training. We also note that these results beat our previous baselines using only ngram features on the smaller original dataset of 851 posts per class (0.70 F1 for SARC, 0.71 F1 for NOT-SARC) (Oraby et al., 2016).

Debate Forums Tweets
We investigate why certain context features benefit each class differently for LSTM. Table 7 shows examples of single posts, divided into P re, RQ, and P ost. Looking at Row 1, it is clear that while the RQ and self-answer portion may not appear to be sarcastic, the P ost context makes the sarcasm much more pronounced. This is frequent in the case of sarcastic debate posts, where the speaker often ends with a sharp remark or an interjection (like "gasp!!!"), or emoticons (like winking ;) or roll-eyes 8-)). In the case of the OTHER forums posts, the RQ is often nestled within sequences of questions, or other RQ and self-answer pairs (Row 2).  Again, while the SVM results do not vary based on changes in context, there is a large improvement in the OTHER class for LSTM when using RQ + P ost level context, giving us our best OTHER class results. From Table 9 Row 4, we see an example of a "call-to-action" that are frequent and distinctive in non-sarcastic Twitter RQs, asking users to visit a link at the end of a tweet (P ost RQ). In the case of the SARC tweet in Row 3, the extra tweet-level context (such as initial exclamations/interjections) aids in highlighting the sarcasm, but is limited in length compared to the forums posts, explaining the smaller gain from context in the Twitter domain for SARC.
Comparing both domains, we observe that the results for tweets in Table 6b are much higher than the results for forums in Table 6a, noting that this could be a result of less lexical diversity and a larger amount of data, making them more distinguishable than the more varied forums posts. We plan to explore these differences more extensively in future work.

Linguistic Characteristics of RQs by Class and Domain
In this section, we discuss linguistic characteristics we observe in our SARCASTIC vs OTHER uses of RQs using the most informative LIWC features. Previous work has observed that FACTUAL utterances are often very heavy on technical jargon (Oraby et al., 2015): this is also true of factual questions. When analyzing differences in LIWC categories in our factual vs. RQ data, we find that our factual questions are slightly longer on average than the RQs (14 words on average compared to 12). We also find significant differences in "function" word categories (p < 0.05, unpaired t-test) in LIWC, marking use of personal references, and "affective processes" (p < 0.005). Both categories are more prevalent in the RQS than in the FACT questions, indicating more emotional language that is targeted towards the second party.
A qualitative analysis of our SARCASTIC vs. OTHER data shows that sarcastic RQs in forums are often followed by short statements that serve to point attention or mock, whereas the other RQself-response pairs often serve as a technique to concisely structure an argument. RQs in Twitter are frequently advertisements (persuasive communication) (Petty et al., 1981), making them more distinguishable from the more diverse sarcastic instances. Tables 8 and 9 show examples of LIWC features that are most characteristic of each domain and class based on our experiments. For ranking, we show the learned feature weight (FW)  for each class, found by performing 10-fold crossvalidation on each training set using an SVM model with only LIWC features. In Table 8, Row 1, we observe that 2 nd person mentions are frequent in the sarcastic debate forums posts (referring to the other person in the debate), while in the Twitter domain, they come up as significant features in the non-sarcastic tweets, where they are used as methods to persuade readers to interact: click a link, like, comment, share (Table 9, Row 6). Likewise, "informal" words and more "verbal speech style" non-fluencies, including exclamations and social media slang ("netspeak"), also appear in sarcastic debate (Table 8, Rows 2 and 4). Features of sarcastic forums include exclamations (Table 8, Rows 3), often used in a hyperbolic or figurative manner (McCarthy and Carter, 2004;Roberts and Kreuz, 1994). We find that sarcastic tweets frequently include sets of exclamations/interjections strung together with commas (Table 9, Row 1), and are often shorter than the tweets in the non-sarcastic class (Table 9, Row 3). Table 8 shows that "interrogatives" are a strong feature of argumentative forums (Row 7), as well as the use of technical jargon (including quantifiers health words with some domain-specific top-ics, such as abortion) (Row 8). Table 9 indicates that OTHER tweets frequently contain forms of advertisement and calls-to-action involving 2 nd person references (Row 7). Similarly, RQ tweets are sometimes used to express frustration ("swear words" in Row 5), or increase engagement with references to "friends" and followers (Row 8).

Conclusions
In this study, we expand on a small corpus from previous work to create a large corpus of RQs in two domains where RQs are prevalent: debate forums and Twitter. To our knowledge, this is the first in-depth study dedicated to sarcasm and other uses of RQs in social media. We present supervised learning experiments using traditional and neural models to classify sarcasm in each domain, providing analysis of unique features across domains and classes, and exploring the effects of training of different levels of context.
We first show that we can distinguish between information-seeking and rhetorical questions (0.76 F1). We then focus on classifying sarcasm in only the RQs, showing that there are distinct linguistic differences between the methods of expression used in RQs across forums and Twitter. For forums, we show that we are able to distinguish be-tween the sarcastic and other uses (noting they are often argumentative) in forums with 0.76 F1 for SARC and 0.77 F1 for NOT-SARC, improving on our baselines from previous work on a smaller dataset (Oraby et al., 2016).
We also explore sarcastic and other uses of RQs on Twitter, noting that other non-sarcastic uses of RQs are often advertisements, a form of persuasive communication not represented in debate dialog. We show that we can distinguish between sarcastic and other uses of RQ in Twitter with scores of 0.83 F1 for both the SARC and OTHER classes. We observe that tweets are generally more easily distinguished than the more diverse forums, and that the addition of linguistic categories from LIWC greatly improves classification performance. We also note that the LSTM model is more sensitive to context changes than the SVM model, and plan to explore the differences between the models in greater detail in future work.
Other future work also includes expanding our dataset to capture more instances of what may characterize RQs across these domains to improve performance, and also to analyze other interesting domains, such as Reddit. We believe that it will be possible to improve our results by using more robust models, and also by developing features to represent the sequential properties of RQs by further utilizing the larger context of the surrounding dialog in our analysis.