Magnets for Sarcasm: Making Sarcasm Detection Timely, Contextual and Very Personal

Sarcasm is a pervasive phenomenon in social media, permitting the concise communication of meaning, affect and attitude. Concision requires wit to produce and wit to understand, which demands from each party knowledge of norms, context and a speaker’s mindset. Insight into a speaker’s psychological profile at the time of production is a valuable source of context for sarcasm detection. Using a neural architecture, we show significant gains in detection accuracy when knowledge of the speaker’s mood at the time of production can be inferred. Our focus is on sarcasm detection on Twitter, and show that the mood exhibited by a speaker over tweets leading up to a new post is as useful a cue for sarcasm as the topical context of the post itself. The work opens the door to an empirical exploration not just of sarcasm in text but of the sarcastic state of mind.


Introduction
Oscar Wilde memorably described sarcasm as "the lowest form of wit but the highest form of intelligence." Though sarcasm lacks the sophistication of irony, and does little to conceal the speaker's disdain for a target, it is a figurative device that requires as much intelligence from its consumers as its producers. The concision with which sarcasm and irony allow speakers to conflate propositional content and affective stance makes it a pervasive mode of communication in the 140-character tweets of Twitter. By combining an overtly positive attitude with a meaning that is more deserving of scorn, sarcasm allows speakers to communicate disappointment about a state of affairs that bites (or etymologically "cuts the flesh") of an ad-dressee. It conveys the feeling the speaker would wish to experience ("I love it when ...") with the state of affairs that up-ends this feeling ("... my friends forget my birthday"). It often combines politeness with mockery to disguise the appearance of hostility while heightening its effect on a listener (Brown and Levinson, 1978;Dews and Winner, 1995). It establishes a wry environment (Dews and Winner, 1999) that has its roots in social norms and the speaker's state of mind.
Psychological theories of irony, such as echoic reminder theory (Kreuz and Glucksberg, 1989) and implicit display theory (Utsumi, 2000b) have yet to fully translate into text-analytic methods. Neuropsychology researchers who have sought patterns of brain activity to identify the neural correlates of sarcasm note that an understanding of sarcasm is highly dependent not just on the context of an utterance but on the state-of-mind and personality of the speaker, as well as on facial expressions and prosody (Shamay-Tsoory et al., 2005). Without the latter markers, purely textual detection must depend largely on the content and context of an utterance, though speaker personality and state-of-mind can also be approximated via text-analytic means. Probabilistic classification models that exploit textual cues -such as the juxtaposition of positive sentiment and negative situations (Riloff et al., 2013), discriminative words and punctuation marks , and emoticon usage (González-Ibánez et al., 2011) have achieved good performance across domains, yet these models typically suffer from an absence of psychological insight into a speaker and topical insight into the context of utterance production. Kreuz and Link (2002) argue that the likelihood of sarcasm is proportional to the amount of knowledge shared by speaker and audience, which includes knowledge of the world and knowledge of the speaker and audience. Personality is defined by Olver and Mooradian (2003) as the "enduring characteristics of the individual" though moodwhich is changeable -is perhaps just as useful if sampled in a timely fashion. The difference between personality and mood can be likened to that between climate and weather. Tausczik & Pennebaker (2010) have developed a Twitter-based mood analysis web service at AnalyzeWords.com which uses a variety of psycholinguistic criteria and the LIWC (Linguistic Inquiry and Word Count) resource 1 to quantify the recent mood -i.e. the recent weather -of a user along 11 dimensions ranging from Arrogance/Remoteness to Anger and Analyticity. To exploit the stable personality of an online user, Celli et al. (2016) sought a correlation between Big Five personality traits (Costa and McCrae, 2008) and the LIWC-quantifiable dimensions found in re-tweets amongst Twitter users. (Rajadesingan et al., 2015) have also shown how relevant aspects of personality can be acquired from a speaker's past tweets. Since personality and mood can each influence the detection process, they underpin our first research question: To what extent can the quantifiable dimensions of either lead to a better understanding of sarcasm? Reliable detection depends as much on the context of an utterance -which provides the motivation for sarcasm -as its content. Consider e.g.: Speaker Utterance: @MSNBC of course all of those jobs will be in China In reply to @realDonaldTrump: I will be the greatest jobsproducing president that God ever created.
The speaker's sarcastic intent cannot be grasped without knowledge of the larger context. This issue provides our second research question: How can we usefully incorporate utterance context into a neural network model of sarcasm detection? Sarcasm is ubiquitous but always in flux, relying on a changing swirl of socially relevant viewpoints. The following tweet is sarcastic by virtue of its echoic mockery of a widely ventilated opinion: Time to get my Sunday dose of #fakenews from the failing @nytimes.
This begs the third research question that we explore in the following sections: How can we train our sarcasm detection model to exploit evolving social norms and public opinions? 1 https://liwc.wpengine.com/ 2 Related Work and Ideas Sarcasm has been extensively researched by linguists and psychologists (Gibbs and Clark, 1992;Gibbs and Colston, 2007;Kreuz and Glucksberg, 1989;Utsumi, 2000a), yet due to the limited availability of stimuli, sarcasm detection in text has relied chiefly on the recognition of stock patterns and lexical cues. Sarcasm often highlights failed expectations by engaging in a pragmatic pretense that is designed to be seen through (Campbell and Katz, 2012), so cues such as interjections, intensifiers, punctuation and markers of non-veridicality and hyperbole play a crucial role in recognizing sarcastic intent. Likewise, stock plaudits such as "yay!" or "great!" are common in sarcastic product reviews , while hashtags such as #sarcasm, as compressed vehicles for user intent, are often used to self-annotate sarcastic texts . Liebrecht et al. (2013) used topic-specific information and n-grams as discriminative features, while (Lukin and Walker, 2013) showed that phrases such as "no way", "Oh really?" and "not so much" serve to flag a sarcastic intent when used with specific linguistic patterns. Capelli et al. (1990); Woodland and Voyer (2011) suggest that contextual awareness is a necessary precursor to identifying sarcasm. Sarcasm is a response to a motivating context that appears to force a rueful incongruity between a text and its context. Exploiting the principle of inferability (see Kreuz (1996)), Bamman and Smith (2015) modeled shared common knowledge by extracting features from context, the author, and the audience. Khattri et al. (2015) identified sarcasm by seeking a strong contrast in affect toward named entities in current vs. historical tweets, while Rajadesingan et al. (2015) also exploited a contrast in statistically-derived author traits across current and historical tweets. Zhang et al. (2016) use similar sources of contextual information to show the effectiveness of a neural network over more traditional approaches involving manually-selected, discrete features, claiming that automatic feature induction can uncover more subtle markers of sarcasm. Amir et al. (2016) argue that sarcasm detection hinges on speaker modeling, and exploited user embedding to quantify incongruity between utterances and the behavioral traits of their authors. These methods measure the disparity between an utterance and expectations arising from knowledge of context or speaker or both together.
We build on this double-grounding for sarcasm to improve detection in a neural network model of sarcasm and thereby address our first two research questions. We model the speaker at the time of utterance production using mood indicators derived from the most recent prior tweets, and model context using features derived from the proximate cause of the new utterance, the tweet to which an utterance is a response. For our third research question, we present a novel feedback-based annotation scheme that engages authors of training/test tweets in a process of explicit annotation, feeding new examples back into the model. Section 3 outlines the kind and source of features exploited in the model. Section 4 outlines our methods of data collection and annotation. Section 5 presents the neural network model, while section 6 & 7 present our experimental set-up and analysis of results. Finally, section 8 offers some closing remarks.

Psychological dimensions and Sarcasm
We cannot perceive a user's state-of-mind directly on Twitter, but we might infer one's current disposition from an analysis of recent tweets, as linguistic expressions tend to be congruent with an author's state-of-mind (Campbell and Katz, 2012). An informative if low-res psychological portrait is sketched by web services such as AnalyzeWords (Tausczik and Pennebaker, 2010), which analyzes the most recent 1000-words or so of a Twitter user using LIWC to score the user on 11-dimensions: Upbeat, Worried, Angry, Depressed, Plugged in, Personable, Arrogant, Spacy, Analytic, Sensory and In-the-moment. Sarcasm is often perceptible in the incongruity between utterance and context  but it can also be conveyed by an incongruity between text and recent mood.
To understand the relationship between these 11 dimensions (each scored 0..100) and a propensity for sarcasm, we performed a k-Nearest Neighbors (KNN) clustering of the Twitter users that provide the tweets of our sarcastic data set. The Analyze-Words snapshot of each user was taken at the time of that user's tweet in the dataset. A value of 30 for k was chosen empirically to ensure a decent size for the clusters. By calculating Spearman correlations between each group and the 11 Analyze-Words dimensions, we estimated the affinity for sarcasm of different dimensions. Unsurprisingly, we observed that clusters showing a high correlation with negative dimensions, such as Angry, also tend to use positive expressions such as 'funny" and 'wow" to mark sarcasm. Here is an example: @realDonaldTrump They can all fit in your head? Wow! Have you seen someone about this?
Unless one knows that @realDonaldTrump often elicits anger, or that the author scored 83 (of 100) for Angry, this tweet might seem quite positive. Valence shifters such as "not" might also suggest literal positivity if not for the implicit anger of the author. At the time of the following tweet An-alyzeWords scored its author as Angry=98.
@realdonaldtrump funny the founder of the birther movement is saying that he's not racist #trumpbirther Polarizing figures such as @realDonaldTrump are magnets for sarcasm on Twitter. By identifying these magnets, we can better detect the sarcasm of a tweet that offers plaudits for negative qualities. We use AnalyzeWords to obtain the popular affective feelings for common addressees by averaging the affective dimensions of the users that tweet at them. The top 5 magnets for sarcasm in our data-set of 18K sarcastic tweets are @hillaryClinton, @realDonaldTrump, @bernieSanders, @AP and @megynKelly. Of these, @hillaryClinton is the biggest target for Angry tweets while @meg-ynKelly is the biggest target for Analytic tweets.
Addresses in the political domain score high for both angry tweets and analytic tweets: people analyze the news and shoot the messenger. We see much less analyticity -a tendency to use complex expressions linked with logical connectives -in tweets about popular entertainers. To mock such targets, users tend to use affective words that contrast with overall public opinion. The magnet with the highest mean Angry score for the tweets that target him is @realDonaldTrump, yet 63% of the affective words in the tweets that target him in the data-set are positive. Knowing that @realDon-aldTrump is a magnet for anger can help a sarcasm detector overcome this positive bias.

Dataset Construction
Tweets with sarcastic intent are often misclassified due to a lack of shared context or knowledge between speaker and annotator. Opposing social beliefs and a dearth of topical or personal knowledge can lead to serious misjudgments. Relevant tweet sets can be harvested by searching sarcasm specific hashtags (e.g. #sarcasm). This approach overlooks tweets that are not explicitly tagged as sarcastic by their authors. Thus we have devised a feedback-based system that contacts tweet authors directly after-the fact to ask for their authoritative self-annotations for a potentially sarcastic tweet.

Data collection
To collect annotations from authors for their own tweets, we used a Twitterbot named @onlinesarcasm to exploit the "retweet with comment" function in Twitter. The bot chooses randomly from tweets that are addressed to any of 700 top Twitter users (as listed by TwitterCounter.com), as we expect high-profile figures to be magnets for sarcasm from others. The bot retweets a chosen tweet (s i ) to its author, appending a yes/no question (q i ) as a comment to elicit a reply.
At the time of retweeting (s i ), the 11 Analyze-Words.com dimensions (aw i ) of the tweet's author (u i ) are saved, along with the context tweet (s j ) by author (u j ) that provoked (s i ). Authors respond to the bot by favoriting/retweeting the bot's request or via a reply (re i ) containing #Yes or #No. Author responses often contain more than a simple #Yes or #No response, and so, after observing a series of responses the following linguistic rules were used to extract the training annotations: • If the number of retweets (r i ) or likes (l i ) for q i is non-zero or re i contains #Yes, then s i is deemed positive for sarcasm.
• If re i contains #No or an explicit mention of 'not sarcastic' or 'no sarcasm' or 'truth', then s i is deemed negative for sarcasm.
We discarded any s i lacking a context tweet s j . Using author feedback, a data set of 40K tweets was collected, comprising 18K tweets acknowledged as sarcastic and 22K deemed non-sarcastic. For another test set, we collected 1200 tweets: 550 tweets acknowledged as sarcastic by their authors and 650 acknowledged to be non-sarcastic.

External datasets
In addition to our own training and test sets, whose annotations come directly from tweet authors, we also used 5 Twitter datasets where tweet information, fetched by tweet identifier, contains identifier of context tweet (Ptáček et al., 2014;Bamman and Smith, 2015;Rajadesingan et al., 2015;Cliche, 2014), from which motivating contexts can be discerned for each. (This contextual requirement pre-vents us from considering even more of the available sarcasm datasets.) For the context tweets s j for each s i in these sets we collected the most recent linked tweets of s i . To obtain the 11 Ana-lyzeWords.com dimensions for tweet authors, we collect the 50 tweets of u i posted just prior to s i , and use the LIWC to estimate the 11 dimensions (Anger, Arrogance, etc.) from those tweets. As AnalyzeWords.com does not provide retrospective analyses, and as its code is not public, we reverseengineered a substitute using the LIWC by following the creators' guidelines in (Tausczik and Pennebaker, 2010). For subsequent evaluations, the 5 external datasets were split into 3 parts each: 80% for training, 10% for development/tuning, and 10% for testing.

The Neural Network Model
Ghosh and Veale (2016) described an Artificial Neural Network (ANN) model built around layers of CNNs (Convolutional Neural Networks) and LSTMs (Long Short Term Memory) for sarcasm detection to efficiently capture contrasting text signals of sarcasm within a tweet. We build here on this model as shown in Fig.1, adding input features for the psychological profile of the author and the context of the tweet to those for the tweet itself. The LSTM layer (Hochreiter and Schmidhuber, 1997) captures dependencies amongst nonadjacent contrasting signals for sarcasm within each s i . We extend this architecture to include a context tweet s j for each s i , but instead of concatenating s j and s i at the input layer, we stitch them together after the LSTM layer. The text input layer is initialized with embeddings from Google's Word2Vec model (Mikolov et al., 2013) with a dimension setting of 300. To further integrate features reflecting the state of mind of the speaker at utterance-time, the values aw i (i = 1...11) for each s i are concatenated with the feature vector of s j & s i in the merge layer. We use a bi-directional LSTM (BLSTM) and forego a maxpooling layer to increase throughput to the BLSTM. We prevent overfitting using a dropout layer with a dropout rate of 0.25 after the BLSTM layers. The concatenation layer combines the feature maps of the source and context tweets (s i & s j ) along with a vector of aw 1...11 for the author u i . The concatenation yields a merge layer of size f (2(|s|m+1)+l) where f , s, m and l are, respectively, the number of BLSTM units, the length of the input se- Figure 1: A Neural Architecture for Detecting Sarcasm in Contextualized Utterances quence, the width of the CNN filter and the length of aw. Notice that the features for a tweet s i and its immediate context s j -which we consider the proximate cause of the sarcasm (if any) in s i -are concatenated only after they have passed through separate sets of CNN and LSTM layers (CNN1 + BLSTM1 and CNN2 + BLSTM2). It is important to keep a tweet and its context separate for as long as possible, as the model is designed to recognize an inherent incongruity between each. This incongruity becomes diffuse if the inputs are combined too soon. EAW is the embedding layer for the 11 AnalyzeWords dimensions; it combines the vectors of s j , s i and aw, and passes the concatenated features to a Deep Neural Network (DNN) to discriminate both classes (sarcasm vs. non-sarcasm). The code 2 is developed using Keras 3 .

Evaluation and Experimental Setup
Success with a neural architecture requires apt input features and an equally apt selection of hyperparameters. After performing a grid search over hyper-parameters, the best configuration of the CNN, LSTM and DNN layers places 1280 hidden memory units into each layer and uses a CNN filter width of 3. A simple baseline will use only the textual content of a tweet s i without a context s j or an affective profile aw of the author u i . To appreciate the contribution of different input sources of information we trained the network on different combinations of these sources.

Addressee information
If s i is addressed to u j , this information can provide additional insights into s i 's tone. In the TTIA (Target Tweet Including Addressee) setting the name of the addressee (but not an estimation of the public opinion of the addressee, as so few addresses are actually famous) is added to the baseline along with s i . If the addressee is a magnet for sarcasm, aspects of this magnetism should still impress themselves on the network during training.

Contextual Information
In a variant of the baseline called CT (Context Tweet), the features of the tweet s j to which the target s i is a response are also added as inputs to the model, to be stitched together with the features of s i at the concatenation layer. Changes in performance with and without CT will allow us to estimate the value of context in sarcasm detection.

Author Profile Information
The 11-dimensional AnalyzeWords snapshot aw for author u i at the time s i is posted offers valuable insights into the intent of u i . In the PD (Psychological Dimensions) configuration, the 11 affective dimensions aw i are added to the model. They pass through an embeddings layer to be combined with utterance (and possibly context) features at the concatenation layer. To determine the relative contribution of each dimension aw i to detection competence, we trained the model in two extra modes. In the first, we fed the model with false values for each aw i , varying values from 0 to 100, to observe the effects on accuracy when e.g. Angry is over-or under-estimated for u i . In the second extra mode, we excised each aw i , one at a time in different training runs, to quantify its lack on the model.

Automatic adaptation
Online sarcasm is often used to comment on the vagaries of politics and current affairs. As topicality is of the essence, we expect a model that regularly acquires new author-annotated training data to bootstrap itself will adapt better to the times and yield better results. To estimate the benefits of bootstrapping we tested the model on an evolving version of the data-set that acquired new training data each week for a month in August 2016. Table 3 shows the recall (R), precision (P) and fscore (F1) for our model, called Sarcasm Magnet, with alternate configurations on different datasets. The configuration for each setup is given in the second row (e.g. the addition of context tweets requires the use of two LSTMs and two CNNs). Setup TTEA is the baseline which uses only the text of a target tweet; it excludes addressee handles, context tweets (CT) and the psychological dimensions (PD) of authors. Setup TTIA adds addressee handles to the baseline to give our model a small boost, mostly in recall. Setup TTEA+CT adds context tweets to the baseline, yielding a significant boost since a good deal of sarcasm is conversational in nature. In this setup the most significant improvement in recall was observed with the Bamman dataset (Bamman and Smith, 2015). In setup TTIA+CT, which uses context and addressee handles, no significant improvement over TTEA+CT is observed, except for precision on Bamman's dataset. In setup TTEA+PD, the affective profile of each author at tweet-time is added to the baseline to yield a significant boost in performance almost as large as that for TTIA+CT. Setup TTIA+CT+PD includes all available information sources (addressee, content, and psychological profile). This column reports (in parentheses) the performance on each dataset by the dataset creator's own system, which is either publicly available or re-implemented from their paper. The results show that Sarcasm Magnet beats the state of the art for these data-sets. Table 1 shows the effect on the model's perfor-  Table 1: Performance of the model when a specific dimension aw i is omitted from training.

Results & Analysis
mance in the absence of specific aw i values. A boost in recall and a drop in precision shows the bias of the model shifting towards sarcasm when the space of non-sarcastic tweets overlaps with that of sarcastic tweets in the absence of an aw i that confirms literal intent. So political tweets may be mis-classified as sarcasm in the absence of values for Angry, Depressed, and Worried, suggesting that sarcastic authors often seem less angry, depressed or worried. A drop in precision and recall when Arrogant/Remote, Analytic, Plugged in and In-the-moment dimensions are absent suggests sarcastic people to be more socially active and aware, and smarter but more arrogant.

A Tale of Two Contexts
The CT and PD additions each bring significant improvements in F-score, yet when added jointly they bring no significant increases over either used individually. For each is a form of context drawn from different sources that reflects different intuitions but which ultimately offers much the same insights. The impact of the 11 aw dimensions is lower on the 5 external datasets than for the new feedback-based dataset, no doubt because the AnalyzeWords.com snapshot of authors in the latter could be taken directly at tweet-time, whilst for the former it was retrospectively approximated using our own jerry-rigged version based on the LIWC. If the official web service were to allow retrospective analyses of Twitter users at specific times we are confident the improvements on the external datasets would mirror those on our own dataset. For now it is interesting to note the effectiveness of the AnalyzeWords.com service at affectively profiling Twitter users at specific times, which is to say, at specific contexts in their Twitter time-lines.The service boils down the most recent tweets (approx. 1000 words in total) to 11 dimensions that are more than simple functions of the lexical scores in the LIWC. Rather, it analyzes the selected text as a coherent product of a coherent mind-set, to measure a local propensity for hostility, optimism, depression, emotional detachment and preference for reason. To use our earlier analogy, AnalyzeWords.com forecasts the psychological weather around an author, not the user's stable climate. Though we may often speak of a "sarcastic personality" as a stable aspect of some speakers, most users of sarcasm will not fall into this category. As such, insight into the recent mindset of an author is more valuable to a detector than knowledge of one's personality overall.

Rolling With The Punches
Our feedback-annotated dataset was collected during a fertile period for sarcasm online: the heights of the 2016 US presidential campaign. The main body of the new dataset was collected and annotated (as described earlier) in the early summer of 2016. During the month of August we acquired additional annotated training data in four weekly tranches, to incrementally retrain the model to an evolving political and social context. As shown in

Conclusions & Future work
Context is vital to the understanding of the fruits of any figurative device, whether metaphor, irony or sarcasm. We have explored two sources of contextual information in this work: the linguistic context of the utterance itself -which we take to be another utterance that is the proximate cause of the text under consideration -and the psychological context of the utterance's author -which we take to be the mind-set that is apparent in the author's most recent writings on Twitter. Each source of context is ultimately grounded in a text and understood in text-analytic terms. It is perhaps not so surprising then that each kind of context yields similarly large improvements to a neural model of sarcasm detection when added in isolation, but no large improvements over either alone when both are combined in a single model. This work makes three principle contributions to the computational analysis of sarcasm. First, as outlined above, it shows how different kinds of context -from the linguistic to the psychological -can be usefully incorporated to yield improved detection. Second, it shows how accurate annotation of training data can be automated on Twitter by going directly to the source of each training text, to obtain a definitive answer as to its figurative status. So the resulting neural model does not learn to approximate the reasoning of independent human annotators but the mind-set and intent of the authors themselves. Thirdly, and perhaps most usefully for future work by others, this feedbackbased dataset will be made available for use by other researchers and in other evaluations. Importantly, this dataset is not merely a collection of yes/no annotated texts, even if the yeses and nos come from authoritative sources. For each text in the dataset, we can provide the linguistic context to which it is a response, and furthermore, we can provide a psychological snapshot of the author at the time the tweet was posted on Twitter. In the end we believe this is the most valuable contribution of the work, as it will allow others to incorporate an understanding of personality and mind-set into their own models of that most personal and moody of figurative devices, sarcasm.  Table 3: Evaluation of Sarcasm Magnet (P -precision, R -recall, F1 -f-score, TTEA -target tweet excluding addressee; TTIA -target tweet including addressee; CT -Context Tweet; PD -Psychological dimensions; S -sarcastic; NS -non-sarcastic). All results are for the Sarcasm Magnet model; when available, results obtained by other authors on their own datasets are in parentheses. *Sarcasm Magnet is the name of the current system and its associated dataset.