Harnessing Cognitive Features for Sarcasm Detection

In this paper, we propose a novel mechanism for enriching the feature vector, for the task of sarcasm detection, with cognitive features extracted from eye-movement patterns of human readers. Sarcasm detection has been a challenging research problem, and its importance for NLP applications such as review summarization, dialog systems and sentiment analysis is well recognized. Sarcasm can often be traced to incongruity that becomes apparent as the full sentence unfolds. This presence of incongruity- implicit or explicit- affects the way readers eyes move through the text. We observe the difference in the behaviour of the eye, while reading sarcastic and non sarcastic sentences. Motivated by his observation, we augment traditional linguistic and stylistic features for sarcasm detection with the cognitive features obtained from readers eye movement data. We perform statistical classification using the enhanced feature set so obtained. The augmented cognitive features improve sarcasm detection by 3.7% (in terms of F-score), over the performance of the best reported system.


Introduction
Sarcasm is an intensive, indirect and complex construct that is often intended to express contempt or ridicule 1 . Sarcasm, in speech, is multi-modal, involving tone, body-language and gestures along with linguistic artifacts used in speech. Sarcasm in text, on the other hand, is more restrictive when it comes to such non-linguistic modalities. This makes recognizing textual sarcasm more challenging for both humans and machines. 1 The Free Dictionary Sarcasm detection plays an indispensable role in applications like online review summarizers, dialog systems, recommendation systems and sentiment analyzers. This makes automatic detection of sarcasm an important problem. However, it has been quite difficult to solve such a problem with traditional NLP tools and techniques. This is apparent from the results reported by the survey from Joshi et al. (2016). The following discussion brings more insights into this.
Consider a scenario where an online reviewer gives a negative opinion about a movie through sarcasm: "This is the kind of movie you see because the theater has air conditioning". It is difficult for an automatic sentiment analyzer to assign a rating to the movie and, in the absence of any other information, such a system may not be able to comprehend that prioritizing the air-conditioning facilities of the theater over the movie experience indicates a negative sentiment towards the movie. This gives an intuition to why, for sarcasm detection, it is necessary to go beyond textual analysis.
We aim to address this problem by exploiting the psycholinguistic side of sarcasm detection, using cognitive features extracted with the help of eye-tracking. A motivation to consider cognitive features comes from analyzing human eyemovement trajectories that supports the conjecture: Reading sarcastic texts induces distinctive eye movement patterns, compared to literal texts. The cognitive features, derived from human eye movement patterns observed during reading, include two primary feature types: 1. Eye movement characteristic features of readers while reading given text, comprising gaze-fixaions (i.e,longer stay of gaze on a visual object), forward and backward saccades (i.e., quick jumping of gaze between two positions of rest).
2. Features constructed using the statistical and deeper structural information contained in graph, created by treating words as vertices and saccades between a pair of words as edges.
The cognitive features, along with textual features used in best available sarcasm detectors, are used to train binary classifiers against given sarcasm labels. Our experiments show significant improvement in classification accuracy over the state of the art, by performing such augmentation.

Feasibility of Our Approach
Since our method requires gaze data from human readers to be available, the methods practicability becomes questionable. We present our views on this below.

Availability of Mobile Eye-trackers
Availability of inexpensive embedded eye-trackers on hand-held devices has come close to reality now. This opens avenues to get eye-tracking data from inexpensive mobile devices from a huge population of online readers non-intrusively, and derive cognitive features to be used in predictive frameworks like ours. For instance, Cogisen: (http://www.sencogi.com) has a patent (ID: EP2833308-A1) on "eye-tracking using inexpensive mobile web-cams".

Applicability Scenario
We believe, mobile eye-tracking modules could be a part of mobile applications built for e-commerce, online learning, gaming etc. where automatic analysis of online reviews calls for better solutions to detect linguistic nuances like sarcasm. To give an example, let's say a book gets different reviews on Amazon. Our system could watch how readers read the review using mobile eye-trackers, and thereby, decide whether the text contains sarcasm or not. Such an application can horizontally scale across the web and will help in improving automatic classification of online reviews.
Since our approach seeks human mediation, one might be tempted to question the approach of relying upon eye-tracking, an indirect indicator, instead of directly obtaining man-made annotations. We believe, asking a large number of internet audience to annotate/give feedback on each and every sentence that they read online, following a set of annotation instructions, will be extremely intrusive and may not be responded well. Our system, on the other hand, can be seamlessly integrated into existing applications and as the eye-tracking process runs in the background, users will not be interrupted in the middle of the reading. This, thus, offers a more natural setting where human mediation can be availed without intervention.

Getting Users' Consent for Eye-tracking
Eye-tracking technology has already been utilized by leading mobile technology developers (like Samsung) to facilitate richer user experiences through services like Smart-scroll (where a user's eye movement determines whether a page has to be scrolled or not) and Smart-lock (where user's gaze position decides whether to lock the screen or not). The growing interest of users in using such services takes us to a promising situation where getting users' consent to record eyemovement patterns will not be difficult, though it is yet not the current state of affairs.
Disclaimer: In this work, we focus on detecting sarcasm in non-contextual and short-text settings prevalent in product reviews and social media. Moreover, our method requires eye-tracking data to be available in the test scenario.

Related Work
Sarcasm, in general, has been the focus of research for quite some time. In one of the pioneering works Jorgensen et al. (1984) explained how sarcasm arises when a figurative meaning is used opposite to the literal meaning of the utterance. In the word of Clark and Gerrig (1984), sarcasm processing involves canceling the indirectly negated message and replacing it with the implicated one. Giora (1995), on the other hand, define sarcasm as a mode of indirect negation that requires processing of both negated and implicated messages. Ivanko and Pexman (2003) define sarcasm as a six tuple entity consisting of a speaker, a listener, Context, Utterance, Literal Proposition and Intended Proposition and study the cognitive aspects of sarcasm processing.
Most of the previously done work on sarcasm detection uses distant supervision based techniques (ex: leveraging hashtags) and stylistic/pragmatic features (emoticons, laughter expressions such as "lol" etc). But, detecting sarcasm in linguistically well-formed structures, in absence of explicit cues or information (like emoticons), proves to be hard using such linguistic/stylistic features alone.
With the advent of sophisticated eyetrackers and electro/magneto-encephalographic (EEG/MEG) devices, it has been possible to delve deep into the cognitive underpinnings of sarcasm understanding. Filik (2014), using a series of eye-tracking and EEG experiments try to show that for unfamiliar ironies, the literal interpretation would be computed first. They also show that a mismatch with context would lead to a re-interpretation of the statement, as being ironic. Camblin et al. (2007) show that in multi-sentence passages, discourse congruence has robust effects on eye movements. This also implies that disrupted processing occurs for discourse incongruent words, even though they are perfectly congruous at the sentence level. In our previous work (Mishra et al., 2016), we augment cognitive features, derived from eye-movement patterns of readers, with textual features to detect whether a human reader has realized the presence of sarcasm in text or not.
The recent advancements in the literature discussed above, motivate us to explore gaze-based cognition for sarcasm detection. As far as we know, our work is the first of its kind.

Eye-tracking Database for Sarcasm Analysis
Sarcasm often emanates from incongruity (Campbell and Katz, 2012), which enforces the brain to reanalyze it (Kutas and Hillyard, 1980). This, in turn, affects the way eyes move through the text. Hence, distinctive eye-movement patterns may be observed in the case of successful processing of sarcasm in text in contrast to literal texts. This hypothesis forms the crux of our method for sarcasm detection and we validate this using our previously released freely available sarcasm dataset 2 (Mishra et al., 2016) Table 1: T-test statistics for average fixation duration time per word (in ms) for presence of sarcasm (represented by S) and its absence (NS) for participants P1-P7. information.

Document Description
The database consists of 1,000 short texts, each having 10-40 words. Out of these, 350 are sarcastic and are collected as follows: (a) 103 sentences are from two popular sarcastic quote websites 3 , (b) 76 sarcastic short movie reviews are manually extracted from the Amazon Movie Corpus (Pang and Lee, 2004) by two linguists. (c) 171 tweets are downloaded using the hashtag #sarcasm from Twitter. The 650 non-sarcastic texts are either downloaded from Twitter or extracted from the Amazon Movie Review corpus. The sentences do not contain words/phrases that are highly topic or culture specific. The tweets were normalized to make them linguistically well formed to avoid difficulty in interpreting social media lingo. Every sentence in our dataset carries positive or negative opinion about specific "aspects". For example, the sentence "The movie is extremely well cast" has positive sentiment about the aspect "cast". The annotators were seven graduate students with science and engineering background, and possess good English proficiency. They were given a set of instructions beforehand and are advised to seek clarifications before they proceed. The instructions mention the nature of the task, annotation input method, and necessity of head movement minimization during the experiment.

Task Description
The task assigned to annotators was to read sentences one at a time and label them with with binary labels indicating the polarity (i.e., positive/negative). Note that, the participants were not instructed to annotate whether a sentence is sarcastic or not., to rule out the Priming Effect (i.e., if sarcasm is expected beforehand, processing incongruity becomes relatively easier (Gibbs, 1986)). The setup ensures its "ecological validity" in two ways: (1) Readers are not given any clue that they have to treat sarcasm with special attention. This is done by setting the task to polarity annotation (instead of sarcasm detection).
(2) Sarcastic sentences are mixed with non sarcastic text, which does not give prior knowledge about whether the forthcoming text will be sarcastic or not.
The eye-tracking experiment is conducted by following the standard norms in eye-movement research (Holmqvist et al., 2011). At a time, one sentence is displayed to the reader along with the "aspect" with respect to which the annotation has to be provided. While reading, an SR-Research Eyelink-1000 eye-tracker (monocular remote mode, sampling rate 500Hz) records several eye-movement parameters like fixations (a long stay of gaze) and saccade (quick jumping of gaze between two positions of rest) and pupil size.
The accuracy of polarity annotation varies between 72%-91% for sarcastic texts and 75%-91% for non-sarcastic text, showing the inherent difficulty of sentiment annotation, when sarcasm is present in the text under consideration. Annotation errors may be attributed to: (a) lack of patience/attention while reading, (b) issues related to text comprehension, and (c) confusion/indecisiveness caused due to lack of context. For our analysis, we do not discard the incorrect annotations present in the database. Since our system eventually aims to involve online readers for sarcasm detection, it will be hard to segregate readers who misinterpret the text. We make a rational assumption that, for a particular text, most of the readers, from a fairly large population, will be able to identify sarcasm. Under this assumption, the eye-movement parameters, averaged across all readers in our setting, may not be significantly distorted by a few readers who would have failed to identify sarcasm. This assumption is applicable for both regular and multi-instance based classifiers explained in section 6.

Analysis of Eye-movement Data
We observe distinct behavior during sarcasm reading, by analyzing the "fixation duration on the text" (also referred to as "dwell time" in the lit- S2: The lead actress is terrible and I cannot be convinced she is supposed to be some forensic genius. S1: I'll always cherish the original misconception I had of you.. Figure 1: Scanpaths of three participants for two negatively polar sentences sentence S1 and S2. Sentence S1 is sarcastic but S2 is not. erature) and "scanpaths" of the readers.

Variation in the Average Fixation Duration per Word
Since sarcasm in text can be expected to induce cognitive load, it is reasonable to believe that it would require more processing time (Ivanko and Pexman, 2003). Hence, fixation duration normalized over total word count should usually be higher for a sarcastic text than for a non-sarcastic one. We observe this for all participants in our dataset, with the average fixation duration per word for sarcastic texts being at least 1.5 times more than that of non-sarcastic texts. To test the statistical significance, we conduct a twotailed t-test (assuming unequal variance) to compare the average fixation duration per word for sarcastic and non-sarcastic texts. The hypothesized mean difference is set to 0 and the error tolerance limit (α) is set to 0.05. The t-test analysis, presented in Table 1, shows that for all participants, a statistically significant difference exists between the average fixation duration per word for sarcasm (higher average fixation duration) and nonsarcasm (lower average fixation duration). This affirms that the presence of sarcasm affects the duration of fixation on words.
It is important to note that longer fixations may also be caused by other linguistic subtleties (such as difficult words, ambiguity and syntactically complex structures) causing delay in comprehension, or occulomotor control problems forcing readers to spend time adjusting eye-muscles. So, an elevated average fixation duration per word may not sufficiently indicate the presence of sarcasm. But we would also like to share that, for our dataset, when we considered readability (Flesch readability ease-score (Flesch, 1948)), number of words in a sentence and average character per word along with the sarcasm label as the predictors of average fixation duration following a linear mixed effect model (Barr et al., 2013), sarcasm label turned out to be the most significant predictor with a maximum slope. This indicates that average fixation duration per word has a strong connection with the text being sarcastic, at least in our dataset. We now analyze scanpaths to gain more insights into the sarcasm comprehension process.

Analysis of Scanpaths
Scanpaths are line-graphs that contain fixations as nodes and saccades as edges; the radii of the nodes represent the fixation duration. A scanpath corresponds to a participant's eye-movement pattern while reading a particular sentence. Figure 1 presents scanpaths of three participants for the sarcastic sentence S1 and the non-sarcastic sentence S2. The x-axis of the graph represents the sequence of words a reader reads, and the y-axis represents a temporal sequence in milliseconds.
Consider a sarcastic text containing incongruous phrases A and B. Our qualitative scanpathanalysis reveals that scanpaths with respect to sarcasm processing have two typical characteristics. Often, a long regression -a saccade that goes to a previously visited segment -is observed when a reader starts reading B after skimming through A. In a few cases, the fixation duration on A and B are significantly higher than the average fixation duration per word. In sentence S1, we see long and multiple regressions from the two incongruous phrases "misconception" and "cherish", and a few instances where phrases "always cherish" and "original misconception" are fixated longer than usual. Such eye-movement behaviors are not seen for S2.
Though sarcasm induces distinctive scanpaths like the ones depicted in Figure 1 in the observed examples, presence of such patterns is not sufficient to guarantee sarcasm; such patterns may also possibly arise from literal texts. We believe that a combination of linguistic features, readability of text and features derived from scanpaths would help discriminative machine learning models learn sarcasm better.

Features for Sarcasm Detection
We describe the features used for sarcasm detection in Table 2. The features enlisted under lexical,implicit incongruity and explicit incongruity are borrowed from various literature (predominantly from Joshi et al. (2015)). These features are essential to separate sarcasm from other forms semantic incongruity in text (for example ambiguity arising from semantic ambiguity or from metaphors). Two additional textual features viz. readability and word count of the text are also taken under consideration. These features are used to reduce the effect of text hardness and text length on the eye-movement patterns.

Simple Gaze Based Features
Readers' eye-movement behavior, characterized by fixations, forward saccades, skips and regressions, can be directly quantified by simple statistical aggregation (i.e., either computing features for individual participants and then averaging or performing a multi-instance based learning as explained in section 6). Since these eye-movement attributes relate to the cognitive process in reading (Rayner and Sereno, 1994), we consider these as features in our model. Some of these features have been reported by Mishra et al. (2016) for modeling sarcasm understandability of readers. However, as far as we know, these features are being introduced in NLP tasks like textual sarcasm detection for the first time. The values of these features are believed to increase with the increase in the degree of surprisal caused by incongruity in text (except skip count, which will decrease).

Complex Gaze Based Features
For these features, we rely on a graph structure, namely "saliency graphs", derived from eye-gaze information and word sequences in the text.

Constructing Saliency Graphs:
For each reader and each sentence, we construct a "saliency graph", representing the reader's atten-   tion characteristics. A saliency graph for a sentence S for a reader R, represented as G = (V, E), is a graph with vertices (V ) and edges (E) where each vertex v ∈ V corresponds to a word in S (may not be unique) and there exists an edge e ∈ E between vertices v 1 and v 2 if R performs at least one saccade between the words corresponding to v1 and v2. Figure 2 shows an example of a saliency graph.A saliency graph may be weighted, but not necessarily connected, for a given text (as there may be words in the given text with no fixation on them). The "complex" gaze features derived from saliency graphs are also motivated by the theory of incongruity. For instance, Edge Density of a saliency graph increases with the number of distinct saccades, which could arise from the complexity caused by presence of sarcasm. Similarly, the highest weighted degree of a graph is expected to be higher, if the node corresponds to a phrase, incongruous to some other phrase in the text.

The Sarcasm Classifier
We interpret sarcasm detection as a binary classification problem. The training data constitutes Features P(1) P(-1) P(avg) R(1) R(-1) R(avg) F(1) F(-   994 examples created using our eye-movement database for sarcasm detection. To check the effectiveness of our feature set, we observe the performance of multiple classification techniques on our dataset through a stratified 10-fold cross validation. We also compare the classification accuracy of our system and the best available systems proposed by Riloff et al. (2013) and Joshi et al. (2015) on our dataset. Using Weka (Hall et al., 2009) and LibSVM (Chang and Lin, 2011) APIs, we implement the following classifiers: • Näive Bayes classifier • Support Vector Machines (Cortes and Vapnik, 1995) (Xu and Frank, 2004) 6.1 Results Table 3 shows the classification results considering various feature combinations for different classifiers and other systems. These are: • Unigram (with principal components of unigram feature vectors), • Sarcasm (the feature-set reported by Joshi et al. (2015) subsuming unigram features and features from other reported systems) • Gaze (the simple and complex cognitive features we introduce, along with readability and word count features), and • Gaze+Sarcasm (the complete set of features). For all regular classifiers, the gaze features are averaged across participants and augmented with linguistic and sarcasm related features. For the MILR classifier, the gaze features derived from each participant are augmented with linguistic features and thus, a multi instance "bag" of features is formed for each sentence in the training data. This multi-instance dataset is given to an MILR classifier, which follows the standard multi instance assumption to derive class-labels for each bag.
For all the classifiers, our feature combination outperforms the baselines (considering only unigram features) as well as (Joshi et al., 2015), with the MILR classifier getting an F-score improvement of 3.7% and Kappa difference of 0.08. We also achieve an improvement of 2% over the baseline, using SVM classifier, when we employ our feature set. We also observe that the gaze features alone, also capture the differences between sarcasm and non-sarcasm classes with a highprecision but a low recall.
To see if the improvement obtained is statistically significant over the state-of-the art system with textual sarcasm features alone, we perform McNemar test. The output of the SVM classifier using only linguistic features used for sarcasm detection by Joshi et al. (2015) and the output of the MILR classifier with the complete set of features are compared, setting threshold α = 0.05. There was a significant difference in the classifier's accuracy with p(two-tailed) = 0.02 with an odds-ratio of 1.43, showing that the classification accuracy improvement is unlikely to be observed by chance in 95% confidence interval.

Considering Reading Time as a Cognitive
Feature along with Sarcasm Features One may argue that, considering simple measures of reading effort like "reading time" as cognitive feature instead of the expensive eye-tracking features for sarcasm detection may be a cost-effective solution. To examine this, we repeated our experiments with "reading time" considered as the only cognitive feature, augmented with the textual features. The F-scores of all the classifiers turn out to be close to that of the classifiers considering sarcasm feature alone and the difference in the improvement is not statistically significant (p > 0.05). One the other hand, F-scores with gaze features are superior to the F-scores when reading time is considered as a cognitive feature.

How Effective are the Cognitive Features
We examine the effectiveness of cognitive features on the classification accuracy by varying the input training data size. To examine this, we create a stratified (keeping the class ratio constant) random train-test split of 80%:20%. We train our classifier with 100%, 90%, 80% and 70% of the training data with our whole feature set, and the feature combination from Joshi et al. (2015). The goodness of our system is demonstrated by improvements in F-score and Kappa statistics, shown in Figure 3. We further analyze the importance of features by ranking the features based on (a) Chi squared test, and (b) Information Gain test, using Weka's attribute selection module. Figure 4 shows the top 20 ranked features produced by both the tests. For both the cases, we observe 16 out of top 20 features to be gaze features. Further, in each of the cases, Average Fixation Duration per Word and Largest Regression Position are seen to be the two most significant features. Table 4 shows a few example cases from the experiment with stratified 80%-20% train-test split.

Example Cases
• Example sentence 1 is sarcastic, and requires extra-linguistic knowledge (about poor living conditions at Manchester). Hence, the sarcasm detector relying only on textual features is unable to detect the underlying incongruity. However, our system predicts the label successfully, possibly helped by the gaze features.
• Similarly, for sentence 2, the false sense of presence of incongruity (due to phrases like "Helped me" and "Can't stop") affects the system with only linguistic features. Our system, though, performs well in this case also.
• Sentence 3 presents a false-negative case where it was hard for even humans to get the sarcasm. This is why our gaze features (and subsequently the complete set of features) account for erroneous prediction.
• In sentence 4, gaze features alone falseindicate presence of incongruity, whereas the system predicts correctly when gaze and linguistic features are taken together.
From these examples, it can be inferred that, only gaze features would not have sufficed to rule out the possibility of detecting other forms of incongruity that do not result in sarcasm.

Error Analysis
Errors committed by our system arise from multiple factors, starting from limitations of the eyetracker hardware to errors committed by linguistic tools and resources. Also, aggregating various eye-tracking parameters to extract the cognitive features may have caused information loss in the regular classification setting.

Conclusion
In the current work, we created a novel framework to detect sarcasm, that derives insights from human cognition, that manifests over eye movement patterns. We hypothesized that distinctive eye-movement patterns, associated with reading sarcastic text, enables improved detection of sarcasm. We augmented traditional linguistic features with cognitive features obtained from readers' eye-movement data in the form of simple gaze-based features and complex features derived from a graph structure. This extended feature-set improved the success rate of the sarcasm detector by 3.7%, over the best available system. Using cognitive features in an NLP Processing system like ours is the first proposal of its kind.
Our general approach may be useful in other NLP sub-areas like sentiment and emotion analysis, text summarization and question answering, where considering textual clues alone does not prove to be sufficient. We propose to augment this work in future by exploring deeper graph and gaze features. We also propose to develop models for the purpose of learning complex gaze feature representation, that accounts for the power of individual eye movement patterns along with the aggregated patterns of eye movements.