Harnessing Sequence Labeling for Sarcasm Detection in Dialogue from TV Series ‘Friends’

This paper is a novel study that views sarcasm detection in dialogue as a sequence labeling task, where a dialogue is made up of a sequence of utterances. We create a manually-labeled dataset of dialogue from TV series ‘Friends’ annotated with sarcasm. Our goal is to predict sarcasm in each utterance, using sequential nature of a scene. We show performance gain using sequence labeling as compared to classiﬁcation-based approaches. Our experiments are based on three sets of features, one is derived from information in our dataset, the other two are from past works. Two sequence labeling algorithms (SVM-HMM and SEARN) outperform three classiﬁcation algorithms (SVM, Naive Bayes) for all these feature sets, with an increase in F-score of around 4%. Our observations high-light the viability of sequence labeling techniques for sarcasm detection of dialogue.


Introduction
Sarcasm is defined as 'the use of irony to mock or convey contempt' 1 . An example of a sarcastic sentence is 'Being stranded in traffic is the best way to start the week'. In this case, the positive word 'best' together with the undesirable situation 'being stranded in traffic' conveys the sarcasm. Because sarcasm has an implied sentiment (negative) that is different from surface sentiment (positive due to presence of 'best'), it poses a challenge to sentiment analysis systems that aim to determine polarity in text (Pang and Lee, 2008).
Some sarcastic expressions may be more difficult to detect. Consider the possibly sarcastic statement 'I absolutely love this restaurant'. Unlike in the traffic example above, sarcasm in this sentence, if any, can be understood using context which is 'external' to the sentence i.e., beyond common world knowledge. 2 . This external context may be available in the conversation that this sentence is a part of. For example, the conversational context may be situational: the speaker discovers a fly in her soup, then looks at her date and says, 'I absolutely love this restaurant'. The conversational context may also be verbal: her date says, 'They've taken 40 minutes to bring our appetizers' to which the speaker responds 'I absolutely love this restaurant'.
Both these examples point to the intuition that for dialogue (i.e., data where more than one speaker participates in a discourse), conversational context is often a clue for sarcasm.
For such dialogue, prior work in sarcasm detection (determining whether a text is sarcastic or not) captures context in the form of classifier features such as the topic's probability of evoking sarcasm, or the author's tendency to use sarcasm (Rajadesingan et al., 2015;Wallace, 2015). In this paper, we present an alternative hypothesis: sarcasm detection of dialogue is better formulated as a sequence labeling task, instead of classification task.
The central message of our work is the efficacy of using sequence labeling as a learning mechanism for sarcasm detection in dialogue, and not in the set of features that we propose for sarcasm detectionalthough we experiment with three feature sets. For our experiments, we create a manually labeled dataset of dialogues from TV series 'Friends'. Each dialogue is considered to be a sequence of utterances, and every utterance is annotated as sarcastic or non-sarcastic (Details in Section 3). It may be argued that a TV series episode is dramatized and hence does not reflect realworld conversations. However, although the script of 'Friends' is dramatized to suit the situational comedy genre, it takes away nothing from its relevance to reallife conversations except for the volume of sarcastic sentences. Therefore, our findings from this work can, in theory, be reliably extended to work for any real-life utterances. Also, such datasets that are not based on real-world conversations have been used in prior work: emotion detection of children stories in  and speech transcripts of a MTV show in Rakov and Rosenberg (2013). As a first step in the direction of using sequence labeling, our dataset is a good 'controlled experiment' environment (The details are discussed in Section 2). In fact, use of a dataset in a new Figure 1: Illustration of our hypothesis for sarcasm detection of conversational text (such as dialogue); A, B, C, D indicate four utterances genre (TV series transcripts, specifically) has potential for future work in sarcasm detection. Our dataset without the actual dialogues from the show (owing to copyright restrictions) may be available on request.
Based on information available in our dataset (names of speakers, etc.), we present new features. We then compare two sequence labelers (SEARN and SVM-HMM) with three classifiers (SVM with oversampled and undersampled data, and Naïve Bayes), for this set of features and also for features from two prior works. In case of our novel features as well as features reported in prior work, sequence labeling algorithms outperform classification algorithms. There is an improvement of 3-4% in F-score when sequence labelers are used, as compared to classifiers, for sarcasm detection in our dialogue dataset. Since many datasets such as tweet conversations, chat transcripts, etc. are currently available, our findings will be useful to obtain additional contexts in future work.
The rest of the paper is organized as follows. Section 2 motivates the approach and presents our hypothesis. Section 3 describes our dataset, while Section 4 presents the features we use (this includes three configurations: novel features based on our dataset, and features from past work). Experiment setup is in Section 5 and results are given in Section 6. We present a discussion on which types of sarcasm are handled better by sequence labeling and an error analysis in Section 7, and describe related work in Section 8. Finally, we conclude in Section 9.

Motivation & Hypothesis
In dialogue, multiple participants take turns to speak. Consider the following snippet from 'Friends' involving two of the lead characters, Ross and Chandler.
[Chandler is at the table. Ross walks in, looking very tanned.] Chandler: Hold on! There is something different. Ross: I went to that tanning place your wife suggested. Chandler: Was that place... The Sun?
Chandler's statement 'Was that place... The Sun?' is sarcastic. The sarcasm can be understood based on two kinds of contextual information: (a) general knowledge (that sun is indeed hot) (b) Conversational context (In the previous utterance, Ross states that he went to a tanning place). Without information (b), the sarcasm cannot be understood. Thus, dialogue presents a peculiar opportunity: using sequential nature of text for the task at hand.
We hypothesize that 'for sarcasm detection of dialogue, sequence labeling performs better than classification'. To validate our hypothesis, we consider two feature configurations: (a) novel features designed for our dataset, (b) features as given in two prior works. To further understand where exactly sequence labeling techniques do better, we also present a discussion on which linguistic types of sarcasm benefit the most from sequence labeling in place of classification. Figure 1 summarizes the scope of this paper. We consider two formulations for sarcasm detection of conversational text. In the first option (i.e. classification), a sequence is broken down into individual instances. One instance as an input to a classification algorithm returns an output for that instance. In the second option (i.e. sequence labeling), a sequence as input to a sequence labeling algorithm returns a sequence of labels as an output. In rest of the paper, we use the following terms: 2. Scene/Sequence: A scene is a sequence of utterances, in which different speakers take turns to speak. We use the terms 'scene' and 'sequence' interchangeably.

'Friends' Dataset
Datasets based on literary/creative works have been explored in the past. One such example is emotion classification of children's stories by Zhang Z (2014). Similarly, we create a sarcasm-labeled dataset that consists of transcripts of a comedy TV show, 'Friends 3 ' (by Bright/Kauffman/Crane Productions, and Warner Bros. Entertainment Inc.). We download these transcripts from OpenSubtitles 4 as given by Lison and Tiedemann (2016), with additional pre-processing from a fan-contributed website called http://www. friendstranscripts.tk. Each scene begins with a description of the location/situation followed by a series of utterances spoken by characters. Figure 2 shows an illustration of our dataset. This is (obviously) a dummy example that has been anonymized. The reason behind choosing a TV show transcript as our dataset was to restrict to a small set of characters (so as to leverage on speaker-specific features) that use a lot of humor. These characters are often sarcastic towards each other because of their inter-personal relationships. In fact, past linguistic studies also show how sarcasm is more common between familiar speakers, and often friends (Gibbs, 2000). A typical snippet is: [Scene: Chandler and Monica's room. Chandler is packing when Ross knocks on the door and enters...] Ross: Hey! Chandler: Hey! Ross: You guys ready to go? Chandler: Not quite. Monica's still at the salon, and I'm just finishing packing.
Our annotators are linguists with an experience of more than 8k hours of annotation, and are not authors Figure 2: Example from our Dataset: Part of a Scene of this paper. A complete scene is visible to the annotators at a time, so that they understand complete context of the scene. They perform the task of annotating every utterance in this scene with two labels: sarcastic and non-sarcastic. The two annotators separately perform this annotation over multiple sessions. To minimize bias beyond the scope of this annotation, we selected annotators who had never watched Friends before this annotation task.
The annotations 5 may be available on request, subject to copyright restrictions. Every utterance is annotated with a label while description of a scene is not annotated.
The inter-annotator agreement for a subset of 105 scenes 6 (around 1600 utterances) is 0.44. This is comparable with other manually annotated datasets in sarcasm detection . Table 1 shows the relevant statistics of the complete dataset (in addition to 105 scenes as mentioned above). There are 17338 utterances in 913 scenes. Out of these, 1888 utterances are labeled as sarcastic. Average length of a scene is 18.6 utterances. Table 2 shows additional statistics. Table 2(a) shows that Chandler is the character with highest proportion of sarcastic utterances (22.24%). Table 2(b) shows that sarcastic utterances have higher surface positive word score 7 (1.55) than non-sarcastic (0.97) or overall utterances (1.03). This validates the past observation that sarcasm is often expressed through positive words (and sometimes contrasted with negation) . Finally, Table 2(c) shows that sarcastic utterances also have higher proportion of non-verbal indicators (action words) (28.23%) than non-sarcastic or overall utterances.

Features
To ensure that our hypothesis is not dependent on choice of features, we show our results on two configurations: (a) when dataset-derived features (i.e., novel features designed based on our dataset) are used, and (b) when features reported in prior work are used. We describe these in forthcoming subsections.

Dataset-derived Features
We design our dataset-derived features based on information available in our dataset. An utterance consists of three parts: 1. Speaker: The name of the speaker is the first word of an utterance, and is followed by a colon. In case of the second utterance in Figure 2, the speaker is 'Ross' while in the third, the speaker is 'Chandler'. 3. Action words: Actions that a speaker performs while speaking the utterance are indicated in parentheses. These are useful clues that form additional context. Unlike speaker and spoken words, action words may or may not be present.

Spoken words
In the second utterance in Figure 2, there are no action words while in the third utterance, 'action Chandler does while reading this' are action words.
Based on this information, we design three categories of features (listed in Table 3). These are: 1. Lexical Features: These are unigrams in the spoken words. We experimented with both count and boolean representations, and the results are comparable. We report values for boolean representation.

Conversational Context Features:
In order to capture conversational context, we use three kinds of features. Action words are unigrams indicated within parentheses. The intuition is that a character 'raising her eyebrows' (action) is different from saying "raising her eyebrows". As the next feature, we use sentiment score of this utterance. These are two values: positive and negative scores. These scores are the positive and negative words present in an utterance. The third kind of conversational context features is the sentiment score of the previous utterance. This captures phenomena such as a negative remark from one character eliciting sarcasm from another. This is similar to the situation described in . Thus, for the third utterance in Figure 2, the sentiment score of Chandler's utterance forms the Sentiment score feature, while that of Ross's utterance forms Sentiment score of previous utterance.

Speaker Context Features:
We use name of the speaker and name of the speaker-listener pair as features. The listener is assumed to be the speaker of the previous utterance in the sequence 8 . The speaker feature aims to capture the sarcastic nature of each of these characters, while the speaker-listener feature aims to capture interpersonal interactions between different characters. In the context of third utterance in Figure 2, the speaker is 'Chandler' while speaker-listener pair is 'Chandler-Ross'.

Features from Prior Work
We also compare our results with features presented in two prior works 9 : 1. Features given in González-Ibánez et al.

Experiment Setup
We experiment with three classification techniques and two sequence labeling techniques: 1. Classification Techniques: We use Naïve Bayes and SVM as classification techniques. Naïve Bayes implementation provided in Scikit (Pedregosa et al., 2011) is used. For SVM, we use SVM-Light (Joachims, 1999). Since SVM does not do well for datasets with a large class imbalance (Akbani et al., 2004) 10 , we use sampling to deal with this skew as done in Kotsiantis et al. (2006). We experiment with two configurations: • SVM (Oversampled) i.e., SVM (O): Sarcastic utterances are duplicated to match the count of non-sarcastic utterances. • SVM (Undersampled) i.e., SVM (U): Random non-sarcastic utterances are dropped to match the count of sarcastic utterances. 9 The two prior works are chosen based on what information was available in our dataset for the purpose of reimplementation. For example, approaches that use the Twitter profile information or the follower/friends structure in the Twitter, cannot be computed for our dataset. 10 We also observe the same.
2. Sequence Labeling Techniques: We use SVM-HMM by Altun et al. (2003) and SEARN by Daumé III et al. (2009). SVM-HMM is a sequence labeling algorithm that combines Support Vector Machines and Hidden Markov Models. SEARN is a sequence labeling algorithm that integrates search and learning to solve prediction problems. The implementation of SEARN that we use relies on perceptron as the base classifier. Daumé III et al. (2009) show that SEARN outperforms other sequence labeling techniques (such as CRF) for tasks like character recognition and named entity class identification.
Thus, we wish to validate our hypothesis in case of:  We report weighted average values of precision, recall and F-score computed using five-fold crossvalidation for all experiments, and class-wise precision, recall, F-score wherever necessary. The folds are created on the basis of sequences and not utterances. This means that a sequence does not get split across different folds.

Results
Section 6.1 describes performance of traditional models that use dataset-derived features (as given in Section 4.1), while Section 6.2 does so for features from prior work (as given in Section 4.2). Table 4 compares the performance of the two formulations: classification and sequence labeling, for our dataset-derived features. When classification techniques are used, we obtain the best F-score of 79.8% with SVM (O). However, when sequence labeling techniques are used, the best F-score is 84.2%. In terms of   Table 5 shows class-wise precision/recall values for these techniques. The best value of precision for sarcastic class is obtained in case of SVM-HMM, i.e., 35.8%. The best F-score for the sarcastic class is in the case of SVM (O) (29%) whereas that for the nonsarcastic class is in the case of SVM-HMM (93.6%). Tables 4 and 5 show that it is due to a high recall, sequence labeling techniques perform better than classification techniques.

Performance on Dataset-derived Features
It may be argued that the benefit in case of sequence labeling is due to our features, and is not a benefit of the sequence labeling formulation itself. Hence, we ran these five techniques with all possible combinations of features. Table 6 shows the best performance obtained by each of the classifiers, and the corresponding (best) feature combinations. The table can be read as: SVM (O) obtains a F-score of 81.2% when spoken words, speaker, speaker-listener and sentiment score are used as features. The table shows that even if we consider the best performance of each of the techniques (with different feature sets), classifiers are not able to perform as well as sequence labeling. The best sequence labeling algorithm (SVM-HMM) gives a Fscore of 84.4% while the best classifier (SVM(O)) has an F-score of 81.2%. We emphasize that both SVM-HMM and SEARN have higher recall values than the three classification techniques.
These findings show that for our novel set of dataset-derived features, sequence labeling works better than classification.

Performance on Features Reported in Prior Work
We now show our evaluation on two sets of features reported in prior work. These sets of features as given in two prior works by Buschmeier et al. (2014) and González-Ibánez et al. (2011). Table 7 compares classification techniques with sequence labeling techniques for features given in González-Ibánez et al. (2011) 11 . Table 8 shows corresponding values for features given in Buschmeier et al. (2014) 12 . For features by González-Ibánez et al. (2011), SVM (O) gives the best F-score for classification techniques (79%), whereas SVM-HMM shows an improvement of 4% over that value. Recall increases by 11.8% when sequence labeling techniques are used instead of classification.
In case of features by Buschmeier et al. (2014), the improvement in performance achieved by using sequence labeling as against classification is 2.8%. The best recall for classification techniques is 77.8% (for SVM (O)). In this case as well, the recall increases by 10% for sequence labeling.
These findings show that for two feature sets reported in prior work, sequence labeling works bet-ter than classification.

Discussion
In previous sections, we show that quantitatively, sequence labeling techniques perform better than classification techniques. In this section, we delve into the question: 'What does this improved performance mean, in terms of forms of sarcasm that sequence labeling techniques are able to handle better than classification?' To understand the implication of using sequence labeling, we randomly select 100 examples that were correctly labeled by sequence labeling techniques but incorrectly labeled by classification techniques. Our annotators manually annotated them into one among four categories of sarcasm as given in Camp (2012). Table 9 shows the proportion of these utterances. Likeprefixed and illocutionary sarcasm types are the ones that require context for understanding sarcasm. We observe that around 71% of our examples belong to these two types of sarcasm. This means that our intuition that sequence labeling will better capture conversational context reflects in the forms of sarcasm for which sequence labeling improves over classification.
On the other hand, examples where our system makes errors can be grouped as: • Topic Drift: Eisterhold et al. (2006) state that topic change/drift is a peculiarity of sarcasm. For example, when Phoebe gets irritated with another character talking for a long time, she says,"See? Vegetarianism benefits everyone". This was misclassified by our system.
• Short expressions: Short expressions occurring in the context of a conversation may express sarcasm. Expressions such as "Oh God, is it" and "Me too" were misclassified as non-sarcastic. However, in the context of the scene, these were sarcastic utterances.
• Dry humor: In the context of a conversation, sarcasm may be expressed in response to a long serious description. Our system was unable to capture such sarcasm in some cases. When a character gives long description of advantages of a particular piece of clothing, Chandler asks sarcastically, "Are you aware that you're still talking?".
• Implications in popular culture: The utterance "Ok, I smell smoke. Maybe that's cause someone's pants are on fire" was misclassified by our system. The popular saying 'Liar, liar, pants on fire 13 ' was the context that was missing in our case.
• Background knowledge: When a petite girl walks in, Rachel says "She is so cute! You could fit her right in your little pocket".
• Long-range connection: In comedy shows like Friends, humor is often created by introducing a concept in the initial part and then repeating it as an impactful, sarcastic remark. For example, in beginning of an episode, Ross says that he has never grabbed a spoon before -and at the end of the episode, he says with a sarcastic tone "I grabbed a spoon".
• Incongruity with situation in the scenes: Utterances that were incongruent with non-verbal situations could not be adequately identified. For example, Ross enters an office wearing a piece of steel bandaged to his nose. In response, the receptionist says, "Oh, that's attractive".
• Sarcasm as a part of a longer sentence: In several utterances, sarcasm is a subset of a longer sentence, and hence, the non-sarcastic portion may dominate the rest of the sentence.
These errors point to future directions in which sequence labeling algorithms may be optimized to improve their impact on sarcasm detection.  Table 9: Proportion of utterances of different types of sarcasm that were correctly labeled by sequence labeling but incorrectly labeled by classification techniques

Related Work
Sarcasm detection approaches using different features have been reported (Tepperman et al., 2006;Kreuz and Caucci, 2007;Veale and Hao, 2010;González-Ibánez et al., 2011;Reyes et al., 2012;Buschmeier et al., 2014). However, Wallace et al. (2014) show how context beyond the target text (i.e., extra-textual context) is necessary for humans as well as machines, in order to identify sarcasm. Following this, the new trend in sarcasm detection is to explore the use of such extratextual context (Khattri et al., 2015;Rajadesingan et al., 2015;Bamman and Smith, 2015;Wallace, 2015). (Wallace, 2015) uses meta-data about reddits to predict sarcasm in a reddit 14 comment. (Rajadesingan et al., 2015) present a suite of classifier features that capture different kinds of context: context related to the author, conversation, etc. The new trend in sarcasm detection is, thus, to look at additional context beyond the text where sarcasm is to be predicted. The work closest to ours is by Wang et al. (2015). They use a labeled dataset of 1500 tweets, the labels for which are obtained automatically. Due to their automatically labeled gold dataset and their lack of focus on labeling utterances in a sequence, our analysis seems to be more rigorous. Our work substantially differs from theirs: (a) They do not deal with dialogue, (b) Their goal is to predict sarcasm of a tweet, using series of past tweets as the context i.e., only the last tweet in the sequence. Our goal is to predict sarcasm in every element of the sequence: a lot more rigorous task. Note that the two differ in the way precision/recall values will be computed. (c) Their 'gold' standard dataset is annotated by an automatic classifier. On the other hand, every textual unit (utterance) in our gold standard dataset is manually labeled -making our dataset and hence, findings lot more reliable. (c) They consider three types of sequences: conversational, historical and topic-based. Historical context is series of tweets by this author, while topic-based context is series of tweets containing a hashtag in the tweet to be classified. We do not use the two because they do not seem suitable for our dataset. They show that a sequence labeling algorithm works well to detect sar-14 www.reddit.com casm of a tweet with a pseudo-sequence generated using such additional context. They attempt to obtain correct prediction only for a single target tweet with no consideration to other elements in the context, which is completely different from our goal. They do not bother about other elements in the sequence but only use an algorithm to perform sarcasm detection of a tweet.
Several approaches for sequence labeling in sentiment classification have been studied. Zhao et al. (2008) perform sentiment classification using conditional random fields.  deal with emotion classification. Using a dataset of children's stories manually annotated at the sentence level, they employ HMM to identify sequential structure and a classifier to predict emotion in a particular sentence. Mao and Lebanon (2006) present a isotonic CRF that predicts global and local sentiment of documents, with additional mechanism for author-specific distributions and smoothing sentiment curves. Yessenalina et al. (2010) present a joint learning algorithm for sentencelevel subjectivity labeling and document-level sentiment labeling. Choi and Cardie (2010) deal with sequence learning to jointly identify scope of opinion polarity expressions, and polarity labels. Taking inspiration from use of sequence labeling for sarcasm detection, our work takes the first step to show if sequence labeling techniques are helpful at all. They experiment with MPQA corpus that is labeled at the sentence level for polarity as well as intensity. Specialized sequence labeling techniques like these are the next step to our first step: showing if sequence labeling techniques are helpful at all, for sarcasm detection of dialogue.

Conclusion & Future Work
We explored how sequence labeling can be used for sarcasm detection of dialogue. We formulated sarcasm detection of dialogue as a task of labeling each utterance in a sequence, with one among two labels: sarcastic and non-sarcastic. For our experiments, we created a manually annotated dataset of transcripts from a popular TV show 'Friends'. Our dataset consisted of 913 scenes where every utterance was annotated as sarcastic or not.
We experiment with: (a) a novel set of features derived from our dataset, (b) sets of features from two prior works. Our dataset-derived features are: (a) lexical features, (b) conversational context features, and (c) author context features. Using these features, we compared two classes of learning techniques: classifiers (SVM (undersampled), SVM (oversampled) and Naïve Bayes) and sequence labeling techniques (SVM-HMM and SEARN). For our classifiers, the best Fscore was obtained with SVM (O) (i.e. 79.8%) while the best F-score for sequence labeling techniques was obtained using SVM-HMM (i.e. 84.2%). Even in case of the best combinations of our features for each algorithm, both sequence labeling techniques outperformed the classifiers. In addition, we also experimented with features introduced in two prior works. We observed an improvement of 2.8% for features in Buschmeier et al. (2014) and 4% for features in González-Ibánez et al. (2011) when sequence labeling techniques were used as against classifiers. In all cases, sequence labeling techniques had a substantially high recall as compared to classification techniques (10% in case of Buschmeier et al. (2014), 12% in case of González-Ibánez et al. (2011)). To understand which forms of sarcasm get correctly labeled by sequence labeling (and not by classification), we manually evaluated 100 examples. 71% of these examples consisted of sarcasm that could be understood only with conversational context. Our error analysis points to interesting future work for sarcasm detection of dialogue such as longrange connection, lack of conversational clues, and sarcasm a part of long utterances.
Thus, we observe that for sarcasm detection of our dataset, in case of different feature configurations, sequence labeling performs better than classification. Our observations establish the efficacy of sequence labeling techniques for sarcasm detection of dialogue.
Future work on repeating these experiments for other forms of dialogue (such as twitter conversations, chat transcripts, etc.) is imperative. Also, a combination of unified sarcasm and emotion detection using sequence labeling is another promising line of work. It would be interesting to see if deep learning-based models that perform sequence labeling perform better than those that perform classification.