Predicting Audience’s Laughter During Presentations Using Convolutional Neural Network

Public speakings play important roles in schools and work places and properly using humor contributes to effective presentations. For the purpose of automatically evaluating speakers’ humor usage, we build a presentation corpus containing humorous utterances based on TED talks. Compared to previous data resources supporting humor recognition research, ours has several advantages, including (a) both positive and negative instances coming from a homogeneous data set, (b) containing a large number of speakers, and (c) being open. Focusing on using lexical cues for humor recognition, we systematically compare a newly emerging text classification method based on Convolutional Neural Networks (CNNs) with a well-established conventional method using linguistic knowledge. The advantages of the CNN method are both getting higher detection accuracies and being able to learn essential features automatically.


Introduction
The ability to make effective presentations has been found to be linked with success at school and in the workplace (Hill and Storey, 2003;Stevens, 2005). Humor plays an important role in successful public speaking, e.g., helping to reduce public speaking anxiety often regarded as the most prevalent type of social phobia, generating shared amusement to boost persuasive power, and serving as a means to attract attention and reduce tension (Xu, 2016).
Automatically simulating an audience's reactions to humor will not only be useful for presentation training, but also improve conversational sys-tems by giving machines more empathetic power. The present study reports our efforts in recognizing utterances that cause laughter in presentations. These include building a corpus from TED talks and using Convolutional Neural Networks (CNNs) in the recognition.
The remainder of the paper is organized as follows: Section 2 briefly reviews the previous related research; Section 3 describes the corpus we collected from TED talks; Section 4 describes the text classification methods; Section 5 reports on our experiments; finally, Section 6 discusses the findings of our study and plans for future work.

Previous Research
Humor recognition refers to the task of deciding whether a sentence/spoken-utterance expresses a certain degree of humor. In most of the previous studies (Mihalcea and Strapparava, 2005;Purandare and Litman, 2006;Yang et al., 2015), humor recognition was modeled as a binary classification task.
In the seminal work (Mihalcea and Strapparava, 2005), a corpus of 16,000 "one-liners" was created using daily joke websites to collect humorous instances while using formal writing resources (e.g., news titles) to obtain non-humorous instances. Three humor-specific stylistic features, including alliteration, antonymy, and adult slang were utilized together with content-based features to build classifiers. In a recent work (Yang et al., 2015), a new corpus was constructed from the Pun of the Day website. Yang et al. (2015) explained and computed stylistic features based on the following four aspects: (a) Incongruity, (b) Ambiguity, (c) Interpersonal Effect, and (d) Phonetic Style. In addition, Word2Vec (Mikolov et al., 2013) distributed representations were utilized in the model building.
Beyond lexical cues from text inputs, other research has also utilized speakers' acoustic cues (Purandare and Litman, 2006;Bertero and Fung, 2016b). These studies have typically used audio tracks from TV shows and their corresponding captions in order to categorize characters' speaking turns as humorous or non-humorous based on canned laughter.
Convolutional Neural Networks (CNNs) have recently been successfully used in several text categorization tasks (e.g., review rating, sentiment recognition, and question type recognition). Kim (2014); Johnson and Zhang (2015); Zhang and Wallace (2015) suggested that using a simple CNN setup, which entails one layer of convolution on top of word embedding vectors, achieves excellent results on multiple tasks. Deep learning recently has been applied to computational humor research (Bertero and Fung, 2016b,a). In Bertero and Fung (2016b), CNN was found to be the best model that uses both acoustic and lexical cues for humor recognition. However, it did not outperform the Logistical Regression (LR) model when using text inputs exclusively. Beyond treating humor detection as a binary classification task, Bertero and Fung (2016a) formulated the recognition to be a sequential labeling task and utilized Recurrent Neural Networks (RNNs) (Hochreiter and Schmidhuber, 1997) on top of CNN models (serving as feature extractors) to utilize context information among utterances.
From the brief review, it is clear that there is a great need for an open corpus that can support investigating humor in presentations. 1 CNNbased text categorization methods have been applied to humor recognition (e.g., in (Bertero and Fung, 2016b)) but with limitations: (a) a rigorous comparison with the state-of-the-art conventional method examined in Yang et al. (2015) is missing; (b) CNN's performance in the previous research is not quite clear; and (c) some important techniques that can improve CNN performance (e.g., using varied-sized filters and dropout regularization (Hinton et al., 2012)) were not applied. Therefore, the present study is meant to address these limitations.

TED Talk Data
TED Talks 2 are recordings from TED conferences and other special TED programs. Many effects in a presentation can cause audience laugh, such as speaking content, presenters' nonverbal behaviors, and so on. In the present study, we focused on the transcripts of the talks. Most transcripts of the talks contain the markup '(Laughter)', which represents where audiences laughed aloud during the talks. This special markup was used to determine utterance labels.
We collected 1,192 TED Talk transcripts 3 . An example transcription is given in Figure 1. The collected transcripts were split into sentences using the Stanford CoreNLP tool (Manning et al., 2014). In this study, sentences containing or immediately followed by '(Laughter)' were used as 'Laughter' sentences, as shown in Figure 1; all other sentences were defined as 'No-Laughter' sentences. Following Mihalcea and Strapparava (2005) and Yang et al. (2015), we selected the same numbers (n = 4726) of 'Laughter' and 'No-Laughter' sentences. To minimize possible topic shifts between positive and negative instances, for each positive instance, we randomly picked one negative instance nearby (the context window was 7 sentences in this study). For example, in Figure 1, a negative instance (corresponding to 'sent-2') was selected from the nearby sentences ranging from 'sent-7' to 'sent+7'. More details about this data set can refer to Lee et al. (2016). The TED data set can be obtained by contacting the authors.

Conventional Model
Following Yang et al. (2015), we applied Random Forest (Breiman, 2001) to perform humor recognition by using the following two groups of features. The first group are humor-specific stylistic features covering the following 4 categories 4 : Incongruity (2), Ambiguity (6), Interpersonal Effect (4), and Phonetic Pattern (4). The second group are semantic distance features, including the humor label classes from 5 sentences in the training set that are closest to the sentence being evaluated (found by using a k-Nearest Neighbors (kNN) sent-7 . . . . . . . . .

No-Laughter
He has no memory of the past, no knowledge of the future, and he only cares about two things: easy and fun. sent-1 Now, in the animal world, that works fine. Laughter If you're a dog and you spend your whole life doing nothing other than easy and fun things, you're a huge success! (Laughter) sent+1 And to the Monkey, humans are just another animal species. . . . . . . sent+7 . . . Figure 1: An excerpt from TED talk "Tim Urban: Inside the mind of a master procrastinator" (http: //bit.ly/2l1P3RJ) method), and each sentence's averaged Word2Vec representations (n = 300). More details can be found in Yang et al. (2015).

CNN model
Our CNN-based text classification's setup follows Kim (2014). Figure 2 depicts the model's details. From the left side's input texts to the right side's prediction labels, different shapes of tensors flow through the entire network for solving the classification task in an end-to-end mode.
Firstly, tokenized text strings were converted to a 2D tensor with shape (L × d), where L represents sentences' maximum length while d represents the word-embedding dimension. In this study, we utilized the Word2Vec (Mikolov et al., 2013) embedding vectors (d = 300) that were trained on 100 billion words of Google News. Next, the embedding matrix was fed into a 1D convolution network with multiple filters. To cover varied reception fields, we used filters of sizes of f w − 1, f w , and f w + 1. For each filter size, n f filters were utilized. Then, max pooling, which stands for finding the largest value from a vector, was applied to each feature map (total 3 × n f feature maps) output by the 1D convolution. Finally, maximum values from all of 3 × n f filters were formed as a flattened vector to go through a fully connected (FC) layer to predict two possible labels (Laughter vs. No-Laughter). Note that for 1D convolution and FC layer's input, we applied 'dropout' (Hinton et al., 2012) regularization, which entails randomly setting a proportion of network weights to be zero during model training, to overcome over-fitting. By using cross-entropy as the learning metric, the whole sequential network (all weights and bias) could be optimized by using any SGD optimization, e.g., Adam (Kingma and Ba, 2014), Adadelta (Zeiler, 2012), and so on.

Experiments
We used two corpora: the TED Talk corpus (denoted as TED) and the Pun of the Day corpus 5 (denoted as Pun). Note that we normalized words in the Pun data to lowercase to avoid a possibly elevated result caused by a special pattern: in the original format, all negative instances started with capital letters. The Pun data allows us to verify that our implementation of the conventional model is consistent with the work reported in Yang et al. (2015).
In our experiment, we firstly divided each corpus into two parts. The smaller part (the Dev set) was used for setting various hyper-parameters used in text classifiers. The larger portion (the CV set) was then formulated as a 10-fold crossvalidation setup for obtaining a stable and comprehensive model evaluation result. For the PUN data, the Dev contains 482 sentences, while the CV set contains 4344 sentences. For the TED data, the Dev set contains 1046 utterances, while the CV set contains 8406 utterances. Note that, with a goal of building a speaker-independent humor detector, when partitioning our TED data set, we always kept all utterances of a single talk within the same partition.
When building conventional models, we developed our own feature extraction scripts and used the SKLL 6 python package for building Random Forest models. When implementing CNN,  Bergstra et al. (2013). After running 200 iterations of tweaking, we ended up with the following selection: f w is 6 (entailing that the various filter sizes are (5, 6, 7)), n f is 100, dropout 1 is 0.7 and dropout 2 is 0.35, optimization uses Adam (Kingma and Ba, 2014). When training the CNN model, we randomly selected 10% of the training data as the validation set for using early stopping to avoid over-fitting. On the Pun data, the CNN model shows consistent improved performance over the conventional model, as suggested in Yang et al. (2015). In particular, precision has been greatly increased from 0.762 to 0.864. On the TED data, we also observed that the CNN model helps to increase precision (from 0.515 to 0.582) and accuracy (from 52.0% to 58.9%). The empirical evaluation results suggest that the CNN-based model has an advantage on the humor recognition task. In addition, focusing on the system development time, gener-7 Our code implementation was based on https://github.com/shagunsodhani/ CNN-Sentence-Classifier ating and implementing those features in the conventional model would take days or even weeks. However, the CNN model automatically learns its optimal feature representation and can adjust the features automatically across data sets. This makes the CNN model quite versatile for supporting different tasks and data domains. Compared with the humor recognition results on the Pun data, the results on the TED data are still quite low, and more research is needed to fully handle humor in authentic presentations.

Discussion
For the purpose of monitoring how well speakers can use humor during their presentations, we have created a corpus from TED talks. Compared to the existing corpora, ours has the following advantages: (a) it was collected from authentic talks, rather than from TV shows performed by professional actors based on scripts; (b) it contains about 100 times more speakers compared to the limited number of actors in existing corpora. We compared two types of leading text-based humor recognition methods: a conventional classifier (e.g., Random Forest) based on human-engineered features vs. an end-to-end CNN method, which relies on its inherent representation learning. We found that the CNN method has better performance. More importantly, the representation learning of the CNN method makes it very efficient when facing new data sets.
Stemming from the present study, we envision that more research is worth pursuing: (a) for presentations, cues from other modalities such as audio or video will be included, similar to Bertero and Fung (2016b); (b) context information from multiple utterances will be modeled by using sequential modeling methods.