UR-FUNNY: A Multimodal Language Dataset for Understanding Humor

Humor is a unique and creative communicative behavior displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (vision) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it happens in face-to-face communication. Although humor detection is an established research area in NLP, in a multimodal context it is an understudied area. This paper presents a diverse multimodal dataset, called UR-FUNNY, to open the door to understanding multimodal language used in expressing humor. The dataset and accompanying studies, present a framework in multimodal humor detection for the natural language processing community. UR-FUNNY is publicly available for research.

: An example of the UR-FUNNY dataset. UR-FUNNY presents a framework to study the dynamics of humor in multimodal language. Machine learning models are given a sequence of sentences with the accompanying modalities of vision and acoustic. Their goal is to detect whether or not the sequence will trigger immediate laughter by detecting whether or not the last sentence constitutes a punchline.

Abstract
Humor is a unique and creative communicative behavior displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (vision) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it happens in face-to-face communication. Although humor detection is an established research area in NLP, in a multimodal context it is an understudied area. This paper presents a diverse multimodal dataset, called UR-FUNNY, to open the door to understanding multimodal language used in expressing humor. The dataset and accompanying studies, present a framework in multimodal humor detection for the natural language processing community. UR-FUNNY is publicly available for research.

Introduction
Humor is a unique communication skill that removes barriers in conversations. Research shows that effective use of humor allows a speaker to establish rapport (Stauffer, 1999), grab attention (Wanzer et al., 2010), introduce a difficult concept without confusing the audience (Garner, 2005) and even to build trust (Vartabedian and Vartabedian, 1993). Humor involves multimodal communicative channels including effective use of words (text), accompanying gestures (vision) and sounds (acoustic). Being able to mix and align those modalities appropriately is often unique to individuals, attributing to many different styles. Styles include gradually building up to a punchline using text, audio, video or in combination of any of them, a sudden twist to the story with an unexpected punchline (Ramachandran, 1998), creating a discrepancy between modalities (e.g., something funny being said without any emotion, also known as dry humor), or just laughing with the speech to stimulate the audience to mirror the laughter (Provine, 1992).
Modeling humor using a computational framework is inherently challenging due to factors such as: 1) Idiosyncrasy: often humorous people are also the most creative ones (Hauck and Thomas, 1972). This creativity in turn adds to the dynamic complexity of how humor is expressed in a multimodal manner. Use of words, gestures, prosodic cues and their (mis)alignments are toolkits that a creative user often experiments with. 2) Contextual Dependencies: humor often develops through time as speakers plan for a punchline in advance. There is a gradual build up in the story with a sudden twist using a punchline (Ramachandran, 1998). Some punchlines when viewed in isolation (as illustrated in Figure 1) may not appear funny. The humor stems from the prior build up, cross-referencing multiple sources, and its delivery. Therefore, a full understanding of humor requires analyzing the context of the punchline.
Understanding the unique dependencies across modalities and its impact on humor require knowledge from multimodal language; a recent research trend in the field of natural language processing (Zadeh et al., 2018b). Studies in this area aim to explain natural language from three modalities of text, vision and acoustic. In this paper, alongside computational descriptors for text, gestures such as smile or vocal properties such as loudness are measured and put together in a multimodal framework to define humor recognition as a multimodal task.
The main contribution of this paper to the NLP community is introducing the first multimodal language (including text, vision and acoustic modalities) dataset of humor detection named "UR-FUNNY". This dataset opens the door to understanding and modeling humor in a multimodal framework. The studies in this paper present performance baselines for this task and demonstrate the impact of using all three modalities together for humor modeling.

Background
The dataset and experiments in this paper are connected to the following areas: Humor Analysis: Humor analysis has been among active areas of research in both natural language processing and affective computing. No-  table datasets in this area include "16000 One-Liners" (Mihalcea and Strapparava, 2005), "Pun of the Day" (Yang et al., 2015), "PTT Jokes" (Chen and Soo, 2018), "Ted Laughter" (Chen and Lee, 2017), and "Big Bang Theory" (Bertero et al., 2016). The above datasets have studied humor from different perspectives. For example, "16000 One-Liner" and "Pun of the Day" focus on joke detection (joke vs. not joke binary task), while "Ted Laughter" focuses on punchline detection (whether or not punchline triggers laughter). Similar to "Ted Laughter", UR-FUNNY focuses on punchline detection. Furthermore, punchline is accompanied by context sentences to properly model the build up of humor. Unlike previous datasets where negative samples were drawn from a different domain, UR-FUNNY uses a challenging negative sampling case where samples are drawn from the same videos. Furthermore, UR-FUNNY is the only humor detection dataset which incorporates all three modalities of text, vision and audio. Table 1 shows a comparison between previously proposed datasets and UR-FUNNY dataset. From modeling aspect, humor detection is done using hand-crafted and non-neural models (Yang et al., 2015), neural based RNN and CNN models for detecting humor in Yelp (de Oliveira et al., 2017) and TED talks (Chen and Lee, 2017). Newer approaches have used (Chen and Soo, 2018) highway networks "16000 One-Liner" and "Pun of the Day" datasets. There have been very few attempts at using extra modalities alongside language for detecting humor, mostly limited to adding simple audio features (Rakov and Rosenberg, 2013;Bertero et al., 2016). Furthermore, these attempts have been restricted to certain topics and domains (such as "Big Bang Theory" TV show (Bertero et al., 2016)). Multimodal Language Analysis: Studying natural language from modalities of text, vision and acoustic is a recent research trend in natural language processing (Zadeh et al., 2018b). Notable works in this area present novel multimodal neural architectures (Wang et al., 2019;Pham et al., 2019;Hazarika et al., 2018;, multimodal fusion approaches Tsai et al., 2018;Zadeh et al., 2018a;Barezi et al., 2018) as well as resources (Poria et al., 2018a;Zadeh et al., 2018cZadeh et al., , 2016Park et al., 2014;Rosas et al., 2013;Wöllmer et al., 2013). Multimodal language datasets mostly target multimodal sentiment analysis , emotion recognition (Zadeh et al., 2018c;Busso et al., 2008), and personality traits recognition (Park et al., 2014). UR-FUNNY dataset is similar to the above datasets in diversity (speakers and topics) and size, with the main task of humor detection. Beyond the scope of multimodal language analysis, the dataset and studies in this paper have similarities to other applications in multimodal machine learning such language and vision studies, robotics, image captioning, and media description (Baltrušaitis et al., 2019).

UR-FUNNY Dataset
In this section we present the UR-FUNNY dataset. We first discuss the data acquisition process, and subsequently present statistics of the dataset as well as multimodal feature extraction and validation.

Data Acquisition
A suitable dataset for the task of multimodal humor detection should be diverse in a) speakers: modeling the idiosyncratic expressions of humor may require a dataset with large number of speakers, and b) topics: different topics exhibit different styles of humor as the context and punchline can be entirely different from one topic to another.
TED talks 1 are among the most diverse idea sharing channels, in both speakers and topics. Speakers from various backgrounds, ethnic groups and cultures present their thoughts through a widely popular channel 2 . The topics of these presentations are diverse; from scientific discoveries to everyday ordinary events. As a result of diversity in speakers and topics, TED talks span across a broad spectrum of humor. Therefore, this platform presents a unique resource for studying the dynamics of humor in a multimodal setup.
TED videos include manual transcripts and audience markers. Transcriptions are highly reliable, which in turn allow for aligning the text and audio. This property makes TED talks a unique resource for newest continuous fusion trends . Transcriptions also include reliably annotated markers for audience behavior. Specifically, the "laughter" marker has been used in NLP studies as an indicator of humor (Chen and Lee, 2017). Previous studies have identified the importance of both punchline and context in understanding and modeling the humor. In a humorous scenario, context is the gradual build up of a story and punchline is a sudden twist to the story which causes laughter (Ramachandran, 1998). Using the provided laughter marker, the sentence immediately before the marker is considered as the punchline and the sentences prior to punchline (but after previous laughter marker) are considered context.
We collect 1866 videos as well as their transcripts from TED portal. These 1866 videos are chosen from 1741 different speakers and across 417 topics. The laughter markup is used to filter out 8257 humorous punchlines from the transcripts (Chen and Lee, 2017). The context is extracted from the prior sentences to the punchline (until the previous humor instances or the beginning of video is reached). Using a similar approach, 8257 negative samples are chosen at random intervals where the last sentence is not immediately followed by a laughter marker. The last sentence is assumed a punchline and similar to the positive instances, the context is chosen. This negative sampling uses sentences from the same distribution, as opposed to datasets which use sentences from other distributions or domains as negative sample (Yang et al., 2015;Mihalcea and Strapparava, 2005). After this negative sampling, there is a homogeneous 50% split in the dataset between positive and negative examples.
Using forced alignment, we mark the beginning and end of each sentence in the video as well as words and phonemes in the sentences (Yuan and Liberman, 2008). Therefore, an alignment is established between text, audio and video. Utilizing this alignment, the timing of punchline as well as context is extracted for all instances in the dataset.

Dataset Statistics
The high level statistics of UR-FUNNY dataset are presented in Table 2. Total duration of the entire dataset is 90.23 hours. There are a total of 1741 distinct speakers and a total of 417 distinct topics in the UR-FUNNY dataset. Figure 2.e shows the word cloud of the topics based on logfrequency of the topic. The top most five frequent topics are technology, science, culture, global issues and design 3 . There are in total 16514 video segments of humor and not humor instances (equal splits of 8257). The average duration of each data instance is 19.67 seconds, with context an average of 14.7 and punchline with an average of 4.97 seconds. The average number of words in punchline is 16.14 and the average number of words in context sentences is 14.80. Figure 2 shows an overview for some of the important statistics of UR-FUNNY dataset. Figure 2.a demonstrates the distribution of punchline for humor and non-humor cases based on number of words. There is no clear distinction between humor and non-humor punchlines as both follow similar distribution. Similarly, Figure 2.b shows the distribution of number of words per context sentence. Both humor and non-humor context sentences follow the same distribution. Majority  (≥ 90%) of punchlines have length less than 32.
In terms of number of seconds, Figure 2.d shows the distribution of punchline and context sentence length in terms of seconds. Figure 2.c demonstrates the distribution of number of context sentences per humor and non-humor data instances. Number of context sentences per humor and nonhumor case is also roughly the same. The statistics in Figure 2 show that there is no trivial or degenerate distinctions between humor and non-humor cases. Therefore, classification of humor versus non-humor cases cannot be done based on simple measures (such as number of words); it requires understanding the content of sentences. Table 3 shows the standard train, validation and test folds of the UR-FUNNY dataset. These folds share no speaker with each other -hence standard folds are speaker independent (Zadeh et al., 2016). This minimizes the chance of overfitting to identity of the speakers or their communication patterns.

Extracted Features
For each modality, the extracted features are as follows: Language: Glove word embeddings (Pennington et al., 2014) are used as pre-trained word vectors for the text features. P2FA forced alignment model (Yuan and Liberman, 2008) is used to align the text and audio on phoneme level. From the force alignment, we extract the timing annotations of context and punchline on word level. Then, the acoustic and visual cues are aligned on word level by interpolation .
Visual: OpenFace facial behavioral analysis tool (Baltrušaitis et al., 2016) is used to extract the facial expression features at the rate of 30 frame/sec. We extract all facial Action Units (AU) features based on the Facial Action Coding System (FACS) (Ekman, 1997). Rigid and nonrigid facial shape parameters are also extracted (Baltrušaitis et al., 2016). We observed that the camera angle and position changes frequently during TED presentations. However, for the majority of time, the camera stays focused on the presenter. Due to the volatile camera work, the only consistently available source of visual information was the speaker's face.
UR-FUNNY dataset is publicly available for download alongside all the extracted features.

Multimodal Humor Detection
In this section, we first outline the problem formulation for performing binary multimodal humor detection on UR-FUNNY dataset. We then proceed to study the UR-FUNNY dataset through the lens of a contextualized extension of Memory Fusion Network (MFN) (Zadeh et al., 2018a) -a state-of-the-art model in multimodal language.

Problem Formulation
UR-FUNNY dataset is a multimodal dataset with three modalities of text, vision and acoustic. We denote the set of these modalities as M = {t, v, a}. Each of the modalities come in a sequential form. We assume word-level alignment between modalities (Yuan and Liberman, 2008). Since frequency of the text modality is less than vision and acoustic (i.e. vision and acoustic have higher sampling rate), we use expected visual and acoustic descriptors for each word . After this process, each modality has the same sequence length (each word has a single vision and acoustic vector accompanied with it).
Each data sample in the UR-FUNNY can be described as a triplet (l, P, C) with l being a binary label for humor or non-humor. P is the punchline and C is the context. Both punchline and context have multiple modalities P = {P m ; m ∈ M }, C = {C m ; m ∈ M }. If there are N C context sentences accompanying the punchline, then C m = [C m,1 , C m,2 , . . . , C m,N C ] -simply context sentences start from first sentence to the last (N C ) sentence. K P is the number of words in the punchline and K Cn N C n=1 is the number of words in each of the context sentences respectively. As examples of this notation, P m,k refers to the kth entry in the modality m of the punchline. C m,n,k refers to the kth entry in the modality m of the nth context.
Models developed on UR-FUNNY dataset are trained on triplets of (l, P, C). During testing only a tuple (P, C) is given to predict the l. l is the label for laughter, specifically whether or not the inputs P, C are likely to trigger a laughter.

Contextual Memory Fusion Baseline
Memory Fusion Network (MFN) is among the state-of-the-art models for several multimodal datasets (Zadeh et al., 2018a). We devise an extension of the MFN model, named Contextual Memory Fusion Network 4 (C-MFN), as a baseline for humor detection on UR-FUNNY dataset. This is done by introducing two components to allow the involvement of context in the MFN model: 1) Unimodal Context Network, where information from each modality is encoded using M Long-short Term Memories (LSTM), 2) Multimodal Context Network, where unimodal context information are fused (using self-attention) to extract the multimodal context information. We discuss the components of the C-MFN model in the continuation of this section.

Unimodal Context Network
To model the context, we first model each modality within the context. Unimodal Context Network ( Figure 3) consists of M LSTMs, one for each modality m ∈ M denoted as LSTM m . For each context sentence n of each modality m ∈ M , LSTM m is used to encode the information into a single vector h m,n . This single vector is the last output of the LSTM m over C m,n as input. The recurrence step for each LSTM is the utterance of each word (due to word-level alignment vision and acoustic modalities also follow this time-step). The output of the Unimodal Context Network is the set H = {h m,n ; m ∈ M, 1 ≤ n < N C }.

Multimodal Context Network
Multimodal Context Network ( Figure 4) learns a multimodal representation of the context based on the output H of the Unimodal Context Network. Sentences and modalities in the context can form complex asynchronous spatio-temporal relations. For example, during the gradual buildup of the context, the speaker's facial expression may be impacted due to an arbitrary previously uttered sentence. Transformers (Vaswani et al., 2017) are a family of neural models that specialize in finding various temporal relations between their inputs through self-attention. By concatenating representations h m∈M,n (i.e. for all M modalities of the nth context), self-attention model can be applied to find asynchronous spatio-temporal relations in The output H of the Unimodal Context Network is connected to an encoder module to get the multimodal outputĤ. For the details of components outlined in orange please refer to the authors' original paper. (Vaswani et al., 2017). Best viewed in color. the context. We use an encoder with 6 intermediate layers to derive a multimodal representationĤ conditioned on H.Ĥ is also spatio-temporal (as produced output of encoders in a transformer are). The output of Multimodal Context Network is the outputĤ of the encoder.

Memory Fusion Network (MFN)
After learning unimodal (H) and multimodal (Ĥ) representations of context, we use a Memory Fusion Network (MFN) to model the punchline (Figure 5). MFN contains 2 types of memories: a System of LSTMs with M unimodal memories to model each modality in punchline, and a Multi-view Gated Memory which stores multimodal information. We use a simple trick to combine the Context Networks (Unimodal and Multimodal) with the MFN: we initialize the memories in the MFN using the outputs H (unimodal representation) andĤ (multimodal representation). For System of LSTMs, this is done by initializing the LSTM cell state of modality m with D m (h m,1≤n<N C ). D m is a fully connected neural network that maps the information from h m,1≥j≥N C (mth modality in context) to the cell state of the mth LSTM in the System of LSTMs. The Multi-view Gated Memory is initialized based on a non-linear projection D(Ĥ) where D is a fully connected neural network. Similar to context where modalities are aligned at word level, punchline is also aligned the same way. Therefore a word-level implementation of the MFN is used, where a word and accompanying vision and acoustic descriptors are used as input to the System of LSTMs at each time-step. The Multi-view Gated Memory is updated iteratively at every recurrence of the System of LSTMs using a Deltamemory Attention Network.
The final prediction of humor is conditioned on the last state of the System of LSTMs and Multiview Gated Memory using an affine mapping with Sigmoid activation.

Experiments
In the experiments of this paper, our goal is to establish a performance baseline for the UR-FUNNY dataset. Furthermore, we aim to understand the role of context and punchline, as well as role of individual modalities in the task of humor detection. For all the experiments, we use the proposed contextual extension of Memory  The above variants of the C-MFN allow for studying the importance of punchline and context in modeling humor. Furthermore, we compare the performance of the C-MFN variants in the following scenarios: (T) a only text modality is used without vision and acoustic, (T+V) text and vision modalities are used without acoustic, (T+A) text and acoustic modalities are used without vision, (A+V) only vision and acoustic modalities are used, (T+A+V) all modalities are used together.
We compare the performance of C-MFN variants across the above scenarios. This allows for understanding the role of context and punchline in humor detection, as well as the importance of different modalities. All the models for our experiments are trained using categorical cross-entropy. This measure is calculated between the output of the model and ground-truth labels.

Results and Discussion
The results of our experiments are presented in Table 4. Results demonstrate that both context and punchline information are important as C-MFN outperforms C-MFN (P) and C-MFN (C) models.
Punchline is the most important component for detecting humor as the performance of C-MFN (P) is significantly higher than C-MFN (C).
Models that use all modalities (T+A+V) outperform models that use only one or two modalities (T, T+A, T+V, A+V). Between text (T) and nonverbal behaviors (A+V), text shows to be the most important modality. Most of the cases, both modalities of vision and acoustic improve the performance of text alone (T+V, T+A).
Based on the above observations, each neural component of the C-MFN model is useful in improving the prediction of humor. The results also indicate that modeling humor from a multimodal perspective yields successful results.
The human performance 5 on the UR-FUNNY dataset is 82.5%.
The results from Table 4 demonstrate that while a state-of-the-art model can achieve a reasonable level of success in modeling humor, there is still a large gap between human-level performance with state of the art. Therefore, UR-FUNNY dataset presents new challenges to the field of NLP, specifically research areas of humor detection and multimodal language analysis.

Conclusion
In this paper, we presented a new multimodal dataset for humor detection called UR-FUNNY. This dataset is the first of its kind in the NLP community. Humor detection is done from the perspective of predicting laughter -similar to (Chen and Lee, 2017). UR-FUNNY is diverse in both speakers and topics. It contains three modalities of text, vision and acoustic. We study this dataset through the lens of a Contextualized Memory Fusion Network (C-MFN). Results of our experiments indicate that humor can be better modeled if all three modalities are used together. Furthermore, both context and punchline are important in understanding humor. The dataset and the accompanying experiments will be made publicly available.