Telling the Whole Story: A Manually Annotated Chinese Dataset for the Analysis of Humor in Jokes

Humor plays important role in human communication, which makes it important problem for natural language processing. Prior work on the analysis of humor focuses on whether text is humorous or not, or the degree of funniness, but this is insufficient to explain why it is funny. We therefore create a dataset on humor with 9,123 manually annotated jokes in Chinese. We propose a novel annotation scheme to give scenarios of how humor arises in text. Specifically, our annotations of linguistic humor not only contain the degree of funniness, like previous work, but they also contain key words that trigger humor as well as character relationship, scene, and humor categories. We report reasonable agreement between annota-tors. We also conduct an analysis and exploration of the dataset. To the best of our knowledge, we are the first to approach humor annotation for exploring the underlying mechanism of the use of humor, which may contribute to a significantly deeper analysis of humor. We also contribute with a scarce and valuable dataset, which we will release publicly.


Introduction
Humor plays important role in human communication, which not only serves to exchange ideas or convey messages, but also involves emotion regulation such as provoking laughter, generating amusement, and reducing stress (Wooten, 1996;Morse, 2007). In particular, with the rapid growth of social media applications such as Facebook and Twitter, a significantly increasing number of individuals are using these social media public platforms to release humorous texts. Humor often arises when two incongruous concepts are applied and examined through one semantic frame (Lefcourt, 2001;Paulos, 2008). The two concepts often involve semantic disconnec-tion in forms such as contradiction and contrast/comparison. Humor sometimes occurs due to ambiguity (Yang et al., 2015), such as unexpected homophones/homographs.
The importance and complexity of humor has thus gained attention in natural language processing (NLP), and many computational approaches to it have been proposed (Binsted et al., 2006;Yang et al., 2015;Baziotis et al., 2017;Ortega-Bueno et al., 2018;Liu et al., 2018). Corpora are fundamental in NLP for sound analysis of humor and for high-quality automatic humor detection. Scholars have been devoted to the study of the humor resources in both English and other languages. Mihalcea and Strapparava (2005) constructed a humorous witticism dataset with 16,000 text data for humor identification in English sentences. The dataset comes from one-liners, reuters titles, BNC sentences, and proverbs, and was annotated with humor and non-humor. Reyes et al. (2013) established an English irony dataset of 40,000 tweets for conducting the study of irony on tweets. The dataset contains the label of irony and other specific hashtags of non-irony (education, humor, and politics). Zhang and Liu (2014) established an English humor corpus with 3,000 tweets to recognize humor on Twitter. The dataset contains the annotation of humorous tweets, non-humorous tweets and humorous non-tweets. Potash et al. (2017) built a 12,734 tweets dataset of English for studying the comparative ranking of humor. The dataset comes from the midnight TV program called Hashtag Wars which published on Twitter. Castro et al. (2017) established a humorous text corpus containing 33,531 tweets for detecting humor in Spanish Tweets. The dataset involves humorous annotation and humor level annotation. The humor level annotation is based on a 5-point scale, 1 signifying the lowest level and 5 signifying the highest level. Castro et al. (2018) revised the Spanish Twitter corpus with crowd notes and presented a 27,000 tweets dataset in total. Specially, the authors used 5 different emojis to represent the 5 degrees of humor instead of using the 5-point annotation.
However, while previous work focuses on textual humor annotation of humorous/non-humorous and degree of funniness, such annotations do not provide adequate knowledge and scenarios to explain how humor arises, so they may not provide a deep analysis of the underlying mechanism of humor. In addition, as the majority of data came from Twitter, the data source lacks variety. To this end, we create a dataset on humor with 9,123 manually annotated jokes in Chinese. We propose a novel scheme with annotations of key words that make text humorous as well as character relationship, scene, humor category, and degree of funniness. The annotation agreement analyses for multiple annotators are described. We also conduct analysis and exploration on the dataset. Our contributions are as follows.
• We propose a novel annotation scheme to explain how humor arises in text. Unlike previous work, we annotate not only what is humorous, but also what causes humor.
• We contribute to a new, sizeable, and scarce joke dataset, which is being released publicly and particularly valuable in languages other than English.

Data Collection
To make the dataset objective and comprehensive, we collected joke data involving both diachronic and synchronic relationships simultaneously from a variety of fields. Also, We selected jokes based on a four-dimensional model. On the time axis, our dataset includes jokes from books, literary journals, etc. published over the past decade, which satisfies the diachronic requirement. It also includes jokes posted on websites and microblogs, many of which are novel, which conforms to the synchronic requirement. On the spatial axis, the dataset contains both domestic and translated foreign jokes. On the subject axis, the perception of intensity of jokes also varies from person to person, due to their varying backgrounds and senses of humor. On the style axis, jokes from books have various themes, and they are relatively canonical, while online jokes seem more oral and informal. The source information is in Table 1. • Relationship: We annotated the mutual relationship between the main characters such as teacher-student, doctor-patient, lovers, superior and subordinate, etc. because accessing the relationship between people in jokes is helpful for a clearer understanding of the contextual coordinates on the joke (Popa, 2005).
• Scene: The scene refers to the place where the joke occurs. Previous studies indicate humor plays an important role in the places including campus (Morrison et al., 2012), workplace (Blumenfeld and Alpern, 1994), family (Lovorn, 2008) and public space (Thornton, 2007). We therefore selected the campus, workplace, family and public space for the annotation of scene in humor.
• Category: There is no consensus on the category of humor in the literature. Based on our investigation of a wide range of literature, we focused on eight main types of the most frequently appearing humor including homophonic, harmonic, antiphrasis, analogy, euphemism, irony, exaggeration, and reversal.
• Keyword: We define the key words as words that trigger humor and that may have conflicting, incongruous and ambiguous meanings in jokes  and implicit. The special word pair in the contrasting text spans triggers the production of humor. Specially, we annotated the keyword based on the thought of contrasting text spans in humor in the format of prototype. An annotation example shows in figure 1.

Keyword annotation
The keyword is the most challenging of the six annotating items. Following Van Hee et al. (2016), our annotation of key words is at the relation level, which involves the identification of incongruous or ambiguous vocabulary, resulting in a comic effect. To discriminate key words, the annotators followed the below guidelines: • Read the entire text-discourse to establish a general understanding of the meaning.
• For each word in the text, establish its meaning in context.
• Determine which words have the meaning of incongruous/conflicting/ambiguous/unexpected or strong emotions that make text humorous in the given context.
• Decide whether the contextual meaning can be understood.
• If yes, mark the word as a key word. Figure 2 shows the example of keyword annotation in humor.
The words "sing" and "hit" act as keywords, because the two phrases are the main indicators of humor: "she sings badly, and it sounds like she is being beaten and screaming." The comparison of "sing" and "hit" invokes the humor, so they are keywords.

Annotation process
Eight postgraduate students and one PhD student worked together to complete the annotation of the joke dataset. The participants were divided into four groups of two. Each group annotated the jokes using cross-validation. The PhD student arbitrated. During the annotation process, when two people reached agreement on the annotation result, then the marking was complete; when there was disagreement, the arbitrator attempted to resolve it. When the arbitration was inconsistent with the views of the two persons' judgment: Case 1. If the inconsistency was in the degree of humor, we used the average value of the three people. Case 2. If there was any disagreement about the generation mechanism, it was discussed by the whole group of nine people, and the mechanism receiving the largest number of votes was the final result.

Annotation agreement and challenges
To evaluate inter-annotator agreement, we let three annotators annotate the same 600 sentences to assess inter-annotator agreement. We used Fleiss' s kappa (Fleiss, 1971). The agreement on the relationship annotation was κ = 0.85; the agreement on the scene annotation was κ = 0.79; the agree- Figure 3: Quantity of joke scenes ment on the category annotation was κ = 0.71, the agreement on the humor level annotation was κ = 0.65, and the agreement on the keywords annotation was κ = 0.59.
The keywords and humor level annotation was the most challenging part of the annotation, due to the subjective nature of cognition and the different background knowledge of people. To minimize the problems annotators faced, we held a seminar once a week to discuss the ambiguities. Then, the guide gave an authoritative explanation. Finally, ambiguity points and measures were added into the annotating guide manual to help annotators to make judgments quickly and correctly when they encountered the same problem.

Dataset Analysis
The dataset contains 9,123 jokes, 39,977 sentences and 4,110,592 words in total, with an average of 4.38 sentences per joke. In the dataset, there are five scenes of joke (where the joke occurs): workplace, campus, family, public space and others. Family accounts for 42% of the dataset. This is perhaps because family life accounts for the largest proportion of life as a whole. These statistical data fully confirm that the jokes originate from life, and that they are known to the general public, which shows that the audience for jokes is very wide. The specific distribution is in Figure 3. Figure 4 shows the vocabulary that appears scene of jokes are various. For instance, the top five high-frequency words in campus jokes are "school, class, exciting, suitable, answer"; in family jokes are "son, mom, regret, homework, see"; in workplace jokes are "boring, boss, curious, forget, tell"; in public space jokes are "man, bus, surprising, cry, calmly". It is intriguing that some high-frequency words in certain jokes are related to certain scenes. For instance, the high-frequency words in workplace jokes are "boss, boring, forget, which not only is in line with the bias of the work place, but also proves the validity of this classification to some extent.
We also analyzed the humor categories because they may associate with underlying mechanism of the use of humor. The quantitative statistics for humor categories are shown in Figure 5.
Our annotation not only contains the degree of funniness, but also key words that trigger humor, as well as character relationship, scene, and humor categories. Specially, we have improved on the study of Castro et al. (2018) by providing evidence of what causes humor and explaining how humor arises in text. Furthermore, our data have come from a range of sources in numerous domains rather than only from Twitter.

Conclusion
We propose a novel annotation scheme to explain how humor arises in text. Unlike previous work, we annotate not only what is humorous, but also what causes humor. Our dataset creation involved nine volunteer students for 8 months. We will release the dataset publicly. With 9,123 Chinese jokes and 39,977 sentences in total, and with fine-grained annotation of humor, the dataset provides a new, sizeable, and scarce joke dataset, which is particularly valuable in languages other than English for scholars in many disciplines, such as computational, linguistic, and cognitive studies.