Supporting Comedy Writers: Predicting Audience’s Response from Sketch Comedy and Crosstalk Scripts

Sketch comedy and crosstalk are two popular types of comedy. They can relieve people’s stress and thus benefit their mental health, especially when performances and scripts are high-quality. However, writing a script is time-consuming and its quality is difficult to achieve. In order to minimise the time and effort needed for producing an excellent script, we explore ways of predicting the audience’s response from the comedy scripts. For this task, we present a corpus of annotated scripts from popular television entertainment programmes in recent years. Annotations include a) text classification labels, indicating which actor’s lines made the studio audience laugh; b) information extraction labels, i.e. the text spans that made the audience laughed immediately after the performers said them. The corpus will also be useful for dialogue systems and discourse analysis, since our annotations are based on entire scripts. In addition, we evaluate different baseline algorithms. Experimental results demonstrate that BERT models can achieve the best predictions among all the baseline methods. Furthermore, we conduct an error analysis and investigate predictions across scripts with different styles.


Introduction
Comedy plays a major role in people's lives in that it relieves stress and anxiety (Williams et al., 2005;Sarıtaş et al., 2019). There are two popular types of comedy: sketch comedy and crosstalk. A sketch comedy usually presents a short story and is performed by multiple comedians in various short * The research was conducted during non-working time. The idea of this research was inspired by a discussion with my friend about an entertainment TV programme in which the comedians mentioned the difficulties of producing a highquality script. 1 The corpus and source code can be freely downloaded from https://github.com/createmomo/ supporting-comedy-writers scenes; while in a crosstalk performance, which is similar to a talk show, there are usually two performers telling humorous stories behind a desk. Although these two types of comedy are different, both of them are performed based on scripts. A script breaks down a story into pieces along with the details that describe which performer should take what action or say which lines at a specific point (Blake, 2014). Therefore, the quality of the script is critical and it directly influences whether the audience enjoys the performance.
However, it is difficult for script writers to ensure a high-quality comedy script and be productive. Firstly, writers have to assess if audiences will react as expected, in particular laughing at specific points. It is necessary to rehearse multiple times to continuously improve the script, which is time-consuming and can be costly. Secondly, to develop laughter triggers, writers need to identify the potential points from the script where there are possibilities for performers to use funny body moves, tone or tell amusing stories to make the audience laugh. Thirdly, the more times a script is publicly performed, the less laughter it can bring, since the audience have become too familiar with it. Thus, it is essential for comedy writers to explore new laughter triggers constantly.
Since natural language processing (NLP) has been widely and successfully applied to a number of fields (Carrera-Ruvalcaba et al., 2019;Rao and McMahan, 2019), we investigate how NLP methods can support comedy writers to produce high-quality scripts more efficiently. This paper specifies this challenge as a new task, i.e. the prediction of the audience's response to sketch comedy and crosstalk scripts. To address this challenge, we explore the use of two different NLP methodologies: 1) Text Classification: we predict whether or   In the first column, we highlight the text spans that trigger laughs from audiences. Note that, we also collected the performer's moves (e.g., "duly closes his eyes" in the third example).
not an actor's lines 2 can make audiences laugh. In other words, we formulate the task of predicting as a binary text classification problem. 2) Information Extraction: we predict the text spans from an actor's lines indicating the specific words that trigger an audience's laughter.
Contributions Firstly, we introduce a Chinese corpus of annotated comedy scripts collected from popular TV entertainment programmes. Our annotations include both text classification and information extraction labels. Tables 1 and 2 present annotation examples. The corpus can be used to build an intelligent system to benefit the script writing for comedy writers. It may also be useful for dialogue system research and discourse analysis. Secondly, we evaluate a number of NLP methods and the results demonstrate that BERT models (Devlin et al., 2019) are able to achieve the best prediction performance among all methods. We also further conduct an error analysis which may be useful for further improving the performance. Lastly, we experimentally show that our corpus can also be used to predict laughter triggers for scripts which have very different styles compared to training data.

Related Work
Our work is closely related to humour detection, which has been widely studied for many years in natural language processing. Mihalcea and Strapparava (2006) 2019) proposed a regression task that predicts the humour score for a tweet. Li et al. (2020) collected Chinese Internet slang expressions and combined them with a humor detecting method to analyse the sentiment of Weibo 4 posts. It should be noted that the examples in all of the corpora used or constructed in the above-mentioned studies are independent of each other. Since our corpus is based on entire scripts, the annotated lines and text spans might also benefit the researchers who are interested in modelling long-context-aware algorithms to understand humour. Apart from the studies on short text fragments, Bertero (2019) and Hasan et al. (2019) created corpora from television (TV) sitcoms such as The Big Bang Theory 5 and TED talks 6 respectively. Their goal is to predict whether or not a sequence of texts will trigger immediate laughter. Yang et al. (2015); Zhang et al. (2019) extracted the key words such as sing, sign language and pretty handy from jokes, which are similar to our information extraction annotations. 44 3 Corpus

Data Collection
Source Selection In order to ensure the highquality of scripts, we carefully selected thirty performances (the total duration is approximately 473 minutes), including both sketch comedies and crosstalks, of which the leading roles are famous Chinese comedians. These performances were played on well-known Chinese TV entertainment programmes such as Chinese New Year Gala and Ace VS Ace 7 . Since there were many people in the audience present for the recording of these performances, the annotators can judge whether the audience laughed based on the performance videos. Please refer to the appendix for the full list of performances which gives details of their titles, leading comedians and sources. Lastly, we manually typed up actors' lines for each performance and completed thirty scripts. Although there may be differences between our scripts and the real scripts used by comedians in terms of format or content, we assume that our scripts contain the key information about the real scripts, i.e., the actors' lines. Therefore the corpus can be useful for the development of intelligence-assistant comedy script writing systems.
Diversity We also took the comedy style into consideration. In order to ensure the diversity and its balance: a) The performances were selected from three main different types of sources 8 as shown in Table 3, including the topic descriptions of selected performances. It can be observed that the corpus has a wide range of topics. b) As a preliminary study, we selected six popular Chinese comedians who have various and distinctive styles, and we chose five representative performances of each comedian. Table 4 illustrates the statistics and Figure 1 shows the laughter rates of each script. The highest line-level and character-level rates are 7 https://es.wikipedia.org/wiki/Ace_vs_ Ace 8 The three sources are: Chinese New Year Galas-the annual televised Chinese New Year celebrations which are the most viewed TV shows in China. The shows consist of various performances including sketch comedies and crosstalks; Reality Shows-the programmes that show the unscripted actions of participants such as playing games and talking. We selected the shows in which comedians were involved; Comedy Competition Shows-the programmes where different comedians present their comedy performances to a studio audience and the winners are selected based on the audience's votes.

Source
Topics Chinese New -Love stories and blind dates between old people; Year Galas -Reflecting social phenomena to call for a better society (e.g. avoid judging people by their appearances, do not spoil children, care more about lonely seniors, the woman builds a good relationship with her mother-in-law, spend more time with children, be wary of scams); -Funny family stories during spring festival; Reality -Stories happened in ancient times; Shows -Stories about young people (e.g. encounter ex-boyfriends or ex-girlfriends, relationships between best friends, blind dates); -Reflecting social phenomena to call for a better society (e.g. give seats to vulnerable people); Comedy -Love stories; Competition -Hot topics (e.g. support the COVID-19 frontline fighters); Shows -Funny stories that happened among friends and in families; -Reflecting social phenomena to call for a better society (e.g. be wary of scams, care more about orphans in orphanage);  Table 4: Corpus statistics. # of Actors' Lines and Characters correspond to the total number of lines and characters in our corpus respectively. Laughter Rate is the rate of lines/characters that trigger laughter. 45.39% and 13.12%, while the lowest rates are 16.03% and 3.49%. We note that the characterlevel laughter rates vary in different scripts. This may be due to density of laughter triggers of a line or the topic of the script.

Annotation
The annotation was completed on Doccano platform (Nakayama et al., 2018) and the annotators are two native Chinese speakers. The annotations were produced based on the studio audiences' responses as observed in the videos, and are not based on the annotators' responses.
Annotation Instruction Annotating text classification labels is easy; annotators are requested to simply assigned label 1 to the lines that make audiences laugh, and 0 to the others. With regard to the information extraction annotations, annotators are requested to identify text spans which are usually phrases. The span consists of the words that immediately made the audience laugh after the comedians said them. For example, as indicated in Table 2, the span incredibly big was annotated. In this case, only annotating big would be considered as an incorrect annotation, because the comedian was using incredibly to strongly emphasise big which was her first impression of a man's house in a blind date. Only annotating incredibly would also be incorrect, because the main reason why the audience laughed was because the comedian said the house looked big. 9 Annotation Process The annotation process was as follows: Firstly, the annotators conducted discussions about the conflicting annotations after several attempts to annotate the same three scripts. Secondly, once agreements about how to solve the conflicts had been reached, they started to annotate their assigned scripts. Afterwards, since information extraction annotation is more complex than that of classification annotations, we measured its quality by computing three types of interannotator agreement. We asked the annotators to annotate the same six scripts having different styles and then calculated the Overall Percent Agreement (OPA), Fleiss's kappa (Fleiss, 1971) and Randolph's kappa (Randolph, 2005). We found that the agreement rates were high (OPA 98.09%, Fleiss's Kappa 0.85, Randolph's Kappa 0.96). This is due to the fact that the discussions about solving conflicts were in-depth and the laughter triggers were usually clear in the lines.

Baselines and Results Discussion
In order to understand how well the machine learning methods work on our corpus, we evaluate the performances of a number of models on 5-fold cross-validation random splits of the scripts in our corpus and report the average results in this section. 10 All the BERT models were pre-trained by using a mixture of large Chinese corpora. 11 Please 9 The house is actually small. Since there is almost no furniture in the house, the comedian said it looked big. 10 Model implementations were adapted from https://github.com/649453932/ Chinese-Text-Classification-Pytorch, https: //github.com/luopeixiang/named_entity_ recognition and Zhao et al. (2019). 11 More details are listed in https://github.com/ dbiir/UER-py/wiki/Modelzoo.   refer to the appendix for the results of each fold, statistics of splits, computing infrastructure, each model's running time, parameter details and hyperparameter settings.
Baselines Tables 5 and 6 respectively present the results of text classification and information extraction. BERT-base has the best F1-scores among all the methods. We also note that the classification recall of RCNN (Lai et al., 2015) is much higher than other methods. Therefore, we suggest using this model if users prefer a classifier with a high recall.
In addition, we observe that the scores are not high, especially for the information extraction task. The reason may be if the audience laughter highly depends on the conversation contexts which were not considered by baselines. Therefore, taking a longer conversation context of a line into consideration is a worthy research direction. Tables 7 and 8     the styles in the training data. Firstly, since the six comedians in the corpus have distinctive comedy styles, we split the entire corpus into a 6-fold crossvalidation manner. The comedies in each fold are performed by the same leading comedian. Secondly, we train baseline models on five of the folds and evaluate the performance on the remaining fold. Tables 9 and 10 present the average results and the full results are available in the appendix. The results demonstrate that the laughter triggers can be detected even though the styles in the training data are very different compared to the testing data.

Conclusion and Future Work
We study the prediction of laughter triggers from comedy scripts by using text classification and information extraction methods. Firstly, we introduce a corpus including high-quality and annotated sketch comedy and crosstalk scripts. Secondly, we evaluate a number of baselines and find that BERT models achieve the best performance. We note that the information extraction performance was very low, indicating that this task is particularly challenging. We also conduct an error analysis of incorrect predictions. The errors suggest the incorporation of rich context information may further improve the performance. Therefore, it is worth investigating a model which can take such infor-   mation into consideration. Furthermore, it is also worth extending the corpus to a multimodal one by aligning scripts to corresponding audios or videos, because certain intonations or scenes can also make audiences laugh. The multimodal corpus can also benefit the creation of silent comedy. Enriching the corpus by including scripts in other languages may also be a potential direction. Lastly, the encouraging cross-style prediction performance shows the usefulness of our corpus for predicting new scripts with different styles. Moreover, it is also interesting to explore human performances by asking annotators to make predictions based purely on the scripts of unwatched comedies, and to investigate if the script writers find the model predictions insightful.
We hope this study will benefit script writing by inspiring the community to develop intelligent systems for comedy writers and other artists in the field. The corpus might also be useful for researchers who are working on related or similar tasks, such as discourse analysis and humorous response generation for dialogue systems.

A Appendices
A.1 Computing Resources Table 11 describes the details of the computing resources used for all of our experiments. These resources are freely available from Paperspace 12 .

A.2 Model Details
Below we present model hyper-parameter values 13 and the average running time of one epoch.  BERT Models We use the same hyper-parameter settings as used in the text classification models with the exception of Batch Size = 16. The average running time of BERT-tiny, BERT-small and BERTbase for information extraction are 34.00s, 55.73s, and 120s respectively.        Table 17: Performance of information extraction in predicting the scripts performed by specific leading comedians.