Stylistic Variation in Television Dialogue for Natural Language Generation

Conversation is a critical component of storytelling, where key information is often revealed by what/how a character says it. We focus on the issue of character voice and build stylistic models with linguistic features related to natural language generation decisions. Using a dialogue corpus of the television series, The Big Bang Theory, we apply content analysis to extract relevant linguistic features to build character-based stylistic models, and we test the model-fit through an user perceptual experiment with Amazon’s Mechanical Turk. The results are encouraging in that human subjects tend to perceive the generated utterances as being more similar to the character they are modeled on, than to another random character.


Introduction
Conversation is an essential component of social behavior, one of the primary means by which humans express emotions, moods, attitudes, and personality. Conversation is also critical to storytelling, where key information is often revealed by what a character says, how s/he says it, and how s/he reacts to what other characters say. Here we focus on the issue of character voice. One way to produce believable, dramatic dialogue is to build stylistic models with linguistic features related to natural language generation (NLG) decisions. Television dialogue are exemplars of many different linguistic styles that were designed to express dramatic characters. Thus we construct a corpus of television character dialogue from The Big Bang Theory (BBT) and apply content analysis and language modeling techniques to extract relevant linguistic features to build character-based stylistic models. We test the model-fit of character models through a generation experiment to test user perceptions of characters.
Our work can be applied to storytelling applications such as video games, interactive narrative, chatbots, or education systems where dialogue with personalities may improve user experience.

Research from corpus linguistics include
Bednarek's work on using Gilmore Girls to compare the genre dramedy to other types (Bednarek, 2011a), and Quaglio's work on using Friends with unscripted conversations (Quaglio, 2009). Other related research focuses on characterization through dialogue. For example, Bubel explored the friendship among characters in the Sex and the City (Bubel, 2005), and Bednarek analyzed linguistic stylistics shifts from characters from the Gilmore Girls (Bednarek, 2011b) and The Big Bang Theory (Bednarek, 2012).
Research from computational stylistics (or stylometry) focuses on the use of quantitative methods to study writing styles to characterize authors, which can be applied to many applications such as classical literary text, modern forensic text, and online reviews, just to name a few (Stamatatos, 2009). Principal component analysis is used to analyze the variations in words, focusing on the challenge of relating features and meanings in text, which is not fixed depending on the context (Schreibman et al., 2008).
There is an extensive amount of research in story generation (narrative content), which tends to focus on plots and character development to achieve narrative goals. One source of creating stories comes from crowd participants writing detailed descriptions for events, going into details with characters' intentions, facial expressions, and actions (Li et al., 2014). In addition, they used the Google N-Gram Corpus and Project Gutenberg to help select different types of sentences (most/least probable, most fictional, most interesting details) and different sentiments (most positive/negative). Our work is also related to character modeling from film dialogue for NLG , except that we focused on TV series because they offered more dialogue.
Despite overlaps, our work differs in that we: 1) extract linguistic stylistic features based on personality studies from psychology; 2) focus on features that can be generated given our current system; 3) find significant features and use them as building blocks to 4) create models using techniques such as standard scores and classification; and 5) apply the models to applications such as natural language generation.

Natural Language Generation Engine
PYPER (Bowden et al., 2016) is a spin-off implementation of PERSONAGE (Mairesse and Walker, 2007) in Python that provides new controls for expressive NLG. It is currently part of the M2D Monolog-to-Dialogue generation (Bowden et al., 2016) framework, which we briefly describe the architecture below ( Figure 1).
The EST framework (Rishes et al., 2013) produces a story annotated by SCHEHERAZADE (Elson and McKeown, 2009) as a list of sentences represented as Deep Syntactic Structures (DsyntS). DSyntS is therefore a dependency-tree structure with nodes containing lexical information about words. This is the input format for the surface realizer RealPro (). M2D converts the story (list of DsyntS) into two-speaker dialogue by accepting input parameters that control the allocation of content, pragmatic markers, etc.

Corpus
We parsed fan-transcribed BBT scripts, seasons 1-4 and partial season 5, to obtain scenes, speakers, and utterances. The series centers around 5 characters, 4 of them (all male) are scientists/engineers working at Caltech, and 1 (Penny) is a waitress. The comedy's theme focuses on the contrast between the geekiness of the male characters and Penny's social skills. Two additional female characters, both scientists, were introduced as love interests to two main male characters, and have since became main characters themselves.

Stylistic Features Extraction
After extracting dialogic utterances from transcripts, we extract features reflecting particular linguistic behaviors for each character. Table 1 describes major feature sets, which include sentiment polarity, dialogue act, passive voice, word categories from LIWC (Pennebaker et al., 2001), tag questions, etc.

Character Stylistic Models
We calculate a standard score (z-value) for each feature to measure the differences between main characters: Leonard, Sheldon, Penny, Howard, Raj, Bernadette, and Amy. A better measurement could be used due to the small population and normal assumption, however we reviewed the results and they seem to capture enough relative differences among characters. Character models are composed of significant features with |z| ≥ 1. While using features with |z| ≥ 2 might be a better choice, our NLG engine can manipulate many features under |z| ≥ 1.
The number and examples of significant features for each character are shown in Table 2. We see that for |z| ≥ 1, Sheldon, Penny, Bernadette and Amy have over 200 significant features. Sheldon, more specifically, has close to 400 significant features. When we narrow them down to z ≥ 2, significant features for Bernadette and Amy decreased by over 85%, Leonard, Penny, Howard, and Raj decreased by 70%, and Sheldon decreased by 54%.

Generating Expressive Utterances
The workflow for generation is to 1) annotate stories using SCHEHERAZADE; 2) use EST to automatically translate annotated stories to deep syntactic structures (DSyntS); 3) PYPER reads 2. Sentiment Polarity. Overall polarity, polarity of sentences, etc., using SENTIWORDNET 1 to calculate positive, negative, and neutral score. 3. Dialogue Act. Train Naive Bayes classifier with NPS Chat Corpus' 15 dialogue act types using simple features. We also determine "First Dialogue Act", where we look at the dialogue act of the first sentence of each turn. 4. Merge Ratio. Use regular expression to detect the merging of subject and verb of two propositions. 5. Passive Voice. Using a third party software (see text) to detect passive sentences. 6. Concession Polarity. Look for concession cues, then calculate polarity of concession portion. 7. LIWC Categories. Word categories from the Linguistic Inquiry and Word Count (LIWC) text analysis software. 8. Markers -PERSONAGE. collect words used in PERSONAGE for generation, which where selected based on psychological studies to identify pragmatic markers of personality that affect the utterance. 9. Tag   and manipulates DSyntS to add expressive elements, and 4) send "expressive" DSyntS to Real-Pro (Lavoie and Rambow, 1997) (a sentence realizer) for generation. We focus on operation 3 where we use our learned character stylistic models to add expressive elements to generic sentences.

Mapping Stylistic Features to NLG Decisions
The re-written and better-controlled PYPER allows for more useful mapping of character models for NLG. For example, hedge insertion patterns are kept in a library where new additions can be easily added. As an example, a partial mapping for LIWC categories are shown in Table 3. For multiple features mapped to the same PYPER parameter, we calculate a weighted average of the features.

Narrative Content
Our narrative content comes from fables and stories: 1 fable (The Fox and the Crow) and 6 blog stories about garden, protest, squirrel, bug, employer, and storm (Gordon et al., 2007). We use The Fox and the Crow fable as an example to describe our process shown in Figure 2. Some phrases are highlighted to show how they were annotated and translated. Many complicated sentences have been broken down into shorter ones. Note that some additional descriptions (adjectives) were added in order to provide enough search space for PYPER to exercise enough expressive parameters, so that characters' personalities will come through in different variations of the story. The final, expressive version of the story shows different stylistic features such as converting a statement to a question and adding character dialogue inspired expressions such as Typical.

Evaluation with User Perceptual Experiment
We used Mechanical Turk to get user feedback on the generated dialogue. The PYPER generated output dialogue were post-processed to get rid of typos and minor grammatical issues. Referring to the MTurk survey (one HIT) in Figure 3, we first show some information about the character in interest (Sheldon, in this case), followed by two sets of dialogue: one by Sheldon and the other by a different random character. The worker does not know which one was modeled by Sheldon. S/he was asked to pick the dialogue that sounded most similar to Sheldon, along with providing reasons. Referring to results in Table 5, we used three participants per pair of characters comparison per story. The character on the left-most column indicates the modeled character, and the comparedto character was the "other random character" in the survey. Each circle (empty or filled) indicates a worker's choice. A filled circle (•) means the worker picked the "matched" generated dialogue to the intended character, otherwise an empty circle (•) is shown.
The probability that at least two out of three participants agree on the right character is > 50% (Table 4), while all three participants agree on the right character is 31.3%, which is higher than chance (12.5%). Overall the 7 characters over all 7 stories were recognized about 65.5% of the time (out of 882 ratings). Per character-wise over all 7 stories, Penny was recognized the most with 82.5% of the time, followed by Leonard (78.6%), Bernadette (66.7%), Amy and Sheldon (both 61.9%), Howard (57.9%), and finally Raj, who was recognized the least with 49.2% of the time.

Character Analysis
In this section we provide some insights to the results, including with participants' comments about their perception of characters through the generated dialogue.
We found the following pairs of characters most distinguishable from each other ( A Crow was sitting on a branch of a tree with a piece of cheese in her beak when a Fox observed her and set his wits to work to discover some way of getting the cheese.

SCHEHERAZADE & EST
The crow sat on the tall tree's branch. The crow has elegant talons. The crow has a good beak. The crow has ugly wings. The crow has bad eyes. The delicious cheese was in the crow's beak. The hungry fox observed the crow. The fox has good claws. The fox has bad teeth. The fox tried to discover for the fox to get the cheese. PYPER: Monologue to Dialogue Speaker 1: The crow sat on the tall tree's branch. The crow has elegant talons, and a good beak. The crow has ugly wings, and bad eyes.
Speaker 2: The delicious cheese was in the crow's beak. The hungry fox observed the crow. The fox has good claws, but bad teeth. Speaker 1: The fox tried to discover for the fox to get the cheese. The fox came and stood under the tall tree. The fox looked toward the crow. Speaker 2: The fox said the fox saw the crow. The fox said the crow's beauty was incomparable. The fox said the crow's plumage's hue was exquisite. PYPER: M2D + Stylistic Parameters Speaker 1: She sat on the tall tree's branch. She has elegant talons, and a good beak. She has ugly wings, and bad eyes. Speaker 2: He has good claws, but bad teeth. Speaker 1: What was in her beak?
Speaker 2 Leonard, and Sheldon-Howard/Leonard. Note that the comparison is not symmetrical because in the survey we gave a "known" (reference) character, which is the first column in the table.
It is not surprising to see Penny being different from most of the male characters, as it is the premise of BBT. Raj is an exception, mainly due to his lack of (expressive) dialogue, though he is definitely different from Sheldon. It is also believable that Leonard is similar to many characters, as he is the most "normal" character out of the group.
We further explore a few characters below.

Perception of Penny in comparison to Leonard (most distinguishable)
Penny is one of the best expressed character in the experiment, missing only by one selection in comparison to Leonard (95.2%), and missing by two in comparison to Sheldon (90.5%). Here we take a look at the comparison with Leonard, where 20 (out of 21) Penny-modeled generated dialogue were rated more similar to Penny, and only 1 (out of 21) Leonard-modeled generated dialogue were rated more similar to Penny. Overall, participants' perception of Pennymodeled generated dialogue seem to agree with Penny's personality, capturing her "bubbly, cheerfulness", as mentioned by one worker. Some no table descriptions include: -talkative, randomness, random pauses, better wording, more personality -seek feedback from others, lots of questions, not always sure of what she's saying, hesitation -good mix of colloquialisms and Penny-like filler, some brief, fairly simple statements -stand-out word choices: magic, huh?, mhmm, let's see, that..., the crow needed what?, oh gosh, I mean, damn yeah Participants perceived Leonard-modeled generated dialogue as not suitable for Penny, mostly because of his bland language. Here are some no table descriptions: -too simple, monotone, boring, direct, bare, straightforward, matter-offact, boxy, bland, not enough questioning for Penny -too much adverb usage on precision or intellect for Penny -not like Penny to use complex words and phrases -not like Penny to use: technically, darn -too rude for her to use, since she wants people to like her: everybody knows that, obviously The MTurk worker of the one missed selection cited Penny being a very simple speaker, implying that her dialogue would contain brief and simple statements. While this is true, she also uses quite a bit of fillers and questions around her "simple" dialogue to sound chatty.

Perception of Penny in comparison to
Bernadette (least distinguishable) It is not surprising to see Penny being the least distinguishable with Bernadette (57.1%). Bernadette was introduced in the series as Penny's friend and coworker working as a waitress. Her role on the show seemed to be more similar to Penny (friendly and sociable) than everyone else (nerdy and socially awkward), despite that she became a scientist eventually.
While the Bernadette-model contain chatty word choices (similar to Penny's), it also contains "intellect" word choices. However due to the randomness of the generated dialogue, where not all features are expressed/activated, some dialogue/story might not show enough of her nerdy side. For example, precise adverbs such as essentially, particularly are more likely to be used by a scientist/engineer (Bernadette) but not by Penny.
In terms of stories, Bug and Garden did the best at distinguished the character pair, while Employer and Storm did the worst (none of the Pennymodeled dialogue sounded like Penny).

Perception of Sheldon in comparison to
Penny (most distinguishable) With Sheldon differs the most with Penny (85.7%), we focus on comments by participants  who confused the two characters. It turns out that certain phrases intended for Penny were perceived as "arrogant" when spoken by Sheldon. Here are the actual comments by participants: -"mmhm..." I can picture coming from Sheldon in an irritated manner. "...you are kidding, right" would be said by Sheldon in an arrogant and condescending manner.
-"You might be interested in knowing..." sounds like an arrogant Sheldon line, followed by the "Oh God..." I can actually picture Sheldon saying this line.
-"You might be interested in knowing..." is used twice in Dialogue 2, and would be something Sheldon might say to make another person feel inferior.
7.2.2 Perception of Sheldon in comparison to Leonard (least distinguishable) As roommates and colleague at work, their similarity is understandable. Here are summarized comments by participants describing the dialogue: -matter-of-fact, straightforward -clear, unhesitant -shorter, more direct sentences; to the point -use "technically" -do not use a long string of adjectives

Leonard
For Leonard, Penny is considered the most distinguishable. Even though Leonard is considered less nerdy than other male characters, his language is still very different from Penny's.
Amy being the least distinguishable for Leonard is also believable. Amy, despite her language closely resembles Sheldon, is also interested in relationships and friendship (e.g., with Penny and Bernadette).
Here are some participants' comments on perceiving Amy's dialogue as Leonard's dialogue: -intelligently spoken but also have a natural tone to them -quick and to the point without over complicating things -"I mean..." sounds like Leonard in his somewhat whiny manner -Leonard sometimes smooths things over for Sheldon so he doesn't get upset. I think he would soften some things he says when he uses "I think" or "I mean" -intelligent yet normal way of speaking -both dialogue work okay really

Other Observations
Leonard and Penny represent the opposite-attracts couple. The biggest differentiating factor is that Penny's dialogue are perceived as being more emotional than Leonard's.
A general theme for Leonard's dialogue is that his speech pattern is "normal", implying that everyone else has a more stylized dialogue. This is an interesting observation because Leonard is not "normal" relative to the general population; he is being characterized as a typical nerd. Yet he is "normal" relative to his friends and therefore easier to identify on many cases.
According to (Brooks and Hébert, 2006), individual's social identities are largely shaped by the popular media: what it means to be white, black, male, female, heterosexual, homosexual, etc. Since characters are expressed through language and therefore connected to characters' identity as an individual and as part of a community (Hurst, 2011), the media such as television often provides the first (and sometimes the only) impression of certain groups of people.
In the context of BBT and the significant features we used to represent characters, it seems that Penny's language represents the typical female as identified by Lackoff (Lakoff, 1973): hedging, emotional emphasis, adjectives, etc. This is in contrast with the male characters as scientists, who tend to be more matter-of-fact.
Do scientists talk differently from the general population? Our results answer with a "yes" in that Penny's language is mostly in contrast with male scientists' language. Such contrast is also reflected in the real world (e.g., % of scientists versus. U.S. population believe in climate change).
What makes the show interesting is the "inbetween" characters: female scientists Amy and Bernadette.
The perception of the dialogue showed that the Penny-Bernadette, and Leonard-Amy pairs shared some similar language. With the right intention and scripts, the media can help narrow the perception and narrative gap between scientists and the general public.

Conclusion and Future Work
We explored character voice from the TV show BBT by building stylistic models relating character dialogue's linguistic features to natural language generation decisions. These models are then used to manipulate an expressive NLG to transform regular sentences into an expressive version. The generated, expressive dialogue are then used in a perceptual experiment to see how users perceive expressed personalities. Our results were encouraging in that people were able to perceive differences among characters, though some better than others. For the ones that were hard to distinguish, participants' comments provided great insight into how to better express the extracted features through NLG.
One possible future work is to use people's blogs as a source to create speaker-specific models. Another possible future work is to use character models to drive the monologue-to-dialogue process that created the stories used in our experiment. For example, if the character sounds mostly negative, the process can try to allocate all negative sentences to a story character's dialogue.
We believe our work can be applied to storytelling applications, such as video games, interactive narrative, chatbots, or education systems where dialogue with personalities may improve user experience, in a more controllable way (than using a neural network for generation, for example).