Gender-Distinguishing Features in Film Dialogue

Film scripts provide a means of examining generalized western social perceptions of accepted human behavior. In particular, we focus on how dialogue in films describes gender , identifying linguistic and structural differences in speech for men and women and in same and different-gendered pairs. Using the Cornell Movie-Dialogs Corpus (Danescu-Niculescu-Mizil et al., 2012a), we identify significant linguistic and structural features of dialogue that differentiate genders in conversation and analyze how those effects relate to existing literature on gender in film.


Introduction
Film characterizations often rely on archetypes as shorthand to conserve narrative space. This effect comes out strongly when examining gender representations in films: assumptions about stereotypical gender roles can help establish expectations for characters and tension. It is also worth examining whether the gendered behavior in film reflects known language differences across gender lines, such as women's tendency towards speaking less or more politely (Lakoff, 1973), or the phenomenon of "troubles talk," a ritual in which women build relationships through talking about frustrating experiences or problems in their lives (Jefferson, 1988) in contrast to a more male process of using language primarily as a means of retaining status and attention (Tannen, 1991). We look at a large sample of scripts from well-known films to try to better understand how features of conversation vary with character gender.
We begin by examining utterances made by individual characters across a film, focusing on the classification task of identifying whether a speaker is male or female. We hypothesize that in film, speech between the two gender classes differs significantly. We isolate interesting lexical and structural features from the language models associated with male and female speech, subdividing to examine particular film genres to evaluate whether features are systematically different across all genres or whether distinguishing features differ on a per-genre basis.
We then focus on the text of conversations between two characters to identify whether the two speakers are both male, both female, or of opposite genders. One belief about gendered conversation expressed in films is that women and men act fundamentally differently around each other than around people of the same gender, due partly to differences in the function of speech as perceived by men and women (Tannen, 1991). We look into features that explore the hypothesis that there are significant differences in how men and women speak to each other that are not accounted for merely by the combination of a male and a female language model, and find distinguishing features in each of these three classes of language. Finally, we look at whether these conversation features have predictive power on the duration of a relationship in a film.

Data Description
Our dataset comes from the Cornell Movie-Dialogs Corpus (Danescu-Niculescu-Mizil et al., 2012a), a collection of dialogues from 617 film scripts. Of the characters in the corpus, 3015 have pre-existing gender labels. We obtain another 1358 gender labels for the remaining characters by taking the top 1000 US baby names for boys and girls and treating any character whose first listed name is on only one of these two lists as having the respective gender of the list. Based on hand-verification of a sample of 100 these newly-added labels, we achieved 94% labeling accuracy, implying that the 4373 character labels have about 98% accuracy. In practice, many of the mislabeled names seem to be from characters named for their job title or last name, suggesting that these characters have fairly little contribution to the dialogue. We investigated using IMDb data as an additional resource but discovered that variations in character naming make this task complex.
Women are less prominent than men across all films, both possessing fewer roles (30% of all roles in major films in 2014) and a smaller proportion of lead roles within them (Lauzen, 2015). This observation is matched quite well in the Movie-Dialogs corpus, where after supplementing gender labels, only 33.1% of characters are female (previously, 32.0% of the original characters were female). In addition, we record 4676 unique relationships (judged by having one or more conversations) with known character genders. A chi-squared test to compare the expected distribution of gender pairs from our character set to the actual relationships shows that the characters are not intermingling independently of gender (p < 10 −5 ), with only 374 of the expected 509 relationships between women and 2225 interactions between men compared to the expected 2099.
Subdividing our data further, we find that certain film genres as represented in this dataset have disproportionate representation of certain gender pairs with respect to gender. Table 1 shows the significant differences within genders of actual vs. expected number of characters and relationships of each gender type. Though we hypothesized that the gender gap may have narrowed over time, we find the gender ratio fairly consistent across time in our corpus, as shown in Figure 1.

Feature Engineering
Our text processing uses the Natural Language Toolkit (NLTK) (Bird et al., 2009). We use a simple tokenizer in our analysis that treats any sequence of alphanumerics as a word for our classifiers, splitting on punctuation and whitespace characters. We elect not to stem or remove stopwords, as non-contentful variation in language is important for our analysis.   Based on theory that women will have more hedging (Lakoff, 1973), we hypothesized that strength of sentiment or signals of arousal or dominance might also signal gender differences in convesation. We used sentiment labels from VADER (Hutto and Gilbert, 2014) and a list of 13,915 English words with scores describing valence, arousal, and dominance (Warriner et al., 2013). We group these features and several nonlexical discourse features into several primary groups, described in Table 2. We also experimented with part-of-speech labels using the Stanford POS tagger (Toutanova et al., 2003), but found they do not significantly influence results.  Table 1: Chi-squared test results on number of characters of each gender and number of gender relationship pairs given gender proportions. The character gender test is done in comparison to the 33% female baseline expectation for that number of characters, whereas the gender-pairs are with respect to the expected proportion of gender pairs were one to randomly draw two characters for each of the relationships observed. Only genres with more than 100 observed characters with assigned gender were included. Stars mark significance levels of p=0.05*, 0.01**, 0.001***, and 0.0001****.
We surveyed several types of simple classifiers in our prediction tasks: Gaussian and Multinomial Naive Bayes, and Logistic Regression. These implementations came from the scikit-learn Python library (Pedregosa et al., 2011).

Controlling Data
In comparing the language of males and females, we want to ensure that confounding factors do not result in significant results; the classification tasks should not yield better/worse results because of the structure of our dataset or the data we used to train/test. The first essential measure we take is to select equal numbers of males and females from each movie. Second, we only further select characters that have non-trivial amount of speech in the film. When selecting characters for single-speaker analysis, we use only those which had at least 3 conversations with other characters, 10 utterances, and 100 words spoken in total. This removes 45% of the characters from the original dataset. While the specific numbers are arbitrary, they were roughly selected after examining random character dialogs by hand. Third, we control for the language of a given movie or the style of its screenwriter(s) by using a leave-one-label-out split when running our classifiers.
Similarly for conversations, we control for each of the gender classes (male-male, female-male, and female-female), by including from each film the same number of conversations from each class. This results in a set of roughly 3500 conversations for consideration, a substantial subset of the original corpus but one with representation of a variety of dialogue lengths and less affected by the gender variation within particular films, to avoid classifying film content.

Evaluating Individual Gender Features
We first examine the language differences in male and female utterances, selecting an equal number k i of random male and female characters from each movie i. We then develop language models based upon the unigram, bigram, and trigram frequencies across all utterances from selected male characters versus female characters. As our focus is on usage of common words, we use raw term frequency instead of boolean features or TF-IDF weighting. While this does not fully control for the amount of speech of a given gender, it does control for variation in gender ratios and conversation subjects within films and genres.
We analyze the interesting n-grams using the weighted log-odds ratio metric with an informative Dirichlet prior (Monroe et al., 2008), distinguishing the significant tokens based upon single-tailed z-scores. Notably, with a large vocabulary, it is expected that some terms will randomly have large zscores. We therefore only highlight n-grams with z-scores of greater magnitude than what arose in 19 out of 20 tests of random reshufflings of the lines of dialogue between gender classes (equivalent to the 95% certainty level of what is significant). The important n-grams are displayed in Figure 2.
The findings here conform to findings we would expect, such as cursing as a male-favored practice (Cressman et al., 2009) and polite words like greetings and "please" as more favored by women (Holmes, 2013). Interesting as well is the predominance of references to women in men's speech and men in women's speech: "she" and "her" are strongly favored by male speakers, while "he" and "him" are strongly favored by female speakers (p < 0.00001). We also observe that in contrast to men's cursing, adverbial emphatics like "so", and "really" are favored by women, conforming to classic hypothesis about gendered language in the real world (Pennebaker et al., 2003;Lakoff, 1973).

Predicting Speaker Gender
Given only the words a character has spoken in conversations over the course of the movie, can we accurately predict the character gender?
As outlined in Controlling Data, we select characters equitably from each movie, each having spoken a significant amount during the movie. Using this method, we obtain 552 male and female characters each. We extract features from the all the lines spoken by each of these characters (as outlined in Feature Engineering), and train/test various scikit-learn built-in classifiers (as from Classifiers) in 10-fold cross-validation. As surveyed here, using a Logistic Regression classifier with different features, we obtain 72.2% classification accuracy (per feature accuracy outlined in Table 3). A multinomial Naive Bayes classifier performs slightly better, on which we applied the more appropriate leave-one-label-out cross-validation method to split training and test data, at 73.6%.  Table 3: Performance of single-speaker gender classification. Bolded outcomes are those statistically insignificantly different from the best result (using a two-tailed z-test).

Evaluating Relationship Text
While the previous section demonstrates systemic differences in language between male and female speakers, an additional factor to consider is the conversation participants of each of these dialogues. We can hypothesize that, in addition to having different lexical content between men and women, movies also demonstrate significant content differences between pairs of interacting genders, such that the conversation patterns of men and women talking to each other have different content than samegendered conversations. We can examine this hypothesis by repeating the analysis performed on single characters throughout a film on individual conversations from films. We use the controlled dataset described in the Methods section, this time contrasting each class of gender pair: male-male, female-male, and female-female (MM, FM, and FF, respectively). We include the most significant words in each class in Table 4. As with the single-gender analysis, we see that men seem to speak about women with other men, and women about men with other women. We also note that several pronouns including "she" and "he" from before are actually considered statistically less probable in two-gendered conversations. This is an interesting signal of men speaking differently around men than around women, which, in conjunction with the high log-odds ratio of "feel", "you", and "you love" favoring dual-gendered conversations, suggests that men and women are more likely to be talking about feelings and each other, while they are more likely to talk about experiences Figure 2: Tokens with significance plotted with respect to log-odds ratio. We ran 20 randomization trials and found that in those trials, the largest magnitude z-score we saw was 4.7. Blue labels at the top refer to female words above that significance magnitude, while orange labels at the bottom refer to words below that significance.
of the other-gendered people in their lives with their same-gendered friends. While this finding does not fully support that women and men are not friends in films, it does suggest the idea that men and women in films are typically interacting in a way distinct from men and women without consideration of context. It also contrasts with the typical understanding of sharing personal problems as a female practice (Tannen, 1991), as it seems that both men and women in films use words discussing feelings and people of the other gender.

Predicting Gender Pairs
In order to focus on the linguistic differences of the content of conversations between our gender pair classes instead of the success of per-character gender classifiers, we took as our additional classification task the problem of predicting the gender pair of the speakers in a conversation. This task is considerably more difficult than most, as conversa-tions are often short and will include multiple speakers. We again use leave-one-label-out training to avoid learning dialogue cues from movies. While we can again attain better accuracy with a multinomial Naive Bayes classifier on LEX features, for our objective of simply demonstrating that features provide indication of gender differences, we are satisfied to use logistic regression to incorporate all features.
As Table 5 shows, the only features producing significant improvement over a random accuracy baseline of 33% are lexical, structural, and discourse features. While the fact that lexical content has distinguishing power is perhaps unsurprising, given the preceding analysis, it is somewhat more surprising that more simple structural and discourse features are also producing significant results.
While there no obvious significant structural differences, one can spot minor variation that seems to provide the slight improvement above random in our classification in Figure 3. We observe in Figure 3a 36  Table 4: The six top words and z-scores correlated with the topic positively and negatively when comparing log-odds ratios for each gender class with respect to the other two. While a z-score of magnitude 2.8 has a significance of p < 0.003, the size of the considered vocabulary makes it unsurprising that several words have scores of this magnitude randomly; however, in twenty trials of randomization of the text between classes, only one z-scores emerged greater than magnitude 3.1. We therefore infer zscores higher than 3.1 or lower than -3.1 are unlikely to be the consequence of random variation between classes.
that while utterance length is significantly higher for all-male than all-female conversations, two-gender conversations seem to behave more like all-female conversations on average. Figure 3b looks again at speaker utterances in combination with their imbalance between speakers, the "delta" average utterance length. Our comparison shows a significant difference between men talking to men and men talking to women. As delta utterance length here explicitly is described by average female utterance length minus average male utterance length, this demonstrates that women are speaking in shorter utterances than men in male-female conversations, in contrast to having longer utterances overall. Word length also is significantly shorter for women than men in single-gender conversations, but in this case, the two-gendered value appears to be just the interpolation of the two single-gender values, suggesting that word length is not decreased for male characters in two-gender conversation.
We also can see some interesting discourse features in Figure 3c. While looking at the data confirms that the average type-to-token ratio does not  differ between our three conversation classes, we find that the type-token ratio difference is significantly higher for conversations between two genders, which suggests that two-gender conversations may have an increased probability of demonstrating one character as less articulate than another. Looking into the data, this slightly but insignificantly favors women having a higher type-to-token ratio than men, suggesting they use more unique words in their speech than do men in conversation. Finally, we note that conversations with women have significantly higher unigram similarity than men. This hints there may be some linguistic mirroring effect that women in film demonstrate more than men, which may relate to the hypothesis that women coordinate language more to build relationships (Danescu-Niculescu-Mizil et al., 2012b;Tannen, 1991).

Relationship Prediction
In addition to testing the prediction of genders in conversations and relationships, we attempted to use the same features to distinguish from a single conversation whether a relationship would be short (3 or fewer conversations) or long (more than 3 conversations). We tested on a dataset of conversations split evenly between gender pairs and between long and short relationships, using leave-one-labelout cross validation to test conversations from one relationship at a time. With a multinomial Naive Bayes classifier, we are able to achieve 60 ± 2% accuracy with a combination of n-gram features, gender labels, and structural and discourse features. Performing ablation with each feature set used, we find that results worsen by omitting either structural features (54 ± 2%) or n-gram features (54 ± 2%), but that omitting gender from the classification does   not significantly impact the classification accuracy (60 ± 2%). Some of this result is predictable from the limits of the data: controlling for the number of conversations in a relationship heavily limits the number of possible short female relationships. Our dataset has few labels for minor female roles and thus short, explicitly female-female relationships are hard to find. In addition, though, analysis of the lexical features that predict this suggest that the difference is fairly subtle, more so than a gender divide might suggest: the significant positive indicators of a long relationship with respect to randomly significant are "it," "we," and "we ll", while the negative indicators are "name," "he," and "mr," which suggest that the identification of a collective "we" might show a longer connection but very little else that obviously signals a relationship's length.

Related Work
There exists prior work analyzing the differences in language between male and female writing, by Argamon, Koppel, Fine, and Shimoni (Argamon et al., 2003). Herring and Paolillo at Indiana University have shown relations in the style and content of weblogs to the gender of the writer (Herring and Paolillo, 2006). The investigative strategy we use for comparing n-gram probabilities stems from work done by Monroe, Colaresi, and Quinn on distinguishing the contentful differences in language of conservatives and liberals on political subjects (Monroe et al., 2008). Recently, researchers used a simpler version of n-gram analysis to distinguish funded from not-funded Kickstarter campaigns based on linguistic cues (Mitra and Gilbert, 2014).

Conclusion
Finding words that are stereotypically male or female came can be done rather quickly and roughly. Yet more sophisticated techniques provide more reliable and believable data. Isolating the right subset of the data to use with proper control methods, and then extracting useful information from this subset results in interesting and significant results. In our small dataset, we find that simple lexical features were by far the most useful for prediction, and that sentiment and structure prove less effective in the setting of our movie scripts corpus. We also isolate several simpler discourse features that suggest interesting differences between single-gender and twogender conversations and gendered speech.