Joint Estimation and Analysis of Risk Behavior Ratings in Movie Scripts

,


Introduction
In one of the longest running movie franchises in history, fictional British Secret Service agent James Bond is more often than not portrayed as an extremely charming gentleman, a cold-blooded killer, a smoker, and a severe alcoholic (Wilson et al., 2018). This is not a unique character trait, as other critically acclaimed films-such as, The Exorcist (Friedkin, 1973), Pulp Fiction (Tarantino, 1994), and A Clockwork Orange (Kubrick, 1972)follow narratives where the main characters engage in a similar collection of risk behaviors. The portrayals of these risk behaviors typically include acts of violence, sexual and substance-abusive behaviors in scenes of fighting, bloodshed, gunplay; intercourse and nudity; and alcohol, smoking and drug use, respectively. While these tend to attract audiences (Barranco et al., 2017) and facilitate a movie's global market reach (Sparks et al., 2005), they have long sparked concerns about the potential side effects of repeated exposure. Particularly, in the case of at-risk populations, such as children and adolescents, where this exposure has been linked to increased risk for engaging in violence (Anderson and Bushman, 2001;Bushman and Huesmann, 2001), smoke and alcohol consumption (Sargent et al., 2005;Dal Cin et al., 2008), and earlier sexual initiation (Brown et al., 2006).
Although various automated tools have been designed to recognize risk behaviors portrayals (e.g., (Chen et al., 2011;Liu et al., 2008)), many rely on cinematic principles from film theory such as illumination, rapid shot transitions or musical score selection (Brezeale and Cook, 2008). This limits their practical impact to an almost-final edition of the content, specifically where visual and sound effects have been added in, making it too late or expensive to implement any modifications. Hence, there is an opportunity on being able to identify these depictions from an earlier stage of content creation as to offer additional useful insights for film-makers and movie producers during the complex creative process.
To this end, our work leverages on two key insights: first, that while all of these works focus on a specific behavior, risk behaviors frequently cooccur with one another both in real-life (Brener and Collins, 1998) and in entertainment media (Bleakley et al., 2017(Bleakley et al., , 2014Thompson and Yokota, 2004). Second, that the language use in movie scripts can characterize portrayals of risk behaviors at the earliest form of content creation-even before production begins. For example, by identifying when Mr. Bond orders his usual alcoholic drink, Pulp Fiction's main characters plotting to kill someone, or the evil incarnated in The Exorcist cursing in a sexually explicit manner.
The present work, to the best of our knowledge, is the first to model the co-occurrence of risk behaviors from linguistic cues found in movie scripts. Our proposed model is a multi-task approach that predicts a movie script's violent, sexual and substance-abusive content from vectorial representations of the character's utterances. We hypothesize that this multi-task approach will help improve violent content classification, as well as in providing insights on their relation to other dimensions of risk behaviors depicted in film media.
Specifically, the contributions of this work are: 1. A multi-task model that significantly improves the state-of-the-art for violent content rating prediction by leveraging the cooccurrence of sexual and substance-abusive content 2. MovieBERT 1 : A domain-specific fine-tuned BERT model (Devlin et al., 2019) pre-trained over a large collection of film and TV scripts. We use this model to obtain better representations for the semantics of a character's language 3. A novel large-scale analysis on the joint portrayals, and their relation to other ratings, of violence, sex, and substance abuse in film.

Related Work
To understand the prevalence of risk behaviors in film and TV, social scientists have often relied on relatively small human annotated data sets (typically under a 100). This includes a study of portrayals of violence in 74 to 77 films from the last decade (Yokota and Thompson, 2000;Webb et al., 2007), as well as portrayals of teenage sex in 90 of the top-grossing films (Callister et al., 2011). Among other findings, these studies provide evidence that MPAA 2 ratings (the primary rating system used for films in the U.S.) are overly sensitive to sexual content, and less effective at identifying other types of risk behaviors (Tickle et al., 2009;Thompson and Yokota, 2004). However, most of these works are limited to the study of a particular behavior, even though risk behaviors frequently cooccur with one another in media (Bleakley et al., 2017(Bleakley et al., , 2014Thompson and Yokota, 2004).
1 https://github.com/usc-sail/ mica-riskybehavior-identification 2 Motion Picture Association of America The task of identifying risk behaviors from language is perhaps closely related to that of recognizing Abusive Language (AL; Waseem et al. 2017). AL is an umbrella term that includes offensive language, including sexist and racist language, and hate-speech. AL computational models are usually designed using popular document classification techniques (Mironczuk and Protasiewicz, 2018), based on features such as n-grams (Nobata et al., 2016); affective language (Wiegand et al., 2018) and distributed semantic representations (Wulczyn et al., 2017). Recent efforts (e.g., Mozafari et al., 2019) explore a supervised finetuning approach that start from pre-trained models of highly-contextualized word representations from transformers (Devlin et al., 2019).
Most similar to our work are efforts in predicting a single movie-level rating from language either in movie scripts (Martinez et al., 2019;Shafaei et al., 2019) or in transcripts (Mohamed and Ha, 2020). These works explore the use of recurrent neural networks (RNN) over sequences of vector representations, each composed by the concatenation of lexical, semantic and sentiment features, to learn a movie representation from which the target rating is predicted. There are two notable differences between these and our proposed model. First, our model incorporates additional information in the form of other prediction targets (i.e., multi-task paradigm) and multiple attention layers (Vaswani et al., 2017). The former is motivated by the previously mentioned notion that characters tend to engage in joint portrayals of risk behaviors (Bleakley et al., 2017); the latter allows the model to jointly attend to information from different representation sub-spaces. Second, these previous works explore an early-fusion method where linguistic features are concatenated and fed to a self-attention mechanism on top of the RNN layer. This assumes that in an effort to construct a meaningful interpretation of the features, the attention layer will be powerful enough to disentangle different aspects of language, such as semantic and sentiment. Instead, we use a late-fusion approach where we separate semantics from sentiment, and direct them through different pathways in our model-all the way up to independent attention layers. Thus, our attention layers have the relatively easier task of identifying what is of importance for a particular view of language in a particular task. This allows our model to attend to what is being said (semantic) and, independently, how it is being said (sentiment). We expect this to be more informative about the content of each utterance, leading to a better representation construction.

Method
Our model learns to map sequences of character utterances' representations to overall movie-level ratings. Each representation is composed by two parts: one representing its semantics, and one for its sentiment. These representations are obtained from models trained on larger out-of-domain corpora but have been validated on related tasks in domains similar to those we study in this work (e.g., classification of movie review sentiment (Pagliardini et al., 2018)). Our decision to start from character utterance representation (as opposed to word representations) comes from the limited number of labeled expert curated content ratings in our dataset (see Section 4).

Semantic representations
The unique aspect of this work is the use of highlycontextualized vector representations for the particular domain of movie scripts to predict content ratings. These techniques have shown remarkable success on a variety of NLP tasks such as sentiment classification (Devlin et al., 2019) and identifying AL in social media (Mozafari et al., 2019).
B. Highly-contextualized representations: Bidirectional Encoder Representations from Transformers (BERT; Devlin et al. 2019) is a novel language model that outperforms its predecessors due to an innovative architecture that incorporates information from both the left and right contexts. This is done through an interlacing of n fullyconnected dense layers each with a multi-head attention layer (Vaswani et al., 2017). From BERT, we obtain vector representations for every utterance. These come from either of two pre-trained models: (a) BERT-base (n = 12; 768-dimensional), and (b) BERT-large (n = 24; 1024-dimension)-both trained on a large corpus of documents from Wikipedia and BookCorpus. Figure 1: Multi-task model for content rating classification: Each utterance is represented by semantic and sentiment features, fed to independent RNN encoders. The sequence of hidden states from the encoders serve as input for task-specific layers (gray boxes).

C. MovieBERT:
A common approach to implement models that produce near state-of-the-art results is to fine-tune large pre-trained models (such as BERT) for a particular task. This aims to keep the generalization power of the original model while also adapting its vocabulary for the language use in a particular domain. Following this idea, here we fine-tune a BERT-base model by continuing its training over the 6, 000 movie scripts dataset. Our adapted model, movieBERT, consists of 12 transformer layers that learn a 768-dimensional representation of a movie script. We train this model over a 85% − 15% train-test data split and, as in (Devlin et al., 2019), we optimize the model for two tasks: next-sentence prediction and masked language modeling. In the former, the model has to predict the sentence that follows a given sentence; in the latter, a random word in a sentence is masked with a token, and the model has to recover the original word. We initialize the weights of our model with those from the pre-trained BERT-base model, and continue training for 10, 000 steps, using the base model's parameters: learning rate of 2 × 10 −5 , batch size of 32, and sequences length of 128. MovieBERT achieves 96.5% accuracy on the next sentence prediction task, and a 65.9% accuracy on the masked language model-an absolute improvement from the BERT-base model of 24.5% and 12.43%, respectively. To obtain sentence-level representations, we concatenate and then averagepooled the output of the last 2 layers. Figure 2: Risk behavior rating co-occurrence: on average, when one risk-behavior rating increases so does the others. Error bars denote 95% confidence intervals.

Sentiment representations
Previous works show the benefits of including lexical features that capture the expressed sentiment characteristics from language for media content prediction tasks (Martinez et al., 2019;Shafaei et al., 2019). However, most approaches to sentiment analysis on movie scripts rely on manuallyconstructed sentiment lexica (e.g., Lapata 2018, 2015). These lexica have a limited vocabulary, which is costly to scale or adapt to new domains. In contrast, here we explore neuralnetwork-based sentiment models that learn representations from language used in the related task of movie reviews (Socher et al., 2013). While we are aware of the possible mismatch between the language use in movie reviews and that of movie scripts, our work relies on the assumption that these reviews provide a good initial step towards capturing sentiment expressed in movie scripts. These models not only learn how words are used from a larger vocabulary but also consider the relations between these words which may allow them to generalize better for unseen data. In this work, we experiment with two neural-based models: bidirectional long short-term memory models (Bi-LSTM; Tai et al. 2015), and bidirectional encoder representations from transformers (Devlin et al., 2019). We chose these models because they provide a good trade-off between the number of parameters and the performance on the sentiment prediction task (Barnes et al., 2017), and due to their outstanding performance in NLP tasks. Our sentiment representations are obtained from the last hidden state of the Bi-LSTM, and the previous to last layer of the BERT transformer.

Role of Movie Genre
Movie genres relate the elements of a story, plot, setting and characters to a specific category. Categorizing a movie indirectly assists in shaping the characters and the story of the movie, and determines the plot and best setting to use. Thus, movie genre contains information on the type of content one could expect in a movie (especially for the case of violent content (Martinez et al., 2019)). Thus, our models include movie genre as an additional feature. Genres for each movie were obtained from IMDb 3 and transformed into a multi-hot encoding.

Ratings Prediction Model
Our model (see Fig. 1) takes a sequence of utterance representations as input, and outputs predictions for target content ratings. Formally, let K be the number of content ratings to output (number of tasks), and {u t } N t=1 be a sequence of N character utterances. For each u t , we obtain features, f 1t and f 2t corresponding to the semantic and sentiment aspects of language respectively. These representations are input to separate bi-directional RNN layers. To improve model generalization, a dropout layer (probability p) was added after the feature extraction layer. Each RNN takes a sequence of representations and outputs a sequence of m hidden vectors {h j1 , . . . , h jm }; h jl ∈ R d where j = 1, 2 corresponds to semantic and sentiment features respectively. Each hidden vector represents a state of conversational context-i.e., what is being said in relation to what has been previously said. This context is important as it follows from the fact that most utterances are not independent of one another, but follow a conversation thread.
Both hidden-vector sequences {h 1i } m i=1 and {h 2i } m i=1 go through k ∈ {1 . . . , K} task-specific units, represented as gray boxes in Fig. 1. Each task-specific unit is composed of a sequence of four layers: (i) two separate self-attention mechanisms; (ii) a concatenation layer; (iii) a z-dimensional dense layer, and (iv) a softmax prediction layer. Self-attention  aggregates the sequence of hidden vectors into a representation of what characters say during the movie. These attention layers, denoted by {α kj ∈ R m : j = 1, 2}, are not shared between the tasks to allow them to focus on what is important for their particular type of content. We chose this approach as it showed improved performance over our initial experiments with multi-head attention (Vaswani et al., 2017). Attention outputs corresponds to a weighted sum of the hidden states and the α kj weights, A kj = m i=1 α kji · h ji . In the concate- nation layer, these aggregated representations are coupled with movie-genre v k = [A k1 ; A k2 ; g], and serve as inputs for a z-dimensional dense layer. This yields s k = φ(W k * v k + b k ) where φ is a ReLu function, and W k , b k are the weight and bias matrix to be learned. We predict the ratings through a prediction layer asŷ k = sof tmax(s k ). The complete model is trained by minimizing the aggregated loss L = k l k (y k ,ŷ k ) where l k is the cross-entropy loss associated with the k-th task.

Data
We collected a large number of movie scripts from three publicly available sources. The first source was related works who shared their movie scripts datasets (Gorinski and Lapata, 2018;Ramakrishna et al., 2017); the second source was online collections of produced scripts 4 , and the final source was online communities where non-produced scripts are shared 5 . In total we collected 12, 706 scripts, some of which correspond to produced films or TV episodes. To improve the quality of this dataset, we clean it by extracting text, limiting to files with more than 1, 000 lines, and replacing nonascii characters. In case of any error, we remove the file from the collection. This procedure resulted in 6, 057 movie scripts spanning 23 genres with an average of 1450.6 utterances per movie (σ = 456.11, M = 1447.0). We use this collection to fine-tune movieBERT.
To evaluate the performance of our model, and directly compare it to previous work, we manually align a subset of 989 movie scripts from our dataset to the content ratings found in (Martinez et al., 2019). These ratings come from Common Sense Media (CSM), a non-profit organization that promotes safe technology and media for children 6 . CSM experts rate movies from 0 (lowest) to 5 (highest) with each rating manually checked by the executive editor to ensure consistency across raters. A manual inspection of the dataset revealed that the movies with the least scores across all risk behaviors correspond to the romantic genre, whereas the movies with the most risky content were in the horror genre. Additionally, we investigate if CSM expert raters capture the co-occurrence of risk behavior portrayals. Figure 2 shows that, on average, when one risk-behavior rating increases so does the others. This was corroborated by significant positive Spearman's correlations between violence and sexual content (rs = 0.161, p < 0.001); violence and substance-abuse (rs = 0.129, p < 0.001), and sexual content and substance-abuse (rs = 0.467, p < 0.001).

Preprocessing
We follow a procedure similar to that described in Martinez et al. (2019), which discards scene headers, actions and transitions to represent a movie script as a sequence of actors speaking one after another. This leads to a natural formulation of a sequence learning model for capturing the dialog narrative using recurrent neural networks. Additionally, we transformed the five-point ratings to three categories using a median split on each rating to counter class imbalance and to be consistent with previous work. The distribution of the ratings is shown in Table 1.

Experimental Setup
In this section we discuss the model implementation, parameter selection, baseline models and sensitivity analysis setup.

Model Implementation
Our model was implemented in Keras 7 . Although not common in most deep-learning approaches, we performed 10-fold cross-validation (CV) to obtain a more reliable estimation for our model's performance. In each fold, the model was trained until convergence (i.e. loss in consecutive epochs was less than 10 −8 difference). To prevent over-fitting, we used Adam optimizer with a small learning rate (0.001), batch size of 16, and high dropout probability (p = 0.5). For the RNN layer, we used Gated Recurrent Units (GRU; . For the sentiment models, Bi-LSTM parameters were informed by the work of Tai et al. (2015): 50-dimensional hidden representation, dropout (p = 0.1), trained with Adam optimizer on a batch size of 25 and a L 2 penalty of 10 −4 . To allow for a fair comparison, all the BERT pre-trained models and movieBERT  Table 2: 10-fold cross validation multi-task classification performance. Precision (P), recall (R) and F1 macro average scores reported (percentages). Models trained independently for each task are denoted by double-line. The best model (shown in bold) performs significantly better than baseline for violence (perm. test n = 10 5 , p = 0.002) and substance-abuse (n = 10 5 , p = 0.006).
had the same set of parameters as the BERT-base model: 12 layers, 768 dimensions, learning rate of 2 × 10 −5 , sequence length of 128 and batch size of 32. For the initial experiments, we set the model parameters to hidden dimension size of d = 16, to help prevent overfitting, and the sequence length m = 500, which is approximately the duration of one movie act (i.e., one third). This selection was informed by previous works (Martinez et al., 2019;Shafaei et al., 2019).

Experiments
In our first set of experiments, we compare the predictive power of each of the proposed features for predicting risk behavior content. In a second set, we explore how varying the number of dimensions (d ∈ {8, 16, 32, 64}) and the utterance sequence length (m ∈ {100, 300, 500, 1000}) impacts the performance of our model. Additionally, we explore the individual contribution of each feature to the overall prediction task using ablation studies. For all experiments, we report macro-average precision, recall and F-score (F 1 ) estimated through 10-fold cross validation.

Baselines
As baselines, we compare against: (i) AL classification (Nobata et al., 2016), since AL likely includes sexual and drug-related terms; (ii) the stateof-the-art for violence rating prediction from movie scripts (Martinez et al., 2019), and (iii) BERT-only document classification systems (Adhikari et al., 2019). Additionally, to measure whether the performance improves with the inclusion of co-occurring risk behaviors, we compare our model against the same architecture without the multi-task approach.
6 Results Table 2 presents the classification performance for the baselines and our proposed model. In line with previous results (Martinez et al., 2019;Shafaei et al., 2019), we observe that including sentiment features (either in the form of lexica or neural network representations) greatly improves the model performance. Even without the multi-task framework, our model architecture shows significant improvement over the baselines (permutation test, n = 10 5 , all p < 0.05). This is likely due to our design choice of reducing the model complexity by focusing just on the informative features (i.e., semantic, sentiment and genre) instead of dealing with redundant features (e.g., n-grams, word2vec, AL lexica). By including the co-occurrence information in the form of additional tasks, our proposed multi-task model with task-specific attention gained an average F1 = 1.22% points. It also results in the best model (movieBERT + sentiment + movie-genre) with an F1 = 67.7% for (d = 16, m = 500), performing significantly better than the previous state-ofthe-art model for violent content rating prediction (perm. test n = 10 5 , p = 0.002) as well as the AL baselines for violence (perm. test n = 10 5 , p = 0.005) and substance-abuse content (perm. test n = 10 5 , Figure 3: 10-fold cross validation multi-task classification performance based on GRU dimension (d) and sequence length (m).

Classification Results
While the proposed model also improves sexual content rating prediction, this improvement is non-significant (p > 0.05). As previously mentioned, this could be attributed to the fact that MPAA's ratings are particularly sensitive to sexual content ( Thompson and Yokota, 2004). In fact, filmmakers are advised to avoid the repeated usage of sexually-derived words-either as an expletive or in a sexual context-as to avoid a non familyfriendly rating (Myers, 2018). Thus, they might refer to sexual acts through the use of euphemisms or innuendos, which the model seems unable to pick up on. Our experiments in using BERT for sentiment representations (last row in Table 2) did not significantly improve performance any further (p > 0.05). Future work will explore further finetuning to better capture affective language.

Performance Analysis
Parameter Selection: We evaluate model performance under different selections of parameters, namely the number of hidden dimensions in the GRU layer (d) and the length of the character utterance sequences (m). The model performance for different dimensions is presented in the left section of Fig. 3. For all tasks, we notice an improvement in performance for d = 16, which drops for higher dimensions. This suggests that the larger models are overfiting the data. There is a slight improvement for sexual content estimation for d = 8 (F1 = 48.1), but its performance is not significantly different from the original model (perm. test p > 0.05).
With respect to m, the right section of Fig. 3 presents the F 1 performance of the multi-task model. Overall, we see that longer sequences improve the model's performance. However, there was no significant difference between the performance of m = 500 and that of m = 1000 (perm.test,  p > 0.05). Although we did not test sequences longer than 1000 utterances, the smaller performance gains between increments of m lead us to believe that the model is saturated, which suggests that any longer sequence length will not provide any significant performance gains.
Ablation studies: Table 3 shows the individual contributions of each of the three representations. We find that semantic representations are the most important source of information. Removing this feature results in an average performance drop of −7.83F1. This difference in performance was significant for violence (perm.test n = 10 5 , p = 0.003) and substance-abuse (perm.test n = 10 5 , p < 0.0001) tasks. The second most informative feature was genre, closely followed by sentiment with average performance drops of −1.2 and −0.96 respectively. These results suggest that, while useful, our sentiment features still have scope for improvement. In particular, we note that a potential limiting factor might be the possible mismatch between the language used in movie reviews and that of the movie scripts. A study on how to bridge this possible mismatch will be part of our future work.
Attention Analysis: Finally, we verify our assumption that the attention layers are correctly identifying the important aspects of language with respect to each behavior. We do so by exploring how the attention weights are distributed across the movies scripts. Each of the 6 attention layers (two per task: one for semantic and one for sentiment) learns a m-dimensional weight vector, where each entry corresponds to a particular utterance in the sequence. The higher the weight, the more importance the model assigns to that particular utterance. For example, for the violent behavior task, we would expect utterances assigned a higher attention weight to be more reflective violent expressions than utterances with lower attention weights. To verify that each attention layer is correctly focusing on the behavior we are in-terested in, we set up a hypothesis test where we compare the maximum weight of each attention layer for movies rated HIGH against movies rated LOW on each behavior. Our null hypothesis is that there will be no difference in the way attention concentrates weights for different levels of the behavior. We reject this null hypothesis for the case of the semantics of the violence task (Mann-Whitney U = 59377.5, n1 = 356, n2 = 304, p = 0.015), and for the sentiment in the sexual content task (Mann-Whitney U = 52937.5, n1 = 214, n2 = 446 , p = 0.011).
These results suggest that our model picks up on violence by focusing on the content of the words, whereas identification of sexual behaviors is dependent on the emotional aspects of the language.

Co-Occurrence Analysis
In this section, we focus on some of the insights that our proposed model may provide film-makers and producers during the creative process. In particular, our analyses centers on three insights: first, on understanding how joint portrayals of risk behaviors appear on screen; second, in identifying temporal patterns that arise from these joint portrayals, and finally, in showcasing the relation between risk behaviors and MPAA ratings. For this analysis, we re-trained the best performing model over the complete movie script dataset (n = 989).
On the relation between joint portrayals of risk behaviors. We find a strong association between predictions of substance-abuse and sexual content: the odds for a movie script to be rated high on sexual content are twice as high when it has a high rating in substance-abuse compared to when it has a low rating (95% Confidence Interval [CI] 2.01 to 34.05). Moreover, we find that the odds of rating high on all three risk behaviors simultaneously are inversely proportional to the predicted violence rating (95% CI, HIGH:0.11 to 0.82 and MED:0.12 to 0.88). Hence, this suggests that film-makers compensate low levels of violence with joint portrayals of sexual and substance-abuse behaviors.
On the temporal patterns of the joint portrayals. If there is a temporal relation between the portrayals, when the model picks up a cue for a particular behavior at time t (i.e., a spike in the attention signal), we expect to see a corresponding spike in the attention signal of another task some time after t. To compute this relation, for each movie script we obtained the maximum correla-tion and its corresponding time lag (∆ ∈ [−m, m]) by using sample cross correlation function (CCF) between the attention weights of each task. CCF is a measure of similarity between two time series as a function of the displacement of one relative to the other. As an example, Fig. 4 shows the co-evolution of attentions weights and the lags corresponding to their maximum correlation for two movies. On average, attention to the sexual sentiment content precedes attention to violence semantics by∆ = 15.50 utterances (95% CI, 10.88 to 17.4), with an average correlation coefficient of rz = 0.192 ± 0.02. This lag increases for movies with higher content ratings on both violence and sex (∆ = 21.46, rz = 0.202), whereas movies with low sex and violent content have almost no temporal difference, and a significantly lower correlation coefficient (∆ = 0.75, rz = 0.172, perm.test n = 10 5 , p = 0.034). These results suggests, as Bleakley et al. (2014) points out, that characters engage in sexual and violent behaviors in a small time span from one another.
On the relation between risk behaviors and MPAA ratings. Finally, we measure the relation between the predicted risk behaviors and the movie's MPAA rating. We find that as sexual content increases, the association between violent (or substance-abuse) content and MPAA rating decreases. Specifically, movies with high sexual rating are more likely to be rated as R 8 , irrespective of their violent or substance-abuse content (odds ratio OR = 12.172 (95% CI: 7.86 to 19.46)). In contrast, the MPAA rating of a movie with low sexual content is strongly associated with both their violent content rating (χ 2 (6) = 18.595, p = 0.004) and their substance-abuse content rating (χ 2 (3) = 17.99, p < 0.001). These results point out the overly sensitiveness of MPAA raters towards sexual content and corroborate previous findings from small manuallyannotated samples of films (Tickle et al., 2009;Thompson and Yokota, 2004).

Conclusion
We designed a multi-task model to capture the cooccurrence of depictions of violent content as well as sexual and substance abuse risk behaviors in film through the language data available in scripts. Our proposed model achieves significant improvements  (Friedkin, 1973), and (b) From Russia With Love (Young, 1963). Sex-sentiment (green) leads the violence-semantics (red) by 31 (ρ = 0.23) and 203 (ρ = 0.29) utterances respectively. over previous state-of-the-art models for violent content rating prediction. While complementing audio-visual methods, our language-based models can be used to identify subtleties in the way risk behavior content is portrayed, before production begins, offering a valuable tool for content creators and decision makers in entertainment media.

Future Work
Our overarching goal is to identify when (and how often) are characters being portrayed as targets of risk behaviors-especially in the case where characters are women and minorities. The next step towards this goal would be to recognize when characters refer to one another, and how this contributes to the movie-level risk behavior rating. We hope this leads to tools that can be helpful during the creative process, rather than after the fact.