Analyzing Framing through the Casts of Characters in the News

We present an unsupervised model for the discovery and clustering of latent “personas” (characterizations of entities). Our model simultaneously clusters documents featuring similar collections of personas. We evaluate this model on a collection of news articles about immigration, showing that personas help predict the coarse-grained framing annotations in the Media Frames Corpus. We also introduce automated model selection as a fair and robust form of feature evaluation.


Introduction
Social science tells us that communication almost inescapably involves framing-choosing "a few elements of perceived reality and assembling a narrative that highlights connections among them to promote a particular interpretation" (Entman, 2007). Memorable examples include loaded phrases (death tax, war on terror), but the literature attests a much wider range of linguistic means toward this end (Pan and Kosicki, 1993;Greene and Resnik, 2009;Choi et al., 2012;Baumer et al., 2015).
Framing is associated with several phenomena to which NLP has been applied, including ideology (Lin et al., 2006;Hardisty et al., 2010;Iyyer et al., 2014), sentiment (Pang and Lee, 2008;Feldman, 2013), and stance (Walker et al., 2012;Hasan and Ng, 2013). Although such author attributes are interesting, framing scholarship is concerned with persistent patterns of representation of particular issueswithout necessarily tying these to the states or intentions of authors-and the effects that such patterns may have on public opinion and policy. We also note that NLP has often been used in large-scale studies of news and its relation to other social phenomena (Leskovec et al., 2009;Gentzkow and Shapiro, 2010;Niculae et al., 2015).
Can framing be automatically recognized? If so, social-scientific studies of framing will be enabled by new measurements, and new applications might bring framing effects to the consciousness of everyday readers. Several recent studies have begun to explore unsupervised framing analysis of political text using autoregressive and hierarchical topic models (Nguyen et al., 2013;Nguyen et al., 2015;Tsur et al., 2015), but most of these conceptualize framing along a single dimension. Rather than trying to place individual articles on a continuum from liberal to conservative or positive to negative, we are interested in discovering broad-based patterns in the ways in which the media communicate about issues.
Here, our focus is on the narratives found in news stories, specifically the participants in those stories. Insofar as journalists make use of archetypal narratives (e.g., the struggle of an individual against a more powerful adversary), we expect to see recurring representations of characters in these narratives (Schneider and Ingram, 1993;Van Gorp, 2010). A classic example is the contrast between "worthy" and "unworthy" victims (Herman and Chomsky, 1988). More recently, Glenn Greenwald has pointed out how he was repeatedly characterized as an activist or blogger, rather than a journalist during his reporting on the NSA (Greenwald, 2014).
Our model builds on the "Dirichlet persona model" (DPM) introduced by Bamman et al. (2013) for the unsupervised discovery of what they called "personas" in short film summaries (e.g., the "dark hero"). As in the DPM, we operationalize personas as mixture of textually-expressed characteristics: what they do, what is done to them, and their descriptive attributes. We begin by providing a description of our full model, after which we highlight the differences from the DPM. This paper's main contributions are: • We strengthen the DPM's assumptions about the combinations of personas found in documents, applying a Dirichlet process prior to infer patterns of coocurrence ( §3). The result is a clustering of documents based on the collections of personas they use, discovered simultaneously with those personas. • Going beyond named characters, we allow Bamman-style personas to account for entities like institutions, objects, and concepts ( §5). • We find that our model produces interpretable clusters that provide insight into our corpus of immigration news articles ( §6). • We propose a new kind of evaluation based on Bayesian optimization. Given a supervised learning problem, we treat the inclusion of a candidate feature set (here, personas) as a hyperparameter to be optimized alongside other hyperparameters ( §7). • In the case of U.S. news stories about immigration, we find that personas are, in many cases, helpful for automatically inferring the coarsegrained framing and tone employed in a piece of text, as defined in the Media Frames Corpus (Card et al., 2015) ( §7).

Model Description
The plate diagram for the new model is shown in Figure 1 (right), with the original DPM (Bamman et al., 2013) shown on the left. As evidence, the model considers tuples w, r, e, i , where w is a word token and r is the category of syntactic relation 1 it bears to an entity with index e mentioned in document with index i. The model's generative story explains this evidence 1 We adopt the terminology from Bamman et al. (2013) of "agent", "patient", and "attribute", even though these categories of relations are defined in terms of syntactic dependences. as follows: 1. Let there be K topics as in LDA (Blei et al., 2003). Each topic φ k ∼ Dir(γ) is a multinomial over the V words in the vocabulary, drawn from a Dirichlet parameterized by γ. 2. For each of P personas p, and for each syntactic relation type r, define a multinomial ψ p,r over the K topics, each drawn from a Dirichlet parameterized by β. 3. Assume an infinite set of distributions over personas drawn from a base distribution H. Each of these θ j ∼ Dir(α) is a multinomial over the P personas, with an associated probability of being selected π j , drawn from the stickbreaking process with hyperparameter λ. 4. For each document i: (a) Draw a cluster assignment s i ∼ π, with corresponding multinomial distribution over personas θ s i . (b) For each entity e participating in i: i. Draw e's persona p e ∼ θ s i . ii. For every r, w tuple associated with e in i, draw z ∼ ψ pe,r then w ∼ φ z . The DPM (Figure 1, left) has a similar generative story, except that each document has a unique distribution over personas. As such, step 4(a) is replaced with a draw from a symmetric Dirichlet distribution θ i ∼ Dir(α).

Clustering Stories
The DPM assumes that each document has a unique distribution (θ i ) from which its personas are drawn. However, for entities mentioned in news articles (as well as for the dramatis personae of films), we would expect certain types of personas to occur together frequently, such as articles about lawmakers and laws. Thus we would like to cluster documents based on their "casts" of personas. To do this, we have added a Dirichlet process (DP) prior on the document-specific distribution over personas (step 3), which allows the number of clusters to adapt to the size and complexity of the corpus (Antoniak, 1974;Escobar and West, 1994).
Although the model admits an unbounded number of distributions over personas, the properties of DPs are such that the number used by D documents will tend to be much less than D. As a result, inference under this model provides topics φ (distributions over words) interpretable as textual descriptors of entities, personas ψ (distributions over reusable topics), and clusters of articles s with associated distributions over personas θ.
Following Bamman et al. (2013), we perform inference using collapsed Gibbs sampling, collapsing out the distributions over words (φ), topics (ψ), and personas (θ), as well as π. On each iteration, we first sample a cluster for each document, followed by a persona for each entity, followed by a topic for each tuple. Because we assume a conjugate base measure, sampling clusters can be done efficiently using the Chinese restaurant process (Aldous, 1985) for story types, personas, and topics, with slice sampling for hyperparameters (α, β, γ, λ). Because such algorithms are well known to NLP readers, we have relegated details to the supplementary material.
During sampling, we discard samples from the first 10,000 iterations, and collect one sample from every tenth iteration for following 1,000 iterations. We sample hyperparameters every 20 iterations for the first 500 iterations, and every 100 thereafter.

Dataset
The Media Frames Corpus (MFC; Card et al., 2015) consists of annotations for approximately 4,200 articles about immigration taken from 13 U.S. newspapers over the years 1980-2012. The annotations for these articles are in terms of a set of 15 generalpurpose "framing dimensions" (such as Politics and Legality), developed to be broadly applicable to a variety of issues, and to be recognizable in text (by trained annotators). Each article has been annotated with a "primary frame" (the overall dominant aspect of immigration being emphasized), as well as an overall "tone" (pro, neutral, or anti), which is the extent to which a pro-immigration advocate would like to see the article in print, without implying any any stance taken by the author. 2 The MFC contains at least two independent annotations for each article; agreement on the primary frame and tone was established through discussion in cases of initial disagreement. A complete list of these framing dimensions is given in the supplementary material.
In order to train our model on a larger collection of articles, we use the original corpus of articles from which the annotated articles in the MFC were drawn. This produces a corpus of approximately 37,000 articles about immigration; we train the persona model on this larger dataset, only using the smaller set for evaluation on a secondary task. Note that the MFC annotations are not used by our model; rather, we hypothesize that the personas it discovers may serve as features to help predict framing-this serves as one of our evaluations ( §7).

Identifying Entities
The original focus of the DPM was on named characters in movies, which could be identified using named entity recognition and pronominal coreference (Bamman et al., 2013), or name matching for pre-defined characters (Bamman et al., 2014). Here, we are interested in applying our model to entities about which we assume no specific prior knowledge.
In order to include a broader set of entities, we preprocess the corpus and apply a series of filters. First, we obtain lemmas, part-of-speech tags, dependencies, coreference resolution, and named entities from the Stanford CoreNLP pipeline (Manning et al., 2014), as well as supersense tags from the AMALGrAM tagger (Schneider and Smith, 2015). For each document, we consider all tokens with a NN* or PRP part of speech as possible entities, partially clustered by coreference. We then merge all clusters (including singletons) within each document that share a non-pronomial mention word.
Next, we exclude all clusters lacking at least one mention classified as a person, organization, location, group, object, artifact, process, or act (by CoreNLP or AMALGrAM). From these, we extract w, r, e, i tuples using extraction patterns lightly adapted from (Bamman et al., 2013). (The complete set of patterns are given in the supplementary material.) To further restrict the set of entities to those that have sufficient evidence, we construct a vocabulary for each of the three relations, and exclude words that appear less than three times in the corresponding vocabulary. 3 We then apply one last filter to exclude entities that have fewer than three qualifying tuples across all mentions. From the dataset described in §4, we extract 128,655 entities, mentioned using 11,262 different mention words, with 575,910 tuples and 11,104 distinct r, w pairs.

Exploratory Analysis
Here we discuss our model, as estimated on the corpus of 37,000 articles discussed in §4 with 50 personas and 100 topics; these values were not tuned.
A cursory examination of topics shows that each tends to be a group of either verbs or attributes. Personas, on the other hand, blend topics to include all three relation types. The estimated Dirichlet hyperparameters are all 1, giving sparse (and hence easily scanned) distributions over personas, topics, and words. Table 1 shows all 50 personas. For each p, we show (i) the mention words most strongly associated with p, and (ii) r, w pairs associated with the persona. (To save space, "I" denotes immigrant.) Recall that, like the Dirichlet persona model, our model says nothing about the mention words; they are not included as evidence during inference. 4 Nonetheless, each persona is strongly associated with a sparse handful of mention words, and we find that labeling each persona by its most strongly associated mention word (excluding immigrant) is often sensible (these are capitalized in Table 1, though in some cases the relation words differentiate strongly (e.g., the group personas, IDs 17 and 18 in Table 1).
The model finds expected participants (such as workers, political candidates, and refugees), but also more conceptual entities, such as laws, bills (IDs 3, 37), and the U.S.-Mexican border (ID 5), which looms large in the immigration debate. Some interesting distinctions are discovered, such as two of the worker personas, one high-skilled and residing legally (ID 48), the other illegal (ID 49).
Using the original publication dates of the articles, we can estimate the frequency of appearance of each persona within immigration coverage by summing the posterior distribution over personas for each entity mention, and plotting these frequencies across time. (Note that time metadata is not given to the model as evidence.) We find immediately that personas can signal events. Figure 2 shows these temporal trajectories for a small, selected set of personas. Although bills and laws are conceptually similar, and have similar trajectories from 1980 to 2005, they are strongly divergent in 2006 and 2010. These are particularly notable years for immigration policy, corresponding to the failed Comprehensive Immigration Reform Act of 2006 (Senate bill S.2611) and Arizona's controversial anti-immigration laws from 2010. 5 Refugees, by contrast, show a marked spike around the year 2000. Inspection showed this persona to be strongly tied to the case of Elián González, which received a great deal of media attention in that year.
The main advantage of the extended model over the DPM is being able to cluster articles by "casts." During sampling, thousands of clusters are created (and mostly destroyed). Ultimately, our inference procedure settled on approximately 110 clusters, and we consider two examples. Figure 3 shows the temporal trajectories of the two clusters with the greatest representation of the refugee persona. Both show the characteristic spike around the year 2000. The top personas for these two clusters are given in Table  ID Mention words Relations 1 AGENT police official authority federal m tell p find a arrest a local m tell a 2 ASYLUM crime refugee asylum seeker political m seek p grant p commit p serious m deny p 3 BILL law immigration reform measure comprehensive m pass a pass p make a have a support p 4 BOAT van crime document criminal m other m have p use a use p be a 5 BORDER border patrol border agent mexican m cross p secure p southern m u.s.-mexico m close p 6 BUSH official mcnary people I have a tell a want a tell p former m call a 7 CANDIDATE bush romney leader republican m presidential m democratic m have a call a support a 8 CARD document visa status green m new m get p temporary m fake m permanent m 9 CARD visa state document consular m federal m have a mexican m receive p get p 10 COMPANY country I state nation have   2. Type A, which includes a story with the headline "Protesters vow to keep Elián in U.S.," emphasizes political aspects, while type B (e.g., "Court says no to rights for refugees") emphasizes legal aspects. Note that Political and Legality are two of the framing dimensions used in the MFC.
Do these persona-cast clusters relate to frames? For the five most common story clusters, (which have no overlap with the two refugee story types), Figure 4 shows the number of annotated articles with each of the primary frames if we assign each article to its most likely cluster. The second and fifth clusters correlate particularly well with primary frames (Political and Crime, respectively). This is further reinforced by looking at the most frequent persona for each of these story clusters which are candidate (ID 7) for the second and immigrant (ID 22), characterized by illegal m and arrest p , for the fifth.

Experiments: Personas and Framing
We evaluate personas as features for automatic analysis of framing and tone, as defined in the MFC ( §4). Specifically, we build multi-class text classifiers (separately) for the primary frame and the tone of a news article, for which there are 15 and 3 classes, respectively. Because there are only a few thousand annotated articles, we applied 10-fold cross-validation to estimate performance. Features are derived from our model by considering each persona and each story cluster as a potential feature. A document's feature values for story types are the proportion of samples in which it was assigned to each cluster. Persona feature values are similarly derived by the proportion of samples in which each entity was assigned to each persona, with the persona values for each entity in each document summed into a single set of persona values per  document. We did not use the topics (z) discovered by our model as features.

Experiment 1: Direct Comparison
For the first experiment, we train independent multiclass logistic regression classifiers for predicting primary frame and tone. We consider adding persona and/or story cluster features to baseline classifiers based only on unigrams and bigrams with binarized counts, a simple but robust baseline (Wang and Manning, 2012). 6 In all cases, we use L 1 regularization and use 5-fold cross validation within each split's training set to determine the strength of regularization. We then repeat this for each of the 10 folds, thereby producing one prediction (of primary frame and tone) for every annotated article. The results of this experiment are given in Table 3; for predicting the primary frame, classifiers that used persona and/or story cluster features achieve higher accuracy than the bag-of-words baseline (W); the classifier using personas from our model but not story clusters is significantly better than the baseline. 7 The enhanced models are also more compact, on average, using fewer effective features. A benefit to predicting tone is also observed, but it did not reach statistical significance.

Experiment 2: Automatic Evaluation
Although bag-of-n-grams models are known to be a strong baseline for text classification, researchers familiar with the extensive catalogue of features of-fered by NLP will potentially see them as a straw man. We propose a new and more rigorous method of comparison, in which a wide range of features are offered to an automatic model selection algorithm for each of the prediction tasks, with the features to be evaluated withheld from the baseline. Because no single combination of features and regularization strength is best for all situations, it is an empirical question which features are best for each task. We therefore make use of Bayesian optimization (Bayesopt) to make as many modeling decisions as possible (Pelikan, 2005;Snoek et al., 2012;Bergstra et al., 2015;Yogatama et al., 2015).
In particular, let F be the set of features that might be used as input to any text classification algorithm. Let f be a new feature that is being proposed. Allow the inclusion or exclusion of each feature in the feature set to be a hyperparameter to be optimized, along with any additional decisions such as input transformations (e.g., lowercasing), and feature transformations (e.g., normalization). Using an automatic model selection algorithm such as Bayesian optimization, allow the performance on the validation set to guide choices about all of these hyperparameters on each iteration, and set up two independent experiments.
For the first condition, A 1 , allow the algorithm access to all features in F . For the second, A 2 , allow the algorithm access to all features in F ∪ f . After R iterations of each, choose the best model or the best set of models from each of A 1 and A 2 (M 1 and M 2 , respectively), based on performance on the validation set. Finally, compare the selected models in terms of performance on the test set (using an appropriate metric such as F 1 ), and examine the features included in each of the best models. If f is a helpful feature, we should expect to see that, a) F 1 (M 2 ) > F 1 (M 1 ), and b), f is included in the best model(s) found by A 2 .
If F 1 (M 2 ) > F 1 (M 1 ) but f is not included in the best models from A 2 , this suggests that the performance improvement may simply be a matter of chance, and there is no evidence that f is helpful. By contrast, if f is included in the best models, but F 1 (M 2 ) is not significantly better than F 1 (M 1 ), this suggests that f is offering some value, perhaps in a more compressed form of the useful signal from other features, but does not actually offer better per-  formance.
For this experiment, we use the tree-structured Parzen estimator for Bayesian optimization (Bergstra et al., 2015), with L 1 -regularized logistic regression as the underlying classifier, and set R = 40. In addition to the entities and story clusters identified by these models, we allow these classifiers access to a large set of features, including unigrams, bigrams, parts of speech, named entities, dependency tuples, ordinal sentiment values (Manning et al., 2014), multi-word expressions (Justeson and Katz, 1995), supersense tags (Schneider and Smith, 2015), Brown clusters (Brown et al., 1992), frame semantic features (Das et al., 2010), and topics produced by standard LDA (Blei et al., 2003). The inclusion or exclusion of each feature is determined automatically on each iteration, along with feature transformations (removal of rare words, lowercasing, and binary or normalized counts).
The baseline, denoted "B," offers all features except personas and story clusters to Bayesopt; we consider adding DPM personas, our model's personas, and our model's personas and story clusters. Table 4 shows test-set accuracy for each setup, averaged across the three best models returned by Bayesopt.
Using this more rigorous form of evaluation, approximately the same accuracy is obtained in all experimental conditions. However, we can still gain insight into which features are useful by examining those selected by the best models in each condition. For primary frame prediction, both personas and story clusters are included by the best models in every case where they have been offered as possible features, as are unigrams, dependency tuples, and semantic frames. Other commonly-selected features include bigrams and part of speech tags. For predicting tone, personas are only included by half of the best models, with the most common features be-ing unigrams, bigrams, semantic frames, and Brown clusters. As expected, the best models in each condition obtain better performance than the models from experiment 1, thanks to the inclusion of additional features and transformations.
This secondary evaluation suggests that for this task, persona features are useful in predicting the primary frame, but are unable to offer improved performance over existing features, such as semantic frames. However, the fact that that both personas and story clusters are included by all the best models for predicting the primary frame suggests that they are competitive with other features, and perhaps offer useful information in a more compact form.

Qualitative Evaluation
Prior to exposure to any output of our model, one of the co-authors on this paper (Gross, who has expertise in both framing and the immigration issue) prepared a list of personas he expected to frequently occur in American news coverage of immigration. Given the example of the "skilled immigrant," he listed 22 additional named personas, along with a few examples of things they do, things done to them, and attributes.
The list he prepared includes several different characterizations of immigrants (low-skilled, unauthorized, legal, citizen children, undocumented children, refugees, naturalized citizens), non-immigrant personas (U.S. workers, smugglers, politicians, officials, border patrol, vigilantes), related pairs (pro / anti advocacy groups, employers / guest workers, criminals / victims), and a few more conceptual entities (the border, bills, executive actions). Of these, almost all are arguably represented in the personas we have discovered. However, there is rarely a perfect one-to-one mapping: predefined personas are sometimes merged (e.g., "the border" and "border patrols") or split (e.g., legislation, employers, and various categories of immigrants). Personas which don't emerge from our model include smugglers, guest workers, vigilantes, and victims of immigrant criminals. On the other hand, our model proposes far more non-person entities, such as ID cards, courts, companies, jobs, and programs.
These partial matchings between predefined personas and the results of our model are generally identifiable by comparing the names given to the predefined personas to the the most commonly occurring mention words and attributes of our discovered personas. The attributes and action words given to the predefined personas are harder to evaluate, as many of them are rare (e.g. politicians "vacillate") or compound phrases (e.g. low-skilled immigrants "do jobs Americans won't do") that tend to miss the more obvious properties captured by our model. For example, the employer persona captured by our model engages in actions like hire, employ, and pay. By contrast, the terms given for the predefined "business owners" persona are "lobby" and "rely on immigrant labor." Our unsupervised discovery of this persona can clearly be matched to the predefined persona in this case, but doesn't provide such fine-grained insight into how they might be characterized.
The best match between predefined and discovered personas is the U.S.-Mexican border. Of the words given for the predefined persona, almost all are more frequently associated with border than with any other discovered persona ("Mexican-U.S.," "lawless," "porous," "unprotected," "guarded," and "militarized"). The most commonly associated words discovered by our model that are missing from the predefined description include crossed, secured, southern, and closed.
While this qualitative evaluation helps to demonstrate the face validity of our model, it would be better to have a more comprehensive set of predefined personas, based on input from additional experts. Moreover, it also illustrates the challenge of trying to match the output of an unsupervised model to expected results. Not only is some merging and splitting of categories inevitable, there was a mismatch in this case in the types of entities to be described (people as opposed to more abstract entities), and the ways of describing them (rare but specific words as opposed to more generic but potentially obvious terms).

Related Work
Much NLP has focused on identifying entities or events (Ratinov and Roth, 2009;Ritter et al., 2012), analyzing schemes or narrative events in terms of characters (Chambers and Jurafsky, 2009), inferring the relationships between entities Iyyer et al., 2016), and predicting personality types from text (Flekova and Gurevych, 2015). Bamman also applied variants of the DPM to characters in novels (Bamman et al., 2014).
Previous work on sentiment, stance, and opinion mining has focused on recognizing stance or political sentiment in online ideological debates (Somasundaran and Wiebe, 2010;Hasan and Ng, 2014;Sridhar et al., 2015), and other forms of social media (O'Connor et al., 2010;Agarwal et al., 2011), and recently through the lens of connotation frames (Rashkin et al., 2016). Opinion mining and sentiment analysis are the subject of ongoing research in NLP and have long served as test platforms for new methodologies (Socher et al., 2013;İrsoy and Cardie, 2014;Tai et al., 2015) Framing is arguably one of the most important concepts in the social sciences, with roots in to sociology, psychology, and mass communication (Gitlin, 1980;Benford and Snow, 2000;D'Angelo and Kuypers, 2010); the scope and relevance of framing is widely debated (Rees et al., 2001), with many authors applying the concept of framing to analyzing documents on particular issues (Baumgartner et al., 2008;Berinsky and Kinder, 2006).

Conclusion
We have extended models for discovering latent personas to simultaneously cluster documents by their "casts" of personas. Our exploration of the model's inferences and their incorporation into a challenging text analysis task-characterizing coarse-grained framing in news articles-demonstrate that personas are a useful abstraction when applying NLP to social-scientific inquiry. Finally, we introduced a Bayesian optimization approach to rigorously assess the usefulness of new features in machine learning tasks.