Hierarchical Encoders for Modeling and Interpreting Screenplays

While natural language understanding of long-form documents remains an open challenge, such documents often contain structural information that can inform the design of models encoding them. Movie scripts are an example of such richly structured text – scripts are segmented into scenes, which decompose into dialogue and descriptive components. In this work, we propose a neural architecture to encode this structure, which performs robustly on two multi-label tag classification tasks without using handcrafted features. We add a layer of insight by augmenting the encoder with an unsupervised ‘interpretability’ module, which can be used to extract and visualize narrative trajectories. Though this work specifically tackles screenplays, we discuss how the underlying approach can be generalized to a range of structured documents.


Introduction
As natural language understanding of sentences and short documents continues to improve, interest in tackling longer-form documents such as academic papers (Ren et al., 2014;Bhagavatula et al., 2018), novels (Iyyer et al., 2016) and screenplays (Gorinski and Lapata, 2018) has been growing. Analyses of such documents can take place at multiple levels, e.g. identifying both document-level labels (such as genre) and narrative trajectories (how do levels of humor and romance vary over the course of a romantic comedy?). However, one key challenge for these tasks is the low signal-to-noise ratio in lengthy texts (as indicated by the performance of such models on curated datasets like NarrativeQA (Kočiský et al., 2018)), which makes it difficult to apply end-to-end (E2E) neural network solutions that have recently achieved state-of-the-art on other tasks (Barrault et al., 2019;Williams et al., 2018;Wang et al., 2019). * Work done during an internship at Netflix Instead, models either rely on a) a pipeline that provides a battery of syntactic and semantic information from which to craft features (e.g., the BookNLP pipeline (Bamman et al., 2014) for literary text, graph-based features (Gorinski and Lapata, 2015) for movie scripts, or outputs from a discourse parser (Ji and Smith, 2017) for text categorization) and/or b) the linguistic intuitions of the model designer to select features relevant to the task at hand (e.g., rather than ingest the entire text, Bhagavatula et al. (2018) only consider certain sections like the title and abstract of an academic publication). While there is much to recommend these approaches, E2E neural modeling offers several key advantages: it obviates the need for auxiliary feature-generating models, minimizes the risk of error propagation, and offers improved generalization across large-scale corpora. This work explores how the inherent structure of a document class can be leveraged to facilitate an E2E approach. We focus on screenplays, investigating whether we can effectively extract key information by first segmenting them into scenes, and further exploiting the structural regularities within each scene.
With an average of >20k tokens per script in our evaluation corpus, extracting salient aspects is far from trivial. Through a series of carefully controlled experiments, we show that a structureaware approach significantly improves document classification by effectively collating sparsely distributed information. Further, this method produces both document-and scene-level embeddings, which can be used downstream to visualize narrative trajectories of interest (e.g., the prominence of various themes across a script). The overarching strategy of this work is to incorporate structural priors as biases into the neural architecture itself (e.g., Socher et al. (2013), Strubell et al. (2018), inter alia), whereby, as Henderson (2020) observe, "locality in the model structure can reflect locality in the linguistic structure" to boost accuracy over feature-engineering approaches. The methods we propose can readily generalize to any long-form text with an exploitable internal structure, including novels (chapters), theatrical plays (scenes), chat logs (turn-taking), online games (levels/rounds/gameplay events), and academic texts (sections and subsections).
We begin by detailing how a script can be formally decomposed first into scenes and further into granular elements with distinct discourse functions, in §2. We then propose an encoder based on hierarchical attention (Yang et al., 2016) that effectively leverages this structure in §3. In §5.3, the predictive performance of the hierarchical encoder is validated on two multi-label tag prediction tasks, one of which rigorously establishes the utility of modeling structure at multiple granularities (i.e. at the level of line, scene, and script). Notably, while the resulting scene-encoded representation is useful for prediction tasks, it is not amenable to easy interpretation or examination. To shed light on the encoded document representations, in §4, we propose an unsupervised interpretability module that can be attached to an encoder of any complexity.
§5.5 outlines our application of this module to the scene encoder, and the resulting visualizations of the screenplay, which illustrate how plot elements vary over the course of the narrative arc. §6 draws connections to related work, before concluding.

Script Structure
Movie and television scripts (or screenplays) are traditionally segmented into scenes, with a rough rule of thumb being that each scene lasts about a minute on-screen. A scene is not necessarily a distinct narrative unit (which is most often a sequence of several consecutive scenes), but is constituted by a piece of continuous action at a single location.  Fig. 1 contains a segment of a scene from the screenplay for the Pulp Fiction, a 1994 American film. These segments tend to follow a standard format. Each scene starts with a scene heading or 'slug line' that briefly describes the scene setting. A sequence of statements follow, and screenwriters typically use formatting to distinguish between dialogue and action statements (Argentini, 1998). A dialogue identifies the character who utters it either on-or off-screen (the latter is often indicated with '(V.O.)' for voice-over). Parentheticals might be used to include special instructions regarding dialogue delivery. Action statements are all nondialogue constituents of the screenplay "often used by the screenwriter to describe character actions, camera movement, appearance, and other details" (Pavel et al., 2015). In this work, we consider action and dialogue statements, as well as character identities for each dialogue segment, ignoring slug lines and parentheticals.

Hierarchical Scene Encoders
The large size of a movie script makes it computationally infeasible for recurrent encoders to ingest these screenplays as single blocks of text. Instead, we propose a hierarchical encoder that mirrors the structure of a screenplay ( §2) -a sequence of scenes, each of which is an interwoven sequence of action and dialogue statements. The encoder is three-tiered, as illustrated in Fig. 2, and processes the text of a script as follows.

Model Architecture
First, an action-statement encoder transforms the sequence of words in an action statement (represented by their pretrained word embeddings) into an action statement embedding. Next, an action-scene encoder transforms the chronological sequence of action statement embeddings within a scene into an action scene embedding. Analogously, a dialogue-statement encoder and a dialogue-scene encoder generate dialogue statement embeddings and aggregate them into dialogue scene embeddings. To incorporate character information, characters are represented as embeddings (randomly initialized and updated during model training), and an average of embeddings of all characters with at least one dialogue in the scene is computed. 1 Finally, the action, dialogue and averaged character embeddings for a scene are concatenated into a single scene embedding. Scene-level predictions can be obtained by feeding scene embeddings into a subsequent neural module, e.g. a feedforward layer for supervised tagging. Alternatively, a final script encoder can be used to transform the sequence of scene embeddings into a script embedding representing the entire screenplay.  A key assumption underlying the model is that action and dialogue statements -as instances of written narrative and spoken language respectivelyare distinct categories of text that must be processed separately. We evaluate this assumption in §5.3.

Encoders
The proposed model incorporates strong inductive biases regarding the overall structure of input documents. In addition, since the aforementioned encoders §3.1 are underspecified, we evaluate three instantiations of the encoder components: 1. Sequential (GRU): A bidirectional GRU (Bahdanau et al., 2015) encodes input sequences (of words, statements or scenes). Given a sequence of input embeddings e 1 , . . . , e T , we obtain GRU outputs c 1 , . . . , c T , and use c T as the recurrent encoder's final output.

Sequential with Attention (GRU + Attn):
Attention (Bahdanau et al., 2015) is used to combine c 1 , . . . , c T . This allows more or less informative inputs to be filtered accordingly. We calculate attention weights using a parametrized vector p of the same dimensionality as the GRU outputs (Sukhbaatar et al., 2015;Yang et al., 2016): These weights are used to compute the final encoder output: 3. Bag-of-Embeddings with Attention (BoE + Attn): These encoders disregard sequential information to compute an attention-weighted average of the encoder's inputs: In contrast, a bag-of-embeddings (BoE) encoder computes a simple average of its inputs. While defining a far more constrained function space than recurrent encoders, BoE and BoE + Attn representations have the advantage of remaining in the input word embedding space. We leverage this property in §4 where we develop an interpretability layer on top of the encoder outputs.

Loss for Tag Classification
The final script embedding is passed into a feedforward classifier (FFNN). As both supervised learning tasks in our evaluation are multi-label classification problems, we use a variant of a simple multi-label one-versus-rest loss, where correlations among tags are ignored. The tag sets have high cardinalities and the fractions of positive samples are inconsistent across tags (see Appendix Tables  3 & 4); this motivates the use of a reweighted loss function: where N is the number of samples, L is the number of tag labels, y ∈ {0, 1} is the target label, z is the output of the FFNN, σ is the sigmoid function, and λ j is the ratio of positive to negative samples (precomputed over the entire training set, since the development set is too small to tune this parameter) for the j th tag label. With this loss function, we account for label imbalance without tuning separate thresholds for each tag on the validation set.

Interpreting Scene Embeddings
As the complexity of learning methods used to encode sentences and documents has increased, so has the need to understand the properties of the encoded representations. Probing methods (Linzen et al., 2016;Conneau et al., 2018) gauge the information captured in an embedding by evaluating its performance on downstream classification tasks, either with manually collected annotations (Shi et al., 2016) or self-supervised proxies (Adi et al., 2016). In our case, it is laborious and expensive to collect such annotations at the scene level (requiring domain experts), and the proxy evaluation tasks proposed in literature do not probe the narrative properties we wish to surface. Instead, we take inspiration from Iyyer et al. (2016) to learn an unsupervised scene descriptor model that can be trained without relying on such annotations. Using a dictionary learning technique (Olshausen and Field, 1997), the model learns to represent each scene embedding as a weighted mixture of various topics estimated over the entire corpus. It thus acts as an 'interpretability layer' that can be applied over the scene encoder. This model is similar in spirit to dynamic topic models (Blei and Lafferty, 2006), with the added advantage of producing topics that are both more coherent and more interpretable than those generated by LDA (He et al., 2017;Mitcheltree et al., 2018).

Scene Descriptor Model
The model has three main components: a scene encoder whose outputs we wish to interpret, a set of topics or descriptors that are the 'basis elements' used to describe an interpretable scene, and a predictor that predicts weights over descriptors for a given scene embedding. The scene encoder uses the text of a given scene s t to produce a corresponding scene embedding v t . This encoder can take any form -from an extractor that derives a hand-crafted feature set from the scene text, as in Gorinski and Lapata (2018), to the scene encoder in §3.
To probe the contents of scene embedding v t , we compute a descriptor-based representation w t ∈ R d in terms of a descriptor matrix R ∈ R k×d that stores k topics or descriptors: ; we use the former in §5.5. Furthermore, we can incorporate additional recurrence into the model by modifying Eq. 6 to add the previous state: Descriptors are initialized either randomly (Glorot and Bengio, 2010) or with the centroids of a k-means clustering of the input word embeddings. For the predictor, f is a two-layer FFNN with ReLU activations and a softmax layer that transforms v t (from the scene encoder) into a 100dimensional intermediate state and then into o t .

Reconstruction Task
We wish to minimize the reconstruction error between two scene representations: (1) the descriptorbased embedding w t which depends on the scene embedding v t , and (2) an attention-weighted bagof-words embedding for s t . This encourages the computed descriptor weights to be indicative of the scene's actual content (the portions of its text that indicate attributes of interest such as genre, plot, and mood). We use a BoE+Attn scene encoder ( §3.2) pretrained on the tag classification task (bottom right of Fig. 3), which yields a vector u t ∈ R d for scene s t . The scene descriptor model is then trained using a hinge loss objective (Weston et al., 2011) to minimize the reconstruction error between w t and u t , with an additional orthogonality constraint on R to encourage semantically distinct descriptors: where u 1 . . . u n are n negative samples selected from other scenes in the same screenplay. We use a BoE+Attn scene encoder as a "target" u t to force w t (and therefore the rows in R) in the same space as the input word embeddings. Thus, a given descriptor can be semantically interpreted by querying its nearest neighbors in the word embedding space. The predicted descriptor weights for a scene s t are obtained by running a forward pass through the model.

Evaluation
We evaluate the proposed script encoder and its variants through two supervised multilabel tag prediction tasks, and a qualitative analysis via the unsupervised extraction of descriptor trajectories.

Datasets
We base our evaluation on the ScriptBase-J corpus released by Gorinski and Lapata (2018) to directly compare our approach with the multilabel encoder proposed in Gorinski and Lapata (2018) and to provide an open-source evaluation standard. 2 In this corpus, each movie is associated with a set of expert-curated tags that range across 6 tag attributes: mood, plot, genre, attitude, place, and flag; in addition, we also evaluate on an internal dataset of labels assigned to the same movies by in-house domain experts, across 3 tag attributes: genre, plot, and mood. The two taxonomies are distinct. (See Appendix Table 3).

Script Preprocessing
As in Pavel et al. (2015), we leverage the standard screenplay format (Argentini, 1998) to extract structured representations of scripts (formatting cues included capitalization and tab-spacing; see Fig. 1 and Table 1 for an example). Filtering erroneously processed scripts removes 6% of the corpus, resulting in a total of 857 scripts. We hold out 20% (172) scripts for evaluation and use the 2 https://github.com/EdinburghNLP/scriptbase rest for training. The average number of tokens per script is around 23k; additional statistics are shown in Appendix Table 1.
To keep within GPU memory limits, we split extremely long scenes to retain no more than 60 action and 60 dialogue lines per scene. The vocabulary is composed of words with at least 5 occurrences across the script corpus. The number of scripts per tag value ranges from high (e.g. for some Genre tags) to low (for most Plot and Mood tags) in both datasets (see Appendix Table  4), which along with high tag cardinality for each attribute motivates the use of the reweighted loss in Eq. 5.

Experimental Setup
All inputs to the hierarchical scene encoder are 100-dimensional GloVe embeddings (Pennington et al., 2014). 3 Our sequential models are bi-GRUs with a single 50-dimensional hidden layer in each direction, resulting in 100-dimensional outputs. The attention parameter p is 100-dimensional; BoE models naturally output 100-dimensional representations, and character embeddings are 10dimensional. The script encoder's output is passed through a linear layer with sigmoid activation and binarized by thresholding at 0.5.
One simplification we use is to utilize the same encoder type for all encoders described in §3.1. However, particular encoder types might suit different tiers of the architecture: e.g. scene embeddings could be aggregated in a permutationinvariant manner, since narratives are interwoven and scenes may not be truly sequential.
We implement the script encoder on top of Al-lenNLP (Gardner et al., 2017) and PyTorch (Paszke et al., 2019), and all experiments are conducted on an AWS p2.8xlarge machine. We use the Adam optimizer with an initial learning rate of 0.005, clip gradients at a maximum norm of 5, and use no dropout. The model is trained for up to 20 epochs to maximize average precision score, with early stopping if the validation metric does not improve for 5 consecutive epochs.

Tag Prediction Experiments
ScriptBase-J also comes with loglines, or short, 1-2 sentence human-crafted summaries of the movie's plot and mood (see Appendix Table 2). A model trained on these summaries can be expected to provide a reasonable baseline for tag prediction, since logline curators are likely to highlight information relevant to this task. The Loglines model is a bi-GRU with inputs of size 100 (GloVe embeddings) and hidden units of size 50 in each direction, whose output feeds into a linear classifier. 4

Model
Genre  Table 2: Investigation of the effects of different architectural (BoE +/-Attn, GRU +/-Attn) and structural choices on a tag prediction task, using an internally tagged dataset: F-1 scores with sample standard deviation in parentheses. Across the 3 tag attributes we find that modeling sentential and scene-level structure helps, and attention helps extract representations more salient to the task at hand. Table 2 contains results for the tag prediction task on our internally-tagged dataset. First, a set of models trained using action and dialogue inputs are used to evaluate the architectural choices in §3.1. We find that modeling recurrence at sentential and scene levels and selecting relevant words/scenes with attention are prominent factors in the robust improvement over the Loglines baseline (see the first five rows in Table 2).
Next, we assess the effect that various structural elements of a screenplay have on classification performance. Notably, the difficulty of the prediction task is directly related to the number of labels per tag attribute: higher-cardinality tag attributes with correlated tag values (like plot and mood) are far more difficult to predict than lower-cardinality tags with more discriminable values (like genre). We find that adding character information to the bestperforming GRU + Attn model (+Char) improves prediction of genre, while using both dialogue and action statements improves performance on plot and mood when compared to using only one or the other. We also evaluate (1) a 2-tier variant of the GRU+Attn model without action/dialoguestatement encoders (i.e., all action statements are concatenated into a single sequence of words and passed into the action-scene encoder, and similarly with dialogue) and (2) a variant similar to Yang et al. (2016) (HAN) that does not distinguish between action and dialogue (i.e., all statements in a scene are encoded using a single statement encoder and statement embeddings are passed to a scene encoder, the output of which is passed into the script encoder). Both models perform slightly better than GRU+Attn on genre, but worse on plot and mood, indicating that incorporating hierarchy and distinguishing between dialogue and action statements helps on the more difficult prediction tasks.  For the results in Table 3, we compared the GRU+Attn configuration in Table 2 (HSE) with an implementation of Gorinski and Lapata (2018) (G&L) that was run on the previous train-test split. G&L contains a number of handcrafted lexical, graph-based, and interactive features that were designed for optimal performance on screenplay analysis. In contrast, HSE directly encodes standard screenplay structure into a neural network architecture, and is an alternative, arguably more lightweight way of building a domain-specific textual representation. Our results are comparable, with the exception of 'place', which can often be identified deterministically from scene headings.

Similarity-based F-1
Results in Tables 2 and 3 check for an exact match between predicted and true tag values to report standard multi-label F-1 scores (one-vs-rest classification evaluation, micro-averaged over tag attributes). However, the characteristics of our tag taxonomies suggest that this measure may not be ideal, since human-crafted tag sets include dozens of highly correlated, overlapping values, and the dataset includes instances of missing tags. A standard scoring procedure may underestimate model performance when, e.g., a prediction of 'Crime' is equally penalized for a target labels of 'Heist' and 'Romance' (see Appendix Table 5).
We use a similarity-based scoring procedure (see Maynard et al. (2006) for related approaches) to assess the impact of such effects. In particular, we calculate cosine similarities between tag embeddings trained on a similar task (see Appendix for details) and evaluate a prediction based the percentile of its similarity to the actual label. Such a measure takes into account the latent relationships among tags via similarity thresholding, wherein a prediction is counted as correct if it is sufficiently similar to the target. The percentile cutoff can be varied to estimate model performance as a function of the threshold percentile.
In Fig. 4 we re-evaluate the GRU + Attn model outputs (row 5 in Table 2) with this evaluation metric to examine how our results might vary if we adopted a similarity-based scoring procedure. When the similarity percentile cutoff equals 100, the result is identical to the standard F-1 score. Even decreasing the cutoff to the 90 th percentile shows striking improvements for high-cardinality attributes (180% for mood and 250% for plot). Notably, using a similarity-based scoring procedure for complex tag taxonomies may yield results that more accurately reflect human perception of the model's performance (Maynard et al., 2006).

Qualitative Scene-level Analysis
To extract narrative trajectories with the scene descriptor model, we analyze the scene encoder from the GRU+Attn model, which performs best on the Plot and Mood tag attributes and does reasonably well on Genre. Similarly to Iyyer et al. (2016), we limit the input vocabulary for the BoE+Attn encoders that yield target vectors u t to words occurring in at least 50 movies (7.3% of the training set), while also filtering the 500 most frequent words in the corpus. We set the number of descriptors k to 25 to allow for a wide range of topics while keeping manual examination feasible.
Further modeling choices are evaluated using the semantic coherence metric (Mimno et al., 2011), which assesses the quality of word clusters induced by topic modeling algorithms. These choices include: the presence of recurrence in the predictor (i.e. toggling between Eqns. 6 and 7, with α = 0.5) and the value of hyperparameter λ. While the kmeans initialized descriptors score slightly higher on semantic coherence, they remain close to the initial centroids and do not reflect the corpus as well as the randomly initialized version, which is the initialization we eventually used. We also find that incorporating recurrence and λ = 10 (tuned using simple grid search) result in the highest coherence.
The outputs of the scene descriptor model are shown in Table 4 and Figure 5. Table 4 presents five example descriptors, each identified by the representative words closest to them in the word embedding space (topic names are manually annotated). Figure 5 presents the narrative trajectories of a subset of descriptors for three screenplays: Pretty Woman, Pulp Fiction, and Pearl Harbor, using a streamgraph (Byron and Wattenberg, 2008). The descriptor weight o t (Eq. 6) as a function of scene number/order is rescaled and smoothed, with the width of a color band indicating the weight value. A critical event for each screenplay is indicated by a letter on each trajectory. A qualitative analysis of such events indicates general alignment between scripts and their topic trajectories, and the potential applicability of this method to identifying significant moments in long-form documents.

Topic
Words Violence fires blazes explosions grenade blasts Residential loft terrace courtyard foyer apartments Military leadership army victorious commanding elected Vehicles suv automobile wagon sedan cars Geography sand slope winds sloping cliffs Table 4: Examples of retrieved descriptors. Trajectories for "Violence", "Military", and "Residential" are shown in Fig. 5.  Table 4.

Related Work
Computational narrative analysis of large texts has been explored in a range of contexts (Mani, 2012) over the past few decades (Lehnert, 1981). Recent work has analyzed narrative from plot (Chambers and Jurafsky, 2008;Goyal et al., 2010) and character (Elsner, 2012;Bamman et al., 2014) perspectives. While movie narratives have received attention (Bamman et al., 2013;Chaturvedi et al., 2018;Kar et al., 2018), the computational analysis of entire screenplays is not as common.
Notably, Gorinski and Lapata (2015) introduced a summarization method for scripts, extracting graph-based features that summarize the key scene sequences. Gorinski and Lapata (2018) built on top of this work, crafting additional features for a specially-designed multilabel encoder, while also emphasizing the difficulty of the tag prediction task. Our work suggests an orthogonal approach using automatically learned scene representations instead of feature-engineered inputs. We also consider the possibility that at least some of the task difficulty owes not to the length or richness of the text, but rather to the complexity of the tag taxonomy. The pattern of results we obtain from a similarity-based scoring measure offers a brighter picture of model performance, and suggests that the standard multilabel F1 measure may not be appropriate for such complex tag sets (Maynard et al., 2006).
Nevertheless, dealing with long-form text remains a significant challenge. One possible solution is to infer richer representations of latent structure using a structured attention mechanism (Liu and Lapata, 2018), which might highlight key dependencies between scenes in a script. Another method could be to define auxiliary tasks as in Jiang and Bansal (2018) to encourage better selec-tion. Lastly, sparse versions of the softmax function (Martins and Astudillo, 2016) could be used to address the sparse distribution of salient information across a screenplay.

Conclusion
In this work, we propose and evaluate various neural network architectures for learning fixeddimensional representations of full-length film scripts. We hypothesize that a network design mimicking the documents' internal structure will boost performance. Experiments on two tag prediction tasks support this hypothesis, confirming the benefits of using hierarchical attention-based models and of incorporating distinctions between various scene components directly into the model. In order to explore the information contained within scenelevel embeddings, we present an unsupervised technique for bootstrapping scene "descriptors" and visualizing their trajectories over the course of the screenplay. For future work, we plan to investigate richer ways of representing character identities, which could allow character embeddings to be compared across movies and linked to character archetypes. A persona-based characterization of the screenplay would provide a complementary view to the current plot-based analysis.
Scripts and screenplays are an underutilized and underanalyzed data source in modern NLP -indeed, most work on narratology in NLP concentrates on short stories and book/movie summaries. This paper shows that capitalizing on their rich internal structure largely obviates the need for featureengineering, or other more complicated architectures, a lesson that may prove instructive in other areas of discourse processing. Our hope is that these results encourage more people to work on this fascinating domain.

A.1 Additional Dataset Statistics
In this section, we present additional statistics on the evaluation sets used in this work.
Min 10th % 90th % Max 4025 16,240 29,376 52,059 Table 5: Statistics on the number of tokens per script in the Scriptbase-J corpus. We use the same script corpus with two different tag sets -the Jinni tags provided with ScriptBase and a tag set designed by internal annotators.

A.2 Tag Similarity Scoring
To estimate tag-tag similarity percentiles, we calculate the distance between tag embeddings learned via an auxiliary model trained on a related supervised learning task. In our case, the related task is  to predict the audience segment of a movie, given a tag set. The general approach is easily replicable via any model that projects tags into a welldefined similarity space (e.g., knowledge-graph embeddings (?) or tag-based autoencoders). Given a tag embedding space, the similarity percentile of a pair of tag values is estimated as follows. For a given tag attribute, the pairwise cosine distance between tag embeddings is computed for all tag-tag value pairs. For a given pair, its similarity percentile is then calculated with reference to the overall distribution for that attribute.
Similarity thresholding simplifies the tag prediction task by significantly reducing the perplexity of the tag set, while only marginally reducing its cardinality. Cardinality can be estimated via permutations. If n is the cardinality of the tag set, the number of permutations p of different tag pairs (k = 2) is: p(n, k) = n! (n − k)!
which simplifies to n 2 − n − p = 0. Likewise, the entropy of a list of n distinct tag values of varying probabilities is given by: H(X) = H(tag 1 , ..., tag n ) = − n i=1 tag i log 2 tag i (10) The perplexity over tags is then simply 2 H(X) .
As the similarity threshold decreases, the number of tags treated as equivalent correspondingly increases. Mapping these "equivalents" to a shared label in our list of tag values allows us to calculate updated values for tag (1) perplexity and (2) cardinality. As illustrated by Table 10, rather than leading to large reductions in the overall cardinality of the tag set, similarity thresholding mainly serves to decrease perplexity by eliminating redundant/highly similar alternatives. Thus, thresholding at once significantly decreases the complexity of the prediction task, while yielding a potentially more representative picture of model performance.