Inducing Semantic Micro-Clusters from Deep Multi-View Representations of Novels

Automatically understanding the plot of novels is important both for informing literary scholarship and applications such as summarization or recommendation. Various models have addressed this task, but their evaluation has remained largely intrinsic and qualitative. Here, we propose a principled and scalable framework leveraging expert-provided semantic tags (e.g., mystery, pirates) to evaluate plot representations in an extrinsic fashion, assessing their ability to produce locally coherent groupings of novels (micro-clusters) in model space. We present a deep recurrent autoencoder model that learns richly structured multi-view plot representations, and show that they i) yield better micro-clusters than less structured representations; and ii) are interpretable, and thus useful for further literary analysis or labeling of the emerging micro-clusters.


Introduction
For the literature aficionado, the quest for the next novel to read can be daunting: the sheer number of novels of different styles, topics and genres is difficult to navigate. It is intuitively clear that readers select novels based on specific but potentially diverse and structured preferences (e.g., they might prefer novels of a particular theme (small-town romance), mood (dark) or based on character types (grumpy boss), character relations (love, enmity) and their development). These preferences also manifest in the organization of online book stores or recommendation platforms. 1 For example, the Amazon book catalog contains semantic tags provided by experts (publishers), including labels of character types (pirates) or theme (secret baby romance) to aid focused search for novels of interest.
Although these tags are already fairly granular, many cover large sets of novels (e.g., the tag secret baby romance covers almost 4, 000 novels), limiting their utility for exhaustive exploration and call for even finer grained micro-groupings. Can we instead automatically induce fine-grained novel clusters in an unsupervised, data-driven way?
We propose a framework to learn structured, interpretable book representations that capture different aspects of the plot, and verify that such representations are rich enough to support downstream tasks like generating interpretable book groupings. A real-world application of this work is content-based book recommendation based on diverse and interpretable book characteristics. Content-based recommendation has been criticized by the limited complexity of typically employed features (limited content analysis; Lops et al. (2011); Adomavicius and Tuzhilin (2005)). This work addresses this problem by inducing complex, structured and interpretable representations. Our contributions are two-fold.
First, assuming that richly structured book tags call for rich content representations (which expert taggers arguably possess), we describe a deep unsupervised model for learning multi-view representations of novel plots. We use the term view to refer to specific types of plot characteristics (e.g., pertaining to events, characters or mood), and multi-view to refer to combinations of these views. We use multi-view book representations to construct meaningful and locally coherent neighbourhoods in model space, which we will refer to as micro-clusters. To this end, we extend a recent autoencoder model (Iyyer et al., 2016) to learn multi-view representations of books. Our model encodes properties of characters (view v 1 ), relations between characters (view v 2 ), and their respective trajectories over the plot. 2 View-specific encodings are learnt in an unsupervised way from raw text as separate sets of word clusters which are jointly optimized to encode relevant and distinct information. These properties are crucial for applications such as book recommendation, because they allow to i) explain why particular books are similar based on the inferred latent structure and ii) find similarities based on important and distinct aspects of a novel (character types or interactions). Our framework of unsupervised multi-view learning is very flexible and can straightforwardly be applied to learn arbitrary kinds and numbers of views from raw text.
Secondly, we propose an empirical evaluation framework. Before we can use models to extend existing categories as discussed above, it must be shown that the representations capture existing associations. To this end, we investigate whether micro-clusters derived from induced representations resemble reference clusters defined as groups of novels sharing tags in the Amazon catalog. While automatic induction of plot representations has attracted considerable attention (see Jockers (2013)), evaluation has remained largely qualitative and intrinsic. To the best of our knowledge, we are the first to investigate the utility of automatically induced plot representations on an extrinsic task at scale. We evaluate micro-clusters as local neighbourhoods in model space containing 10, 000 novels under 50 reference tags.
We show that rich multi-view representations produce better micro-clusters compared to competitive but simpler models, and that interpretability of the learnt representations is not compromised despite the more complex objective. We also qualitatively demonstrate that high-quality micro-clusters emerge from a smaller, more homogeneous data set of Gutenberg 3 novels.

Related Work
Automatically learning representations of book plots, as structured summaries of their content, has attracted much attention (cf, Jockers (2013) for a review). Unsupervised models have been 2 We argue that both characters, and their relations evolve throughout the plot: Heroes pick up new attitudes or skills, and utilize those to different extents; relations change and develop over time (hate to love, friendship to enmity and back).
Other work focused on the dynamics of a plot, learning trajectories of relations between two characters (Iyyer et al., 2016;Chaturvedi et al., 2017). Iyyer et al. (2016) combine dictionary learning (Olshausen and Field, 1997) with deep recurrent autoencoders to learn interpretable character relationship descriptors. They show that their deep model learns better representations than conceptually similar topic models (Gruber et al., 2007;Chang et al., 2009). Here, we extend the model of Iyyer et al. (2016) to simultaneously induce multiple views on the plot.
Methodologically, our work falls into the class of multi-view learning, and we propose a novel formulation of the model objective which encourages encoding of distinct information in the views. Our objective function is inspired by prior work in multi-task learning and deep domain adaptation for classification (Ganin and Lempitsky, 2015;Ganin et al., 2016). They train neural networks to simultaneously learn classifiers which are accurate on their target task and are agnostic about feature fluctuation pertaining to domain shift. We adapt this idea to unsupervised models with a reconstruction objective and learn multi-view representations which efficiently encode the input data and, at the same time, learn to only encode information relevant for the particular view.
Evaluating induced plot representations is notoriously difficult. Most evaluation has resorted to manual inspection, or crowd-sourced human judgments of the coherence and interpretability of the representations (Iyyer et al., 2016;Chaturvedi et al., 2017). While such evaluations demonstrated that the induced representations are qualitatively valuable, it is not clear whether they are rich and general enough to be used for downstream tasks and applications. Others have used automatically created gold-standards of re-occurring character names across scripts ('gang member') (Bamman et al., 2013), prototypical plot templates (tropes, e.g., 'corrupt corporate executive') or manually created gold-standards of character types (Vala et al., 2016) or their relations (Massey et al., 2015;Chaturvedi et al., 2017) to automatically measure the intrinsic value of learnt representations. Here, we investigate how these results extend to extrinsic tasks, and use structured plot representations for the task of inducing micro-clusters of novels.
Elsner (2012) depart from the above pattern, suggesting an extrinsic, albeit artificial, evaluation paradigm. Approaching plot understanding from the angle of its utility for summarization, they use kernel methods to learn character-centric plot representations. They evaluate their trained models on their ability to differentiate between real and artificially distorted novels (e.g., with shuffled chapters). While this evaluation is extrinsic and quantitative, it leverages artificial data and it is not clear how the results extend to real-world summaries.
Language features were previously used in content-based book recommendation e.g., as bagsof-words (Mooney and Roy, 1999) or semantic frames (Clercq et al., 2014). Both works use structured databases and plot summaries rather than the raw book text. Other work used topic models to augment a recommender system of scientific articles (Wang and Blei, 2011). Similar to our work, these works emphasize the added value of interpretable representations and recommendations, however, they do not leverage the raw content of entire novels and the richness of information encoded in those.

Multi-View Novel Representations
We first provide an intuitive description of Relationship Modeling Networks (RMN; Iyyer et al. 2016), and our extension (henceforth MVPlot), which jointly induces temporally aware multi-view representations of novel plots. Afterwards we describe the MVPlot model in technical detail. Iyyer et al. (2016) introduce the RMN, an unsupervised model which learns interpretable plot representations in terms of types of relations between pairs of book characters, and their development over time. Given a book and a character pair, the model learns relation types as word clusters (not unlike topics in a topic model (Blei et al., 2003)) from local contexts mentioning both characters. In addition the RMN learns for each character pair how these relations vary over time as a trajectory of relations. Methodologically, the RMN combines a deep recurrent autoencoder with dictionary learning, where terms in the dictionary are relationship descriptors. The RMN learns to efficiently encode local text spans as a linear combination of these relation descriptors.

Intuition
We extend RMNs to induce temporally aware multi-view representations of novel plots. Multiple interpretable views are induced jointly within one process in an unsupervised way. The core of our model closely corresponds to the structure of the RMN (as technically described in Section 3.2). However, we provide the model with distinct types of informative input for each view, and, reformulate the objective in a way that jointly optimizes parameterizations of all views to encode distinct information (cf., Section 3.3).
Our MVPlot model learns two views: properties associated with individual characters (v 1 ), relations between character pairs (v 2 , as in the RMN) and their respective development over the course of the plot (examples of descriptors learnt by MVPlot for both views are shown in Table 1). Our modeling framework, however, is very general in the sense that any number and type of views can be learnt jointly as long as input with relevant signals can be provided for each view. For example, we could naturally extend the model described here with a 'plot' view to capture properties of the story which are not related to any character.

The MVPlot Model
We now formally describe the MVPlot model for learning multi-view plot representations encoding individual character properties (v 1 ), character pair relationships (v 2 ), and their respective trajectories. The full model is shown in Figure 1. Input to our model are two corpora of text spans, one for each view, S v1 and S v2 . The corpora consist of different sets of relevant viewspecific local contexts as described in Section 5. Given a book b and a character c, S c,b v1 contains e t e b t span embedding book embedding h t hidden input representation , given a book b and a pair of characters c 1 and c 2 , contains linearly ordered text spans which mention both c 1 and c 2 , but no other character, The rest of the input preparation follows Iyyer et al. (2016) as follows. We map text spans into word embedding space, by mapping each word w to its 300-dimensional GloVe embedding e w (Pennington et al., 2014) pre-trained on Common-Crawl, and averaging the word embeddings, (1) We provide MVPlot with a trainable matrix B of dimensions b × n, where b is the number of books in our data set, and each row e b is an n-dimensional book embedding, encoding background information (e.g, about its general setting or style) which is relevant to neither view of MVPlot. 5 Finally the span embedding and the corresponding book embedding are concatenated, 4 with respect to their occurrence in the novel 5 The RMN learns background encodings for characters in addition to the book embeddings. We omit this for MVPlot as this information is explicitly learned in the views. and passed through a ReLu non-linearity (cf., Figure 1, bottom), Model architecture MVPlot uses the architecture of the RMN autoencoder, but replicates it for each input view, v 1 and v 2 (cf., Figure 1, center). Each part will induce an encoding of view-specific information. The feed-forward pass, described below, is identical for both parts, however, the loss and backpropagation will differ (cf., Section 3.3).
We describe the feed-forward pass for v 2 , noting that it works analogously for v 1 . The latent input representation h t (eqn (2)) is passed through a softmax layer which returns a weight vector over descriptors, . Descriptors are rows in the k × d-dimensional descriptor matrix R v2 , with each row k corresponding to one d-dimensional descriptor (similar to a topic in a topic model). The input e t is reconstructed through the dot product of d t v2 and the descriptor matrix R v2 , Like in the original RMN, we want to capture the temporal development of character relations or properties. Intuitively, we assume that the relations between (or properties of) characters at time t depend on their relations (or properties) at time t − 1. As in the RMN, we factor the descriptor weights of the previous time step d t−1 into the representation at time t, such that Output First, the model induces property descriptors (rows in R v1 ) and the relationship descriptors (rows in R v2 ). Both sets of descriptors are optimized to reconstruct model input in GloVe embedding space (cf., Section 3.3 for details). They consequently themselves live in GloVe word embedding space, and can be visualized through their nearest neighbours in this space. In addition, for each book b, character c b and character pair {c 1 , c 2 }, sequences of weight vectors over re- , and over properties are induced, which encode their trajectory of relations and properties, respectively. We will utilize these trajectories for inducing micro-clusters of novels (Section 6.1).

The Multi-View Loss
We formulate our loss as a Hinge loss within the contrastive max-margin framework. Our objective is to learn parameters for each view ∈ {v 1 , v 2 } which efficiently encode view-specific input in a low-dimensional space from which the original input can be re-constructed with high accuracy. In addition, we want to learn view-specific parameters which encode distinct information such that when utilized together, they provide an improved embedding of the data. Intuitively, we achieve this by discouraging parameters of view v 1 from accurately reconstructing input spans from view v 2 , and vice versa. Our loss combines these two objectives as follows. The first part of the loss corresponds to the loss of the RMN. We use negative sampling to induce parameters for each view which reconstruct their respective view-specific input well. Formally, assuming model input from view v 1 , e t v1 , we construct a set of 10 'negative inputs'{e n 1 v1 , ...e n I v1 } which are sampled at random from the v1 input corpus. We want to learn parameters encoding view v 1 to reconstruct the input such that the inner product between the true input e t v 1 and its reconstruction r t v1 is higher than the inner product between r t v1 and any of the negative samples e n i v1 by a margin of at least 1, where θ refers to the set of all model parameters.
We add an orthogonality-encouraging regularizing term to this objective in order to obtain viewspecific descriptors which are distant from each other (Hyvärinen and Oja, 2000), The loss is defined analogously for input of view v 2 . Note that so far, the loss is defined in an entirely view-specific way, independent of the v2 parameters (e.g., the v1 loss in equation (6) is independent of the v2 parameters). We break this independence by adding a second term to our loss function, which ensures that view-specific parameters encode only relevant information. That is, we want v 2 -specific parameters to only encode v 2 -specific information, and vice versa. Assuming model input from view v 1 , e t v 1 ,  we learn parameters for to view v 2 that reconstruct the input poorly. Again, we use the max-margin framework, maximizing the margin between the (high) quality reconstruction of e t v 1 from v 1 parameters, r t v1 , and the (poor) quality of the reconstruction from v 2 parameters, r t v2 , The update is defined analogously, swapping v1 and v2 subscripts, when the true input stems from v2. The full loss is defined as a weighted linear combination of its terms (eqns (6) and (7)),

Semantic Micro-Cluster Evaluation
MVPlot induces structured representations of a novel b as relation trajectories between characters pairs in b, and property trajectories of individual characters in b. Are those representations rich and informative enough to produce meaningful and interpretable micro-clusters of novels? In Section 6.1 we evaluate the quality of such micro-clusters, i.e., local novel neighbourhoods in model space. We propose an objective and empirical evaluation employing expert-provided semantic novel tags in the Amazon catalog. Novels listed in the Amazon catalog are tagged with respect to their genre (e.g., mystery, romance).
They are further labelled with refinements pertaining to diverse information like character types or mood, which take different sets of values, depending on the genre, and are as such predestined as an objective reference for evaluating the diverse information captured by our model. Table 2 lists example tags for the refinement character type.
All tags are provided by publishers and can consequently be taken as a reliable source of information. In our evaluation we assume that novels which share a tag are related to each other. We use this tag-overlap metric to evaluate local neighbourhoods of book representations in model space.  We selected a set of 50 representative tags from the Amazon catalog and did not tune this set for our evaluation. The full tag set is included in the supplementary material. Note that while this scheme provides an empirical way of evaluating plot representations, it may not capture their full potential: our models are not explicitly tuned towards producing microclusters which are coherent with respect to our gold-standard tags, and may encode additional structure which is not probed in this evaluation. That said, we consider this evaluation as a good procedure to evaluate the relative quality of different models in the sense that a better model should produce micro-clusters that better correspond to reference clusters derived from gold-standard tags.

Data
We evaluate our model on two data sets. First, we create a diverse data set by sampling 10,000 digital novels under our 50 gold-standard tags (cf., Section 4) of the Amazon catalog (Amazon). Our second data set consists of 3,500 novels from Project Gutenberg, a large digital collection of freely available novels consisting primarily of classic literature (Gutenberg). The Amazon novels are already labelled with genre and refinement tags, such that evaluation using our goldstandard is straightforward. While Gutenberg novels come with the advantage of being freely available, they are unlabelled, and not fully covered by our 50 gold-standard tags. We therefore restrict our quantitative analysis to the Amazon data set. However, we also report qualitative results on the Gutenberg corpus, demonstrating that our model induces meaningful novel representations for corpora of varying size and diversity.
Both data sets were pre-processed with the BookNLP pipeline (Bamman et al., 2014) for coreference resolution of character mentions. We filtered stop-words and low-frequency words by discarding the 500 most frequent words and those which occur in less than 100 novels, and discarded novels less than 100 sentences long or containing fewer than 5 characters from our data set.
We created view-specific input corpora as follows: (1) a relation corpus of chronologically ordered sequences of text spans of 20 words for character pairs {c 1 , c 2 } in a book b, S c 1 ,c 2 ,b v2 , which mention only c 1 and c 2 with a margin of 10 words for the Amazon corpus (1 word for the smaller Gutenberg corpus) but no other character; and (2) a property corpus of chronologically ordered sequences of 20 word text spans for individual characters c in book b, S c 1 ,c 2 ,b v2 , which mention only c, using the same margins as above.
We keep only sequences of length n time steps s.th., 5 ≤ n ≤ 250. We only keep pair sequences if we also obtain sequences for each individual character confirming to the above criteria. Table 3 summarizes statistics on our input corpora.

Evaluation
Section 6.1 quantitatively evaluates the quality of local neighbourhoods in model space induced from the Amazon corpus against our proposed gold-standard. Section 6.2 evaluates the quality of the induced descriptors from both the Amazon and Gutenberg corpus both through crowd sourcing and illustrative examples.
Models We set the MVPlot performance into perspective comparing it against the RMN. 6 MV-Plot induces both character properties and relations, and is trained on both the relation-view and property-view input, while the RMN only induces pair relationships and can only utilize relationview input. In addition, we report a frequency baseline, which is trained on both property and relation-view input. We concatenate all input spans of a given view for a particular novel; construct its term frequency vector and use cosine similarity to compute the nearest neighbours to each novel.

Nearest Neighbours Evaluation
We evaluate local neighbourhoods in model space using the 500 most popular novels by their number of Amazon reviews as reference novels from the Amazon corpus. For each reference novel we compute the 10 nearest neighbours as described below. We measure neighbourhood quality using the gold-standard tags from Section 4, regarding neighbours as relevant if at least one tag is shared with the reference novel. We report precision at rank 10 (P @10) and mean average precision (M AP ).
Method MVPlot represents a book b in terms of trajectories of weight vectors over relation descriptors T b v2 and property descriptors T b v1 . RMN only learns relation descriptors and their trajectories. For both models, we map each induced trajectory for book b to a fixed-sized kdimensional vector representation by averaging the time-specific weight vectors, for example for a v 2 trajectory, and equivalently for v 1 trajectories, T c,b v1 . We compute the similarity between two books {b r , b c } as follows. We align the v 2 trajectory for each character pair {c 1 , c 2 } in b r , T c 1 ,c 2 ,br , to its closest neighbouring character pair vector in b c , T c 1 ,c 2 ,bc , by Euclidean distance, and compute the overall book similarity in terms of character relations between b r and b c as the average over all distances.
We obtain sim br,bc v1 in an analogous process. For cosine and MVPlot we obtain a final, multi-view similarity by averaging similarity scores obtained in each view's space, For RMN we compute similarity only in character relation space.
Results Table 4 presents micro-cluster quality in terms of precision@10 and M AP . The full MVPlot model statistically significantly outperfoms all other models. 9 The same pattern emerges 9 Also, intra-view comparisons except for MVPlot v1 and cosine v1 are statistically significant.

Model
View  (Table 4 bottom). Combining information from both views boosts performance compared to the single-view versions. This confirms that MVPlot indeed encodes distinct and relevant information in the respective views.
While cosine is a strong baseline, its representations are not structured or interpretable. It consequently does not provide sufficient information for applications like book tagging or recommendation with respect to specific aspects or criteria. Similarly, RMN cannot learn representations of multiple, distinct views of the plot.
Advancing our understanding of the information encoded in the individual views of MVPlot, we took a closer look at the refinement tags for which the single view MVPlot model (v1) has the clearest advantage over the pair view MVPlot model (v2), and vice-versa. We computed tagwise F1-scores for the two MVPlot variants. Table 5 lists the book tags for which the scores of the two views diverge the most.
In terms of types of refinements, view v2 suffers most for predicting book categories, or topical tags ('sports', 'second changes'), while view v1 is particularly deficient for predicting character types. While this seems counterintuitive we hypothesize that character types are to a large extent defined by their interactions with, or relations to, other char-  acters. Topical information, however, is encoded robustly in the properties of individual characters.

Evaluating Induced Descriptors
This evaluation investigates whether induced relation descriptors indeed capture relational information. We evaluate the interpretability of the induced descriptors, comparing the v 2 (relation) descriptors induced by RMN and MVPlot. We apply both models to both the Amazon and the Gutenberg corpus, and report results on both data sets.
Method We collect crowdsourced judgments on Amazon Mechanical Turkto qualitatively evaluate the learnt descriptors, following Chaturvedi et al. (2017). In each task a worker is shown one induced descriptor as a set of its 10 closest words in GloVe space (like in Table 1), and is asked to indicate whether "the words in the group describe a relation, event or interaction between people". We compare the proportion of positive answers, i.e., the number of descriptors considered relevant, for RMN descriptors and MVPlot pair descriptors. We collect 30 judgments for each of k=50 descriptors induced by the respective models.
Results Figure 2 displays our results. We observe a similar pattern of ratings across models and corpora, e.g., around 50% of the descriptors are labelled as relevant by at least 50% of the annotators. None of the differences are statistically significant which lets us conclude that interpretability of induced descriptors is comparable for the RMN and MVPlot. This is encouraging because we confirm that representation interpretability is not compromised by MVPlot's more complex objective. Table 1 displays examples of property and relation descriptors induced by MVPlot from the Gutenberg corpus. We can see that the different views indeed capture differing information (e.g., a v 1 descriptor refers to individuals' titles, while a v 2 descriptor refers to a love relation). Despite its smaller size and more homogeneous nature, we show that MVPlot learns meaningful representations from the Gutenberg corpus, demonstrating the flexibility of our model. Figure 3 further illustrates this, displaying example local neighbourhoods of four reference novels (left) with their eight nearest neighbours ordered by proximity (left to right). The neighbourhoods are intuitively meaningful, and particularly impressive bearing in mind that the full model space contains 3, 500 novels. While most neighbourhoods are dominated by novels of the same author, some exceptions emerge. Row two, for example, contains novels by Thomas Hardy and Charles Dickens who both are known for biographical 17th century novels focusing on class and social changes.

Conclusions
Content-based micro-clustering of novels is a complex but interesting task. In order to eventually augment the diverse associations humans have, models must be able to pick up rich and structured signals from raw text. This paper presented a deep recurrent autoencoder which learns multi-view representations of plots, and introduced a principled evaluation framework using clusters based on expert-provided book tags.
Our evaluation showed that rich multi-view representations are better suited to recover such reference clusters compared to each individual view, as well as compared to simpler, but competitive models which induce less structured representations. Our view-specific representations are interpretable which allows to analyse and explain the emerging micro-clusters, and might reveal previously unnoticed parallels between novels and may be useful for literary analysis or content-based recommendation. This is an exciting avenue for future work.
Our method is general and scalable both in terms of its input, utilizing raw text with only automatic pre-processing, and in terms of the number of distinct views it can learn. We described an objective function which triggers views to encode distinct information. In future work we plan to explore joint learning of more and different views.
Our approach relies strongly on the assumption that text spans mentioning two characters contain information about character relations, and that text spans mentioning one character contain information about the character's properties. While our results suggest that these assumptions are valid, they are arguably crude. In the future we plan to define more targeted input, e.g., by using semantic and syntactic information from dependency parses.
In this work we induced dual-view representations of book content, however, we emphasize that the proposed method is very general. The number and kinds of views, as well as underlying data are in no way constrained, as long as relevant viewspecific input can be defined. In the context of novel representation it would be interesting to in-duce additional views, for example one that captures the mood of a novel. Another interesting avenue for future work would be to apply the framework to questions arising in the digital humanities, e.g., to extract different views from news articles.
The presented model and evaluation are designed with the objective to detect a different kinds of similarity between novels, with the ultimate goal to enrich human-provided genres and tags. We described a first step in this direction, verifying that the learnt information is meaningful and can reproduce human-created semantic book tags. Expert book tags exist for a wide variety of information (mood, theme, characters), and provide a rich evaluation environment for learnt representations. We invite the community to join us in exploring the full space of opportunities and evaluating induced representations holistically in the future.