UvA-DARE (Digital Academic Repository) A Data-Oriented Model of Literary Language

We consider the task of predicting how literary a text is, with a gold standard from human ratings. Aside from a standard bi-gram baseline, we apply rich syntactic tree fragments, mined from the training set, and a series of hand-picked features. Our model is the ﬁrst to distinguish degrees of highly and less literary novels using a variety of lexical and syntactic features, and explains 76.0 % of the variation in literary ratings.


Introduction
What makes a literary novel literary?This seems first of all to be a value judgment; but to what extent is this judgment arbitrary, determined by social factors, or predictable as a function of the text?The last explanation is associated with the concept of literariness, the hypothesized linguistic and formal properties that distinguish literary language from other language (Baldick, 2008).Although the definition and demarcation of literature is fundamental to the field of literary studies, it has received surprisingly little empirical study.Common wisdom has it that literary distinction is attributed in social communication about novels and that it lies mostly outside of the text itself (Bourdieu, 1996), but an increasing number of studies argue that in addition to social and historical explanations, textual features of various complexity may also contribute to the perception of literature by readers (cf.Harris, 1995;McDonald, 2007).The current paper shows that not only lexical features but also hierarchical syntactic features and other textual characteristics contribute to explaining judgments of literature.
Our main goal in this project is to answer the following question: are there particular textual conventions in literary novels that contribute to readers judging them to be literary?We address this ques-tion by building a model of literary evaluation to estimate the contribution of textual factors.This task has been considered before with a smaller set of novels (restricted to thrillers and literary novels), using bigrams (van Cranenburgh and Koolen, 2015).We extend this work by testing on a larger, more diverse corpus, and by applying rich syntactic features and several hand-picked features to the task.This task is first of all relevant to literary studies-to reveal to what extent literature is empirically associated with textual characteristics.However, practical applications are also possible; e.g., an automated model could help a literary publisher decide whether the work of a new author fits its audience; or it could be used as part of a recommender system for readers.
Literary language is arguably a subjective notion.A gold standard could be based on the expert opinions of critics and literary prizes, but we can also consider the reader directly, which, in the form of a crowdsourced survey, more easily provides a statistically adequate number of responses.We therefore base our gold standard on a large online survey of readers with ratings of novels.
Literature comprises some of the most rich and sophisticated language, yet stylometry typically does not exploit linguistic information beyond part-of-speech (POS) tags or grammar productions, when syntax is involved at all (cf.e.g., Stamatatos et al., 2009;Ashok et al., 2013).While our results confirm that simple features are highly effective, we also employ full syntactic analyses and argue for their usefulness.We consider tree fragments: arbitrarily-sized connected subgraphs of parse trees (Swanson and Charniak, 2012;Bergsma et al., 2012;van Cranenburgh, 2012).Such features are central to the Data-Oriented Parsing framework (Scha, 1990;Bod, 1992), which postulates that language use derives from arbitrary chunks (e.g., syntactic tree fragments) of previous lan- guage experience.In our case, this suggests the following hypothesis.
HYPOTHESIS 1: Literary authors employ a distinctive inventory of lexico-syntactic constructions (e.g., a register) that marks literary language.
Next we provide an analysis of these constructions which supports our second hypothesis.
HYPOTHESIS 2: Literary language invokes a larger set of syntactic constructions when compared to the language of non-literary novels, and therefore more variety is observed in the parse tree fragments whose occurrence frequencies are correlated with literary ratings.
The support provided for these hypotheses suggests that the notion of literature can be explained, to a substantial extent, from textual factors, which contradicts the belief that external, social factors are more dominant than internal, textual factors.

Task, experimental setup
We consider a regression problem of a set of novels and their literary ratings.These ratings have been obtained in a large reader survey (about 14k participants),1 in which 401 recent, bestselling Dutch novels (as well as works translated into Dutch) where rated on a 7-point Likert scale from definitely not to highly literary.The participants were presented with the author and title of each novel, and provided ratings for novels they had read.The ratings may have been influenced by well known authors or titles, but this does not affect the results of this paper because the machine learning models are not given such information.The task we consider is to predict the mean2 rating for each novel.We ex-clude 16 novels that have been rated by less than 50 participants.91 % of the remaining novels have a t-distributed 95 % confidence interval ă 0.5; e.g., given a mean of 3, the confidence interval typically ranges from 2.75 to 3.25.Therefore for our purposes the ratings form a reliable consensus.Novels rated as highly literary have smaller confidence intervals, i.e., show a stronger consensus.Where a binary distinction is needed, we call a rating of 5 or higher 'literary.' Since we aim to extract relevant features from the texts themselves and the number of novels is relatively small, we apply cross-validation, so as to exploit the data to the fullest extent while maintaining an out-of-sample approach.We divide the corpus in 5 folds of roughly equal size, with the following constraints: (a) novels by the same author must be in the same fold, since we want to rule out any influence of author style on feature selection or model validation; (b) the distribution of literary ratings in each fold should be similar to the overall distribution (stratification).
We control for length and potential particularities of the start of novels by considering sentences 1000-2000 of each novel.18 novels with fewer than 2000 sentences are excluded.Together with the constraint of at least 50 ratings, this brings the total number of novels we consider to 369.
We evaluate the effectiveness of the features using a ridge regression model, with 5-fold crossvalidation; we do not tune the regularization.The results are presented incrementally, to illustrate the contribution of each feature relative to the features before it.This makes it possible to gauge the effective contribution of each feature while taking any overlap into account.
We use R 2 as the evaluation metric, expressing the percentage of variance explained (perfect score 100); this shows the improvement of the predictions over a baseline model that always predicts the mean value (4.2, in this dataset).A mean baseline model is therefore defined to have an R 2 of 0. Other baseline models, e.g., always predicting 3.5 or 7, attain negative R 2 scores, since they perform worse than the mean baseline.Similarly, a random baseline will yield a negative expected R 2 .

Basic features
Sentence length, direct speech, vocabulary richness, and compressibility are simple yet effective stylometric features.We count direct speech sentences by matching on specific punctuation; this provides a measure of the amount of dialogue versus narrative text in the novel.Vocabulary richness is defined as the proportion of words in a text that appear in the top 3000 most common words of a large reference corpus (Sonar 500; Oostdijk et al., 2013); this shows the proportion of difficult or unusual words.Compressibility is defined as the bzip2 compression ratio of the texts; the intuition is that a repetitive and predictable text will be highly compressible.CLICHES is the number of cliché expressions in the texts based on an external dataset of 6641 clichés (van Wingerden and Hendriks, 2015); clichés, being marked as informal and unoriginal, are expected to be more prevalent in non-literary texts.Table 1 shows the results of these features.Several other features were also evaluated but were either not effective or did not achieve appreciable improvements when these basic features are taken into account; notably Flesch readability (Flesch, 1948), average dependency length (Gibson, 2000), and D-level (Covington et al., 2006)

Automatically induced features
In this section we consider extracting syntactic features, as well as three (sub)lexical baselines.
TOPICS is a set of 50 topic weights induced with Latent Dirichlet Allocation (LDA; Blei et al., 2003) from the corpus (for details, cf.Jautze et al., 2016).
Furthermore, we use character and word n-gram features.For words, bigrams present a good trade off in terms of informativeness (a bigram frequency is more specific than the frequency of an individual word) and sparsity (three or more consecutive words results in a large number of n-gram types with low frequencies).For character n-grams, n " 4 achieved good performance in previous work (e.g., Stamatatos, 2006).
We note three limitations of n-grams.First, the fixed n: larger or discontiguous chunks are not extracted.Combining n-grams does not help since a linear model cannot capture feature interactions, nor is the consecutive occurrence of two features captured in the bag-of-words representation.Second, larger n imply a combinatorial explosion of possible features, which makes it desirable to select the most relevant features.Finally, word and character n-grams are surface features without linguistic abstraction.One way to overcome these limitations is to turn to syntactic parse trees and mine them for relevant features unrestricted in size.
Specifically, we consider tree fragments as features, which are arbitrarily-sized fragments of parse trees.If a parse tree is seen as consisting of a sequence of grammar productions, a tree fragment is a connected subsequence thereof.Compared to bag-of-word representations, tree fragments can capture both syntactic and lexical elements; and these combine to represent constructions with open slots (e.g., to take NP into account), or sentence templates (e.g., "Yes, but . . .", he said).Tree fragments are thus a very rich source of features, and larger or more abstract features may prove to be more linguistically interpretable.
We present a data-driven method for extracting and selecting tree fragments.Due to combinatorics, there are an exponential number of possible fragments given a parse tree.For this reason it is not feasible to extract all fragments and select the relevant ones later; we therefore use a strategy to directly select fragments for which there is evidence of re-use by considering commonalities in pairs of trees.This is done by extracting the largest common syntactic fragments from pairs of trees (Sangati et al., 2010;van Cranenburgh, 2014).This method is related to tree-kernel methods (Collins and Duffy, 2002;Moschitti, 2006), with the difference that it extracts an explicit set of fragments.The feature selection approach is based on relevance and redundancy (Yu and Liu, 2004), similar to Swanson and Charniak (2013).Kim et al. (2011) also use tree fragments, for authorship attribution, but with a frequent tree mining approach; the difference with our approach is that we extract the largest fragments attested in each tree pair, which are not necessarily the most frequent.

Preprocessing
We parse the 369 novels with Alpino (Bouma et al., 2001).The parse trees include discontinuous constituents, non-terminal labels consist of both syntactic categories and function tags, selected morphological features, 3 and constituents are bina-rized head-outward with a markovization of h=1, v=1 (Klein and Manning, 2003).
For a fragment to be attested in a pair of parse trees, its labels need to match exactly, including the aforementioned categories, tags, and features.The h " 1 binarization implies that fragments may contain partial constituents; i.e., a contiguous sequence of children from an n-ary constituent.
Figure 1 shows an example parse tree; for brevity, this tree is rendered without binarization.The non-terminal labels consist of a syntactic category (shown in red), followed by a function tag (green).The part-of-speech tags additionally have morphological features (black) in square brackets.Some labels contain percolated morphological features, prefixed by a colon.

Mining syntactic tree fragments
The procedure is divided in two parts.The first part concerns fragment extraction: 1. Given texts divided in folds F 1 . . .F n , each C i is the set of parse trees obtained from parsing all texts in F i .Extract the largest common fragments of the parse trees in all pairs of folds xC i , C j y with i ă j.A common fragment f of parse trees t 1 , t 2 is a connected subgraph of t 1 and t 2 .The result is a set of initial candidates that occur in at least two different texts, stored separately for each pair of folds xC i , C j y.

Count occurrences of all fragments in all texts.
Fragment selection is done separately w.r.t. each test fold.Given test fold i, we consider the fragments found in training folds t1..nu z i; e.g., given n " 5, for test fold 1 we select only from the fragments and their counts as observed in training folds 2-5.Given a set of fragments from training folds, selection proceeds as follows: 1. Zero count threshold: remove fragments that occur in less than 5 % of texts (too specific to particular novels); frequency threshold: remove fragments that occur less than 50 times across the corpus (too rare to reliably detect a correlation with the ratings).2. Relevance threshold: select fragments by considering the correlation of their counts with the literary ratings of the novels in the training folds.Apply a simple linear regression as infinite verbs, auxiliary verbs, proper nouns, subordinating conjunctions, personal pronouns, and postpositions.
based on the Pearson correlation coefficient, and use an F-test to filter out fragments whose p-value4 ą 0.05.The F-test determines significance based on the number of datapoints N , and the correlation r; the effective threshold is approximately |r| ą 0.11.3. Redundancy removal: greedily select the most relevant fragment and remove other fragments that are too similar to it.Similarity is measured by computing the correlation coefficient between the feature vectors of two fragments, with a cutoff of |r| ą 0.5.Experiments where this step was not applied indicated that it improves performance.
Note that there is some risk of overfitting since fragments are both extracted and selected from the training set.However, this is mitigated by the fact that fragments are extracted from pairs of folds, while selection is constrained to fragments that are attested and significantly correlated across the whole training set.
The values for the thresholds were chosen manually and not tuned, since the limited number of novels is not enough to provide a proper tuning set.tuning these parameters we pick fixed values of C=100 and =0, reducing regularization compared to the default of C=1 and disabling the threshold.
Cf. Table 3 for the scores.The syntactic fragments perform best, followed by char.4-grams and word bigrams.We report scores for each of the 5 folds separately because the variance between folds is high.However, the differences between the feature types are relatively consistent.The variance is not caused by the distribution of ratings, since the folds were stratified on this.Nor can it be explained by the agreement in ratings per novel, since the 95 % confidence intervals of the individual ratings for each novel were of comparable width across the folds.Lastly, author gender, genre, and whether the novel was translated do not differ markedly across the folds.It seems most likely that the novels simply differ in how predictable their ratings are from textual features.
In order to gauge to what extent these automatically induced features are complementary, we combine them in a single model together with the basic features; cf. the scores in Table 4.Both character 4-grams and syntactic fragments still provide a relatively large improvement over the previous features, taking into account the inherent diminishing returns of adding more features.
Figure 2 shows a bar plot of the ten novels with the largest prediction error with the fragment and word bigram models.Of these novels, 9 are highly literary and underestimated by the model.For the other novel (Smeets, Afrekening) the literary rating is overestimated by the model.Since this top 10 is based on the mean prediction from both models, the error is large for both models.This does not change when the top 10 errors using only fragments or bigrams is inspected; i.e., the hardest novels to predict are hard with both feature types.What could explain these errors?At first sight, there is no obvious commonality between the literary novels that are predicted well, or between the ones with a large error; e.g., whether the novels have been translated or not does not explain the error.A possible explanation is that the successfully predicted literary novels share a particular (e.g., rich) writing style that sets them apart from other novels, while the literary novels that are underestimated by the model are not marked by such a writing style.It is difficult to confirm this directly by inspecting the model, since each prediction is the sum of several thousand features, and the contributions of these features form a long tail.If we define the contribution of a feature as the absolute value of its weight times its relative frequency in the document, then in case of Barnes, The sense of an ending, the top 100 features contribute only 34 % of the total prediction.
Table 5 gives the basic features for the top 4 literary novels with the largest error and contrasts them with 4 literary novels which are well predicted.The most striking difference is sentence length: the underestimated literary novels have markedly shorter sentences.Voskuil and Franzen have a higher proportion of direct speech (they are in fact the only literary novels in the top 10 novels with the most direct speech).Lastly, the underestimated novels have a higher proportion of common words (lower vocabulary richness).These observations are compatible with the explanation suggested above, that a subset of the literary novels share a simple, readable writing style with nonliterary novels.Such a style may be more difficult to detect than a literary style with long and complex sentences, or rich vocabulary and phraseology, because a simple, well-crafted sentence may not offer overt surface markers of stylization.Book reviews appear to support this notion for The sense of an ending: "A slow burn, measured but suspenseful, this compact novel makes every slyly crafted sentence count" (Tonkin, 2011); and "polished phrasings, elegant verbal exactness and epigrammatic perceptions" (Kemp, 2011).
In order to test whether the amount of data is sufficient to learn to predict the ratings, we construct a learning curve for different training set sizes; cf. Figure 3.The set of novels is shuffled once, so that initial segments of different size represent random samples.The novels are sampled in 5 % increments (i.e., 20 models are trained).The graphs show the cross-validated scores.
The graphs show that increasing the number of novels has a large effect on performance.The curve is steep up to 30 % of the training set, and the performance keeps improving steadily but more slowly up to the last data point.Since the performance is relatively flat starting from 85 %, we can conclude that the k-fold cross-validation with k " 5 provides an adequate estimate of the model's performance if A similar experiment was performed varying the number of features.Here the performance plateaus quickly and reaches an R 2 of 53.0 % at 40 %, and grows only slightly from that point.

Metadata features
In addition to textual features, we also include three (categorical) metadata features not extracted from the text, but still an inherent feature of the novel in question: GENRE, TRANSLATED, and AUTHOR GENDER; cf.Table 6 for the results.Figure 4 shows a visualization of the predictions in a scatter plot.
GENRE is the coarse genre classification Fiction, Suspense, Romantic, Other, derived from the publisher's categorization.Genre alone is already a strong predictor, with an R 2 of 58.3 on its own.However, this score is arguably misleading, because the predictions are very coarse due to the discrete nature of the feature.
A striking result is that the variables AUTHOR GENDER and TRANSLATED increase the score, but only when they are both present.Inspecting the mean ratings shows that translated novels by female authors have an average rating of 3.8, while originally Dutch male authors are rated 5.0 on average; the ratings of the other combinations lie in between these extremes.This explains why the combination works better than either feature on its own, but due to possible biases inherent in the makeup of the corpus, such as which female or translated authors are published and selected for the corpus, no conclusions on the influence of gender or translation should be drawn from these datapoints.

Previous work
Table 7 shows an overview of previous work on the task of predicting the (literary) quality of novels.Note that the datasets and targets differ, therefore none of the results are directly comparable.For example, regression is a more difficult task than binary classification, and recognizing the difference between an average and highly literary novel is more difficult than distinguishing either from a different domain or genre (e.g., newswire).Louwerse et al. (2008) discriminate literature from other texts using Latent Semantic Analysis.Ashok et al. (2013) use bigrams, POS tags, and grammar productions to predict the popularity of Gutenberg texts.van Cranenburgh and Koolen (2015) predict the literary ratings of texts, as in the present paper, but only using bigrams, and on a smaller, less diverse corpus.Compared to previous work, this paper gives a more precise estimate of how well shades of literariness can be predicted from a diverse range of features, including larger and more abstract syntactic constructions.

Analysis of selected tree fragments
An advantage of parse tree fragments is that they offer opportunities for interpretation in terms of linguistic aspects as well as basic distributional aspects such as shape and size.
Figure 5 shows three fragments ranked highly by the correlation metric, as extracted from the first fold.The first fragment shows an incomplete constituent, indicated by the ellipses as first and last leaves.Such incomplete fragments are made possible by the binarization scheme (cf.Sec.4.1).
Table 8 shows a breakdown of fragment types in the first fold.In contrast with n-grams, we also see   a large proportion of purely syntactic fragments, and fragments mixing both lexical elements and substitution sites.In the case of discontinuous fragments, it turns out that the majority has a positive correlation; this might be due to being associated with more complex constructions.
Figure 6 shows a breakdown by fragment size (defined as number of non-terminals), distinguishing fragments that are positively versus negatively correlated with the literary ratings.
Note that 1 and 3 are special cases corresponding to lexical (e.g., DT Ñ the) and binary grammar productions (e.g., NP Ñ DT N), respectively.The fragments with 2, 4, and 6 non-terminals are not as common because an even number implies the presence of unary nodes.Except for fragments of size 1, the frontier of fragments can consist of either substitution sites or terminals (since we distinguish only the number of non-terminals).On the one hand smaller fragments corresponding to one or two grammar productions are most common, and are predominantly positively correlated with the  literary ratings.On the other hand there is a significant negative correlation between fragment size and literary ratings (r " ´0.2, p ă 0.001); i.e., smaller fragments tend to be positively correlated with the literary ratings.It is striking that there are more positively than negatively correlated fragments, while literary novels are a minority in the corpus (88 out of 369 novels are rated 5 or higher).Additionally, the breakdown by size shows that the larger number of positively correlated fragments is due to a large number of small fragments of size 3 and 5; however, combinatorially, the number of possible fragment types grows exponentially with size (as reflected in the initial set of recurring fragments), so larger fragment types would be expected to be more numerous.In effect, the selected negatively correlated fragments ignore this distribution by being relatively uniform with respect to size, while the  What could explain the peak of positively correlated, small fragments?In order to investigate the peak of small fragments, we inspect the 40 fragments of size 3 with the highest correlations.These fragments contain indicators of unusual or more complex sentence structure: • DU, dp: discourse phenomena for which no specific relation can be assigned (e.g., discourse relations beyond the sentence level).• appositive NPs, e.g., 'John the artist.' • a complex NP, e.g., containing punctuation, nested NPs, or PPs.• an NP containing an adjective used nominally or an infinitive verb.
On the other hand, most non-literary fragments are top-level productions containing ROOT or clauselevel labels, for example to introduce direct speech.Another way of analyzing the selected fragments is by frequency.When we consider the total frequencies of selected fragments across the corpus, there is a range of 50 to 107,270.The bulk of fragments have a low frequency (before fragment selection 2 is by far the most dominant frequency), but the tail is very long.Except for the fact that there is a larger number of positively correlated fragments, the histograms have a very similar shape.
Lastly, Figure 7 shows a breakdown by the syn-tactic categories and function tags of the root node of the fragments.The positively correlated fragments are spread over a larger variety of both syntactic categories and function tags.This means that for most labels, the number of positively correlated fragments is higher; the exceptions are ROOT, SV1 (a verb-initial phrase, not part of the top 15), and the absence of a function tag (indicative of a nonterminal directly under the root node).All of these exceptions point to a tendency for negatively correlated fragments to represent templates of complete sentences.

Conclusion
The answer to the main research question is that literary judgments are non-arbitrary and can be explained to a large extent from the text itself: there is an intrinsic literariness to literary texts.Our model employs an ensemble of textual features that show a cumulative improvement on predictions, achieving a total score of 76.0 % variation explained.This result is remarkably robust: not just broad genre distinctions, but also finer distinctions in the ratings are predicted.
The experiments showed one clear pattern: literary language tends to use a larger set of syntactic constructions than the language of non-literary novels.This also provides evidence for the hypothesis that literature employs a specific inventory of constructions.All evidence points to a notion of literature which to a substantial extent can be explained purely from internal, textual factors, rather than being determined by external, social factors.

Figure 1 :
Figure 1: A parse tree fragment from Franzen, The Corrections.Original sentence: something terrible was going to happen.

Figure 3 :
Figure 3: Learning curve when varying training set size.The error bars show the standard error.

Figure 4 :
Figure 4: A scatter plot of regression predictions and actual literary ratings.Original/translated titles.Note the histograms beside the axes showing the distribution of ratings (top) and predictions (right).

Figure 5 :
Figure5: Three fragments whose frequencies in the first fold have a high correlation with the literary ratings.Note the different scales on the y-axis.From left to right; Blue: complex NP with comma; Green: quoted speech; Red: Adjunct PP with indefinite article.
n e ) o b j1 h d b o d y d p s u v c n u c l p c ld p re d c s a t a p p d

Figure 7 :
Figure 7: Breakdown by category (above) and function tag (below) of fragment root (top 15 labels).
Table 2 lists the number of fragments extracted from folds 2-5 after each of these steps.

Table 2 :
The number of fragments in folds 2-5 after each filtering step.

Table 3 :
Regression evaluation.R 2 scores on the 5 cross-validation folds.

Table 8 :
Breakdown of fragment types selected in the first fold.