A Simple Approach to Classify Fictional and Non-Fictional Genres

In this work, we deploy a logistic regression classifier to ascertain whether a given document belongs to the fiction or non-fiction genre. For genre identification, previous work had proposed three classes of features, viz., low-level (character-level and token counts), high-level (lexical and syntactic information) and derived features (type-token ratio, average word length or average sentence length). Using the Recursive feature elimination with cross-validation (RFECV) algorithm, we perform feature selection experiments on an exhaustive set of nineteen features (belonging to all the classes mentioned above) extracted from Brown corpus text. As a result, two simple features viz., the ratio of the number of adverbs to adjectives and the number of adjectives to pronouns turn out to be the most significant. Subsequently, our classification experiments aimed towards genre identification of documents from the Brown and Baby BNC corpora demonstrate that the performance of a classifier containing just the two aforementioned features is at par with that of a classifier containing the exhaustive feature set.


Introduction
Texts written in any human language can be classified in various ways, one of them being fiction and non-fiction genres. These categories/genres can either refer to the actual content of the write-up or the writing style used, and in this paper, we use the latter meaning. We associate fiction writings with literary perspectives, i.e., an imaginative form of writing which has its own purpose of communication, whereas non-fiction writings are written in a matter-of-fact manner, but the contents may or may not refer to real life incidents (Lee, 2001). The distinction between imaginative and informative prose is very important and can have several practical applications. For example, one could use a software to identify news articles, which are expected to be written in a matter-of-fact manner, but tend to use an imaginative writing style to unfairly influence the reader. Another application for such a software could be for publishing houses which can use it to automatically filter out article/novel submissions that do not meet certain expected aspects of fiction writing style.
The standard approach in solving such text classification problems is to identify a large enough set of relevant features and feed it into a machine learning algorithm. In the genre identification literature, three types of linguistic features have been discussed i.e., high-level, lowlevel and derived features (Karlgren and Cutting, 1994;Kessler et al., 1997;Douglas, 1992;Biber, 1995). High-level features include lexical and syntactic information whereas low-level features involve character-level and various types of token count information. The lexical features deal with word frequency statistics such as frequency of content words, function words or specific counts of each pronoun, etc. Similarly, the syntactic features incorporate statistics of parts of speech, i.e., noun, verb, adjectives, adverbs and grammatical functions such as active and passives voices or affective markers such as modal auxiliary verbs. The character-level features involve punctuation usage, word count, word length, sentence length. And, lastly, the derived features involve ratio metrics such as type-token ratio, average word length or average sentence length based information. Majorly, all the previous work involved a combination of different features to represent a particular nature of the document and developing a model that classify different genres, sentiments or opinions.
Notably, researchers have adopted the frequentist approach (Sichel, 1975;Zipf, 1932Zipf, , 1945 and used lexical richness (Tweedie and Baayen, 1998) as a prominent cue for genre classification (Bur-rows, 1992;Stamatatos et al., 2000Stamatatos et al., , 1999. These studies vouch that coming out with statistical distribution from word frequencies would be the defacto-arbiter for document classification. In this regard, Stamatatos and colleagues have shown that most frequent words in the training corpus as well as in the entire English language are one of the good features for detecting the genre type (Stamatatos et al., 2000). With respect to syntactic and semantics properties of the text, previous studies have used various parts of speech counts in terms of number of types and tokens (Rittman et al., 2004;Rittman, 2007;Rittman and Wacholder, 2008;Cao and Fang, 2009). Researchers have tried to investigate the efficacy of counts vs. ratio features and their impact on the classification model performance. In general, a large number of features often tend to overfit the machine learning model performance. Hence, concerning the derived ratio features, Kessler et al. (1997) argues in his genre identification study that ratio features tend to eliminate over-fitting as well as high computational cost during training.
Although these earlier approaches have made very good progress in text classification, and are very powerful from an algorithmic perspective, they do not provide many insights into the linguistic and cognitive aspects of these fiction and nonfiction genres. The main objective of our work is to be able to extract the features that are most relevant to this particular classification problem and can help us in understanding the underlying linguistic properties of these genres. We begin by extracting nineteen linguistically motivated features belonging to various types (described at the outset) from the Brown corpus and then perform feature selection experiments using Recursive feature elimination with cross-validation (RFECV) algorithm (Guyon et al., 2002). Interestingly, we find that a classifier containing just two simple ratio features viz., the ratio of the number of adverbs to adjectives and number of adjectives to pronouns perform as well as a classifier containing an exhaustive set of features from prior work described above [96.31% and 100% classification accuracy for Brown (Francis and Kučera, 1989) and British National corpus (BNC Baby, 2005), respectively]. This is perhaps the best accuracy reported in the literature so far to the best of our knowledge. Essentially, we find that texts from the fiction genre tend to have a higher ratio of adverb to adjectives, and texts from the non-fiction genre tend to have a higher ratio of adjectives to pronouns. We discuss the implications of this finding for style guides for non-fiction writing (Zinsser, 2006) as well as standard advice proffered to creative writers (King, 2001).
In Section 2, we share details about our linguistic features design, data set and experimental methodology. Section 3 presents the experiments conducted as a part of our study and discusses their critical findings. Finally, Section 4 summarizes the conclusions of the study and discusses the implications of our findings.

Data and Methods
For our experiments, we use the Brown Corpus (Francis and Kučera, 1989), one of the earliest collections of annotated texts of present-day American English and available free of cost with the NLTK toolkit (Loper and Bird, 2002). The nature of the distribution of texts in the Brown corpus helps us to conduct our studies conveniently. The Brown corpus consists of 500 text samples with different genres distributed among 15 categories/genres, which are further divided into two major classes, namely, Informative prose and Imaginative prose. As per our proposed definition in this study, we associate informative prose with the non-fictional genre and imaginative prose as a fictional one. We conduct a binary classification task to separate text samples into these two genres (i.e., fiction and non-fiction) with our proposed linguistic features. Out of the 15 genres, we excluded the 5 genres of humour, editorial, lore, religion and letters from our dataset as it is difficult to accurately associate them with either fiction and non-fiction genres. Finally, the fictional category consists of 5 subcategories, namely: fiction, mystery, romance, adventure, and science fiction. Similarly, the non-fiction category includes 5 subcategories namely: news, hobbies, government, reviews, and learned. This leads us to use 324 samples out of 500 articles in the Brown corpus; out of which 207 samples fall under fiction category and 117 under non-fiction. Despite having less number of samples, the total word count of all texts in the non-fiction category/genre (479,708 words) is higher than that of fiction (284,429 words), and the total number of sentences in the non-fiction category/genre (21,333) is also higher than that of fiction (18,151). Hence, we chose to divide the data by sub-categories rather than having a number of samples or number of words as the base for distribution. Table 1 provides more details regarding the documents in these genres.
To further our understanding of the model's classification performance for Brown corpus and investigate its applicability to British English, we use the British National Corpus (BNC Baby, 2005). This approach helps us to examine model prediction more robustly. Baby BNC consists of four categories, namely, fiction, newspaper, spoken and academic. Due to the clear demarcation between these categories, we use only fiction documents (25 samples) labeled as fiction and academic documents (30 samples) as non-fiction for our experiments. Finally, we apply our algorithm on the articles in the news category (97 samples) to check whether they fall under fiction or non-fiction genre.
Keeping in mind the binary nature of our classification task, we use logistic regression (LR) as our numerical method (McCullagh and Nelder, 1989). Among many classification algorithms, the result of LR is among the most informative ones. By informative, we mean that it not only gives a measure of the relevance of a feature (coefficient value) but also the nature of its association with the outcome (negative or positive). It models the binary dependent variable using a linear combination of one or more predictor values (features) with the help of following equations where φ is the estimated response probability: where, x i is the feature vector for text i, β i is the estimated weight vector, and β 0 is intercept of the linear regression equation.

Experiments and Results
This section describes our experiments aimed to classify texts into the fictional and non-fictional genres using machine learning. The next subsection describes various linguistic features we deploy in detail and the use of feature selection to identify the most useful features. Subsequently, Section 3.2 provides the results of our classification experiments.

Linguistic Features and Feature Selection
We compute different low-level and high-level features as discussed in Section 1 and after that take their ratios as the relative representative metric for the classification task. Table 2 depicts the features used in this work. Some of the ratio features such as average token/type (punctuation) ratio, hyphen exclamation ratio, etc., have been explored in earlier work (Kessler et al., 1997). For calculating high-level ratio features, we use tags from two kind of POS tagsets, i.e, gold standard tags provided as part of the Brown Corpus (87 tags) and automatic tags (based on the 36-tag Penn Treebank tagset) predicted by Stanford tagger 1 (Toutanova et al., 2003). Grammatical categories like noun, verb, and adjective are inferred from the POS tags using the schema given in Ta-   ble 3. We consider both personal pronouns and wh-pronouns as part of the pronoun category. We use the recursive feature elimination with cross-validation (RFECV) method to eliminate non-significant features. Recursive feature elimination (Guyon et al., 2002) follows the greedy search algorithm to select the best performing features. It forms models iteratively with different combinations of features and removes the worst performing features at each step, thus giving the set of best performing set of features. The motivation behind these experiments is not only to get a good accuracy score but also to decipher the importance of these features and to understand their impact on writing. After applying RFECV on the automatically tagged Brown Corpus, we get all features as the optimum set of features. We attribute this result to the POS-tagging errors introduced by the Stanford tagger. So we apply our feature selection method to features extracted from the Brown Corpus with gold standard tags. Here, 13 out of 19 features are marked as non-significant, and we obtain six most significant features (shown in bold in Table 2). Subsequently, we extract these six features from the automatically tagged Brown Corpus, and feature selection on this set revealed only two of these features as being the most significant (underlined in Table 2). The two most notable features which emerge from our second feature selection experiment are adverb/adjective ratio and adjective/pronoun ratio. The Noun/pronoun ratio feature gets eliminated in the process. Figure 1 illustrates how both these ratios provide distinct clusters of data points belonging to the fiction and non-fiction genres (and even their subgenres). Thus, the Brown corpus tagset encoding finer distinctions in grammatical categories (compared to the Penn Treebank tagset), does help in isolating a set of six significant ratio features. These features are useful for identifying the final two POS-ratios based on automatic tags.

Classification Experiments
As described in the previous section, we apply logistic regression to individual files of two data-sets (Brown Corpus and Baby British National Corpus) after extracting various low-level features and features encoding ratios of POS tags based on automatic tags emitted by the Stanford tagger (see Table 2). We use a logistic regression classifier with ten-fold cross-validation and L1 regularization for training to carry out our analyses and report the accuracy achieved over the total number of files in our test sets. We use the Scikit-learn 2 (Pedregosa et al., 2011) library for our classification experiments. The individual performance by nonsignificant features has not been reported in our study. We report results for three data sets after tagging them using the Stanford POS-tagger: We calculate testing accuracy for the first two datasets for ten different combinations of training and testing sets, and report the mean accuracy with  standard deviation for the same as well as for the most frequent baseline accuracy. While for the third dataset, only one training and testing set possible exists, and therefore, we report the testing accuracy and the most frequent class baseline accuracy accordingly. The most frequent class baseline is the percentage accuracy obtained if a model labels all the data points as the most frequent class in the data (non-fiction in our study). Table 4 illustrates our results. Here, we also use another performance metric known as accuracy gain which is considered more rigorous and interpretable measure as compared to the standard measure of accuracy. The accuracy gain percentage is calculated as: Accuracy Gain % = (acc − baseline) (100 − baseline) ×100 (3) where 'acc' is the reported mean accuracy of model, whereas 'baseline' is the mean of most frequent class baseline.
We begin with the Brown Corpus and take 117 sample texts of non-fiction and 207 of fiction categories. Our training set consists of 60% of the total sample size whereas testing set comprises of remaining 40% of samples. We have four combinations of the set of features (refer Row 1 of Table  4). It can be noted that two features model performed better than the model corresponding to the six features and low-level ratio features and is performing as good as 19 features model. To make the model more robust, we follow the same approach for the combination of Brown corpus and Baby BNC with 147 sample texts of non-fiction and 232 sample texts of fiction categories. Baby BNC has been included to check the impact of British English on the performance of the model. One may observe that the model performed even better when exposed to Baby BNC. Similar observations can be made about the accuracy of the two features model (refer Row 2 of Table 4). In our final experiment, we use the Brown corpus for training and the Baby BNC for testing with the available set of features. In this case, the features obtained after feature selection on the exhaustive set of features results in 100% classification accuracy (Row 3 of Table 4). This result also signifies the universal applicability of the ratio features and high-level POS ratios are not something which is affected by bias due to the language variety (i.e., British vs. American English). However, the low performance of the 19 features model (53% classification accuracy) shows how they are prone to overfitting.
The two most significant features, adverb/adjective ratio and adjective/pronoun ratio have regression coefficients 2.73 and -2.90 respectively. Thus, fiction documents tend to have higher values for the ratio of number adverbs to adjectives and a lower value for the ratio of the number of adjectives to pronouns. It is worth noting that the high accuracy scores of more than 95% we obtained by using 19 features in the case of the first two datasets are in the vicinity of the accuracy score given by only these two features. Also, the fact that the F1 scores are close to the accuracy values signifies the fact that the results obtained are robust in nature.
Finally, in order to check the dominant tendencies in the behaviour of classifiers containing different feature sets, we examine the predictions of various classifiers using a separate test set consisting of 97 news documents in the Baby BNC corpus. We also studied model predictions using different training sets. Initially, we use the same data sets mentioned in the last two rows of Ta-   Table 5. It can be observed that most of the samples are classified as non-fiction, as expected. Also, removing news articles from the Brown corpus nonfiction category does not impact the results indicating the unbiased behavior of the model. However, an important conclusion one can draw from Table 5 results is that both the two features as well as the six features model are pretty stable as compared to their 19-feature counterpart. Even the introduction of news samples from Baby BNC in the training data does not seem to help the predictions of 19 features model. This shows the vulnerability of more complex models to a slight change in the training data.

Discussion and Conclusion
In this paper, we have identified two important features that can be very helpful in classifying fiction and non-fiction genres with high accuracy. Fiction articles, i.e., those which are written with an imaginative flavor, tend to have a higher adverb/adjective ratio of POS tags, whereas nonfiction articles, i.e., those which are written in a matter of fact manner, tend to have a higher adjective/pronoun ratio. This not only helps in classification using machine learning but also provides useful linguistic insights. A glance at the percentages of each of these grammatical categories computed over the total number of words in the dataset (Figure 2) reveals several aspects of the genres themselves. In both corpora, the trends are roughly the same. In fiction, both adjectives and adverbs have a roughly similar proportion, while non-fiction displays almost double the number of adjectives compared to adverbs. Also, the percentage of pronouns vary sharply across the two genres in both our datasets as compared to adjectives and adverbs. Figure 3 presents a much more nuanced picture of personal pronouns in the Brown corpus. Fiction displays the greater percentage of third person masculine and feminine pronouns as well as the first person singular pronoun compared to non-fiction, while both genres have comparable percentages of first-person plural we and us. Moreover, differences in modification strategies using adverbs vs. wh-pronouns requires further exploration. Even the usage of punctuation marks differ across genres ( Figure 4). It is worth noting that many guides to writing both fiction (King, 2001) as well as nonfiction (Zinsser, 2006) advise writers to avoid the overuse of both adverbs and adjectives. In a statistical study of classic works of English literature, Blatt (2017) also points to adverb-usage patterns in the works of renowned authors. Nobel prize winning writer Toni Morrison's oft-cited dispreference for adverbs is analyzed quantitatively to show that on an average she used 76 adverbs per 10,000 words (compared to 80 by Hemingway; much higher numbers for the likes of Steinbeck, Rushdie, Salinger, and Wharton). The cited work discusses Morrison's point about eliminating prose like She says softly by virtue of the fact that the preceding scene would be described such that the emotion in the speech is conveyed to the reader without the explicit use of the adverb softly. In fact, Sword (2016) advocates the strategy of using expressive verbs encoding the meaning of adverbs as well, as exemplified below (adverbs in bold and paraphrase verbs italicized): 1. She walked painfully (dragged) toward the car.
A long line of research undeniably argues that adjective and adverbs are strong indicators of affective language and serve as an important feature in text classification tasks viz., automatic genre identification (Rittman et al., 2004;Rittman, 2007;Rittman and Wacholder, 2008;Cao and Fang, 2009).In this regard, Rittman and Wacholder (2008) propound that both these grammatical classes have sentimental connotations and capture human personality along with their expression of judgments. For our classifer, rather than the number of adjectives, it is the relative balance of adjectives and adverbs that determine the identity of a particular genre. A large-scale study needs to validate whether this conclusion can be generalized to the English language as a whole. Thus, prescriptions for both technical as well as creative writing should be based on systematic studies involving large-scale comparisons of fictional texts with other non-fiction genres. In Since our classification is based on the ratios of these POS tags taken across the whole document, it is difficult to identify a few sentences which can demonstrate the role of our features (adverb/adjective and adjective/pronoun ratio) convincingly. Qualitatively, the importance of adjectives can be comprehended with the help of an excerpt taken from the sample file of Brown corpus (fileid cp09; adjectives in bold): " Out of the church and into his big car, it tooling over the road with him driving and the headlights sweeping the pike ahead and after he hit college, his expansiveness, the quaint little pine board tourist courts, cabins really, with a cute naked light bulb in the ceiling (unfrosted and naked as a streetlight, like the one on the corner where you used to play when you were a kid, where you watched the bats swooping in after the bugs, watching in between your bouts at hopscotch), a room complete with moths pinging the light and the few casual cockroaches cruising the walls, an insect Highway Patrol with feelers waving." After removing adjectives (identified using Brown corpus tags), we get: " Out of the church and into his car, it tooling over the road with him driving and the headlights sweeping the pike ahead and after he hit college, his expansiveness, the little pine board tourist courts, cabins really, with a light bulb in the ceiling (unfrosted and naked as a streetlight, like the one on the corner where you used to play when you were a kid, where you watched the bats swooping in after the bugs, watching in between your bouts at hopscotch), a room with moths pinging the light and the few cockroaches cruising the walls, an insect Highway Patrol with feelers waving." Although the text with adjectives removed still belongs to the fiction genre, we can clearly see the role that these words can play in enhancing the imaginative quotient of the text. However, counter-intuitively, Figure 2 shows that texts in the non-fiction genre tend to have a higher percentage of adjectives as compared to texts in the fiction genre, but the latter have a higher percentage of adverbs. Hence, this example reiterates the point that the role played by our salient features (adverb/adjective and adjective/pronoun ratios) in classifying fiction and non-fiction genres is difficult to appreciate with only a few lines of text. An interesting question could be to find out the minimum length of a text required for accurate classification into fiction and non-fiction genres and also more significant features in this regard, which we will take up in the future. We also intend to carry out this study on a much larger dataset in the future in order to verify the efficacy of our features.