A Multi-task Approach to Predict Likability of Books

We investigate the value of feature engineering and neural network models for predicting successful writing. Similar to previous work, we treat this as a binary classification task and explore new strategies to automatically learn representations from book contents. We evaluate our feature set on two different corpora created from Project Gutenberg books. The first presents a novel approach for generating the gold standard labels for the task and the other is based on prior research. Using a combination of hand-crafted and recurrent neural network learned representations in a dual learning setting, we obtain the best performance of 73.50% weighted F1-score.


Introduction
Every year millions of new books are published, but only a few of them turn into commercial successes, and even fewer achieve critical praise in the form of prestigious awards or meaningful sales. Editors have the difficult task of making the go/nogo decision for all manuscripts they receive, and the revenue for their publishing house depends on the accuracy of that judgment. The website www.litrejections.com documents some of the biggest mistakes in the history of the publishing industry, including Agatha Christie, J.K. Rowling, and Dr. Seuss, all of whom received many rejection letters before landing their first publishing deal.
Many factors contribute to the eventual success of a given book. Internal factors such as plot, story line, and character development all have a role in the likability of a book. External factors such as author reputation and marketing strategy are arguably equally relevant. Some factors might even be out of the control of an author or publishing house, such as the current trends, the competition from books released simultaneously, and the historical and contextual factors inherent to society.
Previous work by Ganjigunte Ashok et al. (2013) demonstrated relevant results using stylistic features to predict the success of books. Their definition of success was a function of the number of downloads from Project Gutenberg. However downloading a book is not by itself an indicator of a highly liked or a commercially successful book. We instead propose to use the rating from reviewers collected from Goodreads as a measure of success. We also propose features and deep learning techniques that have not been used before on this problem, and validate their usefulness in two different tasks: success prediction and genre classification. Our key contributions are the following: • We provide a new benchmark dataset for predicting successful books in a more realistic class distribution. This data set is available to the community from this link 1 .
• We show that sentiment analysis using Sen-ticNet sentics is an accurate way to model emotion in books.
• We provide the first results on using recurrent neural networks (RNN) to discover book content representations that are useful for classification tasks such as success prediction and genre detection.
• We show that the multitask approach, simultaneously evaluating success and genre prediction, benefits from its constituent tasks to obtain better performance than the single success prediction task approach.

Previous work
Predicting the success of books is a difficult task, even for an experienced editor. Researchers have studied related tasks, for example predicting the quality of text from lexical features, syntactic features and different measures of density. Pitler and Nenkova (2008) found a strong correlation between user-perceived text quality and the likelihood measures of the vocabulary as computed by a language model, as well as the likelihood measures of discourse relations, as determined by a language model trained on discourse relations. Louis and Nenkova (2013) proposed a combination of genre-specific and readability features with topic-interest metrics for the prediction of great writing in science articles. While some of the features in this prior work were relevant to our task, our goal is different and more aligned to Ganjigunte Ashok et al. (2013), since we aim to model success in books of different genres. Ganjigunte Ashok et al. (2013) investigated the correlation between writing style and number of downloads. The authors analyzed lexical features, production rules, constituents, and sentiment features of books downloaded from Project Gutenberg 2 . They obtained an average accuracy of 70.38% using only unigram features with Support Vector Machines (SVM) as the classifier.
Deep learning representations have seen their share of successes in Natural Language Processing (NLP) tasks Zheng et al., 2013;Gao et al., 2014;Glorot et al., 2011;. In particular, RNN models have been successfully applied in several scenarios where temporal dependencies provide relevant information (Ian Goodfellow and Courville, 2016;LeCun et al., 2015). Kiros et al. (2015) used RNN models to learn language models from books using an unsupervised approach. Also, word embedding  and Paragraph Vector (Le and Mikolov, 2014) have been shown to achieve state-of-theart performance in several text classification and sentiment classification tasks. These techniques are able to learn distributed vector representations that capture semantic and syntactic relationships between words. Collobert and Weston (2008) trained jointly a single Convolutional Neural Network (CNN) architecture on different NLP tasks and showed that multitask learning increases the generalization of the shared tasks. Other researchers (Ian Goodfellow and Courville, 2016;Søgaard and Goldberg, 2016; have also reached to similar conclusions.

Dataset
We experimented with two book collections: one prepared by Ganjigunte Ashok et al. (2013) 3 and the other constructed by us to evaluate a new definition of success. We refer to the first dataset as EMNLP13 and the second dataset as Goodreads.
The EMNLP13 collection contained Project Gutenberg books from eight different genres. The authors created a balanced dataset containing 100 books per genre, resulting in a total of 800 books. We manually reviewed the dataset and found missing or irrelevant content in 58 books: a total of 53 books contained Project Gutenberg license information repeated verbatim, and five books contained only the audio recording certificate in place of the actual book content. We removed the license-related text, since lexical features might be erroneously biased, and replaced the five files with the actual content of the books. Except for these corrections, the data we used is the same as that presented in Ganjigunte Ashok et al. (2013).
We also identified some odd adjudications. For example, 'The Prince And The Pauper' is a popular book by Mark Twain that was adapted into various films and stage plays. Also, 'The Adventures of Captain Horn' was the third best selling book of 1895 (Hackett, 1967). Both these books are labeled as unsuccessful due to their low download counts. We suspect as well that some of the counts are inflated by college students doing English or Literature assignments that may not be directly related to the potential commercial success   of a book.
To address these concerns, we propose a new approach to creating gold labels for successful books based on public reviews rather than download counts. We collected a new set of Project Gutenberg books for this benchmarking. We mapped the books to their review pages on Goodreads 4 , a website where book lovers can search, review, and rate books. We consider only those books that have been rated by at least 10 people. We use the average star rating and total number of reviews for labeling each book. We then set an average rating of 3.5 as the threshold for success, such that books with average rating < 3.5 are classified as Unsuccessful. Table 1 shows the data distribution of our books. To our knowledge, we have one of the largest collection of books, as researchers generally work with a low number of books (Coll Ardanuy and Sporleder, 2014;Goyal et al., 2010;van Cranenburgh and Koolen, 2015). Success Definitions Comparison: After compiling and labeling both the datasets, we drew a comparison between the two definitions of success.
To do this, we downloaded the Project Gutenberg download counts for the books in Goodreads dataset and labeled them using the Ganjigunte Ashok et al. (2013) definition of success. Since they only considered books in the extremes of download counts, we could only label 399 books in the Goodreads dataset using their definition. We found that 142 books had different labels according to the two definitions. 19.7% of these mismatched books were labeled as unsuc-cessful despite having ratings ≥ 3.5 and being reviewed by more than 100 reviewers. Table 2 details the discrepancies between the two definitions.

Methodology
We investigated a wide range of textual features in an attempt to capture the topic, sentiment, writing style, and readability for each book. This set included both new and previously used features. We also explored techniques for automatically learning representations from text using neural networks, which have been shown to be successful in various text classification tasks (Kiros et al., 2015;LeCun et al., 2015). These techniques include word embeddings, document embeddings, and recurrent neural networks.

Hand-crafted text features
Lexical: We used skip-grams, char n-grams, and typed char n-grams (Sapkota et al., 2015) with term frequency-inverse document frequency (TF-IDF) as the weighting scheme. Sapkota et al. (2015) showed that classical character n-grams lose some information in merging instances of ngrams like the which could be a prefix (thesis), a suffix (breathe), or a standalone word (the). They separated character n-grams into ten categories representing grammatical classes, like affixes, and stylistic classes, like beg-punct and midpunct which reflect the position of punctuation marks in the n-gram. The purpose of these features is to correlate success with an author's word choice. Constituents: We computed the normalized counts of 'SBAR', 'SQ', 'SBARQ', 'SINV', and 'S' syntactic tag sets from the parse tree of each sentence in each book, following the method of Ganjigunte Ashok et al. (2013) to determine the syntactic style of the authors. Sentiment: We computed sentence neutrality, positive and negative, using SentiWordNet (Baccianella et al., 2010) along with the counts of nouns, verbs, adverbs, and adjectives. We averaged these scores for every 50 consecutive sentences in order to evaluate change in sentiment throughout the course of each book, because we anticipate emotions, like suspense, anger, and happiness to contribute to the success of the book. SenticNet Concepts: We extracted sentiment concepts from the books using the Sentic Concept Parser 5 . The parser chunks a sentence into noun and verb clauses, and extracts concepts from them using Part Of Speech (POS) bigram rules. We modeled these as binary bag-of-concepts (BoC) features. We also extracted average polarity, sensitivity, attention, pleasantness, and aptitude scores for the concepts defined in the SenticNet-3.0 knowledgebase, which contains semantics and sentics associated with 30,000 common-sense concepts (Cambria and Hussain, 2015).
Writing density: We computed the number of words, characters, uppercase words, exclamations, question marks, as well as the average word length, sentence length, words per sentence, and lexical diversity of each book, with the expectation that successful and unsuccessful writings will have dissimilar distributions of these density metrics.

Neural network learned representations
Representation learning techniques are able to learn a set of features automatically from the raw data. Our hypothesis is that the learned representation can capture the complex factors that influence the success of a book. Word embeddings with Book2Vec: In contrast with Word2Vec, which learns a representation for individual words, Doc2Vec learns a representation for text fragments or even for full documents. We trained the Doc2Vec module of the Gensim (Řehůřek and Sojka, 2010) Python library, on all the books in the Goodreads dataset to obtain a 500 dimensional dense vector representation for each book. Using Doc2Vec, we first trained a distributional memory (DM) model with two approaches: concatenation of context vectors (DMC) and sum of context word vectors (DMM). Then we trained a distributional bag of words (DBoW) model and combined it with the DMC and the DMM for a total of five different models. We set the number of iterations to 50 epochs and shuffled the training data in each pass. We called these book vectors Book2Vec. Furthermore, we created two 300 dimensional vector representations for each book by averaging the vectors of each word in the book using pre-trained Word2Vec vectors from the Google News dataset 6 and our own Word2Vec trained with ∼350M words from 5,000 random books crawled from Project Gutenberg. Multitask RNN method: When dealing with variable length data such as time series or plain text, traditional approaches like feed-forward neural networks are not easily adapted since they expect fixed-size input to model sequential data. One limitation of RNNs is that it has problems dealing with long sequences (Pascanu et al., 2013). We propose a strategy to represent large documents, such as books, with an aggregated representation. Figure 1 depicts the proposed multitask method. The overall strategy uses a RNN to learn a model of sequences of sentences. Each sentence is represented by the average of the Word2Vec representation of its constituent words. The RNN is composed of 2 hidden layers with 32 hidden gated recurrent units (GRU)  each, and the output is a softmax layer. We train the RNN in a supervised fashion using the success categorization and the book genre as labels. The RNN serves a feature extractor and the last hidden states for each sequence acts as its representation. At training time, all sentences from one book are extracted and divided in chunks of 128 sentences. The book's success/genre labels are assigned to each sequence. A sentence is then represented as the average of its constituent word vectors. To make the book label assignment at testing time, we average the predictions of all sequences extracted from each book. Using 128 sentences has threefold a motivation: (1) mitigate vanishing gradient problem (Pascanu et al., 2013), (2) obtain more examples from one book, and (c) be a power of 2 to efficiently use the GPU. An interesting property of neural networks is that the same learning approach, i.e stochastic gradient descent, still holds for more complex architectures as long as the objective cost function is differentiable. We take advantage of this property to build a unified neural network that addresses both genre and success prediction using a single model. These kinds of multitask architectures are also useful as regularizers (Ian Goodfellow and Courville, 2016). In particular, our cost function J (X, Y ) is defined as follows: where x i represents the i-th sample and y succ and y gen are success and genre labels respectively. The rnn (·) function represents the forward propagation over the recurrent neural network and h represents the last hidden state.ŷ succ andŷ gen represent predictions for the two labels. Notice that both of them are computed using the same unified representation h. z succ and z gen represent two different linear transformations over h that map to the number of classes.

Experiments on Goodreads dataset
We merged books from different genres, and then randomly divided the data into a 70:30 train-ing/test ratio, while maintaining the distribution of Successful and Unsuccessful classes per genre. As a preprocessing step we converted all words to lowercase and removed infrequent tokens having document frequency ≤ 2. For our tagging and parsing needs, we used the Stanford parser (Socher et al., 2013). We then trained a Li-bLinear Support Vector Machine (SVM) 7 classifier with L2 regularization using the hand-crafted features described in Section 4. We tuned the C parameter in the training set with 3-fold grid search cross-validation over different values of 1e{-4,...,4}.
With the features used by Ganjigunte Ashok et al. (2013), we obtained the highest weighted F1score of 0.659 with word bigram features. We set this value as our baseline. In order to study the effect of the multitask approach, we devised analogous experiments to our proposed multitask RNN method and predicted both genre and success together for the features described in Section 4. Hence we have two settings for the classification experiments, Single task (ST) and Multitask (MT).
Since we had average rating information, we also modeled the problem as a regression problem and predicted the average rating using only the content of the books. Our work differs from other researchers in this aspect, as most of them (Lei et al., 2016;Li et al., 2011;Mudambi et al., 2014) use review content instead of the actual book content to predict the average rating. We used the Elastic Net regression algorithm with l1 ratio tuned over range {0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99} with 3-fold grid search cross-validation of the training data.
Parameter tuning for RNN: We trained 25 models with random hyper-parameter initialization for learning rate, weights initialization ranges and regularization parameters. We chose the best validation performance model. This is preferable over grid search when training deep models (Bergstra and Bengio, 2012). We used the ADAM algorithm (Kingma and Ba, 2014) to update the gradients. Since these models are prone to overfitting because of the high number of parameters, we applied clip gradient, max-norm weights, early stopping and dropout regularization strategies.   Table 3 shows the results with our new proposed feature sets for the classification and regression tasks. In the ST setting, except for the character ngram features, all proposed hand-crafted features individually had a weighted F1-score less than the word bigram baseline. On the other hand, the neural network methods obtained better results than the baseline. We obtained the highest weighted F1-score of 0.695 and 0.731 with the Book2Vec method in the ST and MT settings, respectively. The results show that the MT approach is better than the ST approach. The genre prediction task must have acted as a regularizer for the success prediction task. Also, we found that modeling the entire book as a vector, rather than modeling it as the average of word vectors, gave better performance. Although the ST Book2Vec performs better than the MT RNN method, the difference is very small. We performed McNemar's test on these methods and found that the results were not statistically significant, with p=0.5. The MT RNN method had the lowest mean square error (MSE) for the regression task, at 0.125. The character ngram proved to be one of the most important hand-crafted features, whereas clausal feature was the least important one. In-  dividually, writing density and readability features seemed to be weak features. We assumed that the sentiment changes in books would be an important characteristic for the task. However, the results in Table 3 show an unimpressive F1-score of 0.610 for sentiment features. On the other hand, the bag of sentic concepts model with average scores for sensitivity, attention, pleasantness, aptitude, and polarity gave a more impressive F1-score of 0.670, much higher than the baseline. This result points to the relevance of performing a more nuanced sentiment analysis beyond lexical statistics for this task.

Results on Goodreads dataset
Our next set of experiments included the combinations of hand-crafted and neural network representations. Some of the best combination results are shown in Table 4. Out of the different possible feature combinations, we obtained the highest weighted F1 score of 0.735 by combining handcrafted and learned representations in the MT setting. We observed that combining low performing hand-crafted features like readability, syntactic clauses, and skip grams with neural representation boosted their performance. Likewise for the regression task, the MT RNN representation proved to be a better choice, as its combination with other features generally lowered the MSE. The best combinations for the regression task lowered the MSE to 0.123. Deep learning and hand-crafted methods may capture complementary sources of information, which upon combination boost performance.

Results on EMNLP13 dataset
We tried to reproduce the results reported in Ganjigunte Ashok et al. (2013) by re-implementing their system. Unlike our setup, they performed experiments on individual genres and reported average accuracy across all genres. We obtained similar results, but not as close as we expected, even after extensive experimentation, and extending the search for parameter optimization. For most of their features we obtained a lower accuracy 8 . The differences may be due to a combination of the curating process we described in Section 3 that corrected content in the books used, as well as the different set of parameter values we explored for tuning the classifier. As pointed out by Fokkens et al. (2013), even seemingly small differences in preprocessing can prevent reproducibility. Hence, we consider our best accuracy so far (71.25%) to be the state-of-the-art performance on this data set. Table 5 shows the results from some of our best feature sets. The features that worked best for the Goodreads data also worked best for the EMNLP13 data. Significantly, with the combination of the sentic concepts and scores, typed ngrams, and writing density, we obtained an average accuracy of 73.00%, much higher than the baseline score of 71.25% for this dataset.
The RNN performance was very low in comparison with the handcrafted features. We relate this behavior to the small size of this particular training dataset and evaluation setup. Notice that Ganjigunte Ashok et al. (2013) experimented per genre, i.e. trained a single classifier per genre. Thus, in a 5-fold approach we only have 80 samples to train and 20 to test. Additionally, we must take out some samples from the training data for validation. It has been empirically shown that one of the key elements in the success of representation learning strategies is a large amount of data, on the order of tens of thousands of samples at least. Moreover, in the EMNLP13 dataset, it is not possible to take advantage of the multitask approach because there is only one target genre in each experiment. Table 6 lists some of the features that were highlyweighted by the classifier. For the sentic concepts, salient features included important adjectives, verbs and relations; all objects that might 8 There was a maximum 4% difference for some features.  mr., mrs., john, thou, amor, pen, his, and,the, ing, n's,ed, gg', pt', d'a, t", i-t, , ,"i ," ", " say," s," she Table 6: Discriminative Features trigger a crucial event. Similarly, for the character n-gram features, honorific titles, stop words, common word endings, and especially n-grams with quotation marks were highly weighted. Quotation marks indicate the exchange of dialogues between characters. This suggests that dialogue is an important aspect of novels. Word n-gram features also support this suggestion. Features like s/he said, said Person Name were also highly weighted. Moreover, pronouns and titles related to male gender also had high weights. Features like i was, . i, i am also had high weights. This might be an indication that books with first person narration tend to be more successful. Another interesting observation was that the number of question marks in a book was also consistently positively correlated with success. This might suggest that readers enjoy books consisting of dialogue or interaction between the characters. We also calculated the maximal information criterion (MIC) and correlation coefficient (CC) for the writing density as well as the readability features against the average rating. Generally, readers prefer books with high writing density (0.19 MIC, 0.25 CC) and somewhat complex writing (0.17 MIC, 0.21 CC).

Analysis of learned representations
In order to investigate deep vectors, we projected them onto 2-dimensional space using t-SNE. Figure 2 suggests that the vectors successfully capture genre-related concepts, as books from the same genre are close to each other in the 2D space. We then performed 8-way genre classification experiment using random stratified division of the data into 70:30 training/test ratio. We obtained an accuracy of 62.50% and F1 score of 69.30% for the EMNLP13 and Goodreads datasets, respectively. These scores were well above the random baseline of 12.50% accuracy and 15.23% F1-score for the EMNLP13 and Goodreads datasets, respectively. We further found that Poetry and Drama were the most accurately classified genres, whereas Fiction was the most difficult to classify.
In order to further investigate the representations learned by RNN for successful and unsuccessful books, we plotted the 2D t-SNE projection of the book representations. Figure 3 shows the projection of vectors for the Short stories genre. The visualization shows that the RNN is able to cluster the book vectors into two separate regions. Furthermore, to investigate what else the RNN might be learning, we plotted some books by the same authors. Figure 3 also shows books from authors Jack London and Alan E. Nourse. The four books by Jack London and the two books by Alan E. Nourse are very close to each other. We thus infer that along with learning peculiarities of successful and unsuccessful classes, the RNN was able to capture features related to the style of authors.
8 How much content is needed for success prediction? Humans are good at detecting poor writing after reading just a few pages. We wanted to investigate if it is the same for machines. We devised stratified 3-fold cross-validation exploratory experiments on training data by gradually increasing the content of the books in the training fold. The results are shown in Figure 4. We see that the cross-validation score gradually increases until we reach 200 sentences. After this point, it plateaued out. Hence, we conclude that 200 sentences is the minimum threshold for the classifier.

Conclusions
In this paper we propose new features for predicting the success of books. We used two main feature categories: hand-crafted and RNN-learned features. Hand-crafted features included typed character n-grams and sentic concepts. For the learned features we proposed two different strategies based on neural networks. The first extends Word2Vec-type representations to work in large documents such as books, and the second one uses an RNN to capture sequential patterns in large texts. We evaluated our methods on our Goodreads dataset, whose classes are not based on download counts, but rather are a function of average star ratings and number of reviewers. Our results outperform state-of-the-art methods. We conclude that instead of having either deeplearning or hand-crafted features outperform the other, both methods capture complementary information, which upon combination gives better performance. Also, the multitask setting is preferable to the single task setting, as the multitask approach helps the classifier better generalize during learning by letting constituent tasks act as regularizers. As our next steps, we plan to investigate features that capture plot-related aspects, such as character profiles and interaction through social network analysis, historical setting, and other feature-learning strategies.