A Genre-Aware Attention Model to Improve the Likability Prediction of Books

Likability prediction of books has many uses. Readers, writers, as well as the publishing industry, can all benefit from automatic book likability prediction systems. In order to make reliable decisions, these systems need to assimilate information from different aspects of a book in a sensible way. We propose a novel multimodal neural architecture that incorporates genre supervision to assign weights to individual feature types. Our proposed method is capable of dynamically tailoring weights given to feature types based on the characteristics of each book. Our architecture achieves competitive results and even outperforms state-of-the-art for this task.


Introduction
Book likability prediction is an important but challenging task. It can be a valuable resource for supporting buying decisions. The experience of choosing a book can be daunting for readers, considering the overwhelming number of books being published. On the other hand, being able to predict how a book will fare in the market has relevant economic value for the publishing industry in order to increase their revenue. The current process is guided by humans, but this is error-prone, very subjective, and a non-scalable process.
An alternative to the human-guided process is to design a reliable automatic system that predicts the likability of books. Such a system, we argue, must be able to take into account all of the many aspects involved in the eventual success of a book. These include not only the topic of the book and the writing style of the author, but in the case of creative writing, also include elements such as creativity, plot structure, and the flow of sentiments (Hall, 2012;Archer and Jockers, 2016;. Other relevant aspects influencing readers' interest for a book could be the cover and the title of the book. We believe that in addition to the ability to incorporate the different aspects, it is equally important to have a robust mechanism that gives higher weight to the most relevant aspects, while at the same time disregards the noisy or redundant aspects. Traditionally, this is achieved by searching through multiple feature combination experiments for an optimal combination of different feature types (Yang and Pedersen, 1997;Forman, 2003). The main problem with these methods is that they are time-consuming and too rigid. The resulting feature types are fixed for every document. In some books, the style of the author may contribute more than the specific topic, whereas the reverse may be true for other books. These methods lack the ability to dynamically assign weights to different features based on the characteristics of a particular test instance. Most likely, a more flexible scheme that adjusts feature weights based on the current book, can lead to better results. This paper attempts to solve this problem by introducing a novel method that is capable of automatically combining information from different aspects and learning to weight them dynamically for each book in order to improve likability prediction. Our method also extends the attention model to incorporate domain specific information like the genre of books. As far as we know, we are the first to use genre supervision while computing attention weights and to use them in the field of feature importance. There are many potentially relevant aspects of books that make them likable by readers. Here we focus on different textual modalities, like the lexical, stylistic, syntactic, and neural representations, along with the visual modality from book covers. Our main contributions in this paper are as follows: • We propose a novel neural architecture, which incorporates genre supervision for computing attention weights to learn the importance of hand-engineered and deep learning features coming from different modalities for predicting the likability of books.
• We show through our results that an adaptive combination of features with the genre-aware attention model performs better than strong baselines and also outperforms state-of-theart.
• We present visualizations that increase interpretability of our results and also demonstrate the advantages of our model.
Along with these contributions, we also show that book cover images contain sufficient information by themselves to perform likability classification, although their contribution becomes negligible in the presence of strong textual features.

Methodology
We propose a model that we call Genre-Aware Attention model (GA), which dynamically weights features coming from different aspects of a book by using genre supervision. We first feed our textual and visual features through a non-linear layer to train higher feature representations. We then use our genre-aware attention model to compute appropriate weights for these feature representations. The motivation to add genre information comes from our previous work showing that adding genre classification as an auxiliary task to success prediction improved results . Moreover, it is also reasonable to expect that different genres should have different sets of features that are more relevant when trying to predict whether readers will like the book. For instance, in Science Fiction, the theme may be more relevant than say, in Drama, where the characters and their interactions or their struggles might be more relevant for likeability prediction.

Features
For our features, we build on the work by  that provides a comprehensive exploration of different hand-crafted features and neural representations. They showed that a combination of writing density (WR) (distribution The source code and data for this paper can be downloaded from https://github.com/sjmaharjan/ genre_aware_attention of word, character, sentences, and paragraphs), Book2Vec, and recurrent neural network representations (RNN) works well for books. Similar to their work, our textual features consist of word, character, and typed character n-grams (Sapkota et al., 2015), syntactic features, sentiment and sentic concepts and scores (SCS) (Cambria et al., 2014), style-related WR and readability (R), and neural representations learned using Word2Vec (Mikolov et al., 2013), Doc2Vec and RNN. We consider these categories of the textual features as different modalities or sources since they capture different aspects of a book and are generated by different processes. In addition to these features, we also add visual information extracted from the book covers. To extract the visual features, we rely on state-of-the-art visual feature extractor methods like VGG (Simonyan and Zisserman, 2014) and Resnet (He et al., 2016), initialized with the weights trained on the Imagenet dataset.  Figure 1 shows the overall architecture of our Genre-Aware Attention model. Let X be a collection of books. For a book x X, let x 1 , x 2 , . . . , x n be the feature representations from the different textual modalities and the visual modality. Since these features have different dimensions, we first pass them through a non-linear layer to project them into a space with the same dimension using Equation 1. This will allow us to perform a weighted average of features from different modalities according to their importance:

Genre-Aware Attention Model
where i is the index of the modality whose feature representation is fed into the network, W h is the weight matrix, x i is the input feature vector for the ith modality, b h is the bias, and selu (Klambauer et al., 2017) is the activation function. All of these feature vectors from different modalities may not be equally important to the final representation and in turn to the likability prediction task. We use the genre-aware attention mechanism to learn the importance of each of these features towards our task and aggregate them to get the final representation. The final book representation r is the weighted sum of h i vectors: where α i are the weights measuring the importance of the different modalities. The GA model combines the genre vector g R dg (d g being the dimension of the genre vector) while computing the α weights. The α i weights are computed as follows: and the score(.) function is defined as: where, W a and W g are the weight matrices and v is the weight vector. The addition of W g g incorporates genre supervision. These parameters are shared across all modalities. This will prevent parameter explosion that is likely to occur when the number of modalities is high, which is the case for us. To further investigate the effect of the genre, we also experiment by concatenating the genre vector g to the final weighted averaged vectors from different modalities r to obtain r; g. The dotted line from genre vector g represents this in Figure 1. We then use a non-linear layer with sigmoid activation to project the book representation (either r or the concatenation r; g) to class probabilities.
where, W c is the weight matrix and b c is the bias vector. Finally, we train the network by minimizing the binary cross entropy loss using Adam (Kingma and Ba, 2015).
where, p i andp i are true labels and predictions, respectively.

Dataset
We experiment with the dataset collected by . The dataset consists of books from eight different genres: Detective Mystery, Drama, Fiction, Historical Fiction, Love Stories, Poetry, Science Fiction, and Short Stories. These books have been reviewed by at least ten reviewers. Based on the average rating received by the books on Goodreads 1 , they labeled the books into two categories: Successful and Unsuccessful. The collection has a total of 1,003 books. However, the dataset did not include book covers. We augmented this dataset by downloading the covers from Goodreads. Since this dataset only contains publicly available books, all of them were published over 100 years ago. Some of the books only had the title of the book on a plain background as their cover images on Goodreads. We manually searched for these books with Google Image Search and found the actual covers for most of them. However, even after an exhaustive search, we were unable to obtain proper covers for 21 books. We did not remove these books from the dataset for the sake of comparison with .

Experiments and Results
We used the same train and test folds as used by  for all of our experiments. The dataset consists of 349 books belonging to the Unsuccessful class and 654 books belonging to the Successful class. Since the dataset is imbalanced, they as well as we use weighted F1score to evaluate the performance.

Baselines
The most naive baseline will be to predict the majority class for all test instances. This majority class baseline yields a weighted F1-score of 50.6% for the likability classification task. This baseline will help to understand whether our proposed model is actually learning from the data at all. Apart from this, we compare with the results from  and we also define several other baselines to validate the superiority of our proposed model. All of the baseline methods are listed below: Mah'17: The current state-of-the-art for this dataset by . They have several results on various combinations of textual features.
Mah'17+Vis: This method is the extension of the Mah'17 method with the addition of visual features. Similar to them, we use the SVM classifier under two settings: Single-task (ST) and Multitask (MT). In ST, we simply predict the likability of books. In MT, along with predicting likability, we also predict genre simultaneously. This experiment will allow us to make a direct comparison with Mah'17 regarding the effect of adding visual modalities.
Concatenation: Similar to GA, we first feed the features from different modalities through a nonlinear layer each having the same number of neurons. We then concatenate them to obtain the final representation for a book. We send this representation to a sigmoid layer for success prediction.
Average Pooling: Instead of concatenation, we take an average of the features after passing them through the non-linear layer. This is also comparable to an attention model assigning equal weights to all modalities. Attention: We use a multilayer perceptron to learn the appropriate weights for each of the features from different modalities. This method is similar to our proposed method, except that we do not use genre information for computing the attention weights. We compute the score(.) as v T selu(W a h i + b a ), without the genre information. This experiment will help us understand the importance of genre in computing weights for the feature types. Bilinear Model: We combine the non-linear transformed modalities h 1 , . . . , h n using a bilinear form (h i the bias vector (Socher et al., 2013;Laha and Raykar, 2016;Fukui et al., 2016;Gao et al., 2016). This operation gives us a k-dimensional vector. In the case of more than two modalities, we first create n 2 pairs of these modalities and combine each of them using a bilinear form. The final book vector is the concatenation of the resulting vectors from each of these pairs. Bilinear models are used in the visual question answering community to fuse visual and textual information (Fukui et al., 2016). This experiment will help us understand how our proposed model compares with other state-of-the-art multimodal approaches.
For all these models as well, we also performed additional experiments by concatenating the genre vector g with the final representations r obtained from each of these models to study the significance of including genre explicitly for likability prediction.

Experimental Settings
For the experiments involving the SVM classifier, we tuned the C hyper-parameter with values {1e-4, . . . , 1e4} by performing three-fold grid search over the training data and then used the best hyperparameters to train the final model.  Table 1 shows and compares our results with different baselines. We experimented with both low performing as well as high performing features and their combinations as found by . We obtained the best weighted F1-score of 75.4% with our proposed GA+Genre concatenation model. This is 4.2% and 8.7% above the corresponding results reported by Mah'17 with their MT and ST settings, respectively. We also see a significant* improvement of 6.5% (over MT) and 22.2% (over ST) when using RNN features with our proposed method as compared to Mah'17. These results support the superiority of our method in learning high-quality book representations than Mah'17's state-of-the-art methods.

Results
The results also show that it is beneficial to use at least some form of attention over just Average Pooling. This suggests that using all available features without regards to their individual contribution towards the task at hand can actually worsen the performance. Our proposed model is capable *  of assigning importance to these features and the results clearly show that this works to our benefit. The results also demonstrate the added advantage of using genre supervision while computing feature weights. There is a considerable improvement in the performance over the Attention method after taking the genre information into account using our GA method. We suspect that the genre metainformation is helping to learn more specialized weights based on the genre of the books.
With the neural baseline methods like Concatenation and Average pooling, we do not always see improvement in performance after combining the genre information with the final book representation. Apart from these two, the combination of genre information does improve the results for other methods. The Bilinear and Attention methods seem to be able to utilize this information well. However, none of these methods are capable of doing better than our method. GA and GA+Genre concatenation models always achieve the best performance for all experiments. This also illustrates the latent power of our method to better exploit domain information like genre for performance improvement.
Another interesting finding is that with the addition of multiple modalities, the performance of Bilinear methods degrades to the majority class baseline (Table 1, last row). This may be due to parameter explosion with the increase in the number of modalities. However, our method is able to selectively weight the feature sources and discount the effect of redundant and irrelevant features to obtain the best performance, even with a larger number of modalities. In short, we see that our proposed method is able to cope with feature pollution and parameter explosion.
Next, we investigate the addition of visual information with the textual information for the likability prediction of books. Under the ST setting with SVMs, we see that the low performing textual features are benefited significantly by the addition of visual features, sometimes even outperforming the MT setting (Table 1, rows 1-4). However, the visual features are not able to contribute much when combined with strong textual features that were already performing well. On the other hand, for the MT setting, the performance decreases for most of the feature combinations with the addition of the visual modality. We suspect that book covers are not very helpful at predicting genre and thus the MT setting does not do well with additional visual features. Visual Results: Our next set of experiments considers only the visual information for books' likability prediction. Even though we do believe that this current corpus might not be ideal for using cover features, we believe it is still interesting to explore whether the current book covers have sufficient information to perform likability classification with reasonable accuracy. We used VGG and Resnet to extract features from book cover images. We replaced the top layers by a dense layer of 256 neurons, and a classification layer (eight neurons with softmax for genre classification and one neuron with sigmoid activation for success classification). We also added a dropout layer in between the dense and the classification layer. The layers were initialized with weights trained on the Imagenet dataset.   Table 2 shows the results with only the visual features for likability and genre classification of books under the ST and MT settings. We obtain the highest weighted F1-score of 61.8% and 25.9% for likability and genre classification tasks, respectively. With the neural experimental setup, we get similar performance under the ST and the MT settings for both tasks. We also experimented with transferring the visual feature vectors to the SVM classifier under the ST and the MT settings. We saw a decrease in performance under the MT settings with both the VGG and Resnet features (Table 2, last two rows). This is the opposite of the Mah'17 results for the textual features as seen in Table 1. The reason behind this may be due to the fact that the textual features are better at both the likability and the genre classification tasks individually, whereas the visual features are not as good as the textual features for the genre classification task. Iwana et al. (2016) also concluded that genre classification with book covers is a difficult task as book covers have images with few visual features or ambiguous features.
These results also empirically verify the decrease in performance for the MT settings with the addition of visual features for likability prediction. Although these results are significantly lower (p<0.001*) than our best results, they are still better than the majority baseline (50.6% and 10.7% for success and genre classification tasks, respectively). These results support our hypothesis that the books' cover images correlate with the likability of books. Also, they dictate for the need of extracting other features that consider different aspects of books.  given by the best model to the different feature types for the books in the test set. The purpose of this visualization is to understand which aspects of a book are deemed to be more important by the model. The figure shows that most of the weights are assigned to the Char 5-gram and the RNN representations. The results in Table 1 also support that RNN features are indeed one of the most important features. The contribution of the visual representations is negligible in the presence of strong textual features. The results in Table 1 also validate this finding. These two textual features also dominate over the other weaker textual features. In the same way, as for the visual features, we see negligible weights assigned to the other textual features as well. Our model seems to have learned that the Char 5-gram and the RNN features can cover the information given by the rest of the features. The Char 5-gram feature is capable of capturing the content, topic, and style of a text and as such might be able to cover the Unigram and Sentic Concepts features. Likewise, the Book2Vec features may be non-essential in the presence of the RNN representations. The model is reducing redundant information that does not aid the classification task and instead might just add noise.

Attention Weights Visualization
In order to validate that features given the top weights by our model are indeed the best features for the task, we ran an experiment with only the Char 5-gram and the RNN features. We were able to obtain a weighted F1-score of 73.6% with just these two features. This score is close to the best score of 75.4%, showing that these features are indeed good features for the task. Also, note that our model was able to figure out this feature set automatically, while using traditional methods would have entailed performing multiple experiments (2 n − 1 experiments, where n is the number of feature types) which is often times not possible to do exhaustively. There is still an extra boost when using the whole feature set rather than using just the Char 5-gram and the RNN features. Since our method tailors the feature weights to each book and its genre as well, the boost likely comes from the presence of other visual and textual features, which at least for some books must be informative.
We just saw that only two out of all feature types are given most of the weights. However, the results in Table 1 show that even without these fea- tures, we are able to get good performances. To understand this, we analyze a model that does well without these two features. Figure 3 plots the average attention weights for a model with Sentic Concepts and Scores, Writing Density, Typed ngrams, and Visual features' combination. We see that the weights now shift to Typed n-grams, and Sentic Concepts and Scores. The topic and content captured through Sentic concepts and the style with Typed n-grams prove important. These features capture different aspects of books and are not strongly correlated with one another. Our model is capable of figuring out that in the absence of Char 5-gram, which encompasses all this information, these other features need to be made more prominent. We can also see that the model knows three different feature types to capture the same amount of information as captured by the two best ones from before. Figure 4: Average attention weights with respect to genre for the best features from two models. Figure 4 further breaks down the attention weights by genre for RNN and Char 5-gram, and Typed n-grams and Sentic Scores and Concepts. From the figure, it is evident that different genres respond differently to each feature type. Compar-ing the two models, we see that Char 5-gram activates similarly to Typed n-gram, and RNN similarly to Sentic concepts for different genres.  Figure 5 shows the feature importance for the Char 5-gram and RNN feature types for six different books having different attention weights for the two features. This validates our assumption that the model is able to dynamically learn and assign weights to different modalities, not only according to the genre but also according to the characteristics of each book. The high variance of attention weights for the top features in Figures 2  and 3 also support this claim. This gives an edge to our model and helps it excel over all other methods. We took the books that were misclassified when we used the visual features only but were correctly predicted after the combination with the textual dimensions. As expected, we found that the books without proper covers were misclassified by visual features. But upon addition of other textual features, they were correctly classified. Figure 6 shows the cover image of two of such books. The fact that the cover has no images with just plain background, and title, leaves little information for the visual modality. Similarly, we also analyzed the books that were correctly classified by visual features only and misclassified when textual features were added. Figure 7 shows two such books. Both the cover image and the title (present in the cover) of these two books seem to be interesting and are very likely to attract a reader's attention.

Related Work
Prior works have shown that stylistic traits to be useful features to predict success of books (Ashok et al., 2013;Underwood and Sellers, 2016;. Ashok et al. (2013) used stylistic features extracted using the first 1K sentences from books to classify highly successful literature from less successful literature. van Cranenburgh and Bod (2017) used lexical and rich syntactic tree features to distinguish the degrees of high and less literary novels. Louis and Nenkova (2013) defined genre-specific and general features to predict the article quality in science journalism domain.  compared their work with Ashok et al. (2013) and presented a new dataset for the book success prediction task. Their multitask approach with the combination of deep representations and hand-crafted features improved the classification results.  also showed that modeling sequential flow of emotions across entire books improves likability prediction of books. Iwana et al. (2016) used neural networks to learn relationships between book covers and genre. They showed that book covers tend to have carefully designed color and tone, objects, and text. Our work relies on prior works' hand-engineered and deep learning features but differs in a way how these features are combined to produce a meaningful book representations.
The attention mechanism (Bahdanau et al., 2014) has been successfully applied in enhancing the document representation for several text classification Wang et al., 2016b), sentiment classification (Kar et al., 2017;Nguyen and Shirai, 2015;Wang et al., 2016a), question answering (Tan et al., 2015;Chen et al., 2016a;Hermann et al., 2015), named entity recognition (Bharadwaj et al., 2016;Aguilar et al., 2017), summarization (Rush et al., 2015), imagecaptioning (Xu et al., 2015) tasks. Zhang et al. (2017) used summary vectors and position vectors while computing the attention weights for the slot filling problem. Chen et al. (2016b) applied user preferences and product characteristics as attentions to words and sentences in reviews to learn the final representation for the sentences and reviews. They used these representation to do the sentiment classification task and showed that adding user information was much more effective in enhancing the document representations than the product information. Similar to their idea, we fuse the genre information while computing attention weights.

Conclusions and Future Work
We present a novel method to fuse the information coming from different modalities using a genreaware attention mechanism to predict the likability of books. We showed that our proposed method outperforms strong baselines and state-of-the-art by learning to distinguish the important features from irrelevant or redundant ones. Other methods either suffered from feature pollution or parameter explosion and yielded low performance. Along with this, our results also showed that the book cover images by themselves also have sufficient information to perform success prediction. However, the difficulty in predicting genre from book covers decreased the performance in multi-task settings with additional visual features. We also used different visualizations to support our findings and improve interpretability of our model. As future work, we will extend the proposed method to include components that learn weights for individual feature elements and not only the entire feature type. This could likely result in higher quality multimodal representations.