On the “Calligraphy” of Books

Authorship attribution is a natural language processing task that has been widely studied, often by considering small order statistics. In this paper, we explore a complex network approach to assign the authorship of texts based on their mesoscopic representation, in an attempt to capture the flow of the narrative. Indeed, as reported in this work, such an approach allowed the identification of the dominant narrative structure of the studied authors. This has been achieved due to the ability of the mesoscopic approach to take into account relationships between different, not necessarily adjacent, parts of the text, which is able to capture the story flow. The potential of the proposed approach has been illustrated through principal component analysis, a comparison with the chance baseline method, and network visualization. Such visualizations reveal individual characteristics of the authors, which can be understood as a kind of calligraphy.


Introduction
The ever increasing availability of public content on the Internet -including books, tweets, and blog posts -has implied in many new developments in several natural language processing (NLP) areas such as machine translation, sentiment analysis, and authorship attribution. Recently, advancements in the latter task have been achieved by using complex networks (Antiqueira et al., 2006;Amancio et al., 2011;Lahiri and Mihalcea, 2013;Marinho et al., 2016;Akimushkin et al., 2017).
The network models used in many of these works are based on word co-occurrence. In this approach, each distinct word is represented by a node, and edges connect adjacent words. Although this networked representation has proven successful in many tasks, it is not without its share of problems. Co-occurrence networks do not portray the topical structure found in many texts and are usually devoid of community structure (de Arruda et al., 2016). In order to overcome this disadvantage, some techniques have been devoted to the mesoscopic representation of texts (de Arruda et al., 2016(de Arruda et al., , 2017. de Arruda et al. (2017) proposed a novel networked model, in which each node represents a respective set of consecutive paragraphs, while weighted edges express the similarity between nodes. Their proposed network is able to extract the organization and flow of text by effectively capturing the similarity between the blocks of text. In addition, their method was employed to distinguish between real and shuffled texts. However, mesoscopic networks have not been applied to tackle other NLP tasks.
Most researchers in the field of authorship attribution assume that each author has a signature (known as authorial fingerprint) that distinguishes his/her writing from the others (Juola, 2006). So inspired, we decided to test the hypothesis that these authorial fingerprints are also visible at a mesoscopic scale. At this scale, distinctive graphical patterns of the course of the text emerge, akin to a "discourse calligraphy" of the author. Thus, in order to classify texts according to their authorship, we created mesoscopic networks from texts and employed a set of topological measurements. In particular, the main goal of this paper is to probe whether the authors' writing styles correlate with the story flow of their books. This paper is structured as follows: Section 2 briefly describes the problem and some complex network approaches for authorship attribu-tion. The process to create mesoscopic networks is explained in Section 3. In addition, we also describe the dataset, the selected measurements and the machine learning algorithms in Section 3. The obtained results are reported in Section 4. Finally, Section 5 outlines our conclusions and prospects for future work.

Related Work
Authorship attribution methods attempt to find the most likely author of a document (Stamatatos, 2009). Since the seminal work conducted by Mosteller and Wallace (1964), authorship attribution has been a widely studied problem and several different approaches have been proposed. One of the first approaches consisted in analyzing the frequency of common words, such as to or the, in order to classify political essays according to their authorship (Mosteller and Wallace, 1964).
Since then, Mosteller and Wallace (1964)'s method has been enhanced to incorporate different attributes capable of qualifying writing styles. These include lexical, character, syntactic, and semantic features (Stamatatos, 2009). Simple lexical and character features (e.g. frequency and burstiness of words and characters, average lengths of texts, and others) have been used in several works, as reported by Grieve (2007), Koppel et al. (2009), andStamatatos (2009). Most of these works have achieved good results by using, for example, the frequency of stopwords. Examples of syntactic information include the frequencies of POS tags and constituency-based parsing tree rules (Baayen et al., 1996;Gamon, 2004;Hirst and Feiguina, 2007). Finally, semantic features can be extracted from semantic dependency graphs and from the semantic roles associated with some words (Gamon, 2004;Argamon et al., 2007).
The usage of network analysis in authorship attribution has already been studied from different perspectives. Antiqueira et al. (2006), one of the first works in the area, extracted some measurements from co-occurrence networks and discovered that these could be used to characterize the writing style of authors. Amancio et al. (2011) combined network measurements with the distribution of words to characterize the authorship of several books. Lahiri and Mihalcea (2013) carried out an in-depth authorship attribution study using more than 100 features extracted from cooccurrence networks. They found that local fea-tures (those extracted from individual nodes) outperform global features in the authorship attribution problem.
Apart from using traditional network measurements, the frequency of network motifs involving three nodes (Milo et al., 2002) was found useful to characterize the writing style (Marinho et al., 2016). Instead of considering the text as a static structure, Akimushkin et al. (2017) studied the topology evolution of co-occurrence networks extracted from different sections of the text. Unlike most of the previous mentioned works, in which stopwords are usually removed, Segarra et al. (2013) proposed an authorship attribution method based on networks formed only by stopwords.

Methods
In this section, we describe the process to create mesoscopic networks from raw texts. We also detail the network measurements and machine learning methods.

Mesoscopic Approach
There are several ways to represent texts as complex networks, such as co-occurrence, syntactic, semantic or similarity networks (Mihalcea and Radev, 2011;Cong and Liu, 2014). In this study, we adopt the mesoscopic network approach proposed by de Arruda et al. (2017). Such networks are able to represent the text unfolding along time, which is normally overlooked by traditional approaches. Moreover, these networks were used to classify documents between real and shuffled texts, using only simple statistics. The high accuracy rate obtained in that classification task led us to infer that mesoscopic networks are able to represent structural aspects of real texts, such as the organization and development of the author's idea.
In order to create the network from a given text (T ), some preprocessing steps can be applied. In our study, we removed the stopwords, and the remaining words were lemmatized. Figure 1 illustrates the methodology used to create mesoscopic networks. In the first step, shown in Figure 1(a), the text is partitioned into a set of paragraphs, T = (p 0 , p 1 , p 2 , · · · ), where p i is a sequence of the preprocessed words belonging to the same paragraph i. Different from the co-occurrence networks, where nodes represent words, in mesoscopic networks nodes encompass sequences of ∆ consecutive paragraphs. More Text (T) Network (weighted) Bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.

Bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
Bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
Bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
Bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla.
Bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla. . First, the text T is divided into subsequent paragraphs (a). Overlapping windows with ∆ = 2 paragraphs are shown in (b). Then, the tf-idf map is computed for all windows (c). Each pair of nodes (windows) i and j is now connected by an edge, weighted by the cosine similarity between their respective tf-idf maps (d). Next, in the network pruning phase, the edges with the lowest weights are removed until the network reaches a given average degree k . The network in (e) illustrates the obtained unweighted mesoscopic network with k = 2.
specifically, each possible subsequent set with ∆ paragraphs, W ∆ i = (p i , p i+1 , · · · , p i+∆−1 ), represents a network node, as shown in Figure 1 So as to account for the importance of the words in a given paragraph, we applied the tf-idf (Manning and Schütze, 1999) statistics, which was originally proposed to quantify the importance of a given word w in a document d given a corpus D.
where f w,d is the frequency of word w in the document d, n is the total number of words in the document d, |D| represents the total number of documents and d w is the number of documents in which w occurs at least once. In order to apply the tf-idf measurement, we considered all the possible windows of subsequent paragraphs, W ∆ i , as the set of documents D (see Figure 1(c)). Finally, for each pair of nodes i and j, a respective edge is created and its weight is calculated according to the cosine similarity between tf-idf(W is a tfidf vector of all words, computed from a given set of paragraphs W ∆ i . This step is illustrated in Figure 1(d).
In order to convert the network from weighted to unweighted, the edges with the lowest weights can be removed, as described in Section 3.2. It should be noted that edges originating from adja-cent paragraphs tend to have higher weights because of the implied overlap. Figure 1(e) shows an example of unweighted network. In our experiment, we set ∆ = 20, as empirically determined elsewhere (de Arruda et al., 2017).

Network Pruning
Mesoscopic networks are complete weighted graphs, i.e. every node is connected to every other node (Newman, 2010). In this paper, we repeatedly removed the edges with the lowest weights until each network reached a fixed network average degree k . The average degree of a network g, with E edges and N nodes, is defined as We used several values of k , ranging from 5 to 50, by steps of 5.

Network Measurements
The following network measurements were extracted from the networks 1 . Most of these measurements (apart from assortativity) apply to a single node. So, in order to obtain more global characterization, we calculated the average, standard deviation and skewness (third moment) of each distribution. The obtained statistics from these distributions were then used as features in the machine learning methods.
Degree: The degree quantifies the number of connections of a node (Costa et al., 2007). Even though the average degree of all networks is the same as a consequence of network pruning, the degree of each node may still vary inside the network. Therefore, we used the standard deviation and skewness of this measurement, disregarding the average.
Average Degree of Neighbors: The average degree of neighbors (Pastor-Satorras et al., 2001) quantifies how well connected are the neighbors of a node.
Assortativity: As described by Newman (2003), the assortativity quantifies how likely it is for a given node to connect to other nodes with similar degree. Lower than zero values of assortativity are obtained when a node tends to connect to others with very different degrees. When a node connects only to others with the same degree, the assortativity becomes one. Null assortativity indicates that there is no correlation.
Clustering Coefficient: This measurement reflects how well interconnected are the neighbors of a given node (Watts and Strogatz, 1998).
Accessibility (h = {2, 3}): The accessibility of a node i is based on Shannon's entropy (Shannon and Weaver, 1963) of the probability of accessing nodes at the h th concentric level, centered at i, by a given dynamics starting at that node (Travençolo and Costa, 2008). Here, we adopted the selfavoiding random walk as the reference dynamics.
Symmetry (h = {2, 3, 4}): This measurement (Silva et al., 2016b), obtained for each node i, quantifies the symmetry of the topology around i. It can be understood as a normalization of the accessibility, and includes two components: backbone, where edges between nodes from the same concentric level are discarded, and merged, where nodes that share edges in the same level are merged.
Network visualization can provide means to better understand the structure of a given book's story by organizing, into an embedding space, the topology of the obtained network. We applied a visualization methodology based on force-directed graph drawing (Silva et al., 2016a). Specifically, this method is based on the Fruchterman and Reingold (1991) (FR) algorithm, which simulates a system of particles, which attract and repel one another. The attractive force, f a , reflects the node connectivity, while the repulsive force, f r , acts between all pair of nodes. A gravitational force, f g , can also be added. We adopted f a = 0.0002, f r = 1.25, and f g = 0.001.

Machine Learning Methods
Several classifiers -Decision Trees, Random Forest, kNN, Logistic Regressors, SVM, Naive Bayes (Duda et al., 2000) -were tested in order to choose the most adequate. Support Vector Machines (SVM) and Random Forest were selected. We used the Linear SVM implementation (with default parameters), and Random Forest with 50 trees, both available at Scikit-learn (Pedregosa et al., 2011). We employed the leave-oneout cross-validation technique, in which only one dataset instance is used as test while all the others are taken for training the classifier. Feature selection was attempted, but no particular subset of features stood out. Therefore, all measurements were considered.

Results and Discussion
In this section, we describe the selected dataset and present the obtained results organized in two parts: (i) the complete set of authors; and (ii) four authors representing major types of works.

Dataset
In order to investigate whether authors can be distinguished by the story flow in their works, we created mesoscopic networks from several texts. Our dataset is composed of 100 English texts written by 20 distinct authors (five texts per author) extracted from Machicao et al. (2016). The selected 20 authors are: Andrew Lang, Arthur Conan Doyle, B. M. Bower, Bram Stoker, Charles Darwin, Charles Dickens, Edgar Allan Poe, H. G. Wells, Hector H. Munro (Saki), Henry James, Herman Melville, Horatio Alger, Jane Austen, Mark Twain, Nathaniel Hawthorne, P. G. Wodehouse, Richard Harding Davis, Thomas Hardy, Washington Irving, and Zane Grey. The whole dataset was obtained from the Project Gutenberg repository 2 . The complete list of used texts is presented in Table 2.

Complete Set of Authors
In the first experiment, we used all the books by all 20 authors, yielding the results presented in Table 1. Remarkably, though the chance baseline for this experiment is only 5% (each author has the same probability of being randomly selected), our best result was as high as 35%. Moreover, 17 (48.5%) out of the 35 books correctly classified by our method were written by only 4 authors: namely Andrew Lang, B. M. Bower, Hector H. Munro (Saki), and Henry James Table 1: Accuracy rate in discriminating the authorship of texts.
We also performed a pairwise classification. The obtained results were compared with a traditional approach usually employed in the literature, the analysis of the most frequent words. For this experiment, we used the original texts of each book, extracted the frequency of the 20 most frequent words, and then used a SVM classifier. Figure 2 shows the accuracies for the traditional features, and Figure 3 illustrates the pairwise classification accuracies when mesoscopic networks were used to model each text, we did not select a single average degree k , but rather we combined all the degrees listed in Table 1. The accuracies were obtained with the SVM classifier.
A careful examination of Figure 2 and 3 reveals that for some cases, except the squares with lighter colors, our results are on par with those obtained with the frequency of the 20 most frequent words (mainly stopwords). Moreover, our method even achieved higher accuracies in some combinations. See, for example, authors Grey and Munro, for which 7 and 6, respectively, of our results were better than the traditional approach. One thing that we should note, and which will be revisited in the following subsection, is the fact that it is hard for mesoscopic networks to distinguish Edgar Allan Poe from Charles Darwin. In this case, we obtained an accuracy rate of 50%, contrasted to 80% achieved by the other approach.

Small Set of Authors
Out of the 20 authors considered in the previous subsection, we selected four authors, namely Charles Darwin, Thomas Hardy, Edgar Allan Poe, and Mark Twain. They were chosen because two of them have several novels (Thomas Hardy and Mark Twain), Edgar Allan Poe is best known for writing short stories and Charles Darwin wrote about his scientific theories and observations. The now obtained accuracy rate in classifying them was enhanced to 65% (Random Forests) and 50% (SVM) by using the mesoscopic representation, contrasted to the chance baseline of 25% obtained for four authors. The Principal Component Analysis (PCA) (Jolliffe, 2002) considering these four authors is presented in Figure 4.
The PCA results indicate a clear partitioning between the groups of books associated to each author. Remarkably, one of Thomas Hardy's book (A Changed Man and Other Tales) resulted between those of Edgar Allan Poe and Charles Darwin. Such a good partitioning is a consequence of the quite different mesoscopic networks obtained for these authors, as depicted in Figure 5.
The mesoscopic networks presented in Figure 5 unveil interesting aspects, including an unexpected similarity to intricate calligraphic shapes. Note that the books which contain tales or short stories, such as those by Edgar Allan Poe, as well as the book A Changed Man and Other Tales, present a similar chain-like topology with a few cycles. Moreover, most of these cycles appear at a relatively small scale. Interestingly, the scientific books of Charles Darwin also present this chainlike structure, which is probably related to the nature of his writings, describing his theories, observations, and findings.
beginning. It is important to highlight that a full visual analysis with all the 20 authors was beyond the scope of this experiment. Our primary goal was to perform a preliminary investigation of the books through geometrical approaches.

Conclusion
Complex network methods have been applied with growing success to several natural language processing tasks. In some of these approaches, a chunk of text is represented as a co-occurrence network, which reflects the syntactic relationship between words (Cancho and Solé, 2001). Although this is a well-known representation, it is not without its share of problems. Those networks, for example, are unable to represent the topical structure found in many texts. So as to overcome such a limitation, a mesoscopic representation has been recently proposed (de Arruda et al., 2017). The main goal of that approach was to take into account the semantical relationship between chunks of text. More specifically, the network nodes correspond to texts from consecutive paragraphs, while the edges are weighted by the similarity between the respective texts. Statistics of some local topological measurements were used to characterize books' mesoscopic networks. We tested the hypothesis that such a representation is useful at assigning the authorship to documents. In particular, we advocated that fingerprints left by each author are visible at a mesoscopic scale.
The obtained accuracy rates, which in one case surpassed by 40 percentage points the chance baseline, suggest that the proposed approach is capable of revealing writing styles characteristics. In addition, we performed an alternative classification, in which all pairs of distinct authors were considered. In some cases our method provided better results than those obtained with traditional features. Such a result indicates that features obtained from mesoscopic networks can be used as a complement to more traditional features of texts. In order to better understand the unfolding of texts, we selected authors whose works include short stories, novels, and scientific writing. A set of topological features was estimated and PCA projected. Interestingly, in this projected space, a book of tales written by Thomas Hardy resulted  closer to Edgar Allan Poe's books, which are also composed of short stories. Even more surprising, the patterns obtained by the visualization resulted quite representative of the different types of works, suggesting a "calligraphy". Such visualizations reveal intricate discourse patterns in the books.
The goal of this paper was not to provide stateof-the-art results for authorship attribution, given that most traditional approaches in the literature have achieved results as high as 90% (Grieve, 2007;Koppel et al., 2009). Instead, we report an approach that can be used to obtain novel stylometric features, as well as to complement traditional methods.
Future works could apply a similar approach to other related tasks -such as authorship verification, plagiarism detection, and topic segmentation -and also extend the mesoscopic representation to include different granularity levels, such as sentences or chapters. Another possibility is to investigate the relationship between the emotional content of a text and its topology. The bluish nodes represent the windows formed by paragraphs from the beginning of the book and the greenish ones represent the windows formed by paragraphs from the end of the book. The order of the windows can be seen in the legend, where N represents the last window.