On the Relevance of Syntactic and Discourse Features for Author Profiling and Identification

The majority of approaches to author profiling and author identification focus mainly on lexical features, i.e., on the content of a text. We argue that syntactic and discourse features play a significantly more prominent role than they were given in the past. We show that they achieve state-of-the-art performance in author and gender identification on a literary corpus while keeping the feature set small: the used feature set is composed of only 188 features and still outperforms the winner of the PAN 2014 shared task on author verification in the literary genre.


Introduction
Author profiling and author identification are two tasks in the context of the automatic derivation of author-related information from textual material. In the case of author profiling, demographic author information such as gender or age is to be derived; in the case of author identification, the goal is to predict the author of a text, selected from a pool of potential candidates. The basic assumption underlying author profiling is that, as a result of being exposed to similar influences, authors who share demographic traits also share linguistic patterns in their writings. The assumption underlying author identification is that the writing style of an author is unique enough to be characterized accurately and to be distinguishable from the style of other authors. State-of-the-art approaches commonly use large amounts of lexical features to address both tasks. We show that with a small number of features, most of them syntactic or discoursebased, we outperform the best models in the PAN 2014 author verification shared task (Stamatatos et al., 2014) on a literary genre dataset and achieve state-of-the-art performance in author and gender identification on a different literary corpus.
In the next section, we briefly review the related work. In Section 3, we describe the experimental setup and the features that are used in the experiments. Section 4 presents the experiments and their discussion. Finally, in Section 5, we draw some conclusions and sketch the future line of our research in this area.

Related Work
Author identification in the context of the literary genre attracted attention beyond the NLP research circles, e.g., due to the work by Aljumily (2015), who addressed the allegations that Shakespeare did not write some of his best plays using clustering techniques with function word frequency, word n-grams and character n-grams.
Another example of this type of work is (Gamon, 2004), where the author classifies the writings of the Brontë sisters using as features the sentence length, number of nominal/adjectival/adverbial phrases, function word frequencies, part-of-speech (PoS) trigrams, constituency patterns, semantic information and ngram frequencies. In the field of author profiling, several works addressed specifically gender identification. Schler et al. (2006), Koppel et al. (2002) extract function words, PoS and the 1000 words that have more information gain. Sarawgi et al. (2011) use long-distance syntactic patterns based on probabilistic context-free grammars, token-level language models and characterlevel language models.
In what follows, we focus on the identification of the author profiling trait 'gender' and on author identification as such. For both, feature engineering is crucial and for both the tendency is to use word/character n-grams and/or function and stop word frequencies (Mosteller and Wallace, 1963;Aljumily, 2015;Gamon, 2004;Argamon et al., 2009), PoS tags (Koppel et al., 2002;Mukherjee and Liu, 2010), or patterns captured by context-free-grammar-derived linguistic patterns; see e.g. (Raghavan et al., 2010;Sarawgi et al., 2011;Gamon, 2004). When syntactic features are mentioned, often function words and punctuation marks are meant; see e.g. (Amuchi et al., 2012;Abbasi and Chen, 2005;Cheng et al., 2009). However, it is well-known from linguistics and philology that deeper syntactic features, such as sentence structure, the frequency of specific phrasal, and syntactic dependency patterns, and discourse structure are relevant characteristics of the writing style of an author (Crystal and Davy, 1969;Di-Marco and Hirst, 1993;Burstein et al., 2003).

Experimental Setup
State-of-the-art techniques for author profiling / identification usually draw upon large quantities of features; e.g., Burger et al. (2011) use more than 15 million features and Argamon et al. (2009) and Mukherjee and Liu (2010) more than 1,000. This limits their application in practice. Our goal is to demonstrate that the use of syntactic dependency and discourse features allows us to minimize the total number of features to less than 200 and still achieve competitive performance with a standard classification technique. For this purpose, we use Support Vector Machines (SVMs) with a linear kernel in four different experiments. Let us introduce now these features and the data on which the trained models have been tested.

Feature Set
We extracted 188 surface-oriented, syntactic dependency, and discourse structure features for our experiments. The surface-oriented features are few since syntactic and discourse structure features are assumed to reflect better than surfaceoriented features the unconscious stylistic choices of the authors.
For feature extraction, Python and its natural language toolkit, a dependency parser (Bohnet, 2010), and a discourse parser (Surdeanu et al., 2015) are used.
The feature set is composed of six subgroups of features: Character-based features are composed of the ratios between upper case characters, peri-ods, commas, parentheses, exclamations, colons, number digits, semicolons, hyphens and quotation marks and the total number of characters in a text.
Word-based features are composed of the mean number of characters per word, vocabulary richness, acronyms, stopwords, first person pronouns, usage of words composed by two or three characters, standard deviation of word length and the difference between the longest and shortest words.
Sentence-based features are composed of the mean number of words per sentence, standard deviation of words per sentence and the difference between the maximum and minimum number of words per sentence in a text.
Dictionary-based features consist of the ratios of discourse markers, interjections, abbreviations, curse words, and polar words (positive and negative words in the polarity dictionaries described in (Hu and Liu, 2004)) with respect to the total number of words in a text.
Syntactic features Three types of syntactic features are distinguished: 1. Part-of-Speech features are given by the relative frequency of each PoS tag 1 in a text, the relative frequency of comparative/superlative adjectives and adverbs and the relative frequency of the present and past tenses. In addition to the finegrained Penn Treebank tags, we introduce general grammatical categories (such as 'verb', 'noun', etc.) and calculate their frequencies.
2. Dependency features reflect the occurrence of syntactic dependency relations in the dependency trees of the text. The dependency tagset used by the parser is described in (Surdeanu et al., 2008). We extract the frequency of each individual dependency relation per sentence, the percentage of modifier relations used per tree, the frequency of adverbial dependencies (they give information on manner, direction, purpose, etc.), the ratio of modal verbs with respect to the total number of verbs, and the percentage of verbs that appear in complex tenses referred to as "verb chains" (VCs).
3. Tree features measure the tree width, the tree depth and the ramification factor. Tree depth is defined as the maximum number of nodes between the root and a leaf node; the width is the maximum number of siblings at any of the levels of the tree; and the ramification factor is the mean num-ber of children per level. In other words, the tree features characterize the complexity of the dependency structure of the sentences.
These measures are also applied to subordinate and coordinate clauses.
Discourse features characterize the discourse structure of a text. To obtain the discourse structure, we use Surdeanu et al. (2015)'s discourse parser, which receives as input a raw text, divides it into Elementary Discourse Units (EDUs) and links them via discourse relations that follow the Rhetorical Structure Theory (Mann and Thompson, 1988).
We compute the frequency of each discourse relation per EDU (dividing the number of occurrences of each discourse relation by the number of EDUs per text) and additionally take into account the shape of the discourse trees by extracting their depth, width and ramification factor.

Datasets
We use two datasets. The first dataset is a corpus of chapters (henceforth, referred to as "Literary-Dataset") extracted from novels downloaded from the "Project Gutenberg" website 2 . Novels from 18 different authors were selected. Three novels per author were downloaded and divided by chapter, labeled by the gender and name of the author, as well as by the book they correspond to. All of the authors are British and lived in roughly the same time period. Half of the authors are male and half female 3 . The dataset is composed of 1793 instances.
The second dataset is publicly available 4 and was used in 2014's PAN author verification task (Stamatatos et al., 2014). It contains groups of literary texts that are written by the same author and a text whose author is unknown (henceforth, "PANLiterary").

Experiments
As already mentioned above, we carried out four experiments; the first three of them on the Lit-

Gender Identification
The gender identification experiment is casted as a supervised binary classification problem. Table  1 shows in the column 'Accuracy Gen' the performance of the SVM with each feature group separately as well as with the full set and with some feature combinations. The performance of the majority class classifier (MajClassBaseline) and of four different baselines, where the 300 most frequent token n-grams (2-5 grams were considered) are used as classification features, are also shown for comparison.
The n-gram baselines outperform the SVM trained on any individual feature group, except the syntactic features, which means that syntactic features are crucial for the characterization of the writing style of both genders. Using only this group of features, the model obtains an accuracy of 88.94%, which is very close to its performance with the complete feature set. When discourse features are added, the accuracy further increases.

Author Identification
The second experiment classifies the texts from the LiteraryDataset by their authors. It is a 18class classification problem, which is considerably more challenging. Table 1 (column 'Accuracy Auth') shows the performance of our model with 10-fold cross-validation when using the full set of features and different feature combinations.
The results of the 10-fold author identification experiment show that syntactic dependency features are also the most effective for the characterization of the writing style of the authors. The model with the full set of features obtains 88.34% accuracy, which outperforms the n-gram baselines. The high accuracy of syntactic dependency features compared to other sets of features proves again that dependency syntax is a very powerful profiling tool that has not been used to its full potential in the field.
Analyzing the confusion matrix of the experiment, some interesting conclusions can be drawn; due to the lack of space, let us focus on only a few of them. For instance, the novels by Elisabeth Gaskell are confused with the novels by Mary Anne Evans, Jane Austen and Margaret Oliphant. This is likely because not only do all of these authors share the gender, but Austen is also considered to be one of the main influencers of Gaskell. Even though, Agatha Christie is predicted correctly most of the times, when she is confused with another author, it is with Arthur Conan Doyle. This may not be surprising since Arthur Conan Doyle and, more specifically, the books about Sherlock Holmes, greatly influenced her writing, resulting in many detective novels with Detective Poirot as protagonist (Christie's personification of Sherlock Holmes). Other mispredictions (such as the confusion of Bram Stoker with Elisabeth Gaskell) require a deeper analysis and possibly also highlight the need for more training material.

Source Book Identification
To further prove the profiling potential of syntactic and discourse features, we carried out an additional experiment. The goal was to identify from which of the 54 books a given chapter is, making use of syntactic and discourse features only. Using the same method and 10-fold cross-validation, 83.01% of accuracy was achieved. The interesting part of this experiment is the error analysis. "Silas Marner", written by Mary Anne Evans (known as George Elliot), is one of the books that created the highest confusion; it is often confused with "Mill on the Floss" written by the same author. "Kidnapped" by Robert Louis Stevenson, which is very different from the other considered books by the same author, is confused with "Treasure Island" also by Stevenson, and "Great Expectations" by Charles Dickens. "Pride and Prejudice" by Jane Austen is confused with "Sense and Sensibility" also by her. The majority of confusions are between books by the same author, which proves our point further: syntactic and discourse structures constitute very powerful, underused profiling features (recall that for this experiment, we used only syntactic and discourse features; none of the features was content-or surface-oriented). When the full set of features was used, the accuracy improved to 91.41%. In that case, the main sources of confusion were between "Agnes Grey" and "The Tenant of Wildfell Hall", both by Anne Brontë and between "Silas Marner" and "Mill on the Floss", both by G. Elliot.

PAN Author Verification
The literary dataset in the PAN 2014 shared task on author verification contains pairs of text instances where one text is written by a specific author and the goal is to determine whether the other instance is also written by the same author. Note that the task of author verification is different from the task of author identification. To apply our model in this context, we compute the feature values for each pair of known-anonymous instances and substract the feature values of the known instance from the features of the anonymous one; the feature values are normalized. As a result, a feature difference vector for each pair is computed. The vector is labeled so as to indicate whether both instances were written by the same author or not.
The task performance measure is computed by multiplying the area under the ROC curve (AUC) and the "c@1" score, which is a metric that takes into account unpredicted instances. In our case, the classifier outputs a prediction for each test instance, such that the c@1 score is equivalent to accuracy. In Table 2, the performance of our model, compared to the winner and second ranked of the English literary text section of the shared task (cf. (Modaresi and Gross, 2014) and (Zamani et al., 2014) for details), is shown. Our model outperforms the task baseline as well  Table 2: Performance of our model compared to other participants on the "PANLiterary" dataset as the best performing approach of the shared task, the META-CLASSIFIER (MC), by a large margin. The task baseline is the best-performing language-independent approach of the PAN-2013 shared task. MC is an ensemble of all systems that participated in the task in that it uses for its decision the averaged probability scores of all of them. , syntactic objects (OBJ), commas, predicative complements of control verbs (OPRD), or adjective modifiers (AMOD). It is interesting to note that the Elaboration discourse relation is distinctive in the first two experiments, while the usage of Contrast relation becomes relevant to gender and book identification. These features are not helpful in the PANLiterary experiment, where discourse patterns were not found in the small dataset. The discourse tree width and the subordinate clause width are distinctive in the author identification experiment, while they are 5 The features starting with a capital are discourse relations; 'sentence range' is defined as the difference between the minimum and maximum value of words per sentence.   Table 3: 20 features with the highest information gain in all the experiments not in the other experiments. This is likely because they can serve as indicators of the structural complexity of a text and thus of the idiosyncrasy of a writing style of an individual -as punctuation marks such as periods and commas, which are typical stylistic features. Discourse markers, words with positive sentiment, first person plural pronouns, Wh-Adverbs and modal verbs are distinctive features in the gender identification experiment. The fact that the usage of positive words is only relevant in the gender identification experiment could be caused by the differences in the expressiveness/emotiveness of the writings of men and women. Punctuation marks become very distinctive in the book identification experiment, where the usage of colons, semicolons, parentheses, commas, periods, exclamations and quotation marks are among the most relevant features of the experiment. Syntactic shape features are distinctive in the author identification and PANLiterary experiments while not as impactful in the rest of the experiments.

Conclusions and Future Work
We have shown that syntactic dependency and discourse features, which have been largely neglected in state-of-the-art proposals so far, play a significant role in the task of gender and author identification and author verification. With more than 88% of accuracy in both gender and author identification within the literary genre, our models that uses them beats competitive baselines. In the future, we plan to experiment with further features and other traits of author profiling.