Continuous N-gram Representations for Authorship Attribution

This paper presents work on using continuous representations for authorship attribution. In contrast to previous work, which uses discrete feature representations, our model learns continuous representations for n-gram features via a neural network jointly with the classification layer. Experimental results demonstrate that the proposed model outperforms the state-of-the-art on two datasets, while producing comparable results on the remaining two.


Introduction
Authorship attribution is the task of identifying the author of a text. This field has attracted attention due to its relevance to a wide range of applications including forensic investigation (e.g. identifying the author of anonymous documents or phishing emails) (Chaski, 2005;Grant, 2007;Lambers and Veenman, 2009;Iqbal et al., 2010;Gollub et al., 2013) and plagiarism detection (Kimler, 2003;Gollub et al., 2013).
From a machine learning perspective, the task can be treated as a form of text classification. Let D = d 1 , d 2 , ..., d n be a set of documents and A = a 1 , a 2 , ..., a m a fixed set of candidate authors, the task of authorship attribution is to assign an author to each of the documents in D. The challenge in authorship attribution is that identifying the topic preference of each author is not sufficient; it is necessary to also capture their writing style (Stamatatos, 2013). This task is more difficult than determining the topic of a text, which is generally possible by identifying domain-indicative lexical items, since writing style cannot be fully captured by an author's choice of vocabulary.
Previous studies have found that word and character-level n-grams are the most effective features for identifying authors (Peng et al., 2003;Stamatatos, 2013;Schwartz et al., 2013). Word n-grams can represent local structure of texts and document topic (Coyotl-Morales et al., 2006;Wang and Manning, 2012). On the other hand, character n-grams have been shown to be effective for capturing stylistic and morphological information (Koppel et al., 2011;Sapkota et al., 2015).
However, previous work relied on discrete feature representations which suffer from data sparsity and do not consider the semantic relatedness between features. To address this problem we propose the use of continuous n-gram representations learned jointly with the classifier as a feedforward neural network. Continuous n-grams representations combine the advantages of n-grams features and continuous representations. The proposed method outperforms the prior state-of-theart approaches on two out of four datasets while producing comparable results for the remaining two.

Related Work
An extensive array of authorship attribution work has focused on utilizing content words and character n-grams. The topical preference of authors can be inferred by their choice of content words. For example, Seroussi et al. (2013) used the Author-Topic (AT) model (Rosen-Zvi et al., 2004) -an extension of Latent Dirichlet Allocation (Blei et al., 2003) -to obtain author representations. Experiments on several datasets yielded state-of-theart performance.
Character n-grams have been widely used and have the advantage of being able to capture stylistic information. By using only the 2,500 most frequent 3-grams, Plakias and Stamatatos (2008) successfully achieved 80.8% accuracy on the CCAT10 dataset, while Sapkota et al. (2015) reported slightly lower performance using only affix and punctuation 3-grams. Escalante et al. (2011) represent documents using a set of local histograms. This approach achieved an accuracy of 86.4%.
Beside being effective indicators of an author's writing style, both content words and character n-grams are also straightforward to extract from documents and are therefore widely used for author attribution. More complex features which require deeper textual analysis are also useful for the problem but have been used less frequently since the complexity of analysis required can hinder performance (Stamatatos, 2009). There have been several attempts to utilize semantic features for author attribution tasks, e.g. (McCarthy et al., 2006;Argamon et al., 2007;Brennan and Greenstadt, 2009;Bogdanova and Lazaridou, 2014). These approaches commonly use WordNet as a source of semantic information about words and phrases. For example, McCarthy et al. (2006) used WordNet to detect causal verbs while Brennan and Greenstadt (2009) used it to extract word synonyms. Our proposed model does not rely on any external linguistic resources, such as Word-Net, making it more portable to new languages and domains.

Continuous n-grams Representations
This work focuses on learning continuous n-gram representations for authorship attribution tasks. Continuous representations have been shown to be helpful in a wide range of tasks in natural language processing (Bengio et al., 2003;Mikolov et al., 2013). Unlike the previous authorship attribution work which uses discrete representations, we represent each n-gram as a continuous vector and learn these representations in the context of the authorship attribution tasks being considered.
To learn the n-gram feature representations jointly with the classifier we adopt the shallow neural network architecture of fastText, which was recently proposed by Joulin et al. (2016). This model is similar to a standard linear classifier, but instead of representing a document with a discrete feature vector, the model represents it with a continuous vector obtained by averaging the continuous vectors for the features present. More formally, fastText predicts the probability distribution over the labels for a document as follows: where x is the frequency vector of features for the document, the weight matrix A is a dictionary containing the embeddings learned for each feature, and B is a weight matrix that is learned to predict the label correctly using the learned representations (essentially averaged feature embeddings).
Since the documents in this model are represented as bags of discrete features, sequence information is lost. To recover some of this information we will consider feature n-grams, similar to the way convolutional neural network architectures incorporate word order (Kim, 2014) but with a simpler architecture.
The proposed model ignores long-range dependencies that could conceivably be captured using alternative architectures, such as recurrent neural networks (RNN) (Mikolov et al., 2010;Luong et al., 2013). However, topical and stylistic information is contained in shorter word and character sequences for which the shallow neural network architectures with n-gram feature representations are likely to be sufficient, while having the advantage of being much faster to run. This is particularly important for authorship attribution tasks which normally involves documents that are much longer than the single sentences which RNNs typically model.

Datasets
We use four datasets in our experiments: Judgment, CCAT10, CCAT50 and IMDb62. These datasets have a different number of authors and document sizes, which allows us to perform experiments and test our approaches in different scenarios. All datasets were made available by the authors of their respective papers. and run experiments with 10-fold cross-validation.
CCAT10 (Stamatatos, 2008). This dataset is a subset of Reuters Corpus Volume 1 (RCV1) (Rose et al., 2002) and consists of newswire stories by 10 authors labelled with the code CCAT (which indicates corporate/industrial news). The corpus was divided into 50 training and 50 test texts per author. In the experiments we follow prior work (Stamatatos, 2013) and measure accuracy using the train/test partition provided.
CCAT50. This corpus is a larger version of CCAT10. In total there are 5,000 documents from 50 authors. Same as CCAT10, for each of the author there are 50 training and 50 test documents.

Model Variations
We perform experiments with three variations of our approach: • Continuous word n-grams. In this model we use word uni-grams and bi-grams. We set the 700 most common words as the vocabulary.
• Continuous character n-grams. Following previous work (Sanderson and Guenter, 2006), we use up to four-grams, as it is found to be the best n value for short English text. We follow Zhang et al. (2015) by setting the vocabulary to 70 most common characters including letters, digits, and some punctuation marks.
• Continuous word and character n-grams.
This model combines word and character ngrams features.

Hyperparameters Tuning and Training Details
For all datasets, early stopping was used on the development sets and models trained with the Adam update rule (Kingma and Ba, 2015). Since none of the datasets have a standard development set, we randomly selected 10% of the training data for this purpose. Both word and character embeddings were initialized using Glorot uniform initialization (Glorot and Bengio, 2010). Keras's (Chollet, 2015) implementation of fast-Text was used for the experiments. The softmax function was used in the output layer without the hashing trick, which was sufficient for our experiments given the relatively small sized datasets. Code to reproduce the experiments is available from https://github.com/ yunitata/continuous-n-gram-AA.
For the Judgment, CCAT10 and CCAT50 datasets an embedding layer with embedding size of 100, dropout rate of 0.75, learning rate of 0.001 and mini-batch size of 5 were used. The model was trained for 150 epochs. The values for the dropout rate and mini-batch size were chosen using a grid search on the CCAT10 devset. Other hyperparameters values (i.e. learning rate and embedding size) are fixed. For IMDb62, we used the same dropout rate. In order to speed up the training process on this dataset, the learning rate, embedding size, mini-batch size and number of epochs were set to 0.01, 50, 32 and 20 respectively.    Table 2 demonstrates that the character models are superior to the word models. In particular, we found that models which employ character level n-grams appear to be more suitable for datasets with a large number of authors, i.e. CCAT50 and IMDb62. To explore this further, we ran an addi-tional experiment varying the number of authors on a subset of IMDb62. For each of the authors we use 200 documents, with 10% of the data used as the development set and another 10% as the test set. Figure 1 shows a steep decrease in the accuracy of word models when the number of authors increases. The drop in accuracy of the character n-gram model is less pronounced. Character models also achieve a slightly better result on the Judgment dataset which consists of only three authors. This can be explained by the fact that the documents in this corpus are significantly longer (almost ten and four times longer than those in IMDb62 and CCAT50 respectively (see Table 1). The large numbers of word n-grams make it more difficult to learn good parameters for them. Combining word and character n-grams only produced a very small improvement on this dataset.

Domain Influence
The majority previous work on authorship attribution has concluded that content words are more effective for datasets where the authors can be discriminated by the document topic (Peng et al., 2004;Luyckx, 2010). Seroussi et al. (2013) show that the Judgment and IMDb62 datasets fall into this category and approaches based on topic models achieve high accuracy (more than 90%). However, our results demonstrate stylistic information from continuous character n-grams can outperform word-based approaches on both datasets. In addition, this results also support the superiority of character n-grams that have been reported in the previous work (Peng et al., 2003;Stamatatos, 2013;Schwartz et al., 2013).

Feature Contributions
An ablation study was performed to further explore the influence of different types of features by removing a single class of n-grams. For this experiment the character model was used on the two CCAT datasets. Three feature types are defined including: 1. Punctuation N-gram: A character n-gram which contains punctuations. There are 34 punctuation symbols in total.
2. Space N-gram: A character n-gram that contains at least one whitespace character.
3. Digit N-gram: A character n-gram that contains at least one digit.
In addition, we also assess the influence of the length of the character n-grams. Results are presented in the Table 3.  Table 3: Results of feature ablation experiment. Table 3 demonstrates that removing punctuation and space n-grams leads to performance drops on both of the datasets. On the other hand, leaving out digit n-grams and bi-grams improves accuracy on the CCAT10 dataset. Other n-gram types do not seem to affect the results much.

Conclusion
This paper proposed continuous n-gram representations for authorship attribution tasks. Using four authorship attribution datasets, we showed that this model is effective for identifying writing style of the authors. Our experimental results provide evidence that continuous representations are suitable for a stylistic (as opposed to topical) text classification task such as authorship attribution.