Variational Autoencoder with Embedded Student-t Mixture Model for Authorship Attribution

Traditional computational authorship attribution describes a classification task in a closed-set scenario. Given a finite set of candidate authors and corresponding labeled texts, the objective is to determine which of the authors has written another set of anonymous or disputed texts. In this work, we propose a probabilistic autoencoding framework to deal with this supervised classification task. Variational autoencoders (VAEs) have had tremendous success in learning latent representations. However, existing VAEs are currently still bound by limitations imposed by the assumed Gaussianity of the underlying probability distributions in the latent space. In this work, we are extending a VAE with an embedded Gaussian mixture model to a Student-t mixture model, which allows for an independent control of the “heaviness” of the respective tails of the implied probability densities. Experiments over an Amazon review dataset indicate superior performance of the proposed method.

Supervised authorship attribution traditionally refers to the task of analyzing the linguistic patterns of a text in order to determine who, from a finite set of enrolled authors, has written a document of unknown authorship. Nowadays, the focus of this closed-set scenario has shifted from literary to social media authorship attribution, where methods have been developed to deal with large-scale datasets of small-sized online texts. Examples are provided by the work of (Rocha et al., 2017), (Boenninghoff et al., 2019a), (Theophilo et al., 2019), and (Tschuggnall et al., 2019).
The ADHOMINEM system proposed by (Boenninghoff et al., 2019b) is a linguistically motivated deep learning topology that can be seen as a feature extractor for such a tasks. The original ADHOMINEM system is trained on a large dataset of Amazon reviews 1 , each written by one of 784, 649 distinct authors. The key aspect of ADHOMINEM is that it is not designed to recognize authors but, instead, to recognize a difference in authorship between two given texts. As such, the system produces internal neural features, i.e. observation space features (see Fig. 1), that do not directly represent authorship but rather the stylistic characteristics that distinguish authorship. Since ADHOMINEM is trained with a large number of authors and since the space of writing style variations across authors is generally almost as big as the space of writing style variations itself, it can be argued that features produced by ADHOMINEM can serve as a proxy representation for writing style variations in general. The features produced by ADHOMINEM, therefore, have favorable properties for a variety of writing-style-based classification tasks, including the supervised authorship attribution pursued in this paper.
We are proposing the use of a Variational Autoencoder (VAE) framework to identify the authorship of a text based on its ADHOMINEM features. The VAE maps ADHOMINEM features into a latent space in which mapped features cluster when they stem from texts written by the same author. In a VAE framework, a probabilistic model is fitted in the latent space, which, in turn, can be exploited for the targeted classification task. In our case, we show that the application of a Student-t Mixture Model (SMM) in the latent space leads to a performance that is superior to the application of the commonly employed Gaussian Mixture Models (GMM).
The conventional VAE framework, as published by (Kingma and Welling, 2013) for example, combines unsupervised deep learning with variational Bayesian methods. A VAE relies on a probabilistic graphical model in the form of a directed acyclic graph, in which the hidden representations of an encoder network as well as the reconstructed outputs of a subsequent decoder network are treated as random variables. More precisely, the encoder defines a variational inference network, using high-dimensional observations to estimate an approximate posterior distribution in the latent space, and the decoder itself is a generative network, mapping latent representations back to distributions over the observation space. The framework is used to generate compressed, approximate representations for virtually any type of patterned input. Depending on the targeted application, we may remove either the encoder or the decoder from the framework, once the joint training of the combined encoder-decoder system has been completed. A VAE can be understood as a single-class probabilistic autoencoder, since it is assumed that all latent representations are sampled from the same Gaussian distribution. Different extensions of the conventional VAE, e.g. (Sohn et al., 2015), (Dilokthanakul et al., 2016), (Nalisnick et al., 2016), (Sønderby et al., 2016), (Johnson et al., 2016, (Nalisnick and Smyth, 2017), (Ebbers et al., 2017), (Lin et al., 2018), (Takahashi et al., 2018), (Davidson et al., 2018), (Domke and Sheldon, 2018), and (Abiri and Ohlsson, 2019), have been proposed. Particularly relevant to our work is the paper by (Jiang et al., 2017), in which the authors broadened the conventional VAE concept by generalizing the assumption of strictly Gaussian distributions to mixtures of Gaussians. This structure represents our baseline in the following.
The advantage of using the Student-t model is that we obtain a means to independently control the heaviness of the respective tails of each distribution. Our generalization of the framework can be successfully employed in a variety of common machine learning tasks: • Unsupervised learning: The basic architecture of our proposed method provides a generic recipe to autonomously group high-dimensional data into meaningful clusters. • Supervised learning: The derived loss function of our training method carries a cross-entropy term, which can be used to directly fuse class label information into the learning task. We are thereby able to enforce learning in a predefined/supervised direction as well. • Semi-supervised learning: In some cases, we may have a large amount of training data, only a small subset of which is labeled. In this situation, we can utilize our method to, first, pre-train the model in a supervised manner and then refine the model with the unlabeled data in an unsupervised fashion.

Preliminaries
Let O = {o n } N n=1 = {o 1 , . . . , o N } denote a training set of observation vectors o n ∈ R L for n ∈ {1 . . . N }. We assume that the o n are independent and identically distributed samples from either a continuous or a discrete random variable. Furthermore, we use X = {x 1 , . . . , x N } to denote a collection of N low-dimensional latent representation vectors x n ∈ R D , where each x n is associated with a corresponding observation o n . Following (Murphy, 2012), we define the Student-t distribution for the n-th latent representation x n by assuming that this D-dimensional vector belongs to the k-th cluster with k ∈ {1, . . . K} as where µ k defines the D-dimensional mean vector of the k-th class, Σ k denotes the D × D scale matrix and ν k ∈ [0, ∞] is the number of degrees of freedom. For ν k → ∞, the Student-t distribution tends towards a Gaussian distribution of the same mean vector and covariance matrix. Alternatively, we can understand the Student-t distribution as a marginalization with respect to a hidden variable, i.e.
where u nk > 0 is the hidden scale variable. The normal distribution is defined as The term G(·) is the Gamma distribution given in the form for u > 0 and α, β > 0. A finite SMM is defined as a weighted sum of multivariate Studentt distributions. With K k=1 π k = 1 we may write p(x n ) = K k=1 Pr(z nk = 1) p(x n |z nk = 1) = K k=1 π k S(x n µ k , Σ k , ν k ). As mentioned by (Svensén and Bishop, 2005) and (Archambeau and Verleysen, 2007), we can view the Student-t distribution in Eq. (1) as the marginalization of a Gaussian-Gamma distribution by integrating out the hidden scale variable u nk . This infinte mixture of normal distributions with the same mean vector but with varying covariance matrices can be incorporated into a generative process. Omitting the dependency on the hyper-parameters µ k , Σ k and ν k , Eq. (1) can be rewritten as is an indicator variable showing whether the n-th observation belongs to the k-th class. Consequently, our generative model in the latent space is augmented by the scale parameter u nk as an additional latent variable.

The Generative Model
to denote the set of all hyper-parameters of an SMM. We can generate observation samples o n for the proposed Student-t VAE with the following 5 steps: 1. Choose a cluster for the n-th observation by sampling the one-hot vector z n ∼ p(z n ), where and z n = z n1 , . . . , z nK T . 2. Sample the n-th scale vector u n ∼ p ξ (u n |z n ), where and u n = u n1 , . . . , u nK T . 3. Sample a new latent representation for the n-th observation, 4. Decode a parameter set for the n-th observation o n , The set θ summarizes all weights and bias terms of the decoder network.

Sample an observation
The generative process for the proposed Student-t VAE as a graphical model is illustrated in Fig. 2.

Approximate Inference
At this point it is notationally beneficial to define the set H = Z, U, X = {z nk , u nk } K k=1 , x n N n=1 of all latent variables of our proposed framework. We apply the mean-field approximation to find an analytical expression of the approximate joint posterior distribution q φ (H). The symbol φ is used to represent the set of all weights and bias terms of the underlying encoder network. Suppose, the joint posterior distribution of H can be factored such that q φ (H|O) = i q φ (H i |O), then the posterior distribution can be obtained according to (Bishop, 2006) In our context, the product i q(H i |O) represents a suitable factorization of the joint posterior distribution of all latent variables. One possible approximate factorization is: The employed generative model implies that there is a statistical dependency between x n and z n , u n . It can be argued, however, that we may ignore this dependency in our case because the posterior distribution in the latent space is encoded by the second neural network, i.e. q φ (x n |o n ) = N (x n |µ Note that the posterior distribution of z n and u n in Eq. (9) does not directly depend on φ, which is important for the calculation of the loss function discussed in Section 2.4. It is not necessary to approximate the joint posterior distribution of z n and u n , as it is possible to analytically determine the marginal distributions q(z nk ) and q(u nk |z nk ) given the joint distribution q(z nk , u nk ). We can apply the mean-field approximation to compute the joint distribution q(z nk = 1, u nk ). The marginal distribution can be derived via q(z nk = 1) = +∞ 0 q(z nk = 1, u nk ) du nk . We can now determine the posterior distribution q(u nk |z nk = 1) as: where define the hyper-parameters for q(u nk |z nk = 1).

The Variational Lower Bound and the Loss Function
Following (Kingma and Welling, 2013), the negative of the derived evidence lower bound provides a loss function, i.e.
The lower bound for the n-th observation can, thus, be partitioned into the 6 terms: − E q(un,zn) ln q(u n , z n ) .
For the sake of clarity we will discuss each term of the above expression separately. First, we may note that Term (17) remains constant during the gradient-based update phase, i.e. we have E q(un,zn) ln q(u n , z n ) = constant, since there is no dependency on the update parameters in θ, φ and ξ. By assuming ergodicy, we can make the following approximation for Term (15): For the re-parameterization trick, x n,t is obtained as follows: t ∼ N (0 D×1 , I D×D ) and x n,t = µ (x|o) n + σ (x|o) n t , which is fed into the decoder: {µ n,t } = Decoder θ x n,t . Considering Term (16) we may exploit the entropy of multivariate Gaussian distributions, i.e.

Interpretation of the Lower Bound
As a result, the Evidence Lower Bound (ELBO) for the n-th observation defined through Terms (12) to (17) can be rewritten. We obtain the following, more compact expression: where ln ρ nk = ln q(z nk = 1) − H G(u nk |α k , β nk ) − D 2 ln 2π . H G(u nk |α k , β nk ) represents the entropy of the Gamma distribution with parameters α k and β nk . Similarly to the ELBO of the conventional VAE, we may interpret the function of each term of the derived lower bound in Eq. (22). The first term represents the reconstruction error, measuring how well the encoder-decoder framework fits the dataset. The second term can be seen as a regularizer quantifying the output of the decoder. Following the maximum entropy principle, it will maximize the uncertainty with regard to possibly missing information. The third term, the cross-entropy, evaluates the clustering or classification. In the case of supervised learning, γ nk is replaced by the true class labels. All terms in Eq. (22) can easily be computed batch-wise. The training procedure of the proposed Student-t VAE is summarized in Algorithm 1. for t = 1, . . . , T do 7: t ∼ N 0, I 8: n,t 2 10: end for 11: for k = 1, . . . , K do hyper-parameters 12: α k = ν k +D 2 13: ln qz nk = ln π k + ν k 2 ln ν k 2 − ln Γ( ν k 2 ) − 1 2 ln det(Σ k ) + ln Γ α k − α k ln β nk 15: ln ρ nk = ln qz nk − H G(u nk |α k , β nk ) − D 2 ln 2π 16: end for 17: ln qz n = ln qzn1, . . . , ln qznK Determine and update gradients of parameters in θ, φ, ξ 30: end for

Evaluation
The following two sections provide results of two experiments. In the first, we applied the proposed method to purely synthetic data in order to produce a graphical representation that illustrates how the method works in an unsupervised learning scenario. Results are discussed in Section 3.1. In the second experiment, we applied the proposed method to a supervised authorship attribution task and compared the results to a selection of reference methods. Details are discussed in Section 3.2.
We are referring to the proposed SMM-based Variational Autoencoder method, i.e. the Student-t VAE method discussed in Section 2, simply as the tVAE method. As reference methods we have implemented a GMM-based Variational Autoencoder, referred to as gVAE, and two types of Support Vector Machines (SVMs), one linear and one non-linear. The gVAE system was inspired by (Ebbers et al., 2017) and is, in structure, very similar to the method presented by (Jiang et al., 2017). For the sake of a fair comparison, we ensured that the network architecture of both, the tVAE and the gVAE implementations, were exactly the same 2 . Both algorithms are implemented in Python, where the training of the neural networks is accomplished with Tensorflow. The code is available to interested readers upon request.
Our implementations contain the following modifications relative to (Ebbers et al., 2017): It is ensured that the mixing weights sum to one and that each covariance matrix is invertible. Hence, instead of directly updating π k , Σ k and ν k , we introduced auxiliary variables such that π 1 , . . . , π k T = Softmax m 1 , . . . , m K T , ν k = log exp(n k )+exp(2.0+ ) and Σ k = C k C T k +σ 2 k I D×D where σ 2 k is a fixed hyper-parameter. More precisely, we enforced ν k > 2 by applying a modified softplus-function and we enforced the positive definiteness of Σ k through a Cholesky decomposition by constructing trainable lower-triangular matrices C k with exponentiated (positive) diagonal elements. We computed and updated the gradients of n k , m k and C k with respect to the loss function.  Figure 3: Results of the Student-t VAE after training with the synthetic pinwheel data set.

Synthetic Data Experiment for Clustering
To illustrate the inner workings of the tVAE method, we first performed clustering on low-dimensional synthetic spirals of noisy data. The dataset as well as the clustering results are shown in Fig. 3. It is the same dataset used in (Dilokthanakul et al., 2016) and(Johnson et al., 2016). Encoder and decoder are fully connected feed-forward networks (with ReLU activation functions) of the form L-H-H-D and D-H-H-L, respectively, where L = 2 defines the dimension of the observations, H = 512 represents the number of hidden nodes and D = 2 is the latent space dimension. We used the Adam optimizer (Kingma and Ba, 2014) to calculate parameter updates. A key problem for the training of VAEs with embedded mixture models is an over-regularization behavior that occurs at the beginning of the training phase. Following (Yeung et al., 2017), it is caused by the regularization term of the ELBO. Both, the prior distribution and the posterior can be decomposed into univariate distributions and therefore we can also decompose the Kullback- , where x n,d is the d-th component of x n . As mentioned in (Yeung et al., 2017), the model has to minimize the KL term en bloc and not component-wise. One obvious option for the model is to enforce a large number of components x n,d helping to minimize the KL term, which means these components are (close to) zero. Similarly, our model has to maximize the cross-entropy in Eq. (22), which includes maximizing β nk ≈ Tr Σ (x|o) n in Algorithm 1. It can happen that the covariances Σ k get smaller and smaller, except for a single global class. This leads to the "anti-clustering behavior" mentioned in (Jiang et al., 2017).
To handle the over-regularization problem, we first trained a GMM to initialize the parameters π k , µ k and Σ k of each Student-t mixture component. A similar strategy was suggested in (Dilokthanakul et al., 2016). With a simple modification we can circumvent the tendency of early class merging: If we treat the obtained GMM weights as class labels for the first 10 − 15 iterations (alternatively, one can randomly assign cluster labels), the neural networks become sufficiently stable and the merging effect is eliminated. The degrees of freedom ν k were initialized with ≈ 5 for all k. The number of clusters K was known a priori and therefore kept fixed for all presented experiments.
After the completion of the training we used the decoder to sample new observations. Fig. 3b illustrates the linearly separable, learned manifolds in latent space by plotting samples drawn from each mixture component according to Eqs. (5) and (6). In Fig. 3c and 3d the mean values and new sampled observations are shown after applying the decoder to the latent data.

Authorship Attribution
In the previous section, we have shown that the proposed tVAE model has the ability to learn a non-linear generative process. Next, we examine to what extent the tVAE framework can be used to accomplish authorship attribution for Amazon review data.

Feature Extraction
As already discussed in Section 1, we are using the ADHOMINEM system by (Boenninghoff et al., 2019b) as a feature extraction mechanism to convert variable-length documents D n (for n = 1, 2, . . .) into fixed-length feature vectors o n in observation space. The feature vectors o n capture the stylistic characteristics of the associated documents D n . Documents may consist of known words, as well as  unknown character concatenations embedded within multiple sentences. The core of ADHOMINEM is a two-level hierarchical attention-based bidirectional LSTM network (Hochreiter and Schmidhuber, 1997). Besides pre-trained word embeddings, ADHOMINEM also provides a characters-to-word encoding layer to take the specific uses of prefixes and suffixes as well as spelling errors into account.

Amazon Reviews Dataset
ADHOMINEM was trained on a large-scale corpus of short Amazon reviews. The dataset is described in (Boenninghoff et al., 2019b) and consists of 9, 052, 606 reviews written by 784, 649 authors, with document lengths varying between 80 and 1000 tokens. For the evaluation of the proposed tVAE method, we randomly selected 21, 172 reviews written by 30 authors. All text from the selected authors had been excluded from the training procedure of ADHOMINEM. Each author contributed with at least 503 reviews and with a maximum of 1, 000 reviews.

Hyper-Parameter Tuning and Regularization
Encoder and decoder were fully connected feed-forward networks (with a tanh activation function) of the form L-H-D and D-H-L, respectively, where L = 200 defines the dimension of the observations, H = D+L 2 represents the number of hidden nodes and D is the latent space dimension. In all experiments, the Adam optimizer from (Kingma and Ba, 2014) was used to update the model parameters. Gradients were normalized so that their l 2 -norm was less than or equal to 1. Furthermore, we added an l 1 -regularization term, J φ,θ = β · ∀(W ,b)∈{φ,θ} vec(W ) 1 + b 1 , to reduce overfitting. The terms W , b represent the weights and bias terms of the encoder/decoder networks. Our hyperparameter tuning is based on a grid search over the following parameter-set combinations: stepsize α ∈ {0.001, 0.002, 0.003, 0.004, 0.005}, σ 2 k ∈ {0.001, 0.01, 0.1, 0.5, 0.9}, D ∈ {20, 50, 100, 150, 200} and β ∈ {0.001, 0.005, 0.01, 0.05}.

Results
A 5-fold cross-validation was performed to evaluate the models in terms of average error rates. In a first step, we reserved 10% of the data as a development set and 10% of the data as a test set. In addition, we addressed the challenge of training the autoencoders with a smaller number of labeled data items by varying the size of the labeled training data from 20% to 100% (reviews were dropped out randomly). Using the best models found (depending on the hyper-parameters) we evaluated the performance of all methods with respect to the average error rate. Table 1  versus the conventional gVAE method, as well as a linear and non-linear SVM. The lowest error rate for each setup is shown in bold face. It can be seen that our tVAE model is able to (slightly) outperform all baseline methods. For all methods, the performance gradually improves as the number of training documents is increased. In addition, Fig. 5 shows the performance results w.r.t. the dimension of the latent variable x n . It is apparent that the best choice for D increases when more reviews are added to the training set. For 20% of the training data, the optimal dimension is D = 50, for 40% we have D = 100 and for more than 60%, D = 200 yields the lowest error rate. Fig. 4 displays the resulting, i.e. learned, degrees of freedom ν k for all clusters (i.e. authors). The plot clearly shows that for smaller latent dimensions D, the cluster distributions are approximately Gaussian. With an increase in latent dimension, the mixture components become more heavy-tailed, making the Student-t distribution a better fit. Fig. 4, thus, provides a strong numerical justification for the move to the proposed tVAE model.

Conclusion
Variational autoencoders (VAEs) have proven their benefit in many tasks. They provide an attractive machine-learning framework that combines the strengths of neural-network training with the power of uncertainty metrics derived from statistical models. They can learn a low-dimensional manifold to summarize the most salient characteristics of data and provide a natural, statistical interpretation, both in the latent as well as in the observation space. In our work, we are addressing the question of whether VAEs can benefit from a statistical model that allows for more heavy-tailed distributions. The promise of heavy-tailed distributions is that susceptibility to outliers is curtailed and that the overall capacity of the model is improved.
Towards that goal, we have proposed and evaluated a VAE that is equipped with an embedded Student-t mixture model. It incorporates an assumption of Student-t distributed data into the joint learning mechanism for the latent manifold and its statistical distribution. Variational inference is performed by trying to simultaneously solve both tasks: jointly learning a nonlinear mapping to transform a given dataset of interest onto a (lower-dimensional) subspace and grouping the latent representations into meaningful categories.
We have derived a variational learning algorithm to accomplish this goal and we have shown its benefit for learning latent representations, both on synthetic data as well as a real-world authorship attribution task. Our more flexible model provided a capacity to obtain better results than an SVM-based classifier as well as a standard Gaussian VAE.