Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text

There exist few text-specific methods for unsupervised anomaly detection, and for those that do exist, none utilize pre-trained models for distributed vector representations of words. In this paper we introduce a new anomaly detection method—Context Vector Data Description (CVDD)—which builds upon word embedding models to learn multiple sentence representations that capture multiple semantic contexts via the self-attention mechanism. Modeling multiple contexts enables us to perform contextual anomaly detection of sentences and phrases with respect to the multiple themes and concepts present in an unlabeled text corpus. These contexts in combination with the self-attention weights make our method highly interpretable. We demonstrate the effectiveness of CVDD quantitatively as well as qualitatively on the well-known Reuters, 20 Newsgroups, and IMDB Movie Reviews datasets.


Introduction
Anomaly Detection (AD) (Chandola et al., 2009;Pimentel et al., 2014; is the task of discerning rare or unusual samples in a corpus of unlabeled data. A common approach to AD is one-class classification (Moya et al., 1993), where the objective is to learn a model that compactly describes "normality"-usually assuming that most of the unlabeled training data is "normal," i.e. non-anomalous. Deviations from this description are then deemed to be anomalous. Examples of one-class classification methods are the well-known One-Class SVM (OC-SVM) (Schölkopf et al., 2001) and Support Vector Data Description (SVDD) (Tax and Duin, 2004).
Applying AD to text is useful for many applications including discerning anomalous web content (e.g. posts, reviews, or product descriptions), automated content management, spam detection, and characterizing news articles so as to identify similar or dissimilar novel topics. Recent work has found that proper text representation is critical for designing well-performing machine learning algorithms. Given the exceptional impact that universal vector embeddings of words (Bengio et al., 2003) such as word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and Fast-Text  or dynamic vector embeddings of text by language models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) have had on NLP, it is somewhat surprising that there has been no work on adapting AD techniques to use such unsupervised pre-trained models. Existing AD methods for text still typically rely on bag-of-words (BoW) text representations Yousef, 2001, 2007;Mahapatra et al., 2012;Kannan et al., 2017).
In this work, we introduce a novel one-class classification method that takes advantage of pretrained word embedding models for performing AD on text. Starting with pre-trained word embeddings, our method-Context Vector Data Description (CVDD)-finds a collection of transforms to map variable-length sequences of word embeddings to a collection of fixed-length text representations via a multi-head self-attention mechanism. These representations are trained along with a collection of context vectors such that the context vectors and representations are similar while keeping the context vectors diverse. Training these representations and context vectors jointly allows our algorithm to capture multiple modes of normalcy which may, for example, correspond to a collection of distinct yet non-anomalous topics. Disentangling such modes allows for contextual anomaly detection with sample-based explanations and enhanced model interpretability.

Context Vector Data Description
In this section we introduce Context Vector Data Description (CVDD), a self-attentive, multicontext one-class classification method for unsupervised AD on text. We first describe the CVDD model and objective, followed by a description of its optimization procedure. Finally we present some analysis of CVDD.

Multi-Head Self-Attention
Here we describe the problem setting and multihead self-attention mechanism which lies at the core of our method. Let S = (w 1 , . . . , w ) ∈ R d× be a sentence or, more generally, a sequence of ∈ N words (e.g. phrase or document), where each word is represented via some d-dimensional vector. Given some pre-trained word embedding, let H = (h 1 , . . . , h ) ∈ R p× be the corresponding p-dimensional vector embeddings of the words in S. The vector embedding H might be some universal word embedding (e.g. GloVe, FastText) or the hidden vector activations of sentence S given by some language model (e.g. ELMo, BERT). The aim of multi-head self-attention (Lin et al., 2017) is to define a transformation that accepts sentences S (1) , . . . , S (n) of varying lengths (1) , . . . , (n) and returns a vector of fixed length, thereby allowing us to apply more standard AD techniques. The idea here is to find a fixed-length vector representation of size p via a convex combination of the word vectors in H. The coefficients of this convex combination are adaptive weights which are learned during training. We describe the model now in more detail. Given the word embeddings H ∈ R p× of a sentence S, the first step of the self-attention mechanism is to compute the attention matrix A ∈ (0, 1) ×r by where W 1 ∈ R p×da and W 2 ∈ R da×r . The tanh-activation is applied element-wise and the softmax column-wise, thus making each vector a k of the attention matrix A = (a 1 , . . . , a r ) a positive vector that sums to one, i.e. a weighting vector. The r vectors a 1 . . . , a r are called attention heads with each head giving a weighting over the words in the sentence. Dimension d a is the internal dimensionality and thus sets the complexity of the self-attention module. We now obtain a fixed-length sentence embedding matrix M = (m 1 , . . . , m r ) ∈ R p×r from the word embeddings H by applying the self-attention weights A as M = HA.
Thus, each column m k ∈ R p is a weighted linear combination of the vector embeddings h 1 , . . . , h ∈ R p with weights a k ∈ R given by the respective attention head k, i.e. m k = Ha k . Often, a regularization term P such as is added to the learning objective to promote the attention heads to be nearly orthogonal and thus capture distinct views that focus on different semantics and concepts of the data. Here, I denotes the r × r identity matrix, · F is the Frobenius norm, and A (i) A(H (i) ; W 1 , W 2 ) is the attention matrix corresponding to sample S (i) .

The CVDD Objective
In this section, we introduce an unsupervised AD method for text. It aims to capture multiple distinct contexts that may appear in normal text. To do so, it leverages the multi-head self-attention mechanism (described in the previous section), with the heads focusing on distinct contexts (one head per context).
We first define a notion of similarity. Let sim(u, v) be the cosine similarity between two vectors u and v, i.e.
and denote by d(u, v) the cosine distance between u and v, i.e.
As before, let r be the number of attention heads. We now define the context matrix C = (c 1 , . . . , c r ) ∈ R p×r to be a matrix whose columns c 1 , . . . , c r are vectors in the word embedding space R p . Given an unlabeled training corpus S (1) , . . . , S (n) of sentences (or phrases, documents, etc.), which may vary in length (i) , and their corresponding word vector embeddings H (1) , . . . , H (n) , we formulate the Context Vector Data Description (CVDD) objective as follows: . (6) σ 1 (H), . . . , σ r (H) are input-dependent weights, i.e. k σ k (H) = 1, which we specify further below in detail. That is, CVDD finds a set of vectors c 1 , . . . , c r ∈ R p with small cosine distance to the attention-weighted data representations m k of the kth head. This causes the network to learn attention weights that extract the most common concepts and themes from the data. We call the vectors c 1 , . . . , c r ∈ R p context vectors because they represent a compact description of the different contexts that are present in the data. For a given text sample S (i) , the corresponding embedding m (i) k provides a sample representation with respect to the kth context.
Multi-context regularization To promote the context vectors c 1 , . . . , c r to capture diverse themes and concepts, we regularize them towards orthogonality: We can now state the overall CVDD objective as where J n (C, W 1 , W 2 ) is the objective function from Eq. (6) and λ > 0. Because CVDD minimizes the cosine distance regularizing the context vectors c 1 , . . . , c k to be orthogonal implicitly regularizes the attention weight vectors a 1 , . . . , a r to be orthogonal as well, a phenomenon which we also observed empirically. Despite this, we found that regularizing the context vectors as in (7) allows for faster, more stable optimization in comparison to regularizing the attention weights as in (3). We suspect this is because in (3) P = P n (W 1 , W 2 ) depends nonlinearly on the attention network weights W 1 and W 2 as well as on the data batches. In comparison, the gradients of P (C) in (7) can be directly computed. Empirically we found that selecting λ ∈ {1, 10} gives reliable results with the desired effect that CVDD learns multiple distinct contexts.
Optimization of CVDD We optimize the CVDD objective jointly over the self-attention network weights {W 1 , W 2 } and the context vectors c 1 , . . . , c r using stochastic gradient descent (SGD) and its variants (e.g. Adam (Kingma and Ba, 2014)). Thus, CVDD training scales linearly in the number of training batches. Training is carried out until convergence. Since the self-attention module is just a two-layer feed-forward network, the computational complexity of training CVDD is very low. However, evaluating a pre-trained model for obtaining word embeddings may add to the computational cost (e.g. in case of large pre-trained language models) in which case parallelization strategies (e.g. by using GPUs) should be exploited. We initialize the context vectors with the centroids from running k-means++ (Arthur and Vassilvitskii, 2007) on the sentence representations obtained from averaging the word embeddings. We empirically found that this initialization strategy improves optimization speed as well as performance.

Weighting contexts in the CVDD objective
There is a natural motivation to consider multiple vectors for representation because sentences or documents may have multiple contexts, e.g. cinematic language, movie theme, or sentiment in movie reviews. This raises the question of how these context representations should be weighted in a learning objective. For this, we propose to use a parameterized softmax over the r cosine distances of a sample S with embedding H in our CVDD objective: for k = 1, . . . , r with α ∈ [0, +∞). The α parameter allows us to balance the weighting between two extreme cases: (i) α = 0 which results in all contexts being equally weighted, i.e. σ k (H) = 1/r for all k, and (ii) α → ∞ in which case the softmax approximates the argminfunction, i.e. only the closest context k min = argmin k d(c k , m k ) has weight σ k min = 1 whereas σ k = 0 for k = k min otherwise. Traditional clustering methods typically only consider the argmin, i.e. the closest representa-tives (e.g. nearest centroid in k-means). For learning multiple sentence representations as well as contexts from data, however, this might be ineffective and result in poor representations. This is due to the optimization getting "trapped" by the closest context vectors which strongly depend on the initialization. Not penalizing the distances to other context vectors also does not foster the extraction of multiple contexts per sample. For this reason, we initially set α = 0 in training and then gradually increase the α parameter using some annealing strategy. Thus, learning initially focuses on extracting multiple contexts from the data before sample representations then gradually get fine-tuned w.r.t their closest contexts.

Contextual Anomaly Detection
Our CVDD learning objective and the introduction of context vectors allows us to score the "anomalousness" in relation to these various contexts, i.e. to determine anomalies contextually. We naturally define the anomaly score w.r.t. context k for some sample S with embedding H as the cosine distance of the contextual embedding m k (H) to the respective context vector c k . A greater the distance of m k (H) to c k implies a more anomalous sample w.r.t. context k. A straightforward choice for an overall anomaly score then is to take the average over the contextual anomaly scores: One might, however, select different weights for different contexts as particular contexts might be more or less relevant in certain scenarios. Using word lists created from the most similar attentionweighted sentences to a context vector provides an interpretation of the particular context.

Avoiding Manifold Collapse
Neural approaches to AD  and clustering (Fard et al., 2018) are prone to converge to degenerate solutions where the data is transformed to a small manifold or a single point. CVDD potentially may also suffer from this manifold collapse phenomenon. Indeed, there exists a theoretical optimal solution (C * , W * 1 , W * 2 ) for which the (nonnegative) CVDD objective (6) becomes zero due to trivial representations. This is the case for (C * , W * 1 , W * 2 ) where m k (H (i) ; W * 1 , W * 2 ) = c * k ∀i = 1, . . . , n, (13) holds, i.e. if the contextual representation m k (· ; W * 1 , W * 2 ) is a constant mapping. In this case, all contextual data representations have collapsed to the respective context vectors and are independent of the input sentence S with embedding H. Because the pre-trained embeddings H are fixed, and the self-attention embedding must be a convex combination of the columns in H, it is difficult for the network to overfit to a constant function. A degenerate solution may only occur if there exists a word which occurs in the same location in all training samples. This property of CVDD, however, might be used "as a feature" to uncover such simple common structure amongst the data such that appropriate pre-processing steps can be carried out to rule out such "Clever Hans" behavior (Lapuschkin et al., 2019). Finally, since we normalize the contextual representations m k as well as the context vectors c k with cosine similarity, a trivial collapse to the origin (m k = 0 or c k = 0) is also avoided.

Related Work
Our method is related to works from unsupervised representation learning for NLP, methods for AD on text, as well as representation learning for AD.
Vector representations of words (Bengio et al., 2003;Collobert and Weston, 2008) or word embeddings have been the key for many substantial advances in NLP in recent history. Wellknown examples include word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), or fastText . Approaches for learning sentence embeddings have also been introduced, including SkipThought (Kiros et al., 2015), ParagraphVector (Le and Mikolov, 2014), Conceptual Sentence Embedding (Wang et al., 2016), Sequential Denoising Autoencoders (Hill et al., 2016) or Fast-Sent (Hill et al., 2016). In a comparison of unsupervised sentence embedding models, Hill et al. (2016) show that the optimal embedding critically depends on the targeted downstream task. For specific applications, more complex deep models such as recurrent (Chung et al., 2014), recursive (Socher et al., 2013) or convolutional (Kim, 2014) networks that learn task-specific dynamic sentence embeddings usually perform best. Recently, large language models like ELMo (Peters et al., 2018) or BERT (Devlin et al., 2018) that learn dynamic sentence embeddings in an unsupervised manner have proven to be very effective for transfer learning, beating the state-of-the-art in many downstream tasks. Such large deep models, however, are very computationally intensive. Finally, no method for learning representations of words or sentences specifically for the AD task have been proposed yet.
There are only few works addressing AD on text.
Manevitz and Yousef study oneclass classification of documents using the OC-SVM (Schölkopf et al., 2001;Manevitz and Yousef, 2001) and a simple autoencoder (Manevitz and Yousef, 2007). Liu et al. (2002) consider learning from positively labeled as well as unlabeled mixed data of documents what they call a "partially supervised classification" problem that is similar to one-class classification. Kannan et al. (2017) introduce a non-negative matrix factorization method for AD on text that is based on block coordinate descent optimization. Mahapatra et al. (2012) include external contextual information for detecting anomalies using LDA clustering. All the above works, however, only consider document-to-word co-occurrence text representations. Other approaches rely on specific handcrafted features for particular domains or types of anomalies (Guthrie, 2008;Kumaraswamy et al., 2015). None of the existing methods make use of pre-trained word models that were trained on huge corpora of text.
Learning representations for AD or Deep Anomaly Detection (Chalapathy and Chawla, 2019) has seen great interest recently. Such approaches are motivated by applications on large and complex datasets and the limited scalability of classic, shallow AD techniques and their need for manual feature engineering. Deep approaches aim to overcome those limitations by automatically learning relevant features from the data and mini-batch training for improved computational scaling. Most existing deep AD works are in the computer vision domain and show promising, state-of-the-art results for image data (Andrews et al., 2016;Schlegl et al., 2017;Golan and El-Yaniv, 2018;Hendrycks et al., 2019). Other works examine deep AD on general high-dimensional point data (Sakurada and Yairi, 2014;Xu et al., 2015;Erfani et al., 2016;Chen et al., 2017). Few deep approaches examine sequential data, and those that do exist focus on time series data AD using LSTM networks (Bontemps et al., 2016;Malhotra et al., 2015Malhotra et al., , 2016. As mentioned earlier, there exists no representation learning approach for AD on text.

Experiments
We evaluate the performance of CVDD quantitatively in one-class classification experiments on the Reuters-21578 1 and 20 Newsgroups 2 datasets as well as qualitatively in an application on IMDB Movie Reviews 3 to detect anomalous reviews. 4

Experimental Details
Pre-trained Models We employ the pre-trained GloVe (Pennington et al., 2014) as well as fast-Text  word embeddings in our experiments. For GloVe we consider the 6B tokens vector embeddings of p = 300 dimensions that have been trained on the Wikipedia and Gigaword 5 corpora. For fast-Text we consider the English word vectors that also have p = 300 dimensions which have been trained on the Wikipedia and English webcrawl. We also experimented with dynamic word embeddings from the BERT (Devlin et al., 2018) language model but did not observe improvements over GloVe or fastText on the considered datasets that would justify the added computational cost.
Baselines We consider three baselines for aggregating word vector embeddings to fixed-length sentence representations: (i) mean, (ii) tf-idf weighted mean, and (iii) max-pooling. It has been repeatedly observed that the simple average sentence embedding proves to be a strong baseline in many tasks (Wieting et al., 2016;Arora et al., 2017;Rücklé et al., 2018). Max-pooling is commonly applied over hidden activations (Lee and Dernoncourt, 2016). The tf-idf weighted mean is a natural sentence embedding baseline that includes document-to-term co-occurrence statistics. For AD, we then consider the OC-SVM (Schölkopf et al., 2001) with cosine kernel (which in this case is equivalent to SVDD (Tax and Duin, 2004)) on these sentence embeddings where we always train for hyperparameters ν ∈ {0.05, 0.1, 0.2, 0.5} and report the best result.
CVDD configuration We employ self-attention with d a = 150 for CVDD and present results for r ∈ {3, 5, 10} number of attention heads. We use Adam (Kingma and Ba, 2014) with a batch size of 64 for optimization and first train for 40 epochs with a learning rate of η = 0.01 after which we train another 60 epochs with η = 0.001, i.e. we establish a simple two-phase learning rate schedule. For weighting contexts, we consider the case of equal weights (α = 0) as well as a logarithmic annealing strategy α ∈ {0, 10 −4 , 10 −3 , 10 −2 , 10 −1 } where we update α every 20 epochs. For multicontext regularization, we choose λ ∈ {1, 10}.
Data pre-processing On all three datasets, we always lowercase text and strip punctuation, numbers, as well as redundant whitespaces. Moreover, we remove stopwords using the stopwords list from the nltk library (Bird et al., 2009) and only consider words with a minimum length of 3 characters.

One-Class Classification of News Articles
Setup We perform one-class classification experiments on the Reuters-21578 and 20 Newsgroups topic classification datasets which allow us to quantitatively evaluate detection performance via AUC measure by using the ground truth labels in testing. Such use of classification datasets is the typical evaluation approach in the AD literature (Erfani et al., 2016;Golan and El-Yaniv, 2018). For the multi-label Reuters dataset, we only consider the subset of samples that have exactly one label and have selected the classes such that there are at least 100 training examples per class. For 20 Newsgroups, we consider the six top-level subject matter groups computer, recreation, science, miscellaneous, politics, and religion as distinct classes. In every one-class classification setup, one of the classes is the normal class and the remaining classes are considered anomalous. We always train the models only with the training data from the respective normal class. We then perform testing on the test samples from all classes, where samples from the normal class get label y = 0 ("normal") and samples from all the remaining classes get label y = 1 ("anomalous") for determining the AUC.

Results
The results are presented in Table 1. Overall, we can see that CVDD shows competitive detection performance. We compute the AUCs for CVDD from the average anomaly score over the contextual anomaly scores as defined in (12). We find CVDD performance to be robust over λ ∈ {1, 10} and results are similar for weighting contexts equally (α = 0) or employing the logarithmic annealing strategy. The CVDD results in Table 1 are averages over those hyperparameters.
We get an intuition of the theme captured by some CVDD context vector by examining a list of top words for this context. We create such lists by counting the top words according to the highest self-attention weights from the most similar test sentences per context vector. Table 2 shows an example of such context word lists for CVDD three contexts. Such lists may guide a user in weighting and selecting relevant contexts in a specific application. Following this thought, we also report the IMDB Movie Reviews  c1  c2  c3  c4  c5  c6  c7  c8  c9  c10  great  awful  plot  two  think  actions  film  head  william movie  excellent  downright characters  one  anybody development filmmakers back  john  movies  good  stupid  story  three  know  efforts  filmmaker  onto  michael porn  superb  inept  storyline  first  would  establishing  movie  cut  richard  sex  well  pathetic  scenes  five  say  knowledge  syberberg  bottom davies  watch  wonderful irritating  narrative  four  really  involvement  cinema  neck  david  teen  nice  annoying  subplots  part  want  policies  director  floor  james  best  best  inane  twists  every never  individuals  acting  flat  walter  dvd  terrific  unfunny  tale  best  suppose  necessary  filmmaking thick  robert  scenes  beautiful  horrible  interesting also  actually  concerning  actors  front  gordon  flick   Table 4: Top words per context on IMDB Movie Reviews for CVDD model with r = 10 contexts.
best single context detection performance in AUC to illustrate the potential of contextual anomaly detection. Those results are given in the c * column of Table 1 and demonstrate the possible boosts in performance. We highlighted the respective best contexts in Table 2 and present word lists of the best contexts of all the other classes in Table 3.
One can see that those contexts indeed appear to be typical for what one would expect as a characterization of those classes. Note that the OC-SVM on the simple mean embeddings establishes a strong baseline as has been observed on other tasks. Moreover, the tf-idf weighted embeddings prove especially beneficial on the larger 20 Newsgroups dataset for filtering out "general language contexts" (similar to stop words) that are less discriminative for the characterization of a text corpus. A major advantage of CVDD is its strong interpretability and the potential for contextual AD which allows to sort out relevant contexts.

Detecting Anomalous Movie Reviews
Setup We apply CVDD for detecting anomalous reviews in a qualitative experiment on IMDB Movie Reviews. For this, we train a CVDD model with r = 10 context vectors on the full IMDB train set of 25 000 movie review samples. After training, we examine the most anomalous and most normal reviews according to the CVDD anomaly scores on the IMDB test set which also includes 25 000 reviews. We use GloVe word embeddings and keep the CVDD model settings and training details as outlined in Section 4.1.
Results Table 4 shows the top words for each of the r = 10 CVDD contexts of the trained model. We can see that the different contexts of the CVDD model capture the different themes present in the movie reviews well. Note for example that c 1 and c 2 represent positive and nega-tive sentiments respectively, c 3 , c 7 , and c 10 refer to different aspects of cinematic language, and c 9 captures names, for example. Figure 1 in the introduction depicts qualitative examples of this experiment. 1a are the movie reviews having the highest anomaly scores. The top anomaly is a review that repeats the same phrase many times. From examining the most anomalous reviews, the dataset seems to be quite clean in general though. Figure 1c shows the most normal reviews w.r.t. the first three contexts, i.e. the samples that have the lowest respective contextual anomaly scores.
Here, the self-attention weights give a samplebased explanation for why a particular review is normal in view of the respective context.

Conclusion
We presented a new self-attentive, multi-context one-class classification approach for unsupervised anomaly detection on text which makes use of pretrained word models. Our method, Context Vector Data Description (CVDD), enables contextual anomaly detection and has strong interpretability features. We demonstrated the detection performance of CVDD empirically and showed qualitatively that CVDD is well capable of learning distinct, diverse contexts from unlabeled text corpora.
and Netflix, and ARO W911NF-12-1-0241 and W911NF-15-1-0484. MK and RV acknowledge support by the German Research Foundation (DFG) award KL 2698/2-1 and by the German Federal Ministry of Education and Research (BMBF) awards 031L0023A, 01IS18051A, and 031B0770E. Part of the work was done while MK was a sabbatical visitor of the DASH Center at the University of Southern California.