Identifying Semantically Deviating Outlier Documents

A document outlier is a document that substantially deviates in semantics from the majority ones in a corpus. Automatic identification of document outliers can be valuable in many applications, such as screening health records for medical mistakes. In this paper, we study the problem of mining semantically deviating document outliers in a given corpus. We develop a generative model to identify frequent and characteristic semantic regions in the word embedding space to represent the given corpus, and a robust outlierness measure which is resistant to noisy content in documents. Experiments conducted on two real-world textual data sets show that our method can achieve an up to 135% improvement over baselines in terms of recall at top-1% of the outlier ranking.


Introduction
The technology today has made it unprecedentedly easy to collect and store documents in an increasing number of domains. Automatic text analysis (e.g. document clustering, summarization, topic modeling) becomes more useful and demanded as the corpus size grows. Some trending * Research was sponsored in part by the U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), National Science Foundation IIS-1320617, IIS 16-18481, and NSF IIS 17-04532, and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). The views and conclusions contained in this document are those of the author(s) and should not be interpreted as representing the official policies of the U.S. Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.
domains (e.g. health records) call for a new analytical task, mining outlier documents: given a corpus, identify a small number of documents which substantially deviate from the semantic focuses of the given corpus. Outlier documents can provide valuable insights or imply potential errors. For example, an outlier health record from records of the same disease could indicate a new variation of the disease if it has an abnormal symptom description, or a medical error if it has an abnormal treatment description. A previous study (Hauskrecht et al., 2013) uses structured data in health records to show the importance of this application, and points out that further improvement should be achieved by leveraging text data.
Existing work has studied a related albeit different task, novel document detection (Kasiviswanathan et al., 2012(Kasiviswanathan et al., , 2013Zhang et al., 2002Zhang et al., , 2004, where one aims to identify from a document stream if a newly arriving document is novel or redundant. In other words, this task assumes all the previous documents are known to be "normal", and only checks if a new document is novel. In our task, no document is known to be normal, and there could be multiple outliers in the corpus. Outlier detection (Chandola et al., 2009;Hodge and Austin, 2004) is a popular topic in data mining but few focus on text data. A study (Guthrie, 2008) identifies anomalous text segments in a document, but mainly based on writing styles. We focus on studying semantically deviating documents.
The problem of detecting outlier documents has its unique challenges. First, different words or phrases may be used to indicate the same semantic meaning, which introduces lexical sparsity. Second, finding proper words or phrases to characterize the corpus is non-trivial. Semantically frequent words or phrases can still be too general or too vague. Third, a document can carry extremely rich and noisy signals, most of which are not helpful to determine whether it is an outlier.
We tackle the problem of mining outlier documents in the following steps. We leverage word embedding (Mikolov et al., 2013) to capture the semantic proximities between words and/or phrases, in order to solve the sparsity issue. Then we propose a generative model to identify semantic regions in the embedded space frequently mentioned by documents in the corpus. The model represents each semantic region with a von Mises-Fisher distribution. We also learn a concentration parameter for each region with our model, and develop a selection method to identify semantically specific regions which can better represent the corpus, and filter regions with largely uninformative words.
As the final step, we design a robust outlierness measure emphasizing only the words or phrases in a document relatively close to the semantic focuses identified, and eliminating the noises and redundant information.
The remaining of the paper is organized as follows. Section 2 introduces the preprocessing of data sets and clarifies the notations. Section 3 proposes the methodology to mine outlier documents. Section 4 describes the experiment setup, Section 5 presents the results and Section 6 concludes the paper.

Preliminaries
In this section, we formalize the problem and then briefly describe the preprocessing step.

Notations
The notations used in this study are introduced here. A document is represented as a sequence d i = (w i1 , w i2 , · · · , w in i ), where each w ij ∈ V represents a word or phrase from a given vocabulary V and n i denotes the length of the d i . We refer to a set of documents as a corpus, represented as Notice that w ij may refer to a unigram word or a multi-gram phrase. Although it is nontrivial to appropriately segment a document into a mixed sequence of words and phrases, it is not the focus of our paper. A recently developed phrase mining technique (Liu et al., 2015) is used to extract quality phrases and segment the documents.
Word embedding provides vectorized representations of words and phrases to capture their se-mantic proximity. We assume there is an effective word embedding technique (e.g. (Mikolov et al., 2013)), f : V → R ν , where f is the transforming function that takes a word or a phrase as input and projects it into a ν-dimensional vector as its distributed representation. The semantic proximity between two words or phrases w and w can be preserved by the cosine similarity between their embedded vectors: This work studies how to effectively rank documents in a corpus based on how much they deviate from the semantic focuses of the corpus. Given a set of documents D, our objective is to design an outlierness measure Ω : D → R, such that documents with larger outlierness Ω(d) semantically deviate more from the majority of D.

Preprocessing
We perform several steps of preprocessing to derive the input representation of each document in a given corpus.
Phrase mining. SegPhrase, a recently developed phrase-mining method (Liu et al., 2015), is utilized to automatically identify quality phrases in a corpus. After being trained in one corpus, Seg-Phrase is also capable of segmenting unseen documents into chunks of phrases with mixed lengths. We train SegPhrase on an external corpus D e to obtain the list of quality phrases. Then for each corpus D given for outlier detection, we employ the trained SegPhrase to chunk each document into a sequence of words and quality phrases.
Word embedding. We adopt word embedding as a preprocessing step to capture the semantic proximity between words/phrases. Instead of using the raw text, similar to (Liu et al., 2015), we use the sequence derived from SegPhrase as input to the word embedding algorithm. In particular, word2vec (Mikolov et al., 2013) is utilized in our experiments, but can be seamlessly replaced by any other embedding results.
We run the embedding algorithm based on the external corpus D e , the same corpus used in phrase mining. As D e is sufficiently large, there are only few words or phrases in D which never appear in D e , and are simply discarded in the experiments.
Stop words removal. We remove stop words, as well as the words or phrases ranked high within a certain quantile in terms of document frequency 1 (DF) in the external corpus D e . Such words or phrases usually carry background noise, and obstruct outlier detection.

Mining Outlier Documents
Our framework consists of the following steps. First, we leverage a generative model to identify semantic "regions" in the word embedding space frequently mentioned by documents in the given corpus. Second, we develop a selection method to further remove semantics regions that are too general to properly characterize the given corpus, and only keep regions both frequent and semantically specific, denoted as "semantic focuses". Finally, we calculate the outlierness measure for each document based on the mined semantic focuses. We design a robust outlierness measure which is less sensitive to noisy words or phrases in documents.

Embedded von Mises-Fisher Allocation
We start with a generative model to identify the frequent semantic regions in the word embedding space.
Since we use cosine similarity to capture the semantic proximities between two words or phrases, the magnitude of the embedding vector of each word can be omitted in this part. We use x ij = f (w ij )/ f (w ij ) to represent the unit vector with the same direction as the embedded vector of w ij , and use X to represent the collection of all x ij where 1 ≤ i ≤ |D| and 1 ≤ j ≤ n i .
In order to characterize a semantic region in the embedded space, we introduce von Mises-Fisher (vMF) distribution. The von Mises-Fisher (vMF) distribution is prevalently adopted in directional statistics, which studies the distribution of normalized vectors on a spherical space. The probability density function of the vMF distribution is explicitly instantiated by the cosine similarity. It is an ideal distribution for our task because we use cosine similarity to measure the semantic proximity. Moreover, as we will see later, it empowers us to characterize how specific each semantic region is, which is helpful in further identification of semantic focuses for outlier detection.
We first introduce the formalization of the von Mises-Fisher distribution.

Von Mises-Fisher (vMF) distribution.
A νdimensional unit random vector x (i.e. x ∈ R ν and x = 1) follows a von Mises-Fisher distribution vMF(·|µ, κ) if the probability density function follows: The two parameters in the vMF distribution are the mean direction µ and the concentration parameter κ respectively, where µ ∈ R ν , µ = 1 and κ > 0. The distribution concentrated around the mean direction µ, and is more concentrated if the concentration parameter κ is larger.

Embedded von Mises-Fisher allocation.
We propose a generative model by regarding each document as a bag of normalized embedded vectors, analogous to the bag-of-word representation of documents utilized in typical topic model (e.g., LDA (Blei et al., 2001)). The major difference is that the data to be generated is now a bag-ofnormalized-embedded-vectors for each document, and should be generated from a mixed vMF distribution instead of a mixed multinomial distribution.
A formalized description of the model is summarized as follows: where T > 0 is an integer indicating the number of semantic regions, namely the number of vMF distributions in our mixture model. We regularize the vMF parameters by the following prior distributions. We assume the mean direction µ t of each vMF distribution is generated from a prior vMF distribution vMF(·|µ 0 , C 0 ), while the concentration parameter κ t is generated from a log-normal prior logNormal(·|m 0 , σ 2 0 ). A similar design is also adopted in (Gopal and Yang, 2014).
Parameter inference. We infer the parameters by Gibbs sampling. Because both the von Mises-Fisher distribution and the Dirichlet distribution have conjugate priors, we can integrate out parameters µ t and π i and develop a collapsed Gibbs sampler of z ij : is the number of words in the i-th document being assigned to the t-th von Mises-Fisher distribution without taking w ij into account; is the sum of word vectors assigned to semantic region t without counting w ij . Here δ(·) is the indicator function.
We can also derive a collapsed Gibbs sampler for concentration parameters κ t 's: where n ·t is the number of words in semantic region t.
While sampling z ij is relatively trivial, sampling κ t is not straightforward. Similar difficulty is also mentioned in (Gopal and Yang, 2014). We employ a Metropolis-Hasting algorithm with another log-normal distribution centered at the current κ t value as the proposal distribution.
After obtaining a sample from the posterior distribution of z ij 's and κ t 's, we can easily obtain the MAP estimate of mean directions µ t 's and the mixing distribution of each documents π i : Discussions. We notice that there are some topic models (Das et al., 2015;Batmanghelich et al., 2016) proposed for similar data, where words are represented as embedding vectors. Our model is proposed independently for the purpose of identifying semantic focuses, which serves the task of outlier detection. Existing models may lack signals for the following outlier detection steps and hence cannot be directly plugged in. However, it is possible to adapt certain models to the outlier detection task.

Identifying Semantic Focuses
The semantic regions learned from the Embedded vMF Allocation model provide a set of candidates frequently mentioned by documents in the corpus. However, not all of them are semantic focuses of the corpus -some are too general to distinguish outlier and normal document.
We notice that uninformative semantic regions (e.g. a semantic region containing {"percent", "average", "compare", ...}) tend to have more scattered distribution over embedded vectors, possibly because of the diverse context of their usage. In contrast, corpus-specific semantic regions are more concentrated, (e.g. a semantic region containing {"drugs", "antidepressant", "prescription", ...}). Modeling semantic regions by vMF distributions provides us with a parsimonious signal to characterize how concentrated a semantic region is, i.e. the concentration parameter κ t . This allows us to simply filter unqualified semantic regions with too small concentration parameters and obtain high-quality semantic focuses. Let a binary variable φ t (t = 1, 2, · · · , T ) indicate whether the t-th vMF distribution is a semantic focus. Suppose a user specifies a threshold parameter 0 ≤ β ≤ 1. We can determine φ t by estimating the log-normal distribution that generates all κ t 's, logN ormal(m,σ 2 ), wherê SetF κ (·) to be its cumulative distribution function. We assign φ t = 1 for semantic regions with κ t ≥ F −1 κ (β), and filter all the other semantic regions as φ t = 0.
Although parameter β needs to be set manually, our experiments suggest the performance is not quite sensitive to its value.

Document Outlierness
In this subsection, we start with a straightforward definition of outlierness based on the mined semantic focuses. Then we present several refinements to improve its robustness.

Baseline outlierness measure.
A straightforward intuition is to assume outlier documents averagely have fewer words or phrases drawn from semantic focuses. To estimate this, we first need to calculate the probability of each word being drawn from the semantic focuses.
It is then possible to estimate the expected percentage of words not drawn from semantic focuses in each document as the outlierness: However, due to the noisiness in text data, this assumption oversimplifies the characterization of outlier documents. In practice, we observe the following two issues: lexically general words/phrases, and noisy content in documents.
Penalizing lexically general words and phrases. Not all words or phrases close to semantic focuses are strong indicators of normal documents. General words (e.g. "science") can happen to be semantically close to a semantic focus, but are not as specific as most other words close to it (e.g. "medical research"). Therefore, we utilize a background corpus D bg to calculate the specificity of the word. Assuming the actual mention of the word can be chosen from either the general background, or a corpus-specific vocabulary, we write down the probability that a word is corpus-specific to be: where nd(w) = |{d i |w ∈ d i , d i ∈ D}| is the number of documents in D containing word w; nd bg (w) = |{d i |w ∈ d i , d i ∈ D bg }| is the number of documents containing word w in the background corpus D bg ; λ ij is a binary random variable indicating whether w ij is specific enough.
For each word, we define the word is orthodox if the word is not only semantically close to a semantic focus of the corpus, but also sufficiently specific. We then define the probability that a word or phrase w ij in document d i is orthodox as: where ϕ ij = 1 indicates that w ij (or equivalently x ij ) is orthodox. Now, we can define a second outlierness measure as the expected percentage of words that are not orthodox. Noisy content in documents. We present the second issue of normal documents with an example. We compare a normal document in a corpus of New York Times news articles with tag "Health", to another document originally from another corpus, but with its outlierness calculated with regard to the semantic focuses of the "Health" corpus.
In Figure 1(a), we show the distribution of inferred orthodox probability P (ϕ ij = 1|x ij , w ij ) by ranking the words or phrases according to their probability value. We can observe that the outlier document barely has any words or phrases surely orthodox, while the normal document has 5% of words or phrases with a probability no less than 0.8 to be orthodox. However, if we simply take the average, these two documents become indistinguishable as the average is substantially dominated by the "tail" where most words or phrases in either documents are clearly not orthodox. Let n ϕ i be a random variable indicating the true number of orthodox words or phrases in document d i . Since n ϕ i follows a Poisson-Binomial distribution, we can plot the probability distribution of n ϕ i normalized by the length of the document, as shown in Figure 1(b). It can be observed that the difference between the normalized expectation E[n ϕ i ]/d i of two documents is insignificant. Therefore, the measure described in Equation (2) will be unable to tell the difference between these two documents.
This example illustrates why the strategy of taking the average over the whole document can make mistakes, and also provides an important insight. As long as a document has a (potentially small) portion of words or phrases that are highly certain to be orthodox, it should not be considered as an outlier. Based on the above observation, we propose a third outlierness measure.

Orthodox quantile outlierness.
We define a quantile-based outlierness definition to rank document outliers. Notice that the distribution of random variable n ϕ i follows a Poisson-Binomial distribution, which is the total number of success trials when one tosses a coin for each word or phrase in the document to determine whether it is orthodox with probability P (ϕ ij |x ij , w ij ).
Moreover, we define the first 1 1−θ -quantile of the Poisson-Binomial distribution of n ϕ i as: where 0 < θ < 1 is a given parameter close to 1. Intuitively, it measures the maximum lower bound of n ϕ i we can guarantee with confidence θ. Based on Equation (3), we can give a formalized definition of our proposed outlierness: where the 1 1−θ -quantile is normalized by the document length with a smoothing constant. The cumulative probability distribution of a Poisson-Binomial distribution can be efficiently calculated by dynamic programming (Chen and Liu, 1997).
The advantage of the last proposed outlierness measure is that it emphasizes more on the highly orthodox words or phrases and eliminates the noise from a number of relatively uncertain ones.

Data Sets
New York Times News (NYT). We collected 41,959 news article published in 2013 from The New York Times API 2 . Each article is assigned with a unique label indicating in which section the article is published, such as Arts, Travel, Sports, and Health. There are totally 9 section labels in our collected data set. We treat papers in each section as a corpus D. Thereby we have a set of corpora D = {D s }, without overlapping documents. We also have an external news data set D e crawled from Google news, with 51,114 news article published in 2015 without any label information.
ArnetMiner Paper Abstracts (ARNET). We employ abstracts of papers published in the field of computer science up to 2013, collected by Ar-netMiner (Tang et al., 2008), and assign each paper into a field, according to Wikipedia 3 . We use papers from a set of domains to serve as an external corpus D e , while papers in other domains form different corpora D = {D s }. Each domain (e.g., data mining, computational biology, and computer graphics) forms a corpus D s respectively. Again, notice that the corpora do not have overlapping documents with each other. A summary is presented in Table 1.
Benchmark generation. Since we do not have true labels for outliers in a corpus, we use injection method to generate outlier detection benchmark. For each data set, we randomly select a corpus D s ∈ D and mark all of its document as "normal documents". We then randomly select another corpus D s ∈ D, D s = D s , to inject ω documents from D s into D s and mark them as outliers. We confine ω to be a small integer less than 1% of the size of |D s |. More concretely, ω is an integer uniformly sampled from (0, 0.01|D s |].
For each data set, we randomly generate 10 outlier detection benchmarks, and evaluate the overall performance by the average performance on all the benchmarks.

Methods Evaluated
We compare the performances of the following methods.
Cosine similarity based. We characterize each document as a vector, and use the negative average cosine similarity between each document and the corpus as outlierness. We use two different ways to vectorize documents: TF-IDF weighted, and paragraph2vec (Le and Mikolov, 2014). The two methods are denoted as TFIDF-COS and P2V-COS respectively.
KL divergence based. We represent each document as a probability distribution, and the entire corpus as another probability distribution. Then we use the KL-divergence between each document and the entire corpus as the outlierness. We also use two different ways to calculate the probability distribution. The first is to estimate the unigram distribution for each document and the entire corpus respectively, denoted as UNI-KL. The other is to first perform LDA on the entire corpus with 10 topics, and then infer topical allocation distribution of each document and the entire corpus. This method is represented as TM-KL.
Our method Our quantile based method is denoted as VMF-Q. We also provide two baselines derived from our own method as an ablation analysis. One method abandons the quantile based outlierness but use the expected orthodox percentage as Equation (2), denoted as VMF-E. The other method further removes the penalty on lexical general words and phrases, using Equation (1), denoted as VMF-SF.

Evaluation Measures
In most outlier detection applications, people are more concerned with recall. We measure the performance by recall at a certain percentage. More specifically, we compute the recall of outlier detection if the user checks a certain percentage r of the top-ranked documents in the output results. Since in our benchmark generation, the percentage of outliers does not exceed 1%. Therefore, the perfect results for any r ≥ 1% should be 1.0.
We choose r to be 1%, 2%, and 5% respectively and evaluate different methods with recall at top-r (percentage). We also report the performance in terms of mean average precision (MAP).

Parameter Configurations
All benchmark data sets are preprocessed as described in Section 2. In the NYT data set we remove words or phrases within top 20% with respect to document frequency, while in the ARNET data set we remove the top 10%. The document frequency is calculated based on a background corpus D bg , which is the same as the external corpus of NYT. Word embedding are trained on the external data set D e using code of Mikolov et al. (Mikolov et al., 2013) with default parameter configurations, where the embedded vector length is set to 200. For paragraph2vec, we learn the length-100 vectors for each document along with the external data set to guarantee sufficient training data.
For the prior vMF distribution, we set C 0 = 0.1, a sufficiently small number so the prior distribu- tion is close to a uniform distribution. µ 0 is set as a normalized all-1 vector. We also set m 0 = log(100), and σ 2 = 0.01. The total number for Gibbs sampling is set to be 50 times of the total count of z ij 's (i.e. η = 50). The number of vMF distributions T is set to 20 in the NYT data set and 10 in the ARNET data set respectively, due to the smaller sizes of corpora in the ARNET data set.
To determine semantic focuses, we set threshold parameter β = 0.55 for both data sets. The confidence parameter θ in outlierness calculation is set to 0.95 in both data sets. Our experiments later will show the performance is relatively robust to different configurations of both parameters.

Results
We present the experimental results in this section.
Performance comparison. Table 2 shows performance of different outlier document detection methods. It can be observed that our method outperforms all the baselines in both data sets. In both data sets, VMF-Q can achieve a 45% to 135% increase from baselines in terms of recall by examining the top 1% outliers. Generally, performances of most methods are lower in the ARNET data set comparing to NYT, potentially because the relatively short document lengths and more technical terminologies in ARNET.
Ablation analysis. Both refinements of the outlierness measure benefits the performance. Specifically, by changing the average based outlierness to quantile based outlierness, the recall@1% can be improved by 50-75%, and the recall@5% can also be improved by more than 17%. Sensitivity studies of parameters.
We study if our proposed method is sensitive to the confidence parameter θ and filtering threshold parameter β. We compare the performance of VMF-Q by varying each parameter on both data sets. Figure 2(a) and 2(b) show that the performance is not very sensitive to different values of θ, as long as θ is sufficiently large (close to 1). Figure 2(c) and 2(d) show that the performance is relatively stable when β is between 0.5 and 0.7, but drops a little when β is set to larger value.

Human judgments.
We compare VMF-Q to VMF-E and P2V-COS respectively by crowdsourcing, without artificially inserting "outliers". We conduct this experiments on two corpora in NYT data sets with topic "Health" and "Art" respectively. To compare two methods, we randomly select pairs of documents d i and d j such that both are ranked as top-10% outliers by at least one method, but their orders in the two rankings disagree. We conduct the experiments on Crowd-Flower. Online crowd workers are given d i and d j as well as other documents in the corpus, and are asked to judge which one of d i and d j deviates more from the corpus. For each corpus, we select 200 pairs of documents.
Before taking the questions, each crowd worker needs to go through at least 10 "test questions" which we know the correct answer. These questions are constructed by taking one document from the corpus as d i and another document not from the corpus as d j . Therefore, the one not from the corpus should be the answer. A crowd worker needs to achieve no less than 80% of accuracy to be eligible to work on actual questions, and the accuracy needs to be maintained over 80% during the work, which is measured by "test questions" hidden in actual questions. Each question is answered by 3 workers. The final answer is determined by majority voting. Figure 3 presents the results. On both corpora, there are significantly more workers tend to agree with VMF-Q comparing to P2V-COS, with significance level α = 0.05. This further verifies that our method VMF-Q can achieve better performance than the P2V-COS baseline. On the other hand, on both data sets we can still observe more workers favoring VMF-Q than VMF-E, but the difference is not as large as the difference between VMF-Q and P2V-COS.
Case study. We also conduct a case study to show how our proposed method outperforms other baselines. Table 3 shows two pairs of documents in "Health" corpus of NYT data set. The left two columns show some comparing methods and their higher ranked outlier documents. The row of "Crowds" shows the outlier document chosen by human workers from the crowdsourcing platform, with a consensus of opinions from multiple workers.
In the first document pair, document A is about gun control policy and is substantially irrelevant to "Health" topic, while document B is about lung infection cases. Document A is a significant outlier, and VMF-Q and VMF-E also agree with our intuition. However, paragraph2vec (P2V) ranks document B higher, probably because it tries to summarize the entire document.
In the second document pair, document B is clearly not an outlier as the story is about a new book of AIDS. In comparison, document A dis- Table 3: Case study of documents in "Health" corpus of NYT data set. We present several pairs of documents and how different methods rank the pair. The "Outlier" column indicates the document ranked higher in the outlier document ranking generated by the corresponding methods, and the row "Crowds" shows the ranking given by human evaluators.

Method
Outlier Document A Document B P2V-COS Doc B CHICAGO (AP) States with the most gun control laws have the fewest gun-related deaths, according to a study that suggests sheer quantity of measures might make a difference ...
A prominent Scottish bagpiping school has warned pipers around to world to clean their instruments regularly after one of its longtime members nearly died of a lung infection ...

VMF-E Doc A VMF-Q Doc A Crowds
Doc A P2V-COS Doc B ATLANTA There's more evidence that U.S. births may be leveling off after years of decline. The number of babies born last year only slipped a little, ...
Young men in a state prison for juveniles and professors of library science from the University of South Carolina have joined forces to fight AIDS with a graphic novel ...

VMF-E Doc B VMF-Q Doc A Crowds
Doc A cussing U.S. population is an outlier. However, a great part of document B is about the content of the book, which confuses baselines P2V and VMF-E, as both methods tend to summarize the entire document and highly relevant words like "AIDS" are overwhelmed by the majority of the document. The only method that agrees with human annotators is VMF-Q.

Conclusion
In this paper, we propose a novel task of detecting document outliers from a given corpus. We propose a generative model to identify semantic focuses of a corpus, each represented as a vMF distribution in the embedded space. We also design a document outlierness measure. We experimentally verify the effectiveness of our methods. We hope this work provides insights for further studies on outlier document texts in specific domains, and in more challenging settings such as detecting outliers from crowdsourced data.