Unsupervised Anomaly Detection in Parole Hearings using Language Models

Each year, thousands of roughly 150-page parole hearing transcripts in California go unread because legal experts lack the time to review them. Yet, reviewing transcripts is the only means of public oversight in the parole process. To assist reviewers, we present a simple unsupervised technique for using language models (LMs) to identify procedural anomalies in long-form legal text. Our technique highlights unusual passages that suggest further review could be necessary. We utilize a contrastive perplexity score to identify passages, defined as the scaled difference between its perplexities from two LMs, one fine-tuned on the target (parole) domain, and another pre-trained on out-of-domain text to normalize for grammatical or syntactic anomalies. We present quantitative analysis of the results and note that our method has identified some important cases for review. We are also excited about potential applications in unsupervised anomaly detection, and present a brief analysis of results for detecting fake TripAdvisor reviews.


Introduction
California houses America's largest "lifer" population, with 25% of its 115,000 prisoners serving life sentences. Each year, the Board of Parole Hearings (BPH) conducts thousands of parole hearings to decide whether to grant prisoners early release. As California has enacted legislation to reduce its prison population, the number of hearings is scheduled to double this year and continue to rise for the foreseeable future. While each hearing is transcribed into about 150 pages of dialogue and sent to the BPH and governor's office for review, capacity constraints mean that, in practice, only grants of parole are reviewed. Legal scholars who painstakingly analyzed small subsets of transcripts have found that parole decisions are sometimes made PRESIDING COMMISSIONER: Let me ask you a question, Mr. [REDACT]. Are you angry? INMATE [REDACT]: No. PRESIDING COMMISSIONER: You seem kind of like you're a smart ass. I don't mean to say that rudely, but are you a smart ass? Figure 1: Example of a semantic anomaly in an arbitrary and capricious manner (Bell, 2019), but they lack the resources for ongoing review.
To help alleviate these capacity constraints and allow for greater review of parole denials, we propose an automatic anomaly detection system that allows reviewers to focus their attention on the most anomalous portions of text in each hearing. 1 The lack of gold anomaly labels precludes the use of many supervised anomaly detection techniques, so instead we propose using language models trained on the parole transcripts to perform unsupervised anomaly detection.
Defining an "anomaly" in this context is challenging. There are many ways in which a piece of text might be unusual without constituting grounds for additional review. We distinguish primarily between non-semantic, semantic, and procedural anomalies. We define a non-semantic anomaly as an irregularity in the linguistic structure of a piece of text (for instance, a sentence fragment). A semantic anomaly, by contrast, is one caused by the meaning of the text. In the context of a parole hearing, a conversation that deviates substantially from the typical topics of discussion would constitute a semantic anomaly. Finally, a procedural anomaly is an irregularity that indicates the hearing differed substantively from the prescribed guidelines. Often, a procedural anomaly will also be a semantic anomaly. Figure 1 represents such a case, as it both includes language atypical for a parole hearing and, more generally, indicates a breakdown in communication between the commissioner and the parole candidate. We note that there are also, of course, legal anomalies that do not manifest as atypical language.
A language model (LM) provides an organic way to identify unusual text through its perplexity score. We hypothesize that many procedural anomalies can be identified by examining statistical anomalies in the texts of transcripts, which would seemingly allow for their detection by an LM. However, most instances of unusual text found by a naive LM are non-semantic, consisting of typos, ungrammatical sentences, etc. To solve this problem, we instead use a pair of language models. We define our anomaly metric, the contrastive perplexity score, as the scaled difference between the perplexity of one LM, which has been fine-tuned on the target domain, and the perplexity of another LM, which has only been pre-trained on out-of-domain text. Non-semantic anomalies will have high perplexity under both LMs (and thus low contrastive perplexity), so the second LM acts as a "normalizer" for non-semantic content. We present our results on a human-annotated subset of the parole data. Our method recalls 71% of human-labeled procedural anomalies while only asking experts to review 50% of the text of each transcript. We also show that our method can be extended to other domains where a large labeled corpus of anomalous text is unavailable, namely the task of opinion spam detection in TripAdvisor reviews.

Related Work
Anomaly detection (AD) techniques cover a range of problem settings. Schölkopf et al. (1999) Text is a challenging regime for AD because of the importance of domain-dependence: what is shocking in one case might be mundane in another. Few, if any, universal features for AD exist. General approaches for text AD include non-negative matrix factorization (Kannan et al., 2017) and the use of "selectional preferences" (Dasigi and Hovy, 2014). One notable approach, studied in the dis-course coherence literature, is to focus on local abnormalities in topics. Li and Jurafsky (2017) and Lin et al. (2011) present deep models for identifying incoherent passages of text, but discourse coherence studies much shorter text than parole hearings. To address longer text, our approach, like that of Guthrie et al. (2008), splits each document into segments ranked by anomaly score.
Our strategy of using LMs for AD has precedents, but primarily much simpler LMs, and for AD contexts that require more supervision than is available in the parole hearing setting. Laskov (2006, 2007) and Aktolga et al. (2011) use n-gram LMs to identify anomalous sections and documents in a corpus of American bills presented before Congress. Axelrod et al. (2011) andXu et al. (2019) also explore using a "baseline" LM for translation and discourse coherence, respectively.

Approach
Our model uses GPT-2, a transformer-based LM pre-trained on WebText, a corpus scraped from the internet (Radford et al., 2019;Vaswani et al., 2017). The following three observations motivate our approach to identifying anomalous text: (1) The perplexity of a fine-tuned LM on a target domain yields a score that measures both genre-specific semantic anomalies and general language anomalies (e.g. ungrammatical inputs, misspellings, incoherence).
(2) The perplexity of an LM only pre-trained on many domains represents solely general language anomalies. (3) Putting the two together, the difference in perplexity between a fine-tuned language model and a pre-trained language model gives a "semantic anomaly score" of a piece of text.
We define the contrastive perplexity LM anomaly score to be the scaled difference in perplexities observed from two models. One model, the fine-tuned LM, is fit to a target corpus of text, without any supervision on which passages are anomalies. The other model, the normalizer LM, is the out-of-the-box GPT-2 model (Radford et al., 2019;Vaswani et al., 2017).
For a mundane piece of text, both pplx fine-tuned and pplx norm. are low. For a non-semantic anomaly, both are high. In both cases, contrastive perplexity is low. However, for a semantic anomaly, we expect pplx fine-tuned to be high, because of its sensitivity to the text's context domain, and pplx norm. to be low, because the text may not otherwise be unusual in general English, leading to high contrastive perplexity.
Because the fine-tuned LM achieves a lower perplexity, we use β to re-scale the perplexity output of the normalizer and ensure the models operate at the same scale. While β can be tuned as a hyperparameter, a reasonable and balanced choice is the ratio between the mean perplexities achieved by the fine-tuned model and the normalizer model on a validation dataset.

Anomaly Aggregation
We can use our LM anomaly score to identify the top k chunks of anomalous text for a given set of documents directly. In a completely unsupervised setting, with no labels as to which documents (or chunks) are anomalies, there is no way to associate the absolute contrastive perplexity scores with the predictive target. However, if given a clean dataset (i.e. a validation set that is labeled and known not to contain anomalies) we can instead anchor the scores to the clean dataset and detect anomalies by performing an out-of-distribution test.

Baselines
We compare our model to a number of unsupervised baseline models. Within AD, most existing algorithms are unsuitable for our task (e.g. due to the need for supervision, incompatibility with long-form text). The most straightforward baseline is simply the fine-tuned GPT-2 model alone. We also compare our work to an unsupervised topic-modeling baseline that should also be agnostic to non-semantic anomalies, like Misra et al. (2008). We fit a latent Dirichlet allocation (LDA) model (Blei et al., 2003) to our train-corpus, then compute the mean representation and covariance matrix over topics, over a held-out portion of data. At prediction time, we compute the LDA representation for some text f (x) and use its Mahalanobis distance from the mean representation as our anomaly score: where µ T and Σ T are the sample mean and covariance over the topic mixture embeddings, respectively.

Parole Hearings
Our analysis is performed over the complete 2 set of parole hearing transcripts in California between January 2007 and July 2018, which totals 30,734 transcripts. Each document is a transcript of an hours-long conversation between the parole board and a candidate (other parties are occasionally also present), which ends in a decision from the parole board. Transcribed, each hearing is roughly 27,000 tokens long. We train our model on a train corpus of 27,577 transcripts, each split into non-overlapping chunks of 1024 tokens. We fit β on a validation corpus of 1,963 transcripts, with chunksize 256. The training chunksize was selected to maximize efficiency of the underlying GPT-2 model, while the smaller validation chunksize better matches the scale at which we expect to observe linguistic anomalies. We collected a held-out test corpus of anomalies over 315 transcripts by asking undergraduate and law students to label instances of anomalous language. Out of 82,959 chunks, students found 179 anomalies. An experienced parole attorney checked the anomalies and confirmed 68. Student reviewers were asked to identify semantic anomalies and the expert was asked to determine which of those were also procedural anomalies. While we believe that this offers a viable estimate of the true set of procedural anomalies, this leaves out anomalies that are not manifested by irregular language. To evaluate our model's recall, we investigate the tradeoff between the share of the expert's "true anomalies" we recover, and the number of chunks human reviewers must read. We asked the parole attorney to review our model's predictions at a fixed threshold. We compute the mean reciprocal rank (MRR) (?), rather than precision, because a single anomaly suffices to flag a whole transcript for review: only the rank of the highest scoring anomaly affects reviewer time. Details are given in Appendix B.

Hotel Reviews
Our second experiment is performed over the Deceptive Opinion Spam dataset (Ott et al., 2011(Ott et al., , 2013. The dataset consists of 1,600 short humangenerated reviews of 20 hotels in the Chicago area. 800 of these reviews were scraped from TripAdvisor and are marked "authentic"; the remaining 800 reviews are marked "anomalous" and were gen- erated by Mechanical Turk workers. In order to fine-tune GPT-2, we use a collection of TripAdvisor reviews collected by (Wang et al., 2010). 3 We only include the 171,016 reviews that were shorter than 1024 tokens and longer than 30 tokens. Additionally, we hold out 10,000 reviews to fit µ and Σ for the LDA baseline.

Model & Training
We use the GPT-2 base model for all of our experiments, trained for 48 hours using the Adam optimizer with an initial learning rate of 10 −5 and linear decay.

Parole Hearings
Our fine-tuned and normalizer model achieve mean perplexities of 9.22 and 22.99 (β = 0.40), respectively, on the validation set with fixed chunksize 256. Figure 2 describes the tradeoff between recall and the percentage the transcript human reviewers must read for our model and baselines as we vary the model. Contrastive perplexity outperforms all baselines, but overall recall is low. We also observe that the LM anomaly score produced by our model is not well-conserved across documents. Rather than using a global threshold for our model, we can instead ask reviewers to always use top k predictions for each document. Table 1 shows recall for different values of k.
We evaluate our model's precision at the threshold that yields an average of 55 chunks per document (corresponding to about 52% of average transcript length) and recall of 0.68, marked on the plot. At this threshold, our model achieves an MRR of 0.227. Student annotators achieve 0.264 precision (note that, because the ratings from the students were not ranked, it is not possible to compute their MRR). The low human precision underscores the 3 We ensured that there is no overlap in between the reviews used for fine-tuning and the Deceptive Opinion Spam dataset.  intrinsic difficulty of the task and the level of disagreement between human annotators over what constitutes an anomaly.

Hotel Reviews
Our fine-tuned and normalizer model achieve mean perplexities of 22.62 and 53.60 (β = 0.42) on the validation set of "real" TripAdvisor reviews. Figure 3 shows the ROC curve of our model compared to baselines, using our unsupervised LM anomaly measure as a "fake review classifier" on the Deceptive Opinion Spam dataset. Our model achieves an F1 of 0.537 at the optimal threshold. With manually tuned β = 1.0, we achieve 0.679. While below the 0.898 F1 achieved by the best fully supervised models (Ott et al., 2011), this indicates that our model is a promising unsupervised predictor.

Discussion & Conclusion
We present a novel contrastive perplexity-based approach for unsupervised anomaly detection. We define semantic and non-semantic anomalies, and present evidence that our model can distinguish between them better than other unsupervised baselines. Detecting procedural anomalies in legal cases is easier with structured data, but that data is often not readily available. Our approach seeks to support legal decision makers in identifying anomalous cases for review when structured records are unavailable.
Our experiments on an unexplored dataset of 30,734 parole hearing transcripts have identified troubling cases for review. However, our quantitative evaluations also show the difficulty of defining a semantic anomaly consistently. Our results on detecting fake hotel reviews indicate that our approach becomes more powerful when anomalyfree documents are available to perform an out-ofdistribution test.
In future work, we seek to use conditional LMs to bridge the gap between our unsupervised method and settings in which some structured data is available.