Neural Legal Judgment Prediction in English

Legal judgment prediction is the task of automatically predicting the outcome of a court case, given a text describing the case’s facts. Previous work on using neural models for this task has focused on Chinese; only feature-based models (e.g., using bags of words and topics) have been considered in English. We release a new English legal judgment prediction dataset, containing cases from the European Court of Human Rights. We evaluate a broad variety of neural models on the new dataset, establishing strong baselines that surpass previous feature-based models in three tasks: (1) binary violation classification; (2) multi-label classification; (3) case importance prediction. We also explore if models are biased towards demographic information via data anonymization. As a side-product, we propose a hierarchical version of BERT, which bypasses BERT’s length limitation.


Introduction
Legal information is often represented in textual form (e.g., legal cases, contracts, bills).Hence, legal text processing is a growing area in NLP with various applications such as legal topic classification (Nallapati and Manning, 2008;Chalkidis et al., 2019), court opinion generation (Ye et al., 2018) and analysis (Wang et al., 2012), legal information extraction (Chalkidis et al., 2018), and entity recognition (Cardellino et al., 2017;Chalkidis et al., 2017).Here, we focus on legal judgment prediction, where given a text describing the facts of a legal case, the goal is to predict the court's outcome (Aletras et al., 2016;S ¸ulea et al., 2017;Luo et al., 2017;Zhong et al., 2018;Hu et al., 2018).
Such models may assist legal practitioners and citizens, while reducing legal costs and improving access to justice (Lawlor, 1963;Katz, 2012;Stevenson and Wagoner, 2015).Lawyers and judges can use them to estimate the likelihood of winning a case and come to more consistent and informed judgments, respectively.Human rights organizations and legal scholars can employ them to scrutinize the fairness of judicial decisions unveiling if they correlate with biases (Doshi-Velez and Kim, 2017;Binns et al., 2018).
This paper contributes a new publicly available English legal judgment prediction dataset of cases from the European Court of Human Rights (ECHR). 1 Unlike Aletras et al. (2016), who provide only features from approx.600 ECHR cases, our dataset is substantially larger (∼11.5kcases) and provides access to the raw text.As a second contribution, we evaluate several neural models in legal judgment prediction for the first time in English.We consider three tasks: (1) binary classification (i.e., violation of a human rights article or not), the only task considered by Aletras et al. (2016); (2) multi-label classification (type of violation, if any); (3) case importance detection.In all tasks, neural models outperform an SVM with bag-of-words (Aletras et al., 2016;Medvedeva et al., 2018), the only method tested in English legal judgment prediction so far.As a third contribution, we use an approach based on data anonymization to study, for the first time, whether the legal predictive models are biased towards demographic information or factual information relevant to human rights.Finally, as a side-product, we propose a hierarchical version of BERT (Devlin et al., 2019), which bypasses BERT's length limitation and leads to the best results.

ECHR Dataset
ECHR hears allegations that a state has breached human rights provisions of the European Conven-tion of Human Rights. 2 Our dataset contains approx.11.5k cases from ECHR's public database. 3 For each case, the dataset provides a list of facts extracted using regular expressions from the case description, as in Aletras et al. (2016) 4 (see Fig. 1).Each case is also mapped to articles of the Convention that were violated (if any).An importance score is also assigned by ECHR (see Section 3).
The dataset is split into training, development, and test sets (Table 1).The training and development sets contain cases from 1959 through 2013, and the test set from 2014 through 2018.The training and development sets are balanced, i.e., they contain equal numbers of cases with and without violations.We opted to use a balanced training set to make sure that our data and consequently our models are not biased towards a particular class.The test set contains more (66%) cases with violations, which is the approximate ratio of cases with violations in the database.We also note that 45 out of 66 labels are not present in the training set, while another 11 are present in fewer than 50 cases.Hence, the dataset of this paper is also a good testbed for few-shot learning.
3 Legal Prediction Tasks

Binary Violation
Given the facts of a case, we aim to classify it as positive if any human rights article or protocol has been violated and negative otherwise.

Multi-label Violation
Similarly, the second task is to predict which specific human rights articles and/or protocols have been violated (if any).The total number of articles and protocols of the European Convention of Human Rights are 66 up to day.For that purpose, we define a multi-label classification task where no labels are assigned when there is no violation.

Case Importance
We also predict the importance of a case on a scale from 1 (key case) to 4 (unimportant) in a regression task.These scores, provided by the ECHR, 2 An up-to-date copy of the European Convention of Human Rights is available at https://www.echr.coe.int/Documents/Convention_ENG.pdf.
3 See https://hudoc.echr.coe.int.Licensing conditions are compatible with the release of our dataset. 4Using regular expressions to segment legal text from ECHR is usually trivial, as the text has a specific structure.denote a case's contribution in the development of case-law allowing legal practitioners to identify pivotal cases.Overall in the dataset, the scores are : 1 (1096 documents), 2 (904), 3 (2,982) and 4  (6,496), indicating that approx.10% are landmark cases, while the vast majority (83%) are considered more or less unimportant for further review.

Neural Models
BiGRU-Att: The fisrt model is a BIGRU with self-attention (Xu et al., 2015) where the facts of a case are concatenated into a word sequence.
Words are mapped to embeddings and passed through a stack of BIGRUs.A single case embedding (h) is computed as the sum of the resulting context-aware embeddings ( i a i h i ) weighted by self-attention scores (a i ).The case embedding (h) is passed to the output layer using a sigmoid for binary violation, softmax for multi-label violation, or no activation for case importance regression.

HAN:
The Hierarchical Attention Network (Yang et al., 2016) is a state-of-the-art model for text classification.We use a slightly modified version where a BIGRU with self-attention reads the words of each fact, as in BIGRU-ATT, producing fact embeddings.A second-level BIGRU with selfattention reads the fact embeddings, producing a single case embedding that goes through a similar output layer as in BIGRU-ATT.

LWAN:
The Label-Wise Attention Network (Mullenbach et al., 2018) has been shown to be robust in multi-label classification.Instead of a single attention mechanism, LWAN employs L attentions, one for each possible label.This produces L case embeddings (h (l) = i a l,i h i ) per case, each one specialized to predict the corresponding label.Each of the case embeddings goes through a separate linear layer (L linear layers in total), each with a sigmoid, to decide if the corresponding label should be assigned.Since this is a multi-label model, we use it only in multi-label violation.
For a new task, a task-specific layer is added on top of BERT and is trained jointly by fine-tuning on task-specific data.We add a linear layer on top of BERT, with a sigmoid, softmax, or no activation, for binary violation, multi-label violation, and case importance, respectively.5BERT can process texts up to 512 wordpieces, whereas our case descriptions are up to 2.6k words, thus we truncate them to BERT's maximum length, which affects its performance.This also highlights an important limitation of BERT in processing long documents, a common characteristic in legal text processing.
To surpass BERT's maximum length limitation, we also propose a hierarchical version of BERT (HIER-BERT).Firstly BERT-BASE reads the words of each fact, producing fact embeddings.Then a self-attention mechanism reads fact embeddings, producing a single case embedding that goes through a similar output layer as in HAN.

Experimental Setup
Hyper-parameters: We use pre-trained GLOVE (Pennington et al., 2014) embeddings (d = 200) for all experiments.Hyper-parameters are tuned by random sampling 50 combinations and selecting the values with the best development loss in each task. 6Given the best hyper-parameters, we perform five runs for each model reporting mean scores and standard deviations.We use categorical cross-entropy loss for the classification tasks and mean absolute error for the regression task, Glorot initialization (Glorot and Bengio, 2010), Adam (Kingma and Ba, 2015) with default learning rate 0.001, and early stopping on the development loss.
Baselines: A majority-class (MAJORITY) classifier is used in binary violation and case importance.A second baseline (COIN-TOSS) randomly predicts violation or not in binary violation task.We also compare our methods against a linear SVM with bag-of-words features (most frequent Table 2: Macro precision (P), recall (R), F1 for the binary violation prediction task (± std.dev.).
multi-label task; and Support Vector Regression (BOW-SVR) for the case importance prediction.7

Binary Violation Results
Table 2 (upper part) shows the results for binary violation.We evaluate models using macroaveraged precision (P), recall (P), F1.The weak baselines (MAJORITY, COIN-TOSS) are widely outperformed by the rest of the methods.BIGRU-ATT outperforms in F1 (79.5 vs. 71.8) the previous best performing method (Aletras et al., 2016) in English judicial prediction.This is aligned with results in Chinese (Luo et al., 2017;Zhong et al., 2018;Hu et al., 2018).HAN slightly improves over BIGRU-ATT (80.5 vs. 79.5),while being more robust across runs (0.2% vs. 2.7% std.dev.).BERT's poor performance is due to the truncation of case descriptions, while HIER-BERT that uses the full case leads to the best results.We omit BERT from the following tables, since it performs poorly.Fig. 1 shows the attention scores over words and facts of HAN for a case that ECHR found to violate Article 3, which prohibits torture and 'inhuman or degrading treatment or punishment'.Although fact-level attention wrongly assigns high attention to the first fact, which seems irrelevant, it then successfully focuses on facts 2-4, which report that police officers beat the applicant for several hours, that the applicant complained, was referred for forensic examination, diagnosed with concussion etc. Word attention also successfully focuses on words like 'concussion', 'bruises', 'damaged', but it also highlights entities like 'Kharkiv', its 'District Police Station' and 'City Prosecutor's office', which may be indications of bias.Models Biases: We next investigate how sensitive our models are to demographic information appearing in the facts of a case.Our assumption is that an unbiased model should not rely on information about nationality, gender, age, etc.To test the sensitivity of our models to such information, we train and evaluate them in an anonymized version of the dataset.The data is anonymized by using SPACY's (https://spacy.io)Named Enity Recognizer, replacing all recognized entities with type tags (e.g., 'Kharkiv' → LOCATION).
While neural methods seem to exploit named entities among other information, as in Figure 1, the results in Table 2 indicate that performance is comparable even when this information is masked, with the exception of HIER-BERT that has quite worse results (2%) compared to using non-anonymized data, suggesting model bias.We speculate that HIER-BERT is more prone to overfitting compared to the other neural methods that rely on frozen GLOVE embeddings, because the embeddings of BERT's wordpieces are trainable and thus can freely adjust to the vocabulary of the training documents including demographic information.

Multi-label Violation Results
Table 3 reports micro-averaged precision (P), recall (R), and F1 results for all methods, now including LWAN, in multi-label violation prediction.The results are also grouped by label frequency for all (OVERALL), FREQUENT, and FEW labels (articles), counting frequencies on the training subset.
We observe that predicting specific articles that have been violated is a much more difficult task than predicting if any article has been violated in a binary setup (cf.Table 2).Overall, HIER-BERT outperforms BIGRU-ATT and LWAN (60.0 vs. 57.6OVERALL (all labels) P R F1  43.6 ± 14.5 05.0 ± 02.8 08.9 ± 04.9 Table 3: Micro precision, recall, F1 in multi-label violation for all, frequent, and few training instances.micro-F1), which is tailored for multi-labeling tasks, while being comparable with HAN (60.0 vs. 59.9 micro-F1).All models under-perform in labels with FEW training examples, demonstrating the difficulty of few-shot learning in ECHR legal judgment prediction.The main reason is that labels in the FEW group, 11 in total, are extremely rare and have been assigned in 1.25% of the documents across all datasets, while the most frequent 4 labels overall (Articles 3, 5, 6 and 13) have been assigned in approx.42% of the documents.

Case Importance Results
Table 4 shows the mean absolute error (MAE) obtained when predicting case importance.Surprisingly, MAJORITY outperforms the rest of the methods.As already noted in Section 3, the distribution of importance scores is highly skewed in favour of the majority class, thus MAJORITY can correctly predict the score in most cases with zero mean absolute error (MAE).BOW-SVR performs worse than BIGRU-ATT, while HAN is 10% and 3% better, respectively.HIER-BERT further improves the results, outperforming HAN by 17%.
While MAJORITY has the lowest mean absolute error, it cannot distinguish important from unimportant cases, thus it is practically useless.To evaluate the methods on that matter, we measure the correlation between the gold scores and each method's predictions with SPEARMAN's ρ.HIER-BERT has the best ρ (.527), indicating a moderate positive correlation (> 0.5), which is not the case for the rest of the methods.The overall results indicate that a case's importance cannot be predicted solely by the case facts and possibly also relies on background knowledge (e.g., judges' experience, court's history, rarity of article's violation).

Discussion
We can only speculate that HAN's fact embeddings distill importance-related features from each fact, allowing its second-level GRU to operate on a sequence of fact embeddings that are being exploited by the fact-level attention mechanism and provide a more concise view of the entire case.The same applies to HIER-BERT, which relies on BERT's fact embeddings and the same fact-level attention mechanism.By contrast, BIGRU-ATT operates on a single long sequence of concatenated facts, making it more difficult for its BIGRU to combine information from multiple, especially distant, facts.This may explain the good performance of HAN and HIER-BERT across all tasks.

Related Work
Previous work on legal judgment prediction in English used linear models with features based on bags of words and topics to represent legal textual information extracted from cases (Aletras et al., 2016;Medvedeva et al., 2018).
More sophisticated neural models have been considered only in Chinese.Luo et al. (2017) use HANs to encode the facts of a case and a subset of predicted relevant law articles to predict crim-inal charges that have been manually annotated.
In their experiments, the importance of few-shot learning is not taken into account since the criminal charges that appear fewer than 80 times are filtered out.However in reality, a court is able to judge even under rare conditions.Hu et al. (2018) focused on few-shot charges prediction using a multi-task learning scenario, predicting in parallel a set of discriminative attributes as an auxiliary task.Both the selection and annotation of these attributes are manually crafted and dependent to the court.Zhong et al. (2018) decompose the problem of charge prediction into different subtasks that are tailored to the Chinese criminal court using multitask learning.

Limitations and Future Work
The neural models we considered outperform previous feature-based models, but provide no justification for their predictions.Attention scores (Fig. 1) provide some indications of which parts of the texts affect the predictions most, but are far from being justifications that legal practitioners could trust; see also Jain and Wallace (2019).Providing valid justifications is an important priority for future work and an emerging topic in the NLP community. 8In this direction, we plan to expand the scope of this study by exploring the automated analysis of additional resources (e.g., relevant case law, dockets, prior judgments) that could be then utilized in a multi-input fashion to further improve performance and justify system decisions.We also plan to apply neural methods to data from other courts, e.g., the European Court of Justice, the US Supreme Court, and multiple languages, to gain a broader perspective of their potential in legal justice prediction.Finally, we plan to adapt bespoke models proposed for the Chinese Criminal Court (Luo et al., 2017;Zhong et al., 2018;Hu et al., 2018) to data from other courts and explore multitask learning.

Figure 1 :
Figure 1: Attention over words (colored words) and facts (vertical heat bars) as produced by HAN.

Table 1 :
Statistics of the ECHR dataset.The size of the label set (ECHR articles) per case (C) is L = 66.

Table 4 :
Mean Absolute Error and Spearman's ρ for case importance.Importance ranges from 1 (most important) to 4 (least).* Not Applicable.