LEGAL-BERT: The Muppets straight out of Law School

BERT has achieved impressive performance in several NLP tasks. However, there has been limited investigation on its adaptation guidelines in specialised domains. Here we focus on the legal domain, where we explore several approaches for applying BERT models to downstream legal tasks, evaluating on multiple datasets. Our findings indicate that the previous guidelines for pre-training and fine-tuning, often blindly followed, do not always generalize well in the legal domain. Thus we propose a systematic investigation of the available strategies when applying BERT in specialised domains. These are: (a) use the original BERT out of the box, (b) adapt BERT by additional pre-training on domain-specific corpora, and (c) pre-train BERT from scratch on domain-specific corpora. We also propose a broader hyper-parameter search space when fine-tuning for downstream tasks and we release LEGAL-BERT, a family of BERT models intended to assist legal NLP research, computational law, and legal technology applications.

Typically, transfer learning with language models requires a computationally heavy step where the language model is pre-trained on a large corpus and a less expensive step where the model is finetuned for downstream tasks. When using BERT, the first step can be omitted as the pre-trained models are publicly available. Being pre-trained on generic corpora (e.g., Wikipedia, Children's Books, etc.) BERT has been reported to under-perform in specialised domains, such as biomedical or scientific text Beltagy et al., 2019). To overcome this limitation there are two possible strategies; either further pre-train (FP) BERT on domain specific corpora, or pre-train BERT from scratch (SC) on domain specific corpora. Consequently, to employ BERT in specialised domains one may consider three alternative strategies before fine-tuning for the downstream task ( Figure 1): (a) use BERT out of the box, (b) further pre-train (FP) BERT on domain-specific corpora, and (c) pre-train BERT from scratch (SC) on domain specific corpora with a new vocabulary of sub-word units.
In this paper, we systematically explore strategies (a)-(c) in the legal domain, where BERT adaptation has yet to be explored. As with other specialised domains, legal text (e.g., laws, court pleadings, contracts) has distinct characteristics compared to generic corpora, such as specialised vocabulary, particularly formal syntax, semantics based on extensive domain-specific knowledge etc., to the extent that legal language is often classified as a 'sublanguage' (Tiersma, 1999;Williams, 2007;Haigh, 2018). Note, however, that our work contributes more broadly towards a better understanding of domain adaptation for specialised domains.
Our key findings are: (i) Further pre-training (FP) or pre-training BERT from scratch (SC) on domain-

Related Work
Most previous work on the domain-adaptation of BERT and variants does not systematically explore the full range of the above strategies and mainly targets the biomedical or broader scientific domains.   BERT-BASE, or by pre-training BERT-BASE from scratch (SC) on a domain-specific corpus, i.e., the model is randomly initialized and the vocabulary was created from scratch. Improvements were reported in downstream tasks in both cases. Sung et al. (2019) further pre-trained BERT-BASE on textbooks and question-answer pairs to improve short answer grading for intelligent tutoring systems.
One shortcoming is that all previous work does not investigate the effect of varying the number of pre-training steps, with the exception of . More importantly, when fine-tuning for the downstream task, all previous work blindly adopts the hyper-parameter selection guidelines of Devlin et al. (2019) without further investigation. Finally, no previous work considers the effectiveness and efficiency of smaller models (e.g., fewer layers) in specialised domains. The full capacity of larger and computationally more expensive models may be unnecessary in specialised domains, where syntax may be more standardized, the range of topics discussed may be narrower, terms may have fewer senses etc. We also note that although BERT is the current state-of-the-art in many legal NLP tasks (Chalkidis et al., 2019c,a,d), no previous work has considered its adaptation for the legal domain. is highly skewed towards generic language, using a vocabulary of 30k sub-words that better suits these generic corpora. Nonetheless we expect that prolonged in-domain pre-training will be beneficial.
LEGAL-BERT-SC has the same architecture as BERT-BASE with 12 layers, 768 hidden units and 12 attention heads (110M parameters). We use this architecture in all our experiments unless otherwise stated. We use a newly created vocabulary of equal size to BERT's vocabulary. 2 We also experiment with LEGAL-BERT-SMALL, a substantially smaller model, with 6 layers, 512 hidden units, and 8 attention heads (35M parameters, 32% the size of BERT-BASE). This light-weight model, trains approx. 4 times faster, while also requiring fewer hardware resources. 3 Our hypothesis is that such a specialised BERT model can perform well against generic BERT models, despite its fewer parameters.

Experimental Setup
Pre-training Details: To be comparable with BERT, we train LEGAL-BERT for 1M steps (approx. 40 epochs) over all corpora (Section 3), in batches of 256 samples, including up to 512 sentencepiece tokens. We used Adam with learning rate of 1e−4, as in the original implementation. We trained all models with the official BERT code 4 using v3 TPUs with 8 cores from Google Cloud Compute Services. Legal NLP Tasks: We evaluate our models on text classification and sequence tagging using three datasets. EURLEX57K (Chalkidis et al., 2019b) is a large-scale multi-label text classification dataset of EU laws, also suitable for few and zero-shot learning. ECHR-CASES (Chalkidis et al., 2019a) contains cases from the European Court of Human Rights (Aletras et al., 2016) and can be used for binary and multi-label text classification. Finally, CONTRACTS-NER (Chalkidis et al., 2017(Chalkidis et al., , 2019d is a dataset for named entity recognition on US contracts consisting of three subsets, contract header, dispute resolution, and lease details. We replicate the experiments of Chalkidis et al. (2019c,a,d) when fine-tuning BERT for all datasets. 5 Tune your Muppets! As a rule of thumb to fine-tune BERT for downstream tasks, Devlin et al. (2019) suggested a minimal hyperparameter tuning strategy relying on a gridsearch on the following ranges: learning rate ∈ {2e−5, 3e−5, 4e−5, 5e−5}, number of training epochs ∈ {3, 4}, batch size ∈ {16, 32} and fixed dropout rate of 0.1. These not well justified suggestions are blindly followed in the literature Alsentzer et al., 2019;Beltagy et al., 2019;Sung et al., 2019). Given the relatively small size of the datasets, we use batch sizes ∈ {4, 8, 16, 32}. Interestingly, in preliminary experiments, we found that some models still underfit after 4 epochs, the maximum suggested, thus we use early stopping based on validation loss, without a fixed maximum number of training epochs. We also consider an additional lower learning rate (1e−5) to avoid overshooting local minima, and an additional higher drop-out rate (0.2) to improve regularization. Figure 4 (top two bars) shows that our enriched grid-search (tuned) has a substantial impact in most of the end-tasks compared to the default hyper-parameter strategy of Devlin et al. (2019). 6 We adopt this strategy for LEGAL-BERT.

Experimental Results
Pre-training Results: Figure 2 presents the training loss across pre-training steps for all versions of LEGAL-BERT. LEGAL-BERT-SC performs much better on the pre-training objectives than LEGAL-BERT-SMALL, which was highly expected, given the different sizes of the two models. At the end of its pre-training, LEGAL-BERT-SMALL has similar loss to that of BERT-BASE pre-trained on generic corpora (arrow in Figure 2). When we consider the additional pre-training of BERT on legal corpora  (LEGAL-BERT-FP), we observe that it adapts faster and better in specific sub-domains (esp. ECHR cases, US contracts), comparing to using the full collection of legal corpora, where the training loss does not reach that of LEGAL-BERT-SC. End-task Results: Figure 3 presents the results of all LEGAL-BERT-FP variants on development data. The optimal strategy for further pre-training varies across datasets. Thus in subsequent experiments on test data, we keep for each end-task the variant of LEGAL-BERT-FP with the best development results. Figure 4 shows the perplexities and end-task results (minimum, maximum, and averages over multiple runs) of all BERT variants considered, now on test data. Perplexity indicates to what extent a BERT variant predicts the language of an end-task. We expect models with similar perplexities to also have similar performance. In all three datasets, a LEGAL-BERT variant almost always leads to better results than the tuned BERT-BASE. In EURLEX57K, the improvements are less substantial for all, frequent, and few labels (0.2%), also in agreement with the small drop in perplexity (2.7). In ECHR-CASES, we again observe small differences in perplexities (1.1 drop) and in the performance on the binary classification task (0.8% improvement). On the contrary, we observe a more substantial improvement in the more difficult multi-label task (2.5%) indicating that the LEGAL-BERT variations benefit from in-domain knowledge. On CONTRACTS-NER, the drop in perplexity is larger (5.6), which is reflected in the increase in F 1 on the contract header (1.8%) and dispute resolution (1.6%) subsets. In the lease details subset, we also observe an improvement (1.1%). Impressively, LEGAL-BERT-SMALL is comparable to LEGAL-BERT across most datasets, while it can fit in most modern GPU cards. This is important for researchers and practitioners with limited access to large computational resources. It also provides a more memory-friendly basis for more complex BERT-based architectures. For example, deploying a hierarchical version of BERT for ECHR-CASES (Chalkidis et al., 2019a) leads to a 4× memory increase.
We showed that the best strategy to port BERT to a new domain may vary, and one may consider either further pre-training or pre-training from scratch. Thus, we release LEGAL-BERT, a family of BERT models for the legal domain achieving state-of-art results in three end-tasks. Notably, the performance gains are stronger in the most challenging end-tasks (i.e., multi-label classification in ECHR-CASES and contract header, lease details in CONTRACTS-NER) where in-domain knowledge is more important. We also release LEGAL-BERT-SMALL, which is 3 times smaller but highly competitive to the other versions of LEGAL-BERT. Thus, it can be adopted more easily in low-resource test-beds. Finally, we show that an expanded grid search when fine-tuning BERT for end-tasks has a drastic impact on performance and thus should always be adopted. In future work, we plan to explore the performance of LEGAL-BERT in more legal datasets and tasks. We also intend to explore the impact of further pre-training LEGAL-BERT-SC and LEGAL-BERT-SMALL on specific legal sub-domains (e.g., EU legislation).

A Legal NLP datasets
Bellow are the details of the legal NLP datasets we used for the evaluation of our models: • EURLEX57K (Chalkidis et al., 2019b) contains 57k legislative documents from EURLEX with an average length of 727 words. All documents have been annotated by the Publications Office of EU with concepts from EUROVOC. 8 The average number of labels per document is approx. 5, while many of them are rare. The dataset is split into training (45k), development (6k), and test (6k) documents.
• ECHR-CASES (Chalkidis et al., 2019a) contains approx. 11.5k cases from ECHR's public database. For each case, the dataset provides a list of facts. Each case is also mapped to articles of the Human Rights Convention that were violated (if any). The dataset can be used for binary classification, where the task is to identify if there was a violation or not, and for multi-label classification where the task is to identify the violated articles.
• CONTRACTS-NER (Chalkidis et al., 2017(Chalkidis et al., , 2019d contains approx. 2k US contracts from EDGAR. Each contract has been annotated with multiple contract elements such as title, parties, dates of interest, governing law, jurisdiction, amounts and locations, which have been organized in three groups (contract header, dispute resolution, lease details) based on their position in contracts.

B Implementation details and results on downstream tasks
Below we describe the implementation details for fine-tuning BERT and LEGAL-BERT on the three downstream tasks:  Recently there has been a debate on the overparameterization of BERT (Kitaev et al., 2020;Rogers et al., 2020). Towards that directions most studies suggest a parameter sharing technique (Lan et al., 2019) or distillation of BERT by decreasing the number of layers (Sanh et al., 2019). However the main bottleneck of transformers in modern hardware is not primarily the total number of parameters, misinterpreted into the number of stacked layers. Instead Out Of Memory (OOM) issues mainly happen as a product of wider models in terms of hidden units' dimensionality and the number of attention heads, which affects gradient accumulation in feed-forward and multi-head attention layers (see Table 2). Table 2 shows that LEGAL-BERT-SMALL despite having 3× and 2× the parameters of ALBERT and ALBERT-LARGE has faster training and inference times. We expect models overcoming such limitations to be widely adopted by researchers and practitioners with limited resources. Towards the same direction Google released several lightweight versions of BERT. 9