Large-Scale Multi-Label Text Classification on EU Legislation

We consider Large-Scale Multi-Label Text Classification (LMTC) in the legal domain. We release a new dataset of 57k legislative documents from EUR-LEX, annotated with ∼4.3k EUROVOC labels, which is suitable for LMTC, few- and zero-shot learning. Experimenting with several neural classifiers, we show that BIGRUs with label-wise attention perform better than other current state of the art methods. Domain-specific WORD2VEC and context-sensitive ELMO embeddings further improve performance. We also find that considering only particular zones of the documents is sufficient. This allows us to bypass BERT’s maximum text length limit and fine-tune BERT, obtaining the best results in all but zero-shot learning cases.


Introduction
Large-scale multi-label text classification (LMTC) is the task of assigning to each document all the relevant labels from a large set, typically containing thousands of labels (classes). Applications include building web directories (Partalas et al., 2015), labeling scientific publications with concepts from ontologies (Tsatsaronis et al., 2015), assigning diagnostic and procedure labels to medical records (Mullenbach et al., 2018;Rios and Kavuluru, 2018). We focus on legal text processing, an emerging NLP field with many applications (e.g., legal judgment (Nallapati and Manning, 2008;Aletras et al., 2016), contract element extraction (Chalkidis et al., 2017), obligation extraction (Chalkidis et al., 2018)), but limited publicly available resources.
Our first contribution is a new publicly available legal LMTC dataset, dubbed EURLEX57K, containing 57k English EU legislative documents from the EUR-LEX portal, tagged with ∼4.3k labels (concepts) from the European Vocabulary (EUROVOC). 1 EUROVOC contains approx. 7k labels, but most of them are rarely used, hence they are under-represented (or absent) in EURLEX57K, making the dataset also appropriate for few-and zero-shot learning. EURLEX57K can be viewed as an improved version of the dataset released by Mencia and Fürnkranzand (2007), which has been widely used in LMTC research, but is less than half the size of EURLEX57K (19.6k documents, 4k EU-ROVOC labels) and more than ten years old.
As a second contribution, we experiment with several neural classifiers on EURLEX57K, including the Label-Wise Attention Network of Mullenbach et al. (2018), called CNN-LWAN here, which was reported to achieve state of the art performance in LMTC on medical records. We show that a simpler BIGRU with self-attention (Xu et al., 2015) outperforms CNN-LWAN by a wide margin on EURLEX57K. However, by replacing the CNN encoder of CNN-LWAN with a BIGRU, we obtain even better results on EURLEX57K. Domainspecific WORD2VEC (Mikolov et al., 2013) and context-sensitive ELMO embeddings (Peters et al., 2018) yield further improvements. We thus establish strong baselines for EURLEX57K. As a third contribution, we investigate which zones of the documents are more informative on EURLEX57K, showing that considering only the title and recitals of each document leads to almost the same performance as considering the full document. This allows us to bypass BERT's (Devlin et al., 2018) maximum text length limit and finetune BERT, obtaining the best results for all but zero-shot learning labels. To our knowledge, this is the first application of BERT to an LMTC task, which provides further evidence of the superiority of pretrained language models with task-specific fine-tuning, and establishes an even stronger baseline for EURLEX57K and LMTC in general.

Related Work
You et al. (2018) explored RNN-based methods with self-attention on five LMTC datasets that had also been considered by Liu et al. (2017), namely RCV1 (Lewis et al., 2004), Amazon-13K, (McAuley and Leskovec, 2013), Wiki-30K and Wiki-500K (Zubiaga, 2012), as well as the previous EUR-LEX dataset (Mencia and Fürnkranzand, 2007), reporting that attention-based RNNs produced the best results overall (4 out of 5 datasets). Mullenbach et al. (2018) investigated the use of label-wise attention in LMTC for medical code prediction on the MIMIC-II and MIMIC-III datasets (Johnson et al., 2017). Their best method, Convolutional Attention for Multi-Label Classification, called CNN-LWAN here, employs one attention head per label and was shown to outperform weak baselines, namely logistic regression, plain BIGRUs, CNNs with a single convolution layer.
Rios and Kavuluru (2018) consider few-and zero-shot learning on the MIMIC datasets. They propose Zero-shot Attentive CNN, called ZERO-CNN-LWAN here, a method similar to CNN-LWAN, which also exploits label descriptors. Although ZERO-CNN-LWAN did not outperform CNN-LWAN overall on MIMIC-II and MIMIC-III, it had much improved results in few-shot and zero-shot learning, among other variations of ZERO-CNN-LWAN that exploit the hierarchical relations of the labels with graph convolutions.
We note that the label-wise attention methods of Mullenbach et al. (2018) and Rios and Kavuluru (2018) were not compared to strong generic text classification baselines, such as attention-based RNNs (You et al., 2018) or Hierarchical Attention Network (HAN) (Yang et al., 2016), which we investigate below.

The New Dataset
As already noted, EURLEX57K contains 57k legislative documents from EUR-LEX 2 with an average length of 727 words (Table 1). 3 Each document contains four major zones: the header, which includes the title and name of the legal body Subset Documents (D) Words/D Labels/ D  Train  45,000  729  5  Dev.  6,000  714  5  Test  6,000  725  5  Total  57,000  727  5   Table 1: Statistics of the EUR-LEX dataset. enforcing the legal act; the recitals, which are legal background references; the main body, usually organized in articles; and the attachments (e.g., appendices, annexes). Some of the LMTC methods we consider need to be fed with documents split into smaller units. These are often sentences, but in our experiments they are sections, thus we preprocessed the raw text, respectively. We treat the header, the recitals zone, each article of the main body, and the attachments as separate sections.
All the documents of the dataset have been annotated by the Publications Office of EU 4 with multiple concepts from EUROVOC. While EU-ROVOC includes approx. 7k concepts (labels), only 4,271 (59.31%) are present in EURLEX57K, from which only 2,049 (47.97%) have been assigned to more than 10 documents. Similar distributions were reported by Rios and Kavuluru (2018) for the MIMIC datasets. We split EURLEX57K into training (45k documents), development (6k), and test subsets (6k). We also divide the 4,271 labels into frequent (746 labels), few-shot (3,362), and zeroshot (163), depending on whether they were assigned to more than 50, fewer than 50 but at least one, or no training documents, respectively.

Methods
Exact Match, Logistic Regression: A first naive baseline, Exact Match, assigns only labels whose descriptors can be found verbatim in the document. A second one uses Logistic Regression with feature vectors containing TF-IDF scores of n-grams (n = 1, 2, . . . , 5).
BIGRU-ATT: The first neural method is a BIGRU with self-attention (Xu et al., 2015). Each document is represented as the sequence of its word embeddings, which go through a stack of BIGRUs (Figure 1a). A document embedding (h) is computed as the sum of the resulting context-aware embeddings (h = i a i h i ), weighted by the selfattention scores (a i ), and goes through a dense  (Figure 1c), which converts the word embeddings into context-sensitive embeddings h i , much as in BIGRU-ATT. Unlike BIGRU-ATT, however, both CNN-LWAN and BIGRU-LWAN use L independent attention heads, one per label, generating L document embeddings (h (l) = i a l,i h i , l = 1, . . . , L) from the sequence of vectors h i produced by the CNN or BI-GRU encoder, respectively. Each document embedding (h (l) ) is specialized to predict the corresponding label and goes through a separate dense layer (L dense layers in total) with a sigmoid, to produce the probability of the corresponding label.
ZERO-CNN-LWAN, ZERO-BIGRU-LWAN: Rios and Kavuluru (2018) designed a model similar to CNN-LWAN, called ZACNN in their work and ZERO-CNN-LWAN here, to deal with rare labels. In ZERO-CNN-LWAN, the attention scores (a l,i ) and the label probabilities are produced by comparing the h i vectors that the CNN encoder pro-duces and the label-specific document embeddings (h (l) ), respectively, to label embeddings. Each label embedding is the centroid of the pretrained word embeddings of the label's descriptor; consult Rios and Kavuluru (2018) for further details. By contrast, CNN-LWAN and BIGRU-LWAN do not consider the descriptors of the labels. We also experiment with a variant of ZERO-CNN-LWAN that we developed, dubbed ZERO-BIGRU-LWAN, where the CNN encoder is replaced by a BIGRU.
BERT: BERT (Devlin et al., 2018) is a language model based on Transformers (Vaswani et al., 2017) pretrained on large corpora. For a new target task, a task-specific layer is added on top of BERT. The extra layer is trained jointly with BERT by fine-tuning on task-specific data. We add a dense layer on top of BERT, with sigmoids, that produces a probability per label. Unfortunately, BERT can currently process texts up to 512 wordpieces, which is too small for the documents of EURLEX57K. Hence, BERT can only be applied to truncated versions of our documents (see below).

Experiments
Evaluation measures: Common LMTC evaluation measures are precision (P @K) and recall (R@K) at the top K predicted labels, averaged over test documents, micro-averaged F1 over all labels, and nDCG@K (Manning et al., 2009). However, P @K and R@K unfairly penalize methods when the gold labels of a document are fewer or more than K, respectively. Similar concerns have led to the introduction of R-Precision and nDCG@K in Information Retrieval (Manning et al., 2009), which we believe are also more appropriate for LMTC. Note, however, that R-Precision requires the number of gold labels per document to be known beforehand, which is unrealistic in practical applications. Therefore we propose using R-Precision@K (RP @K), where  K is a parameter. This measure is the same as P @K if there are at least K gold labels, otherwise K is reduced to the number of gold labels.  shows RP @K for the three best systems, macro-averaged over test documents. Unlike P @K, RP @K does not decline sharply as K increases, because it replaces K by the number of gold labels, when the latter is lower than K. For K = 1, RP @K is equivalent to P @K, as confirmed by Fig. 2. For large values of K that almost always exceed the number of gold labels, RP @K asymptotically approaches R@K, as also confirmed by Fig. 2. 5 In our dataset, there are 5.07 labels per document, hence K = 5 is reasonable. 6 5 See Appendix C for a more detailed discussion on the evaluation measures. 6 Evaluating at other values of K lead to similar conclusions (see Fig. 2 and Appendix D).
Setup: Hyper-parameters are tuned using the HYPEROPT library selecting the values with the best loss on development data. 7 For the best hyper-parameter values, we perform five runs and report mean scores on test data. For statistical significance tests, we take the run of each method with the best performance on development data, and perform two-tailed approximate randomization tests (Dror et al., 2018) on test data. 8 Unless otherwise stated, we used 200-D pretrained GLOVE embeddings (Pennington et al., 2014).
Full documents: The first five horizontal zones of Table 2 report results for full documents. The naive baselines are weak, as expected. Interestingly, for all, frequent, and even few-shot labels, the generic BIGRU-ATT performs better than CNN-LWAN, which was designed for LMTC. HAN also performs better than CNN-LWAN for all and frequent labels. However, replacing the CNN encoder of CNN-LWAN with a BIGRU (BIGRU-LWAN) leads to the best results, indicating that the main weakness of CNN-LWAN is its vanilla CNN encoder.
The zero-shot versions of CNN-LWAN and BIGRU-LWAN outperform all other methods on zero-shot labels (Table 2), in line with the findings of Rios and Kavuluru (2018), because they exploit label descriptors, but more importantly because they have a component that uses prior knowledge as is (i.e., label embeddings are frozen). Exact Match also performs better on zero-shot labels, for the same reason (i.e., the prior knowledge is intact). BIGRU-LWAN, however, is still the best method in few-shot learning. All the differences between the best (bold) and other methods in Table 2 are statistically significant (p < 0.01).   First 512 tokens: Given that H+R contains enough information and is shorter than 500 tokens in 83% of our dataset's documents, we also apply BERT to the first 512 tokens of each document (truncated to BERT's max. length), comparing to BIGRU-LWAN also operating on the first 512 tokens.

Limitations and Future Work
One major limitation of the investigated methods is that they are unsuitable for Extreme Multi-Label Text Classification where there are hundreds of thousands of labels (Liu et al., 2017;Zhang et al., 9 The approximate randomization tests detected no statistically significant difference in this case (p = 0.20). 2018;Wydmuch et al., 2018), as opposed to the LMTC setting of our work where the labels are in the order of thousands. We leave the investigation of methods for extremely large label sets for future work. Moreover, RNN (and GRU) based methods have high computational cost, especially for long documents. We plan to investigate more computationally efficient methods, e.g., dilated CNNs (Kalchbrenner et al., 2017) and Transformers (Vaswani et al., 2017;Dai et al., 2019). We also plan to experiment with hierarchical flavors of BERT to surpass its length limitations. Furthermore, experimenting with more datasets e.g., RCV1, Amazon-13K, Wiki-30K, MIMIC

Appendix
A EURLEX57K statistics Figure 3 shows the distribution of labels across EURLEX57K documents. From the 7k labels fewer than 50% appear in more than 10 documents. Such an aggressive Zipfian distribution has also been noted in medical code predictions (Rios and Kavuluru, 2018), where such thesauri are used to classify documents, demonstrating the practical importance of few-shot and zero-shot learning.
B Hyper-paramater tuning Table 5 shows the best hyper-parameters returned by HYPEROPT. Concerning BERT, we set the dropout rate and learning rate to 0.1 and 5e-5, respectively, as suggested by Devlin et al. (2018), while batch size was set to 8 due to GPU memory limitations. Finally, we noticed that the model did

C Evaluation Measures
The macro-averaged versions of R@K and P @K are defined as follows: where T is the total number of test documents, K is the number of labels to be selected per document, S t (K) is the number of correct labels among those ranked as top K for the t-th document, and R t is the number of gold labels for each document. Although these measures are widely used in LMTC, we question their appropriateness for the following reasons: 1. R@K leads to excessive penalization when documents have more than K gold labels. For example, evaluating at K = 1 for a single document with 5 gold labels returns R@1 = 0.20, if the system managed to return a correct label. The system is penalized, even though it was not allowed to return more than one label.
2. P @K does the same for documents with fewer than K gold labels. For example, evaluating at K = 5 for a single document with a single gold label returns P @1 = 0.20.
3. Both measures over-or under-estimate performance on documents whose number of gold la-    bels largely diverges from K. This is clearly illustrated in Figure 2 of the main article.
4. Because of these drawbacks, both measures do not correctly single out the best methods.
Based on the above arguments, we believe that R-Precision@K (RP @K) and nDCG@K lead to a more informative and fair evaluation. Both measures adjust to the number of gold labels per document, without over-or under-estimating performance when documents have few or many gold labels. The macro-averaged versions of the two measures are defined as follows: Again, T is the total number of test documents, K is the number of labels to be selected, S t (K) is the number of correct labels among those ranked as top K for the t-th document, and R t is the number of gold labels for each document. In the main article we report results for K = 5. The reason is that the majority of the documents of EURLEX57K (57.7%) have at most 5 labels. The detailed distributions can be seen in Figure 4.

D Experimental Results
In Tables 6-9, we present additional results for the main measures used across the LMTC literature (P @K, R@K, RP @K, nDGC@K).