Paragraph-level Rationale Extraction through Regularization: A case study on European Court of Human Rights Cases

Interpretability or explainability is an emerging research field in NLP. From a user-centric point of view, the goal is to build models that provide proper justification for their decisions, similar to those of humans, by requiring the models to satisfy additional constraints. To this end, we introduce a new application on legal text where, contrary to mainstream literature targeting word-level rationales, we conceive rationales as selected paragraphs in multi-paragraph structured court cases. We also release a new dataset comprising European Court of Human Rights cases, including annotations for paragraph-level rationales. We use this dataset to study the effect of already proposed rationale constraints, i.e., sparsity, continuity, and comprehensiveness, formulated as regularizers. Our findings indicate that some of these constraints are not beneficial in paragraph-level rationale extraction, while others need re-formulation to better handle the multi-label nature of the task we consider. We also introduce a new constraint, singularity, which further improves the quality of rationales, even compared with noisy rationale supervision. Experimental results indicate that the newly introduced task is very challenging and there is a large scope for further research.


Introduction
Model interpretability (or explainability) is an emerging field of research in NLP (Lipton, 2018;Jacovi and Goldberg, 2020). From a model-centric point of view, the main focus is to demystify a model's inner workings, for example targeting selfattention mechanisms (Jain and Wallace, 2019;Wiegreffe and Pinter, 2019), and more recently Transformer-based language models (Clark et al., 2019;Kovaleva et al., 2019;Rogers et al., 2020). From a user-centric point of view, the main focus is to build models that learn to provide proper Correspondence to: ihalk.aueb.gr justification for their decisions, similar to those of humans, (Zaidan et al., 2007;Lei et al., 2016; by requiring the models to satisfy additional constraints. Here we follow a user-centric approach to rationale extraction, where the model learns to select a subset of the input that justifies its decision. To this end, we introduce a new application on legal text where, contrary to mainstream literature targeting word-level rationales, we conceive rationales as automatically selected paragraphs in multiparagraph structured court cases. While previous related work targets mostly binary text classification tasks (DeYoung et al., 2020), our task is a highly skewed multi-label text classification task. Given a set of paragraphs that refer to the facts of each case (henceforth facts) in judgments of the European Court of Human Rights (ECtHR), the model aims to predict the allegedly violated articles of the European Convention of Human Rights (ECHR). We adopt a rationalization by construction methodology (Lei et al., 2016;, where the model is regularized to satisfy additional constraints that reward the model, if its decisions are based on concise rationales it selects, as opposed to inferring explanations from the model's decisions in a post-hoc manner (Ribeiro et al., 2016;Alvarez-Melis and Jaakkola, 2017;Murdoch et al., 2018).
Legal judgment prediction has been studied in the past for cases ruled by the European Court of Human Rights (Aletras et al., 2016;Medvedeva et al., 2018;Chalkidis et al., 2019) and for Chinese criminal court cases (Luo et al., 2017;Hu et al., 2018;Zhong et al., 2018), but there is no precedent of work investigating the justification of the models' decisions. Similarly to other domains (e.g., financial, biomedical), explainability is a key feature in the legal domain, which may potentially improve the trustworthiness of systems that abide by the principle of the right to explanation (Goodman and Figure 1: A depiction of the ECtHR process: The applicant(s) request a hearing from ECtHR regarding specific accusations (alleged violations of ECHR articles) against the defendant state(s), based on facts. The Court (judges) assesses the facts and the rest of the parties' submissions, and rules on the violation or not of the allegedly violated ECHR articles. Here, prominent facts referred in the court's assessment are highlighted.
Flaxman, 2017). We investigate the explainability of the decisions of state-of-the-art models, comparing the paragraphs they select to those of legal professionals, both litigants and lawyers, in alleged violation prediction. In the latter task, introduced in this paper, the goal is to predict the accusations (allegations) made by the applicants. The accusations can be usually predicted given only the facts of each case. By contrast, in the previously studied legal judgment prediction task, the goal is to predict the court's decision; this is much more difficult and vastly relies on case law (precedent cases).
Although the new task (alleged violation prediction) is simpler than legal judgment prediction, models that address it (and their rationales) can still be useful in the judicial process ( Fig. 1). For example, they can help applicants (plaintiffs) identify alleged violations that are supported by the facts of a case. They can help judges identify more quickly facts that support the alleged violations, contributing towards more informed judicial decision making (Zhong et al., 2020). They can also help legal experts identify previous cases related to particular allegations, helping analyze case law (Katz, 2012). Our contributions are the following: • We introduce rationale extraction for alleged violation prediction in ECtHR cases, a more tractable task compared to legal judgment prediction. This is a multi-label classification task that requires paragraph-level rationales, unlike previous work on word-level rationales for binary classification.
• We study the effect of previously proposed rationale constraints, i.e., sparsity, continuity (Lei et al., 2016), and comprehensiveness , formulated as regularizers. We show that continuity is not beneficial and requisite in paragraph-level rationale-extraction, while comprehensiveness needs to be re-formulated for the multi-label nature of the task we consider. We also introduce a new constraint, singularity, which further improves the rationales, even compared with silver (noisy) rationale supervision.
• We release a new dataset for alleged article violation prediction, comprising 11k ECtHR cases in English, with silver rationales obtained from references in court decisions, and gold rationales provided by ECHR-experienced lawyers. 1 To the best of our knowledge, this is also the first work on rationale extraction that fine-tunes end-toend pre-trained Transformer-based models. 2 2 Related Work Legal judgment prediction: Initial work on legal judgment prediction in English used linear models with features based on bags of words and topics, applying them to ECtHR cases (Aletras et al., 2016;Medvedeva et al., 2018). More recently, we experimented with neural methods (Chalkidis et al., 2019) , showing that hierarchical RNNs (Yang et al., 2016), and a hierarchical variation of BERT (Devlin et al., 2019) that encodes paragraphs, outperform linear classifiers with bag-of-word representations. In all previous work, legal judgment prediction is tackled in an over-simplified experimental setup 1 Our dataset is publicly available at https:// huggingface.co/datasets/ecthr_cases, see usage example in Appendix E.
2 Others fine-tuned such models only partially , i.e., top two layers, or not at all (DeYoung et al., 2020). where only textual information from the cases themselves is considered, ignoring many other important factors that judges consider, more importantly general legal argument and past case law. Also, Aletras et al. (2016), Medvedeva et al. (2018), Chalkidis et al. (2019) treat ECtHR judgment prediction as a binary classification task per case (any article violation or not), while the ECtHR actually considers and rules on the violation of individual articles of the European Convention of Human Rights (ECHR).
In previous work (Chalkidis et al., 2019), we also attempted to predict which particular articles were violated, assuming, however, that the Court considers all the ECHR articles in each case, which is not true. In reality, the Court considers only alleged violations of particular articles, argued by applicants. Establishing which articles are allegedly violated is an important preliminary task when preparing an ECtHR application. Instead of oversimplifying the overall judgment prediction task, we focus on the preliminary task and use it as a test-bed for generating paragraph-level rationales in a multi-label text classification task for the first time.
Legal judgment prediction has also been studied in Chinese criminal cases (Luo et al., 2017;Hu et al., 2018;Zhong et al., 2018). Similarly to the literature on legal judgment prediction for ECtHR cases, the aforementioned approaches ignore the crucial aspect of justifying the models' predictions.
Given the gravity that legal outcomes have for individuals, explainability is essential to increase the trust of both legal professionals and laypersons on system decisions and promote the use of supportive tools (Barfield, 2020). To the best of our knowledge, our work is the first step towards this direction for the legal domain, but is also applicable in other domains (e.g., biomedical), where justifications of automated decisions are essential.
Rationale extraction by construction: Contrary to earlier work that required supervision in the form of human-annotated rationales (Zaidan et al., 2007;Zhang et al., 2016), Lei et al. (2016) introduced a self-supervised methodology to extract rationales (that supported aspect-based sentiment analysis predictions), i.e., gold rationale annotations were used only for evaluation. Furthermore, models were designed to produce rationales by construction, contrary to work studying saliency maps (generated by a model without explainability constraints) using gradients or perturbations at inference time (Ribeiro et al., 2016;Alvarez-Melis and Jaakkola, 2017;Murdoch et al., 2018). Lei et al. (2016) aimed to produce short coherent rationales that could replace the original full texts, maintaining the model's predictive performance. The rationales were extracted by generating binary masks indicating which words should be selected; and two additional loss regularizers were introduced, which penalize long rationales and sparse masks (that would select non-consecutive words).  proposed another constraint to ensure that the rationales would contain all the relevant information. They formulated this constraint through a minimax game, where two players, one using the predicted binary mask and another using the complement of this mask, aim to correctly classify the text. If the first player fails to outperform the second, the model is penalized.  use a Generative Adversarial Network (GAN) (Goodfellow et al., 2014), where a generator producing factual rationales competes with a generator producing counterfactual rationales to trick a discriminator. The GAN was not designed to perform classification. Given a text and a label it produces a rationale supporting (or not) the label.  decoupled the model's predictor from the rationale extractor to produce inherently faithful explanations, ensuring that the predictor considers only the rationales and not other parts of the text. Faithfulness refers to how accurately an explanation reflects the true reasoning of a model (Lipton, 2018;Jacovi and Goldberg, 2020).
All the aforementioned work conceives rationales as selections of words, targeting binary classification tasks even when this is inappropriate. For instance, DeYoung et al. (2020) and  over-simplified the task of the multipassage reading comprehension (MultiRC) dataset (Khashabi et al., 2018) turning it into a binary classification task with word-level rationales, while sentence-level rationales seem more suitable. Responsible AI: Our work complies with the ECtHR data policy. By no means do we aim to build a 'robot' lawyer or judge, and we acknowledge the possible harmful impact (Angwin et al., 2016;Dressel and Farid, 2018) of irresponsible deployment. Instead, we aim to support fair and explainable AI-assisted judicial decision making and empirical legal studies. We consider our work as part of ongoing critical research on responsible AI (Elish et al., 2021) that aims to provide explainable and fair systems to support human experts.

The New ECtHR Dataset
The court (ECtHR) hears allegations regarding breaches in human rights provisions of the European Convention of Human Rights (ECHR) by European states (Fig. 1). 3 The court rules on a subset of all ECHR articles, which are predefined (alleged) by the applicants (plaintiffs). Our dataset comprises 11k ECtHR cases and can be viewed as an enriched version of the ECtHR dataset of Chalkidis et al. (2019), which did not provide ground truth for alleged article violations (articles discussed) and rationales. The new dataset includes the following: Facts: Each judgment includes a list of paragraphs that represent the facts of the case, i.e., they describe the main events that are relevant to the case, in numbered paragraphs. We hereafter call these paragraphs facts for simplicity. Note that the facts are presented in chronological order. Not all facts have the same impact or hold crucial information with respect to alleged article violations and the court's assessment; i.e., facts may refer to information that is trivial or otherwise irrelevant to the legally crucial allegations against defendant states.
Allegedly violated articles: Judges rule on specific accusations (allegations) made by the applicants (Harris, 2018). In ECtHR cases, the judges discuss and rule on the violation, or not, of specific articles of the Convention. The articles to be discussed (and ruled on) are put forward (as alleged article violations) by the applicants and are included in the dataset as ground truth; we identify 40 violable articles in total. 4 In our experiments, however, the models are not aware of the allegations. They predict the Convention articles that will be discussed (the allegations) based on the case's facts, and they also produce rationales for their predictions. Models of this kind could be used by potential applicants to help them formulate future allegations (articles they could claim to have been violated), as already noted, but here we mainly use the task as a test-bed for rationale extraction.
Violated articles: The court decides which allegedly violated articles have indeed been violated. These decisions are also included in our dataset and could be used for full legal judgment prediction experiments (Chalkidis et al., 2019). However, they are not used in the experiments of this work.
Silver allegation rationales: Each decision of the ECtHR includes references to facts of the case (e.g., "See paragraphs 2 and 4.") and case law (e.g., "See Draci vs. Russia (2010)."). We identified references to each case's facts and retrieved the corresponding paragraphs using regular expressions. These are included in the dataset as silver allegation rationales, on the grounds that the judges refer to these paragraphs when ruling on the allegations.
Gold allegation rationales: A legal expert with experience in ECtHR cases annotated a subset of 50 test cases to identify the relevant facts (paragraphs) of the case that support the allegations (alleged article violations). In other words, each identified fact justifies (hints) one or more alleged violations. 5 Task definition: In this work, we investigate alleged violation prediction, a multi-label text classification task where, given the facts of a ECtHR case, a model predicts which of the 40 violable ECHR articles were allegedly violated according to the applicant(s). 4 The model also needs to identify the facts that most prominently support its decision.

Methods
We first describe a baseline model that we use as our starting point. It adopts the framework proposed by Lei et al. (2016), which generates rationales by construction: a text encoder sub-network reads the text; a rationale extraction sub-network produces a binary mask indicating the most important words of the text; and a prediction sub-network classifies a hard-masked version of the text. We then discuss additional constraints that have been proposed to improve word-level rationales, which can be added to the baseline as regularizers. We argue that one of them is not beneficial for paragraphlevel rationales. We also consider variants of previous constraints that better suit multi-label classification tasks and introduce a new one.

Baseline Model
Our baseline is a hierarchical variation of BERT (Devlin et al., 2019) with hard attention, dubbed  Each case (document) D is viewed as a list of facts (paragraphs) D = [P 1 , . . . , P N ]. Each paragraph is a list of tokens P i = [w 1 , . . . , w L i ]. We first pass each paragraph independently through a shared BERT encoder (Fig. 2) to extract context-unaware paragraph representations P [CLS] i , using the [CLS] embedding of BERT. Then, a shallow encoder with two Transformer layers (Vaswani et al., 2017) produces contextualized paragraph embeddings, which are in turn projected to two separate spaces by two different fully-connected layers, K and Q, with SELU activations (Klambauer et al., 2017). K produces the paragraph encoding P K i , to be used for classification; and Q produces the paragraph encoding P Q i , to be used for rationale extraction. The rationale extraction sub-network passes each P Q i encoding independently through a fully-connected layer with a sigmoid activation to produce soft attention scores a i ∈ [0, 1]. The attention scores are then binarized using a 0.5 threshold, leading to hard attention scores z i (z i = 1 iff a i > 0.5). The hardmasked document representation D M is obtained by hard-masking paragraphs and max-pooling: D M is then fed to a dense layer with sigmoid activations, which produces a probability estimate per label, Y = [ŷ 1 , . . . ,ŷ |A| ], in our case per article of the Convention, where |A| is the size of the label set. For comparison, we also experiment with a model that masks no facts, dubbed HIERBERT-ALL. The thresholding that produces the hard (binary) masks z i is not differentiable. To address this problem, Lei et al. (2016) used reinforcement learning (Williams, 1992), while Bastings et al. (2019) proposed a differentiable mechanism relying on the reparameterization trick (Louizos and Welling, 2017). We follow a simpler trick, originally proposed by , where during backpropagation the thresholding is detached from the computation graph, allowing the gradients to bypass the thresholding and reach directly the soft attentions a i . 6 In previous work, we proposed a hierarchical variation of BERT with self-attention (Chalkidis et al., 2019). In parallel work, Yang et al. (2020) proposed a similar Transformer-based Hierarchical Encoder (SMITH) for long document matching.

Rationale Constraints as Regularizers
Sparsity: Modifying the word-level sparsity constraint of Lei et al. (2016) for our paragraph-level rationales, we also hypothesize that good rationales include a small number of facts (paragraphs) that sufficiently justify the allegations; the other facts are trivial or secondary. For instance, an introductory fact like "The applicant was born in 1984 and lives in Switzerland." does not support any allegation, while a fact like "The applicant contended that he had been beaten by police officers immediately after his arrest and later during police questioning." suggests a violation of Article 3 "Prohibition of Torture". Hence, we use a sparsity loss to control the number of selected facts: where T is a predefined threshold specifying the desired percentage of selected facts per case. We can estimate T from silver rationales (Table 1).
Continuity: In their work on word-level rationales, Lei et al. (2016) also required the selected words to be contiguous, to obtain more coherent rationales. In other words, the transitions between selected (z i = 1) and not selected (z i = 0) words in the hard mask should be minimized. This is achieved by adding the following continuity loss: In paragraph-level rationale extraction, where entire paragraphs are masked, the continuity loss forces the model to select contiguous paragraphs. In ECtHR cases, however, the facts are selfcontained and internally coherent paragraphs (or single sentences). Hence, we hypothesize that the continuity loss is not beneficial in our case. Nonetheless, we empirically investigate its effect.
Comprehensiveness: We also adapt the comprehensiveness loss of , which was introduced to force the hard mask Z = [z 1 , . . . , z N ] to (ideally) keep all the words (in our case, paragraphs about facts) of the document D that support the correct decision Y . In our task, Y = [y 1 , . . . , y |A| ] is a binary vector indicating the Convention articles the court discussed (gold allegations) in the case of D. Intuitively, the complement Z c of Z, i.e., the hard mask that selects the words (in our case, facts) that Z does not select, should not select sufficient information to predict Y . Given D, let D M , D c M be the representations of D obtained with Z, Z c , respectively; let Y , Y c be the corresponding probability estimates; let L p , L c p be the classification loss, typically total binary crossentropy, measuring how far Y , Y c are from Y . In its original form, the comprehensiveness loss requires L c p to exceed L p by a margin h.
While this formulation may be adequate in binary classification tasks, in multi-label classification it is very hard to pre-select a reasonable margin, given that cross-entropy is unbounded, that the distribution of true labels (articles discussed) is highly skewed, and that some labels are easier to predict than others. To make the selection of h more intuitive, we propose a reformulation of L g that operates on class probabilities rather than classification losses. The right-hand side of Eq. 3 becomes: The margin h is now easier to grasp and tune. It encourages the same gap between the probabilities predicted with Z and Z c across all labels (articles). We also experiment with a third variant of comprehensiveness, which does not compare the probabilities we obtain with Z and Z c , comparing instead the two latent document representations: where cos denotes cosine similarity. This variant forces D M and D c M to be as dissimilar as possible, without requiring a preset margin.
Singularity: A limitation of the comprehensiveness loss (any variant) is that it only requires the mask Z to be better than its complement Z c . This does not guarantee that Z is better than every other mask. Consider a case where the gold rationale identifies three articles and Z selects only two of them. The model may produce better predictions with Z than with Z c , and D M may be very different than D c M in Eq. 5, but Z is still not the best mask. To address this limitation, we introduce the singularity loss L r , which requires Z to be better than a mask Z r , randomly generated per training instance and epoch, that selects as many facts as the sparsity threshold T allows: Here L g (Z, Z r ) is any variant of L g , but now using Z r instead of Z c ; and γ regulates the effect of L g (Z, Z r ) by considering the cosine distance between Z r and Z. The more Z and Z r overlap, the less we care if Z performs better than Z r .
The total loss of our model is computed as follows. Again L p is the classification loss; L c p , L r p are the classification losses when using Z c , Z r , respectively; and all λs are tunable hyper-parameters.
We include L c p in Eq. 7, because otherwise the network would have no incentive to make D c M and Y c competitive in prediction; and similarly for L r p . Rationales supervision: For completeness we also experimented with a variant that utilizes silver rationales for noisy rationale supervision (Zaidan et al., 2007). In this case the total loss becomes: where MAE is the mean absolute error between the predicted mask, Z, and the silver mask, Z s , and λ ns weighs the effect of MAE in the total loss.

Experimental Setup
For all methods, we conducted grid-search to tune the hyper-parameters λ * . We used the Adam optimizer (Kingma and Ba, 2015) across all experi-ments with a fixed learning rate of 2e-5. 7 All methods rely on LEGAL-BERT-SMALL (Chalkidis et al., 2020), a variant of BERT (Devlin et al., 2019), with 6 layers, 512 hidden units and 8 attention heads, pre-trained on legal corpora. Based on this model, we were able to use up to 50 paragraphs of 256 words each in a single 32GB GPU. In preliminary experiments, we found that the proposed model relying on a shared paragraph encoder, i.e., one that passes the same context-aware paragraph representations P [CLS] i to both the Q and K sub-networks, as in Fig. 2, has comparable performance and better rationale quality, compared to a model with two independent paragraph encoders, as the one used in the literature (Lei et al., 2016;. 8 For all experiments, we report the average and standard deviation across five runs. We evaluate: (a) classification performance, (b) faithfulness (Section 2), and (c) rationale quality, while respecting a given sparsity threshold (T ).
Classification performance: Given the label skewness, we evaluate classification performance using micro-F1, i.e., for each Convention article, we compute its F1, and micro-average over articles.
Faithfulness: Recall that faithfulness refers to how accurately an explanation reflects the true reasoning of a model. To measure faithfulness, we report sufficiency and comprehensiveness (DeYoung et al., 2020). Sufficiency measures the difference between the predicted probabilities for the gold (positive) labels when the model is fed with the whole text ( Y + f ) and when the model is fed only with the predicted rationales ( Y + ). Comprehensiveness (not to be confused with the homonymous loss of Eq. 3-5) measures the difference between the predicted probabilities for the gold (positive) labels obtained when the model is fed with the full text ( Y + f ) and when it is fed with the complement of the predicted rationales ( Y + c ). We also compare classification performance (again using micro-F1) in both cases, i.e., when considering masked inputs (using Z) and complementary inputs (using Z c ).
Rationale quality: Faithful explanations (of system reasoning) are not always appropriate for users (Jacovi and Goldberg, 2020), thus we also evaluate rationale quality from a user perspective. The latter 7 In preliminary experiments, we tuned the baseline model on development data as a stand-alone classifier and found that the optimal learning rate was 2e-5, searching in the set {2e-5, 3e-5, 4e-5, 5e-5}. The optimal drop-out rate was 0. 8 See Appendix B for additional details and results.
can be performed in two ways. Objective evaluation compares predicted rationales with gold annotations, typically via Recall, Precision, F1 (comparing system-selected to human-selected facts in our case). In subjective evaluation, human annotators review the extracted rationales. We opt for an objective evaluation, mainly due to lack of resources. As rationale sparsity (number of selected paragraphs) differs across methods, which affects Recall, Precision, F1, we evaluate rationale quality with mean R-Precision (mRP) (Manning et al., 2009  6 Experimental Results Table 2 reports the classification performance of HIERBERT-ALL (no masking, no rationales), across ECHR articles. F1 is 72.5% or greater for most of the articles with 1,000 or more training instances. The scores are higher for articles 2, 3, 5, 6, because (according to the legal expert who provided the gold allegation rationales), (i) there is a sufficient number of cases regarding these articles, and (ii) the interpretation and application of these articles is more fact-dependent than those of other articles, such as articles 10 or 11 (Harris, 2018). On the other hand, although there is a fair amount of training instances for articles 13, 14, 34, and 46, these articles are triggered in a variety of ways, many of which turn on legal procedural technicalities.  (no masking) -73.7 ± 0.6 -HIERBERT-HA + L s (Eq. 1) (Lei et al., 2016) 31.7 ± 1.1 73.1 ± 0.6 69.5 ± 2.4 0.063 58.8 ± 1.5 0.181 HIERBERT-HA + L s + L g (Eq. 3)  31.4 ± 1.9

Tuning the λ Hyper-parameters
Instead of tuning simultaneously all the λ * hyperparameters of Eq. 7, we adopt a greedy, but more intuitive strategy: we tune one λ at a time, fix its value, and proceed to the next; λs that have not been tuned are set to zero, i.e., the corresponding regularizer is not used yet. We begin by tuning λ s , aiming to achieve a desirable level of sparsity without harming classification performance. We set the sparsity threshold of L s (Eq. 1) to T = 0.3 (select approx. 30% of the facts), which is the average sparsity of the silver rationales (Table 1). We found λ s = 0.1 achieves the best overall results on development data, thus we use this value for the rest of the experiments. 9 To check our hypothesis that continuity (L c ) is not beneficial in our task, we tuned λ c on development data, confirming that the best overall results are obtained for λ c = 0. 9 Thus we omit L c in the rest of the experiments.

Comprehensiveness/Singularity Variants
Next, we tuned and compared the variants of the comprehensiveness loss L g (Table 4). Targeting the label probabilities (Eq. 4) instead of the losses (Eq. 3) leads to lower rationale quality. Targeting the document representations (Eq. 5) has the best rationale quality results, retaining (as with all versions of L g ) the original classification performance (micro-F1) of Table 2. Hence, we keep the L g variant of Eq. 5 in the remaining experiments of this section, with the corresponding λ g value (1e-3).  73.4 ± 0.8 32.8 ± 2.8 36.9 ± 3.6 39.0 ± 3.9 Eq. 4, 6 72.5 ± 0.7 32.0 ± 1.0 39.7 ± 3.1 42.6 ± 3.8 Eq. 5, 6 72.8 ± 0.3 31.5 ± 0.9 33.0 ± 2.7 35.5 ± 2.6 Concerning the singularity loss L r (Table 5), targeting the label probabilities (Eq. 4, 6) provides the best rationale quality, comparing to all the methods considered. Interestingly Eq. 5, which performed best in L g (Table 4), does not perform well in L r , which uses L g (Eq. 6). We suspect that in L r , where we use a random mask Z r that may overlap with Z, requiring the two document representations D M , D r M to be dissimilar (when using Eq. 5, 6) may be a harsh regularizer with negative effects. Table 3 presents results on test data. The models that use the hard attention mechanism and are regularized to extract rationales under certain constraints (HIERBERT-HA + L * ) have comparable classification performance to HIERBERT-ALL. Furthermore, although paragraph embeddings are contextualized and probably have some information leak for all methods, our proposed extensions in rationale constraints better approximate faithfulness, while also respecting sparsity. Our proposed extensions lead to low sufficiency (lower is better, ↓), i.e., there is only a slight deterioration in label probabilities when we use the predicted rationale instead of the whole input. They also lead to high comprehensiveness (higher is better, ↑); we see a 20% deterioration in label probabilities when using the complement of the rationale instead of the whole input. Interestingly, our variant with the singularity loss (Eq. 4, 6) is more faithful than the model that uses supervision on silver rationales (Eq. 8).

Rationale Quality
We now consider rationale quality, focusing on HIERBERT-HA variants without rationale supervision. Similarly to our findings on development data (Tables 4, 5), we observe ( Table 6) that using (a) our version of comprehensiveness loss (Eq. 5) or (b) our singularity loss (Eq. 4, 6) achieves better results compared to former methods, and (b) has the best results. The singularity loss is better in both settings (silver or gold test rationales), even compared to a model that uses supervision on silver rationales. The random masking of the singularity loss, which guides the model to learn to extract masks that perform better than any other mask, proved to be particularly beneficial in rationale quality. Similar observations are derived given the results on the full test set considering silver rationales. 10 In general, however, we observe that the rationales extracted by all models are far from human rationals, as indicated by the poor results (mRP, F1) on both silver and gold rationales. Hence, there is ample scope for further research.

Qualitative Analysis
Quality of silver rationales: Comparing silver rationales with gold ones, annotated by the legal expert, we find that silver rationales are not complete, i.e., they are usually fewer than the gold ones. They also include additional facts that have not been annotated by the expert. According the expert, these facts do not support allegations, but are included for technical reasons (e.g., "The national court did not accept the applicant's allegations."). Nonetheless, ranking methods by their rationale quality measured on silver rationales produces the same ranking as when measuring on gold rationales in the common subset of cases (Table 6). Hence, it may be possible to use silver rationales, which are available for the full dataset, to rank systems participating in ECtHR rationale generation challenges.
Model bias: Low mRP with respect to gold rationales means that the models rely partially on non causal reasoning, i.e., they select secondary facts that do not justify allegations according to the legal expert. In other words, the models are sensitive to specific language, e.g., they misuse (are easily fooled by) references to health issues and medical examinations as support for Article 3 alleged violations, or references to appeals in higher courts as support for Article 5, even when there is no concrete evidence. 11 Manually inspecting the predicted rationales, we did not identify bias on demographics. Although such spurious features may be buried in the contextualized paragraph encodings (P [CLS] i ). In general, de-biasing models could benefit rationale extraction and we aim to investigate this direction in future work (Huang et al., 2020).
Plausibility: Plausibility refers to how convincing the interpretation is to humans (Jacovi and Goldberg, 2020). While the legal expert annotated all relevant facts with respect to allegations, according to his manual review, allegations can also be justified by sub-selections (parts) of rationales. Thus, although a method may fail to extract all the available rationales, the provided (incomplete) set of rationales may still be a convincing explanation. To properly estimate plausibility across methods, one has to perform a subjective human evaluation which we did not conduct due to lack of resources.

Conclusions and Future work
We introduced a new application of rationale extraction in a new legal text classification task concerning alleged violations on ECtHR cases. We also released a dataset for this task to foster further research. Moreover, we compared various rationale constraints in the form of regularizers and introduced a new one (singularity) improving faithfulness and rationale quality in a paragraph-level setup comparing both with silver and gold rationales.
In the future, we plan to investigate more constraints that may better fit paragraph-level rationale extraction and explore techniques to de-bias models and improve rationale quality. Paragraph-level rationale extraction can be also conceived as selfsupervised extractive summarization to denoise long documents, a direction we plan to explore in the challenging task of case law retrieval (Locke and Zuccon, 2018).

C Annotation of Gold Rationales
The full dataset has the following characteristics: • There are 1,000 cases in the test set. These are the most recent and have been ruled from October 5, 2017 until July 7, 2019.
• The average number of facts (paragraphs) per case is 25.2 ranging from 5 to 259.

Number of test cases in brackets.
Based on the above statistics and the opinion of the legal expert, for the gold rationales we considered a subset of 50 cases with the following characteristics: • Each case should consist of 25 ± 10 facts.
• The cases should be as representative as possible with respect to the defendants (European states).
• The cases should have allegations in a subset of the following articles {2, 3, 5, 6}, whose interpretation is more fact-dependent based on the literature (Harris, 2018) and our presented empirical results ( Table 2).
The annotation guidelines briefly were: • The annotator (legal expert) inspects (reads) all the factual paragraphs of the case and selects one or more articles in the predefined set {2, 3, 5, 6}, that should have been argued by the applicants according to the text.
• The annotator selects the factual paragraphs that "clearly" indicate allegations for the selected article(s), annotated in the first step.
The legal expert performance compared to the gold allegedly violated articles, is 92.3% micro-F1 (Table 8). In few cases, the legal expert selected more articles (hypothesized allegations for articles 3 and 5) compared to the gold ones. As he suggested, it is a common trend for the applicants, based on the legal opinion of their attorneys, to raise allegations only for a few articles that they believe can be justified and proved to be violated, i.e, if a citizen has no concrete evidence (documents) for his torture, his lawyer may suggest him to not raise this issue in his application. The legal expert also missed a few allegations for articles 2 and 6. The best of our models, (HIERBERT-SA + L s + L r ) achieves 87.6% micro-F1 in the same subset. ECHR

D Additional Experimental Results
For completeness, we report results on development data for sparsity loss (L s ) in Table 9 and continuity loss (L c ) in Table 10 for different values of λ * hyper-parameters.   In Section 5.6, we reported rationale quality on a subset of test data that includes silver and gold allegation rationales. For completeness, in Table 11 we report results on the full set of test data for silver rationales. We observe that all findings and particularly the ranking of the methods with respect to the subset of silver and gold rationales hold. Furthermore, we observe that the rationale quality performance on the full test set is slightly inferior in most cases (2-4%), which is expected as the sample annotated by the expert included only cases with allegations for articles that are more explainable.

E Using ECtHR dataset via
The dataset is available at https://archive.org/ details/ECtHR-NAACL2021; but you can easily load and use it in python with two lines of code: from datasets import load_dataset dataset = load_dataset("ecthr_cases")

F Examples of extracted Rationales from ECtHR cases with comments
In Fig. 3-7, we present examples of ECtHR cases. The highlighting (green background colour) indicates gold rationales. The dots (green dot on the left) indicate rationales extracted by our best model, HIERBERT-HA + L s + L g (Eq. 4, 6). In the caption of each figure, we include short comments explaining false positives (paragraphs the model wrongly selected) and false negatives (paragraphs the model wrongly missed).  , 13 and 20 clearly indicate plausible violation of the right to liberty (Article 5), as they refer to continuous extension of applicant detention, but our model was unable to extract them, thus it was unable to predict this allegation. The model targeted only paragraphs that indicate ill-treatment, which is connected to plausible violation of Article 3 (Prohibition of Torture). clearly indicate that the applicant's health (life) was at risk and authorities did not pay attention, but these paragraphs were not selected by the model. Instead paragraph 10 states that the applicant initially informed the authorities for his medical history and they provided medication. This is an indication of model sensitivity to language describing health issues (tuberculosis) in general and not specific well-defined allegations for ill-treatment on the merits. Figure 6: (BRAJOVIC AND OTHERS v. MONTENEGRO, No. 52529/12) A causal inference would connect paragraph 8 (initial trial) with paragraphs 20-22 (next trials) to infer mistrial, because there was is verdict after a reasonable period of time. Instead the model seems to be sensitive to references for the involvement of higher courts as justification of mistrial (paragraphs 10, 13, 18, and 21). This suggests that the model probably follows poor (greedy) reasoning, i.e., if the applicant appealed to higher courts, then the case is mistreated. Figure 7: (RAJAK v. MONTENEGRO, No. 71998/11) Similarly to the case presented in Fig. 6, the main argument in this case is mistrial because there was a verdict after a reasonable period of time (paragraphs 5 and 18-19). The model selected paragraph 11, which does not indicate plausible violations. Given the model's prediction for allegations with respect to Article 1 of the 1st Protocol on the protection of property, we believe that paragraph 11 was selected by the model as justification on that matter.