Legal Judgment Prediction via Topological Learning

Legal Judgment Prediction (LJP) aims to predict the judgment result based on the facts of a case and becomes a promising application of artificial intelligence techniques in the legal field. In real-world scenarios, legal judgment usually consists of multiple subtasks, such as the decisions of applicable law articles, charges, fines, and the term of penalty. Moreover, there exist topological dependencies among these subtasks. While most existing works only focus on a specific subtask of judgment prediction and ignore the dependencies among subtasks, we formalize the dependencies among subtasks as a Directed Acyclic Graph (DAG) and propose a topological multi-task learning framework, TopJudge, which incorporates multiple subtasks and DAG dependencies into judgment prediction. We conduct experiments on several real-world large-scale datasets of criminal cases in the civil law system. Experimental results show that our model achieves consistent and significant improvements over baselines on all judgment prediction tasks. The source code can be obtained from https://github.com/thunlp/TopJudge.


Introduction
Legal Judgment Prediction (LJP) aims to predict the judgment results of legal cases according to the fact descriptions. It is a critical technique for the legal assistant system. On the one hand, LJP can provide low-cost but high-quality legal consulting services to the masses who are unfamiliar with legal terminology and the complex judgment procedures. On the other hand, it can serve as the handy reference for professionals (e.g., lawyers and judges) and improve their work efficiency.
LJP has been studied for decades (Kort, 1957;Ulmer, 1963;Nagel, 1963;Keown, 1980;Segal, 1984;Lauderdale and Clark, 2012;Ye et al., 2018;, and most existing works formalize LJP as a text classification task. For example, some works (Liu et al., 2004;Liu and Hsieh, 2006) propose to extract shallow textual features (e.g. characters, words, and phrases) for charge prediction. Katz et al. (2017) predict the US Supreme Court's decisions based on efficient features from case profiles. Luo et al. (2017) propose an attention-based neural model for charge prediction by incorporating the relevant law articles.
Despite these efforts in designing efficient features and employing advanced NLP techniques, LJP is still confronted with two major challenges: Multiple Subtasks in Legal Judgment: Practically, legal judgment usually consists of detailed and complicated subclauses, such as charges, the term of penalty, and fines. Specifically, for those countries with the civil law system (e.g., China, France, and Germany), the prediction of relevant articles is also considered to be one of the fundamental subtasks, which will guide the prediction for other subtasks. In other words, all these subtasks compose the complete form of judgment pre-diction. Nevertheless, existing works on LJP usually focus on one specific subtask of judgments, which does not conform to the real scenarios. Although some methods (Luo et al., 2017) are developed to predict law articles and charges at the same time, their models are designed for a specific set of subtasks which are hard to scale to other subtasks.
Topological Dependencies between Subtasks: For human judges, there exists a strict order among the subtasks of legal judgment. As illustrated in Fig. 1, given the fact description of a specific case, a judge in the civil law system first decides which law articles are relevant to the scenario, and then determines the charges according to the instructions of relevant law articles. Based on these results, the judge further confirms the term of penalty and fines. How to simulate the judicial logic of human judges and model the topological dependencies among legal subtasks will deeply influence the creditability and interpretability of judgment prediction.
As stated above, conventional works cannot handle these two challenges due to both the limitation of specific tasks and neglecting topological dependencies. To address these issues, we propose to model the multiple subtasks in judgment prediction jointly under a novel multi-task learning framework.
We model the topological dependencies among these subtasks with a Directed Acyclic Graph (DAG), which means all subtasks are arranged in topological order. If the judgment of the j-th subtask t j depends on the output of the i-th subtask t i , then t i appears earlier than t j in such order. It is notable that such formulation provides an explicit explanation of dependency relations among subtasks.
Accordingly, we introduce topological learning for LJP and propose a unified framework, named as TOPJUDGE. Specifically, given the encoded representation of the fact description, TOPJUDGE predicts the outputs of all the subtasks following the topological order, and the output of a specific subtask will be affected by all the subtasks it depends on. In contrast with conventional multi-task learning, our model takes the explicit topological dependencies of LJP subtasks into consideration and is flexible to handle other LJP subtasks. Moreover, the topological order of legal dependencies renders our model interpretable and reliable.
To verify the effectiveness and flexibility of TOPJUDGE, we conduct a series of experiments on several real-world large-scale datasets. Experimental results show that our model achieves significant and consistent improvements over stateof-the-art models on all tasks and datasets. To summarize, we make several noteworthy contributions as follows: (1) We are the first to explore and formalize the multiple subtasks of legal judgment under a joint learning framework. Moreover, we formulate the dependencies among the subtasks of LJP as a form of DAG and introduce this prior knowledge to enhance judgment prediction.
(2) We propose a novel judgment prediction framework, TOPJUDGE, to unify multiple subtasks and make judgment predictions through topological learning. This model can handle any form of DAG dependent subtasks, which has been verified in the experiments.
(3) We carry out experiments on several largescale real-world datasets, and our model significantly and consistently outperforms all the baselines on all subtasks.

Judgment Prediction
Employing automatic analysis techniques for legal judgment has drawn attention from researchers in the legal field for decades. Early works usually focus on analyzing existing legal cases in specific scenarios with mathematical and statistical algorithms (Kort, 1957;Ulmer, 1963;Nagel, 1963;Keown, 1980;Segal, 1984;Lauderdale and Clark, 2012).
With the development of machine learning and text mining techniques, more researchersformalize this task under text classification frameworks. Most of these studies attempt to extract efficient features from text content (Liu and Hsieh, 2006;Lin et al., 2012;Aletras et al., 2016;Sulea et al., 2017) or case annotations (e.g., dates, terms, locations, and types) (Katz et al., 2017). However, these conventional methods can only utilize shallow textual features and manually designed factors, both require massive human efforts and usually suffer from the generalization issue when applied to other scenarios.
Inspired by the success of neural networks on NLP tasks (Kim, 2014;Baharudin et al., 2010;Tang et al., 2015), researchers began to handle LJP by incorporating neural models with legal knowl-edge. For example, Luo et al. (2017) present an attention-based neural network that jointly models charge prediction and relevant article extraction.  incorporate 10 discriminative legal attributes to predict few-shot and confusing charges. Nevertheless, these models are designed for specific subtasks and thus non-trivial to be extended to more subtasks of LJP with complex dependencies. Besides, Ye et al. (2018) utilize a Seq2Seq model to generate court views with fact descriptions and predicted charges in Chinese civil law.

Multi-task Learning
Multi-task learning (MTL) aims to exploit the commonalities and differences across relevant tasks by solving them at the same time. It can transfer useful information among various tasks and has been applied to a wide range of areas, including NLP (Collobert and Weston, 2008), speech recognition (Deng et al., 2013), and computer vision (Girshick, 2015;Mao et al., 2017).
There have been numerous successful usages of MTL in NLP tasks. Most works follow the hard parameter sharing setting by sharing representations or some encoding layers among relevant tasks. For example, Collobert and Weston (2008) use shared word embeddings in solving part-of-speech tagging and semantic role labeling tasks.  share the encoding layers of input queries to address query classification and information retrieval. Dong et al. (2015) and Luong et al. (2016) propose to share encoders or decoders to improve one (many) to many neural machine translation. Firat et al. (2016) propose to share attention mechanism in multi-way, multilingual machine translation. Besides hard parameter sharing, soft parameter sharing is another common approach in MTL. It assumes that each task owns its specific parameters and the distance between parameters in different tasks should be close to each other. For example, Duong et al. (2015) employ L2 distance for regularization, while Yang and Hospedales (2017) use the trace norm.  introduce gates among task-specific RNN layers to control the information flow. Ruder et al. (2017) introduce a model which can decide the amount of sharing between different NLP tasks. There are also some works focusing on increasing tasks (Hashimoto et al., 2017) or handing unlabeled data (Augenstein et al., 2018). In this work, we introduce a topological learning framework TOPJUDGE to handle multiple subtasks in LJP. Different to conventional MTL models which focus on how to share parameters among relevant tasks, TOPJUDGE models the explicit dependencies among these subtasks with scalable DAG forms.

Method
In the following parts, we will first give the essential definitions of LJP task. We then introduce the DAG dependencies of the subtasks in LJP. And finally, we describe the neural encoder for fact representation and the judgment predictor for the subtasks with DAG dependencies. The overall framework of TOPJUDGE has been shown in Fig 2.

Problem Formulation
We will focus on the LJP tasks in civil law. Suppose the fact description of a case is a word sequence x = {x 1 , x 2 , . . . , x n }, where n is the length of x and each word x i comes from a fixed vocabulary W . Based on the fact description x, the task of LJP T aims to predict judgment results of applicable law articles, charges, term of penalty, fines and so on. Formally, we assume T contains |T | subtasks, i.e., T = {t 1 , t 2 , . . . , t |T | }, each of which is a classification task. For the i-th subtask t i ∈ T , we aim to predict the corresponding result Take the subtask of charge prediction for example, the corresponding label set should contain Theft, Traffic Violation, Intentional Homicide and so on.

DAG Dependencies of Subtasks
We assume that the dependencies among multiple subtasks of LJP form a DAG. As a result, the task list T should satisfy topological constraints. Formally, we use the notation t i ¡ t j to denote that  the j-th subtask depends on the i-th subtask, and The task list T can be ordered to satisfy the following constraint (1) We demonstrate the flexibility of our formulation by describing two special cases: (1) As shown in Fig. 3 (a), if no dependencies exist, i.e., D j = ∅, it corresponds to the typical MTL setting where we simultaneously make predictions for all subtasks.
(2) As shown in Fig. 3 (b), if each task only depends on its previous task, i.e., D j = {t j−1 }, it forms a sequential learning process.

Neural Encoder for Fact Descriptions
We employ a fact encoder to generate the fact description's vector representation as the input of TOPJUDGE. In the following part, we briefly introduce an encoder based on Convolutional Neural Networks (CNN) (Kim, 2014).
Taking a word sequence x as input, the CNN encoder computes the text representation through three layers, i.e., lookup layer, convolution layer and pooling layer.
Lookup We first convert each word x i in x into its word embedding x i ∈ R k , where k is the dimension of word embeddings. The word embedding sequence is then represented aŝ x = {x1, x2, . . . , xn}. (2) Convolution A convolution operation involves a convolution matrix W ∈ R m×(h×k) , which is applied to a sliding window of length h with number of filters m to produce a feature map by where x i:i+h−1 is the concatenation of word embeddings within the i-th window and b ∈ R m is the bias vector. By applying convolution over each window, we obtain c = {c 1 , . . . , c n−h+1 }.
Pooling We apply per-dimension max-pooling over c and obtain the final fact representation

Judgment Predictor over DAG
Based on the DAG assumption, we obtain an ordered task list T * = [t 1 , t 2 , . . . , t |T | ]. For each task t j ∈ T , we aim to predict its judgment result y j based on the fact representation vector d and the judgment results of its dependent tasks. For prediction, we employ a specific LSTM cell for each task and get the output of each task in the topological order. More specifically, for each task t j ∈ T , we obtain its final judgment result through three steps, i.e., cell initialization, taskspecific representation, and prediction.
Cell Initialization As stated above, the prediction result of t j will be conditioned on the fact representation d and the outputs of all dependent tasks y k , ∀t k ∈ D j . Hence, we have Here, h i and c i are the hidden state and memory cell of t i .h j andc j are the initial hidden state and memory cell of t j . W i,j and b j are transformation matrices and bias vectors specific to t i and t j .
Task-Specific Representation Taking the fact representation d, the initial hidden stateh j , and the initial memory cellc j as inputs, we process them with an LSTM cell (Hochreiter and Schmidhuber, 1997).
We regard the final hidden state h j as the taskspecific representation of task t j . The last cell state c j is used to compose the initial hidden state for the downstream tasks by Eq. 5 Prediction With the representation h j , we apply an affine transformation followed by softmax and obtain the final prediction aŝ Here, W p j and b p j are parameters specific to task t j .
With the prediction resultŷ j , we minimize the cross-entropy betweenŷ j and y j as follows:

Training
We use cross-entropy loss for each subtask and sum up losses to train TOPJUDGE: where λ j is the weight factor for each subtask t j . The DAG dependencies of subtasks ensure that our model is differentiable and can be trained in an end-to-end fashion. In practice, we set all weights λ j to 1, and employ Adam (Kingma and Ba, 2015) for optimization. We also apply dropout (Srivastava et al., 2014) on the fact representation to prevent overfitting.

Experiments
To evaluate the proposed TOPJUDGE framework, we conduct a series of experiments on LJP over three large-scale datasets of criminal cases in China. We select three representative judgment prediction subtasks for comparison, including law articles, charges, and the terms of penalty.   Xiao et al. (2018). For all datasets we mentioned above, as the documents are well-structured and human-annotated, we can easily extract fact descriptions, applicable law articles, charges and the terms of penalty from each document using regular expressions. We have manually checked a randomly sampled set of cases, and extraction errors are negligible.

Dataset Construction
In real-world scenarios, there are some cases with multiple defendants and multiple charges, which will increase the complexity of judgment prediction. As our model aims to explore the effectiveness of considering topological dependencies between various subtasks, we filter out these cases and leave them as our future work.
Meanwhile, there are also some infrequent charges and law articles, such as money laundering, smuggling of nuclear materials and tax dodge. We filter out these infrequent charges and law articles and only keep those with frequencies greater than 100. For the term of penalty, we divide the terms into non-overlapping intervals. We list detailed statistics of these datasets in Table 1.

Baselines
For comparison, we employ the following text classification models and judgment prediction methods as baselines: TFIDF+SVM: We employ term-frequency inverse document frequency (TFIDF) (Salton and Buckley, 1988) to extract word features and utilize SVM (Suykens and Vandewalle, 1999) for text classification.
CNN: We employ CNN with multiple filter widths (Kim, 2014) for fact encoding and classification.
Hierarchical LSTM (HLSTM): Tang et al. (2015) employs hierarchical neural networks to learn document representations in sentiment clas-   sification. Based on this work, we employ an LSTM for sentence representations and another one to obtain the representation of complete fact descriptions. Pipeline Model (PM): To demonstrate the advantage of TOPJUDGE on modeling subtasks jointly, we also implement a pipelined method for comparison. Here, we train 3 separate CNN classifiers for law articles, charges, and term of penalty. For each subtask, the input is the concatenation of the fact representation and the embeddings for predicted labels of previous subtasks.
Besides, we compare our model with conventional MTL methods that do not consider the dependencies among subtasks as in Fig. 3 (a). These methods are denoted as CNN-MTL and HLSTM-MTL, where we implement the fact encoder as in Fig. 2 using CNN or HLSTM respectively.

Experimental Settings
As the case documents are written in Chinese with no space between words, we employ THU-LAC (Sun et al., 2016) for word segmentation. Afterward, we adopt the Skip-Gram model (Mikolov et al., 2013) to pre-train word embeddings on these case documents, with embedding size set to 200 and frequency threshold set to 25.
For all models, we set the fact representation and task-specific representation size to 256. Meanwhile, we set the maximum sentence length to 128 words and maximum document length to 32 sentences.
For training, the learning rate of Adam optimizer is 10 −3 , and the dropout probability is 0.5. We also set the batch size to 128 for all models. We train every model for 16 epochs, and evaluate the final model on the testing set.
Here, the macroprecision/recall/F 1 are calculated by averaging the precision/recall/F 1 of each category.

Results and Analysis
We evaluate the performance on three LJP subtasks, including law articles (denoted as t 1 ), charges (denoted as t 2 ), and the terms of penalty (denoted as t 3 ). Experimental results are shown in Tables 2, 3, and 4. Note that, we implement TOP-JUDGE with the dependency relationship in Fig. 3  (c), i.e.,  This means that the prediction of charges depends on law articles, and the prediction of term of penalty depends on both law articles and charges. Such explicit dependencies conform to the judicial logic of human judges, which will be verified in later sections. These results show that: (1) The proposed TOPJUDGE model outperforms other baselines significantly on most subtasks and datasets. It demonstrates the effectiveness and robustness of our proposed framework.
(2) Compared with conventional single-task models, e.g., CNN and HLSTM, MTL methods take advantage of the correlation among relevant subtasks and thus achieve promising improvements. It indicates the importance of modeling LJP subtasks jointly.
(3) Moreover, TOPJUDGE significantly outperforms typical MTL models, especially on the prediction of charges and the terms of penalty. It verifies the rationality and importance of modeling dependencies over LJP subtasks with DAG.  To further illustrate the significance of legal dependencies and explore how the DAG dependencies influence the performance, we evaluate the performance of TOPJUDGE under various DAG architectures. Using Eq. 9 as the full dependencies, we remove the dependency of t 3 ¡ t 1 (law articles and term of penalty, corresponding to the sequential form in Fig. 3), t 2 ¡ t 1 (law articles and charges), and all dependencies respectively. Results are summarized in Table 5.

Ablation Analysis
We observe that the performance of TOPJUDGE decreases on all tasks after removing either dependency. More specifically, when we dropped dependencies t 3 ¡ t 1 and t 2 ¡ t 1 respectively, significant decreases are observed for t 3 and t 2 correspondingly. This demonstrates that incorporating dependencies is beneficial for relevant subtasks, verifying its guiding role in the civil law system.
Meanwhile, we note that there are two main differences between TOPJUDGE and traditional multi-task models, namely the Cell Initialization and the Task-Specific Representation. We can see that if we eliminate Cell Initialization from TOPJUDGE, the dependencies will not be represented in the model and it will become similar to CNN-MTL. If we eliminate the Task-Specific Representation from TOPJUDGE, TOP-JUDGE will become the same as the Pipeline Model. In a word, the main improvement of our models comes from the combination.

Case Study
We give some intuitive examples to demonstrate the significance of TOPJUDGE on LJP subtasks.
As shown in Table 6, case 1 is about negligently causing a fire. The fact description of this case states "The defendant pulled up weeds in the fields and piled them up in haphazard stacks. Afterward, he lighted them up and triggered the forest fires..." TOPJUDGE predicts all judgments correctly, while CNN-MTL fails to predict the charge and term of penalty. Moreover, CNN-MTL obtains conflicting judgments, i.e., "crime of arson" and "1-2 years", due to its neglecting of dependencies of these subtasks. According to the legal provisions of law article 115, the crime of arson should be sentenced to more than 10 years.  Case 2 in Table 6 is another evidence of the insufficiency of conventional MTL on LJP. This case is about picking quarrels and provoking troubles. Both CNN-MTL and TOPJUDGE succeed to predict the relevant law articles (i.e., law article 293 of the crime of affray). However, CNN-MTL is confused between "crime of affray" and "crime of intentional destruction or damage of properties", two charges similar to each other. Conversely, TOPJUDGE can utilize the prediction result of law articles and consequently prevent this confusion.
To summarize, modeling the explicit dependencies among various subtasks can remarkably help the LJP model address the issue of predicting conflicting results.

Error Analysis
Prediction errors induced by our proposed model can be traced down into the following causes.
Data Imbalance. For the subtasks of law articles and charges, our model achieves more than 90% on accuracy, while only about 60% for macro-F1. This issue is much more severe on the subtask of the term of penalty, which our model yields a poor performance of only 30% macro-F1. The bad performance is mainly due to the imbalance of category labels, e.g., there are only a few training instances where the term is "life imprisonment or death penalty". Most judgment prediction approaches perform poorly (especially for Recall) on these labels as listed in Fig. 4. Instance weighting schemes can be introduced to address this issue in future works.
Incomplete Information. Following existing LJP works, we predict the final judgment according to the fact descriptions, which is incomplete as compared to the whole materials relevant to this case. In Chinese Law, there are certain circumstances under which the sentence can be shortened. For example, minors usually receive a light- ened penalty, and those guilty of misdemeanors are allowed for a secured pending trial while paying a security deposit. However, such information is not included in the fact descriptions. The lack of such information also raises difficulties for judgment prediction, especially for the prediction of the term of penalty. In Fig. 4, we can see that the highest error rate comes from the cases with a short term of penalty. Our model fails to distinguish the cases with no penalty and those with 0-6 months term of imprisonment.

Conclusion
In this paper, we focus on the task of legal judgment prediction (LJP) and address multiple subtasks of judgment predication with a topological learning framework. To be specific, we formalize the explicit dependencies over these subtasks in a DAG form, and propose a novel MTL framework, TOPJUDGE, by integrating the DAG dependencies. Experimental results on three LJP subtasks and three different datasets show that our TOP-JUDGE outperforms all single-task baselines and conventional MTL models consistently and significantly.
In the future, we will seek to explore the following directions: (1) We will explore more LJP subtasks and more scenarios of cases such as multiple defendants and charges to investigate the effectiveness of TOPJUDGE. (2) We will explore how to incorporate into LJP the temporal factors, which are not considered in this work.