Charge-Based Prison Term Prediction with Deep Gating Network

Judgment prediction for legal cases has attracted much research efforts for its practice use, of which the ultimate goal is prison term prediction. While existing work merely predicts the total prison term, in reality a defendant is often charged with multiple crimes. In this paper, we argue that charge-based prison term prediction (CPTP) not only better fits realistic needs, but also makes the total prison term prediction more accurate and interpretable. We collect the first large-scale structured data for CPTP and evaluate several competitive baselines. Based on the observation that fine-grained feature selection is the key to achieving good performance, we propose the Deep Gating Network (DGN) for charge-specific feature selection and aggregation. Experiments show that DGN achieves the state-of-the-art performance.


Introduction
Judgment prediction (Kort, 1957;Ulmer, 1963;Segal, 1984;Liu et al., 2004;Liu and Hsieh, 2006) aims at automatically predicting the judgment result given a textual description of a legal case (An example is given in Figure 1). Recently, there has been a resurgent interest in this task due to the availability of more data and new machine learning techniques (Luo et al., 2017;Zhong et al., 2018b;.
Judgment prediction can be decomposed into several sub-tasks: (a) relevant law article extraction (Liu and Hsieh, 2006;Liu and Liao, 2005;, (b) charge prediction (Liu and Hsieh, 2006;Luo et al., 2017;, (c) and prison term prediction (Zhong et al., 2018a). The dependencies among them have also been studied by Zhong et al. (2018b). While effective methods exist for sub-task (a) and (b), (e.g In CAIL2018 * Both authors contributed equally. Case description: On July 7, 2017, when the defendant Cui XX was drinking in a bar, he came into conflict with Zhang XX…… After arriving at the police station, he refused to cooperate with the policeman and bited on the arm of the policeman…… Result of judgment: Cui XX was sentenced to 12 months imprisonment for creating disturbances and 12 months imprisonment for obstructing public affairs…… l Charge#1 creating disturbances term 12 months l Charge#2 obstructing public affairs term 12 months competition (Zhong et al., 2018a), both the charge prediction and the article prediction have attained F micro over 95%), the prison term prediction remains the performance bottleneck.
In this paper, we improve the accuracy of prison term prediction by decomposing it into a set of charge-based prison term predictions (CPTPs). In this way, more subtle and sophisticated interactions between textual description and a specific charge can be captured, resulting in more precise term predictions for individual charges. Meanwhile, CPTPs also shed light on the prediction of the total prison term.
On the other hand, CPTP also poses challenges due to the following reasons: The case description can be very lengthy and not all parts are relevant to a specific charge. The charge-related descriptions are often presented in an interleaving way, making it difficult to associate a specific charge with its corresponding information.
To address the above problems, we propose the Deep Gating Network (DGN) for gradually filtering and aggregating charge-specified information at different levels of granularity. Specifically, we stack multiple blocks of an LSTM layer and a charge-specific gating layer for generating a focused charge-based representation of the case description. Finally, the whole document representa-tion is obtained by a convolutional neural network.
To conduct the experiments, we construct a new dataset, which contains more than 200, 000 criminal cases. 1 To show the effectiveness of the proposed approach, we compare it with several strong baselines adapted from aspect-based sentiment classification (Wang et al., 2016;Tang et al., 2016;Chen et al., 2017;. Experiments show that our method achieves significantly better results than all of them. In addition, when we leverage the results of charge-based term predictions for the total prison term prediction, it also surpasses several strong baselines that are directly aimed at the total term prison prediction.
In summary, our contributions are as follows: • We formally define the task of charge-based prison term prediction and collect the first dataset for it.
• We propose the Deep Gating Network (DGN). Experiments show our method achieves the state-of-the-art performance.
• We show that the accuracy of the total term prediction is also improved by a simple heuristic integration of individual chargebased term predictions.

Problem Definition & Dataset Construction
We formally define the task of charge-based prison term prediction as follows. The input are a case description x = {x 1 , x 2 , · · · , x n } and a set of corresponding charges c = {c 1 , c 2 , · · · , c k }, where n and k are the length of case description and the number of charges respectively. The goal is to predict the prison terms y = {y 1 , y 2 , · · · , y k }, where y j is the prison term corresponding to charge c j .
To the best of our knowledge, there is no existing structured dataset for the above task. We thus collect and construct a dataset based on the published records from the Supreme People's Court of China, 2 where each criminal case document includes the accusation by the procuratorate, the court view, and the result of judgment. Following Xiao et al. (2018), we take the accusation by the procuratorate as the input textual description. The charges and the corresponding prison terms #single #multiple total   Figure 2: The architecture of DGN are extracted from the result of judgment using regular expressions like "sentence to months imprisonment for ". We build 238,749 wellstructured cases in total (An example is given in Fig 1). The collected cases are further split into the training set, the validation set, and the test set. The statistics of the dataset are detailed in Table  1. The range of possible prison terms is [1,240] (in months). The dataset has a broad coverage of common charges, 157 different types of charges are involved.

Deep Gating Network
At the bottom layer of DGN, each word x i is mapped into a low dimensional vector h (0) i according to a word embedding table.
DGN then starts to construct charge-specific representations gradually. L identical blocks are hierarchically stacked. The l-th block takes the output of the (l − 1)-th layer h l−1 i as input. Each block transforms its input semantic vectors into more sophisticated and focused representations based on gated feature selection and combination.
Specifically, each gating block consists of a bi-LSTM layer for context aggregation and a gating layer for charge-specific feature filtering.
i is the gate for i-th vector of the l-th gating block and denotes element-wise multiplication. g (l) i is computed as: where c j is target charge embedding. The gating layers can select the charge-specific features according to the target charge embedding.

Convolutional Neural Network
Convolutional Neural Network (CNN) has been effective in modeling sequential data (Kim, 2014;Hu et al., 2014;Pang et al., 2016). It uses convolution operations (with multiple groups of filters) for n-gram feature extraction. The sequencelevel representation is then obtained through maxpooling, where the most salient n-gram features are detected and selected.
In this work, we use a CNN with filter width in [1,2,3,4,5]. The number of filters for each width is 256. We concatenate the outputs of different filters for the final document representation z.

Output and Training
The charge-specific document representation z is passed to a fully connected layer with ReLU activation for the final prediction.
where W o and b o are trainable parameters.
Since the Mean Squared Error (MSE) loss cannot reflect the relative deviation ratio between the prediction and the ground-truth, we take the logarithm before estimating their difference.
To alleviate the impact of outliers and stabilize the training, we propose to use Huber Loss (Huber, 1964), a is set to 1 in experiments: Total Term Prediction Although our model is trained to predict the prison term for specific charge, it can be readily adapted to predict the total term by a simple heuristic integration of individual charge-based prison term predictions. There are certain regulations for combined punishment of crimes in Chinese legislation. For simplicity, we take the average of the maximum and summation of individual charge-specific term predictions. The total term prediction is also capped at 240 months.

Evaluation Metrics
For evaluation, we adopt the official score function (S metric) of the CAIL2018 Competition (Zhong et al., 2018a). The score function measures the log different δ between prediction valueŷ and gold value y as in Eq 2. The final score s(δ) is a piecewise function that increases monotonically with the value of δ. For more details about the S metric, we refer interested readers to (Zhong et al., 2018a). We also report the exact match (EM) rate and error-tolerant accuracy Acc@p, where p is the maximum acceptable error rate. Formally, a prediction is considered "correct" if and only if its value is in the range [y(1 − p), y(1 + p)].

Compared Methods
The task of charged-based prison term prediction is similar in spirit to aspect-based sentiment classification (Pang et al., 2008), where multiple classification decisions are made given one text description and different target entities. This suggests that other neural architectures proposed for aspect-based sentiment classification may also be suitable for our task. The adaption from classification to regression can be easily accomplished by replacing the original final layer with that of Eq 1. Specifically, we adapted the following models:  • ATAE-LSTM (Wang et al., 2016): it concatenates aspect embedding and the output of LSTM, and uses self-attention to obtain aspect-based representation.
• MemNet (Tang et al., 2016): it uses multihop attention over the word embeddings for a sentence, where aspect embedding is regarded as the initial key.
• RAM (Chen et al., 2017): it also uses multihop attention for aspect-specific representation learning, while the attention at different time steps are aggregated by recurrent neural network.
• TNet : it has a similar architecture to DGN. The major difference is that it employs a Transformation Network for mixing the information in aspect embedding and token representations rather than the explicit gates in our model.
The aspect embedding in above models is replaced by charge embedding in our experiments. In addition, we also compare with the popular models for total term prediction (Zhong et al., 2018a,b): • CNN (Kim, 2014): the case description is encoded by a CNN with multiple filter widths, followed by max-pooling.
• RNN (Hochreiter and Schmidhuber, 1997): bi-LSTM are used for case description encoding, where the final states are regarded as the document representation.
• RCNN (Lai et al., 2015): we stack a CNN on the top of LSTM states for final representation.

Main Results
The results of charge-based prison term prediction are shown in   achieves the best results on all four metrics. In addition, the margins between our model and others are remarkably wide. It can be observed that aspect-based sentiment models only give moderate performance, which we attribute to that the case description is so long that more rigorous feature selection, such as the treatment of DGN, is needed. Our model selects and aggregates features in a explicit way which is more efficient and effective in dealing with charge-specific descriptions often spread out across lengthy case documents in CPTP. Table 3 presents the results of the total term prediction. Although our method is not directly trained to make the final prediction, the performance of our model surpasses all baselines, which confirms that the breakdown charge-based analysis can indeed help the total prison term prediction.

Depth of DGN
To study the impact of the number of DGN blocks, we test our model with various depths and show the results in Fig 3. 3 As shown, the performance improves as the depth of DGN increases until it reaches 3 when the performance begins to drop likely due to overfitting.

Effects of Log Huber Loss
We compare Log Huber Loss (LHL) with Mean Square Error (MSE), Mean Absolute Error (MAE) and Huber Loss (HL). We also try Log Cosh Loss (LCL), but it does not converge. As shown in Figure 4, Log Huber Loss performs best in all metrics for all models. The improvement is most significant in S metric. It also suggests that making the loss function consistent with the evaluation metric is beneficial.

Error Analysis
So far, our model has the best results on prison term predictions. In this section, we aim to conduct an in-depth analysis and answer the following questions: (1) In which cases, our model fails to deliver accurate predictions? (2) What are the prospects for further improvement? After carefully analyzing 100 examples, we roughly classify them into the following categories.
Lengthy Description Some cases are extremely complicated, especially for cases with gangs. These descriptions are often lengthy and involve multiple criminal suspects.
Incomplete Information In some cases, the input case description does not contain sufficient information for precise prediction. Note we only take the accusation by the procuratorate as input, which is incomplete compared to the whole materials relevant to a case. For example, if a defendant is recidivism within a shorter period, he/she shall be given a heavier punishment.
Rare Cases Some special circumstances will influence the prison term, yet rarely happen in the training set. For example, if a defendant cause injuries to others due to excessive defense, he/she shall be given a lighter punishment. This knowledge is easily understandable by humans, bu hard to be learned by machine learning models.

Ethical Discussions
Although the research on prison term prediction has considerable potential to improve efficiency and fairness in criminal justice, there are certain ethical concerns worth discussions. First, does the training data provide unbiased examples and sufficient? For example, some may worry about that the model may treat people differently based on race, social class, age and so on (Tonry, 2014). Discrimination in the past may be learned in models. Also, with the development of our society, new forms of crimes will appear. A model trained on historical data may fail in these new cases.
Second, is the learned system robust enough? Some subtle details may significantly affect the result of judgment. For example, the amount of theft and the number of drugs, these numerical values are often not uniform in different case descriptions, causing it hard to learn by neural models. Some infrequent words, such as named entities, may also cause undesirable interference.
The mistake of legal judgment is serious, it is about people losing years of their lives in prison, or dangerous criminals being released to reoffend. We should pay attention to how to avoid judges' over-dependence on the system. It is necessary to consider its application scenarios. In practice, we recommend deploying our system in the "Review Phase", where other judges check the judgment result by a presiding judge. Our system can serve as one anonymous checker.
In summary, the judgment prediction is an emerging technology at its exploratory stage. We should be aware of the risks and prevent any inappropriate use of the technology.

Conclusion
In this paper, we formally presented the task of charge-based prison term prediction. We introduced the first large-scale dataset for this task. To tackle the problem of the noisy and entangled description of legal cases, we proposed the deep gating network for charge-specific information filter. Experiments show that our model significantly improves the accuracy of charge-based prison term prediction, as well as the total term prediction. Finally, we discussed some ethical problems of the proposed techniques that are worth cautious thinking.