An Element-aware Multi-representation Model for Law Article Prediction

Existing works have proved that using law articles as external knowledge can improve the performance of the Legal Judgment Prediction. However, they do not fully use law article information and most of the current work is only for single label samples. In this paper, we propose a Law Article Element-aware Multi-representation Model (LEMM), which can make full use of law article information and can be used for multi-label samples. The model uses the labeled elements of law articles to extract fact description features from multiple angles. It generates multiple representations of a fact for classiﬁcation. Every label has a law-aware fact representation to encode more information. To capture the dependencies between law articles, the model also introduces a self-attention mechanism between multiple representations. Compared with base-line models like TopJudge, this model improves the accuracy of 5.84%, the macro F1 of 6.42%, and the micro F1 of 4.28%.


Introduction
Legal Judgment Prediction(LJP) aims to predict a law case's judgment results given a fact description text. LJP mainly contains three sub-tasks, law article prediction, charge prediction, and terms of penalty prediction. In the civil law system, the correct prediction of law article prediction can help improve the accuracy of charge prediction (Luo et al., 2017). The investigation of law article prediction has significant meaning for LJP.
The law article prediction aims to predict the case's relevant law articles given the fact description (hereinafter abbreviated fact) of a case. In the law article prediction, law articles play an essential role as external information. Luo et al. (2017) uses some candidate law articles to improve the performance of the charge prediction task. However, current researches have two main limitations. One is that certain law articles are considerably similar which makes them difficult to distinguish. Using the representation of overall law articles to extract fact information is not intuitive enough. Another one is that most of the works (Zhong et al., 2018a;Yang et al., 2019;Liu et al., 2019) only predict on single label examples. Meanwhile, in the actual judgment, many cases contain multiple relevant law articles (Zhong et al., 2018b).
Human judge process mainly compares the elements of law article with the case description , such as the subject of crime (person or specific identity), the object of the crime (person or thing), the purpose and motive of the crime, the harmful behavior, the adverse result, and the crime scene (time or place).
To make full use of the law article information and reduce the confusion in distinguishing different law articles, we have designed a Law Article Element-aware Multi-representation Model (LEMM). LEMM is more related to human cognitive logic and more intuitive based on the law element. We call it LEMM because it extracts fact features specifically by using law article elements and generates multiple law-aware fact representations. Each label has a particular fact representation in classification, which benefits the law article prediction task. Using one vector to distinguish correct law article is inappropriate because the number of relevant law articles is more than 100. Considering law-aware fact representation takes law article as an individual unit, and there are some dependencies between law articles, we capture the relationship between them via the self-attention mechanism. Our LEMM model makes an excellent performance in all evaluation indicators.

Structurally Labeling the Law Articles
Judging whether the law article and case are relevant mainly depends on whether the key elements (including the subject, object, purpose, motive, the harmful behavior, the result of the harm, and the circumstances of the crime) are consistent with the law. Therefore, we divide the law articles into seven elements: 1. the crime subject, 2. the crime object, 3. the purpose and motive of the crime, 4. the harmful behavior, 5. the harmful result, 6. the crime occasion and 7. the supplementary explanation. Since a law article may contain multiple crimes, such law article has multiple groups of elements which correspond to different crimes. As shown in Figure 1, we first divide the content of the law according to the crime and then label the various elements. For elements that are not specified or restricted, we mark them as None. We label 183 candidate law articles of the CAIL dataset (Xiao et al., 2018), which contains a total of 202 crimes.

LEMM Model
The fact is a word sequence {w 1 , w 2 , . . . , w m }. The model uses labeled law articles to help extract features of the fact. The labeled law articles contain the name of crime and the elements of the crime. The name of crime is a word sequence: {w 1 , w 2 , . . . ., w n }. The elements of crime contain seven word sequences: {ele 1 , ele 2 , . . . , ele 7 }, where ele i is {w 1 , w 2 , . . . , w ik }.
Our model contains five components: Encoder: encode law article elements and fact. Feature Extraction: use element representa-tions to extract word-level and document-level fact representation by attention mechanism. Fusion: fuse the word-level and document-level fact representation to law-aware representations.
Relation Extraction: extract the dependencies between law articles by self-attention.
Classification: classify whether the law article is relevant.

Encoder
The Encoder component contains two encoders, which are element encoder and fact encoder.

Element Encoder
Element Encoder uses BiGRU (Cho et al., 2014) to encoder crime name and crime elements. It takes the hidden state of the last token as the representation of the input. This process is shown as below:

Fact Encoder
Fact Encoder also uses BiGRU. It takes the hidden state of the last token as document level representation of the fact F and each hidden state as corresponding word representation x i .

Word-level Feature Extraction
We use each law article element as a query to generate word-level representations of the fact by attention mechanism. The calculation is shown as below, where rep wi is the word-level representation of fact extracted by ele i and f is a non-linear function.

Document-level Feature Extraction
To further strengthen the interaction between the fact and the law articles, we also use crime name representations to extract the document-level features of the fact rep d via element-wise product. rep

Fusion
The Fusion is used to fuse word-level representation and document-level representation. Considering the word-level representations are based on the crime element, the document-level representations are affected by crime name, and crime belongs to law article, we do the crime-level Fusion firstly and then do the law article-level Fusion.

Crime-aware Fusion
We use linear fusion to fuse crime name and the seven elements corresponding to the crime. The document-level case description representation generated by the crime name and the word-level case description representation generated by the elements of the crime are concatenated and put into a linear function for fusion.
f is a linear function and [; ] means concatenate. chAware is crime-aware representation.

Law-aware Fusion
The Law-aware Fusion is to fuse crime-aware representation based on a law article unit. Some of the law articles only contain one crime, so that we take the crime-aware representation as the article-aware representation.
articleAware s = chAware u (10) articleAware is the fact representation generated by k-th law article and chAware u is the fact representation generated by u-th crime. The crime u belongs to lawarticle s and the lawarticle s only has one crime u in content. When multiple crimes occur in one law article, we hope to select the prominent features of crimeaware presentation. Considering that argmax will cause for failing to return gradients, we use sof tmax instead.
chAware vt is the t-th position of the v-th crimeaware representation vector. s mvt is the sof tmax score of the t-th position of the v-th crime-aware representation.

Relation Extraction
Considering the entire process from the Encoder to the Feature Extraction, and then to the Fusion, each law article is regarded as an independent individual. So far we have not taken the interaction between law articles into consideration. To extract the interaction between law articles , we use the self-attention mechanism (Vaswani et al., 2017) to calculate the interaction between them.
input i is the new fact representation used to discriminate whether the i-th law article is relevant.

Classification
We have generated multiple article-aware representations for one fact, and each representation input i corresponds to a law article. We will use these representations to make classification respectively. Each label has a vector for prediction, which helps to retain more feature information. Unlike other multi-label classifications, where a threshold selects the sof tmax output results, we use multiple binary classifications.
M LP is a multi-layer perceptron.

Experiments
This part includes data selection, experimental parameter setting, baseline model, and detailed experimental results.

Dataset and Evaluation
We use CAIL 2018 small dataset (Xiao et al., 2018). CAIL(Chinese AI and Law Challenge) is a criminal case dataset for competition released by the Supreme People's Court of China. The details of CAIL can be found in Xiao et al. (2018). Considering the serious long-tail distribution of the sample in the dataset, we only select the samples with more than 300 occurrences of the relevant law. To study the model's performance on low-frequency samples, we also conducted experiments on the complete small dataset. We use the correct rate, micro/macro accuracy, precision, recall, and F1 as evaluation indicators.

Experimental Parameter Setting
We use the T hulac (Li and Sun, 2009) tool to segment words, and use CBOW (Rong, 2014) to train word vector on the training data and law article content. The dimension of the word vector is 300. Due to the enormous length of the fact, we only keep the first 256 words of fact. The hidden size is 512. The optimizer is Adam, and the learning rate is 2e-4.

Experimental Results
We compared our model with LSTM (Cheng et al., 2016), BiLSTM, CNN (Kim, 2014), and the current state-of-the-art TopJudge model. The hidden size is 512, the max word length is 256, the kernel size is [3,3,3], and the pooling size is [3,3,3].  (1) Our model has achieved outstanding performance in all evaluation indicators. Compared with TopJudge, our model has achieved 12.42% and 3.34% improvement in macro accuracy and micro accuracy respectively, and 14.56% and 9.18% improvement in macro recall and micro recall respectively.
(2) The performance of TopJudge(current stateof-the-art model) on the two datasets is worse than that of LSTM and BiLSTM. Base on the result, we suspect that joint learning of TopJudge's three subtasks causes more error propagation, and terms of penalty prediction is greatly affected by external factors.

Ablation Experiment
We compared the LEMM model with some variant models on the screened dataset. The experimental results are shown in Tabel 3. -R means to remove the law article relationship extraction module. The model fact-art puts the entire word sequence of the  The ablation experiment shows that the law article relationship significantly contributes to the improvement of the accuracy rate and recall rate. Nevertheless, the precision of the model has been slightly dropped with law article relationship. There might be some noise information in extracting the relationships, which affects the accuracy of the model.
The performance has a sharp drop without manual labeling law article elements. This verifies the labeled law article information is useful in extracting facts.

Conclusion
We propose a model that predicts relevant law articles on multi-label samples by simulating the human judging process. Our proposed LEMM model uses elements of the manually labeled law articles to generate multiple representations of a fact. It uses self-attention to capture dependencies between law articles and makes a unique representation for each candidate label for prediction. The experiments verify that the element-aware multirepresentation can better extract features of the factual information and the dependencies between law articles are beneficial to the law article prediction task. The model achieves state-of-the-art performance in benchmark datasets. It also fills the gap between experimental and practical applications on multi-label samples.