Joint Entity and Relation Extraction for Legal Documents with Legal Feature Enhancement

In recent years, the plentiful information contained in Chinese legal documents has attracted a great deal of attention because of the large-scale release of the judgment documents on China Judgments Online. It is in great need of enabling machines to understand the semantic information stored in the documents which are transcribed in the form of natural language. The technique of information extraction provides a way of mining the valuable information implied in the unstructured judgment documents. We propose a Legal Triplet Extraction System for drug-related criminal judgment documents. The system extracts the entities and the semantic relations jointly and benefits from the proposed legal lexicon feature and multi-task learning framework. Furthermore, we manually annotate a dataset for Named Entity Recognition and Relation Extraction in Chinese legal domain, which contributes to training supervised triplet extraction models and evaluating the model performance. Our experimental results show that the legal feature introduction and multi-task learning framework are feasible and effective for the Legal Triplet Extraction System. The F1 score of triplet extraction finally reaches 0.836 on the legal dataset.


Introduction
Automatic extraction of information from legal documents is crucial for legal document analysis and related business processing. The techniques of information extraction are pivotal modules for down-stream judicial applications such as the assistance of reviewing case documents, identification of criminal case facts and auxiliary generation of legal documents. Besides, by implementing named entity recognition and relation extraction, the judgment documents can be transformed into several triplets, capturing entity pairs and their interrelations inside the fact description. The structured legal triplets conduce to legal knowledge graph construction, which benefits query capabilities and interpretability in judicial applications. There are great quantities of actual judgment documents released on China Judgments Online 1 . The abundant information contained in these judgment documents is worthy of in-depth study thanks to the authenticity and typicality of the documents. In this work, we conduct information extraction based on these public legal texts.
Recently, information extraction techniques have developed rapidly. Neural networks for named entity recognition (Lample et al., 2016;Zhu and Wang, 2019) and deep learning methods applying to relation extraction (Nguyen and Grishman, 2015;Zhou et al., 2016) are extensively investigated. Early methods accomplish the triplet extraction drawing on the idea of pipelining. They first recognize the entities in the texts and then classify each entity pair into predefined relation types. These methods suffer from the error propagation problem caused by the incorrect and redundant entities attained by the previous step. Besides, the interactions between the prediction of entities and relations also need to be emphasized. For these challenges, joint learning methods have been intensively carried out. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/.
Initially, joint models based on neural networks (Miwa and Bansal, 2016;Zheng et al., 2017a) apply parameter sharing mechanism to extract entities and relations in a single model. However, these models essentially treat entity recognition and relation extraction as two separated steps and do the prediction in a pipeline. The novel tagging scheme (Zheng et al., 2017b) converts the joint extraction task into a sequence tagging problem and decodes the entities and relations all together. But it cannot solve the problem of overlapping triplets since it assigns each token only one tag. Zeng et al. (2018) proposes a joint extraction method based on the Sequence-to-Sequence (Seq2Seq) model with copy mechanism, which handles the overlapping problem. Recently some Seq2Seq-based methods (Nayak and Ng, 2019;Zeng et al., 2020) improve the performance of joint entity and relation extraction on the accuracy as well as the computational efficiency.
In this work, we concentrate on the triplet extraction in Chinese legal domain, especially on the drugrelated criminal cases. There are twelve crime types of drug-related crimes, and three of them have the most cases, i.e. drug trafficking, illegal possession of drugs and providing venues for drug users. Taking the representation of the facts of drug-related cases into account, we define four relation types, i.e. traffic in, sell drug to, possess and provide shelter for, which cover the crime of the most three drugrelated types according to Criminal Law of The People's Republic of CHINA. Specifically, they differ in that traffic in denotes the fact that the suspect deals in drugs and sell drug to denotes that the suspect conducts a drug trade with someone else. It is obvious that these two relations generally share the same head entity in one case, which appears as the overlapping triplets.
In consideration of the performance and the solution to overlapping problem, we propose a Legal Triplet Extraction System based on the Seq2Seq model to accomplish joint entity and relation extraction from the drug-related criminal judgment documents. The system consists of three main components, encoder, decoder and sequence tagging layer. Concretely, the encoder converts the source sentences into semantic vectors with legal feature enhancement. We explore the pre-trained language model BERT for the encoder in light of its remarkable performance on multiple NLP tasks. The decoder is an efficient network able to solve the overlapping problem, inspired by previous work (Nayak and Ng, 2019). The sequence tagging layer serves as an auxiliary task and assists the encoder in learning the entity boundary information. Our main contributions can be summarized as follows: (i) We propose a Legal Triplet Extraction System based on the Seq2Seq framework, which can jointly extract the entities and the interrelationships from legal texts.
(ii) We focus on the triplet extraction on the drug-related criminal cases, and thus introduce the lexicon of drug names as the legal feature into the model. Furthermore, an auxiliary sequence tagging task is utilized to strengthen the entity recognition ability of the model.
(iii) We manually annotate a dataset for Named Entity Recognition and Relation Extraction in Chinese legal domain, based on the drug-related criminal judgment documents. The dataset is valuable for training supervised triplet extraction models and evaluating the model performance.
2 Related Work

Joint Entity and Relation Extraction
The techniques of information extraction are crucial for many downstream natural language tasks such as knowledge graph construction and question answering system. Early works regard the triplet extraction task as two separate subtasks, i.e. named entity recognition (Lample et al., 2016;Ma and Hovy, 2016;Chiu and Nichols, 2016;Zhu and Wang, 2019) and relation classification (Zeng et al., 2014;Nguyen and Grishman, 2015;Zhou et al., 2016;Wu and He, 2019). They use pipeline methods to extract the triplets, where all entities in the texts are recognized firstly and then combined into entity pairs for relation prediction. The pipeline methods suffer from the error propagation and ignore the interactions between the prediction of entities and relations.
To tackle the aforementioned challenges, joint models for triplet extraction are well-studied. Initial methods for joint extraction (Hoffmann et al., 2011;Li and Ji, 2014;Miwa and Sasaki, 2014;Ren et al., 2017) are feature-based, which rely on heavy feature engineering and require quantities of manual efforts and domain knowledge. With the development of Deep Neural Networks, models that automatically learn to extract features have emerged. Some neural-network-based joint models achieve the interaction between entity recognition and relation classification by sharing the parameters (Miwa and Bansal, 2016;Zheng et al., 2017a). The tagging methods utilize novel tagging schemes to convert the triplet extraction task into a sequence tagging problem and jointly extract the entities and relations (Zheng et al., 2017b;. The Seq2Seq model (Sutskever et al., 2014) is another method for joint extraction. Zeng et al. (2018) proposes CopyRE, a Seq2Seq-based model with copy mechanism, which can solve the overlapping problem. Other works based on the Seq2Seq model (Nayak and Ng, 2019;Zeng et al., 2020) optimize the decoder of CopyRE and improve the performance of the triplet extraction on the accuracy as well as the computational efficiency. In addition, Takanobu et al. (2019) apply a hierarchical reinforcement learning framework to deal with overlapping triplets in joint extraction. Fu et al. (2019) and Sun et al. (2019) take advantage of the graph structure and use graph neural networks to jointly learn the entities and relations.

Pre-trained Language Model
There is a long history and rich literature on pre-trained language models. It is able for pre-trained language models to capture the meaning of words dynamically considering their context. Conneau et al. (2017) shows the effectiveness of universal sentence representations trained with supervision on Stanford Natural Language Inference datasets. Some approaches regard learned representations as features in a model for the downstream task. According to different granularities, there are word embedding methods (Mikolov et al., 2013), as well as sentence embedding methods (Logeswaran and Lee, 2018). ELMo (Peters et al., 2018) and its predecessor generate context sensitive word representations through stacked bidirectional LSTM and residual structure. For unsupervised fine-tuning approaches on language models, Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) improves the state-of-the-art results on many natural language processing tasks. On this basis, BERT with Whole Word Masking (BERT-wwm) (Cui et al., 2019) are proposed for Chinese NLP tasks.  exploits XLNet, which is a generalized autoregressive pre-training method. RoBERTa  improves the training procedure of BERT and achieves state-of-the-art results on various tasks with substantial improvement.
The pre-trained language model BERT benefits many NLP tasks thanks to its abundant prior knowledge. However, it captures the general language information from the large-scale corpus during the training procedure, which leads to the lack of the task-specific and domain-specific knowledge. There are researches aiming at integrating knowledges into the BERT model. SG-Net  incorporates explicit syntactic constraints into attention mechanism in order to guide the text modeling with syntax. Works on embedding the knowledge bases into pre-trained language models (Liu et al., 2020;Peters et al., 2019) contribute to introducing domain knowledge.

Legal Triplet Extraction System
In this section, we describe our method to extract the triplets occurring in the fact descriptions of judgment documents. We start with the introduction of the triplet extraction task in Chinese legal domain and propose a manually annotated dataset for legal triplet extraction. We introduce each component of the proposed Legal Triplet Extraction System and the training procedure in detail. The architecture of the triplet extraction system is shown as Figure 1.

Task Description and Dataset Construction
In this research, we concentrate on the triplet extraction in Chinese legal domain, especially on the drugrelated criminal cases. Given a fact description sentence S from the judgment documents, the target of the legal triplet extraction system is to identify the triplets in the form of e 1 , r , e 2 , where e 1 and e 2 are the entities in S, and r is the relationship between them.
There are twelve crime types of drug-related crimes.We focus on three types among them with the most cases, i.e. drug trafficking, illegal possession of drugs and providing venues for drug users. In practice, it happens that one case involves more than two crimes. The relevant crimes need to be distinguished and reorganized for measurement of penalty. In order to describe the key criminal behaviors covering the three drug-related crime types, we summarize four relation types, i.e. traffic in, sell drug to, possess and provide shelter for according to the criminal law. Concretely, traffic in and sell drug to represent the relationships in drug trafficking cases. The former denotes the fact that the suspect deals in drugs and the latter means that the suspect conducts a drug trade with someone else. The relation type possess denotes that the suspect holds a certain amount of narcotic drugs, and provide shelter for describes the fact that the suspect provides shelter for others to ingest or inject drugs. These two relations cover the crime types of illegal possession of drugs and providing venues for drug users respectively.
In order to realize legal information extraction, we manually annotate the entities and their relations of the drug-related criminal judgment documents, which are downloaded from China Judgments Online. The fact descriptions are firstly extracted from the raw documents through rules. There are 1750 fact descriptions selected as the instances to be annotated. We annotate the relations between all entity pairs in every instance. The annotated data is conducive to supervised training and performance evaluation.

Encoder with Legal Feature Enhancement
The encoder targets to convert the source sentences into semantic vectors. we explore RoBERTa  to encode the contextual information. The architecture of the pre-trained model is the same as BERT (Devlin et al., 2018), which is a L-layer bidirectional Transformer encoder (Vaswani et al., 2017). For RoBERTa BASE the number of L is 12, and for RoBERTa LARGE L is 24. In this work, we use the model of RoBERTa BASE for the encoder. The hidden state vectors of the last layer from RoBERTa BASE are utilized as the general representation of each token in the input sentence S, denoted as H L .
The encoding of the pre-trained model tends to capture the general text representation but is short of domain knowledge. In order to make up for the lack of legal domain information, we add legal feature enhancement into the encoder. The accuracy of the recognized entities is crucial for the performance of triplet extraction, hence the lexicon feature for the entities is worthy of exploration. Towards the triplet extraction on the Chinese drug-related judgment documents, we first build up a drug name lexicon Lexicon Drug as the legal feature. We collect the scientific names of all kind of drugs and their common statements recording in the judgment documents. The drug name lexicon includes the possible expressions of all existing drugs, both in written and spoken language.
Given an input sequence S = w 1 , w 2 , . . . , w N , where w i denotes the i-th token in the input sequence and N denotes the length of the sequence, we match it with the drug name lexicon and find all subsequences that may form the expressions of the drugs. We define S[i : j] as the subsequence of S which begins with token w i and ends with token w j . We utilize a mask matrix M D to represent the legal feature of drug names. M D is organized as a N × N matrix, where the element m ij at the i-th row and the j-th column denote whether the subsequence S[i : j] is an expression of the drugs. The computational process of M D can be mathematically described as: We calculate the legal domain specific representation of the input sentence with an extra Transformer encoder layer (Vaswani et al., 2017). The layer has two sub-layers, i.e. a multi-head self-attention mechanism and a position-wise feed-forward network. Each sub-layer is followed by a residual connection and layer normalization. The general representation of the input sentence H L is first projected into dis- respectively. The output of the self-attention function, denoted as Att D h , is computed with the legal feature mask matrix M D : where the operator '*' denotes the mask operation in the attention computing. We concatenate the outputs of all the attention heads and pass the result through the feed-forward sublayer. The final output from the feature-masked Transformer encoder is integrated with the drug lexicon feature, denoted as H D . Finally, we make a weighted average of the general representation H L and the feature-fused representation H D to obtain the legal feature enhanced representation H Encoder : where γ is the weighting parameter. In this work, we employ γ = 0.5.

Decoder
The decoder aims to predict the entity pairs and their interrelationships of the input sentence. A Long Short Term Memory (LSTM) network is utilized to decode the triplet sequence T . We regard the indexes of the first token and the last token of an entity as the representation of each entity. Thus we can extract the entities of a triplet from the original texts by locating them. Moreover, we obtain the relation type of an entity pair by a relation classifier. Given the legal feature enhanced representation H Encoder from the encoder, the triplet sequence T = t 0 , t 1 , t 2 , . . . , t M is decoded, where t k denotes the k-th triplet in the sequence and M denotes the length of the triple sequence T . Since t 0 is the beginning of the decoded sequence, it has no practical meaning and is assigned to be a zero vector. t k (k > 0) represents a triplet e 1 , r , e 2 , constitutive of the starting index and the ending index of e 1 and e 2 , and the relation type between them. The decoder keeps operating until the relation type of the current triplet turns to be 'NA' or the sequence length reaches the default maximum. For each time step k, we define the hidden state vector of decoder as h Decoder k and the representation of decoded triplets before k as t pr . t pr is computed by the sum of the previous triplets.
We first calculate the encoder-decoder attention vector a k with the representation H Encoder from the encoder, the last hidden state vector h Decoder k−1 from the decoder and t pr presented earlier. Then the hidden state vector at time step k of the decoder is computed with a LSTM cell. The input of the calculation is the concatenation of t pr and a k , which contains the information of the decoded triplets as well as the encoder representation.This process is denoted as: Finally, we predict the indexes of the entity pair and the relation type which form a triplet based on H Encoder and h Decoder k . The hidden state vector h Decoder k is extended to the input sequence length N and obtain the matrix H Decoder k . These two representations from the encoder and the decoder are concatenated and passed through a bi-directional LSTM layer.The probabilities of each token in the input sequence being the beginning and the end of the entity e 1 is computed: where p b1 and p e1 denote the probabilities of each token being the beginning and the end of e 1 . The indexes of tail entity e 2 are calculated in a similar way except for the input of the BiLSTM layer. p b2 and p e2 are the probabilities of each token being the beginning and the end of e 2 .
W b1 , W e1 , W b2 and W e2 mentioned below are trainable parameters. The embeddings of e 1 and e 2 in the current triplet are denoted as e 1 k and e 2 k , which can be obtained by: where h 1 i and h 2 i are the i-th hidden vector in H 1 k and H 2 k . The conditional probability of the relation type between e 1 and e 2 is predicted by a relation classifier. Moreover, the representation t k of the current triplet is computed, where r k is the embedding of the relation type between e 1 and e 2 at time step k and W r is a trainable parameter matrix.

Sequence Tagging Layer
The recognition of entity span is crucial for the triplet extraction. It not only decides the accuracy of the entities in a triplet, but also partly influences the relation prediction of the triplet because of the decoding process. We use a sequence tagging layer to conduct entity span recognition. This auxiliary task conduces to introducing information of entity boundary to the model. The input of the sequence tagging layer is the legal feature enhanced representation H Encoder from the encoder. We use a multilabel classifier to predict the entity span for each token in the input sequence. The probability of the tag sequence X is computed: We adopt the BIO tagging scheme to distinguish entity boundary. Concretely, the tag 'B' denotes the beginning token of an entity, the tag 'I' denotes the token in a multi-token entity except the first token, and 'O' means that the token doesn't belong to any entities.

Training Details
In the training process, we use the ground-truth labels to obtain the relation embeddings. We mini-mize the negative log-likelihood loss for the prediction of the entity indexes and the relation types. Given all training examples {(S i , T i )}| H , the loss function of the decoder is denoted as: where H is the size of training examples, M is the length of decoded triplet sequence, and α is a weighting parameter.
The sequence tagging layer participates in calculating only in the training procedure. It plays a role in the assistance of learning the entity boundary information. Given all training examples {(S i , X i )}| H , the loss function is computed with sentence-level log-likelihood loss: The final loss of the Legal Triplet Extraction System is defined as the weighted summation of L Dec and L T ag , where β denotes the weighting parameter: 4 Experiments and Results

Dataset and Experimental Settings
We use the legal dataset mentioned in section 3.1 to evaluate our proposed Triplet Extraction System. The dataset consists of 1750 fact descriptions of the drug-related criminal judgment documents downloaded from China Judgments Online. We split the dataset by a ratio of 4:1 to obtain the training set and the test set.
We utilize the pre-trained language model RoBERTa-wwm-ext, Chinese 2 Cui et al., 2019) for the encoder. The length of input sequence N is set to 512 and length of triplet sequence M is 10. The dimensions of the encoder representation and the hidden vector of the decoder are both 768. The weighting parameters γ, α and β are set to 0.5, 1 and 1 respectively.

Experimental Results and Analysis
The Precision, Recall and F 1 -score of the extracted triplets are used as evaluation metrics. The equations of the evaluation metrics are as follows, where correct num, predict num and true num are the number of triplets extracted correctly, the number of triplets extracted by the system and the number of true triplets. We regard that a triplet is extracted correctly only if the beginning and end of the two entities and the relation of the triplet are all correct.
We experiment the performance of Legal Triplet Extraction System with the state-of-the-art method PNDec proposed by Nayak and Ng (2019) for joint entity and relation extraction. The main results on the legal dataset are shown in Table 1.
In Table 1, PNDec denotes the baseline method proposed by Nayak and Ng (2019), which implements the joint entity and relation extraction by a bi-directional LSTM encoder and a pointer network-based decoder. +BERT Enc denotes the model replacing the encoder by the pre-trained language model, which refers to RoBERTa-wwm-ext, Chinese. On this basis, +STL is the model adding Sequence Tagging Layer, which enables the encoder to learn entity boundary information better. Our model enhances encoder with the legal lexicon feature. The results illustrate the superiority in the abundant prior knowledge of BERT with the increase of 8.1% in F 1 -score. Furthermore, they indicate that the auxiliary task of sequence tagging conduces to extracting the triplets better since there are improvements of 1.4%, 1.5% and 1.4% in Precision, Recall and F 1 -score, respectively. The proposed enhancement of legal feature on encoder further advances the joint extraction in Chinese legal domain with an improvement of 1% in Precision.

Models
Precision Effectiveness of the Auxiliary Sequence Tagging Task:To further investigate the importance of the auxiliary sequence tagging task, we evaluate the performance of adding a sequence tagging layer to the joint model. We carry out on both the models with BERT encoder and BiLSTM encoder. For the model with BERT encoder, we use a multi-label classifier to tag the sequence, whereas a Conditional Random Fields (CRF) decoder (Lafferty et al., 2001) is used on sequence tagging for the model with BiLSTM encoder. The results on the legal dataset are shown in Table 2. Two sequence tagging schemes are utilized in the experiment. +STL with type denotes the sequence tags consist of entity boundary and entity type, while +STL w/o type denotes that the tags consist of only entity boundary.

Models
Precision  The results in Table 2 suggest that the auxiliary sequence tagging task is effective for the model to learn the features of entities. There are improvements in both the models with different encoders by adding the sequence tagging task. For the model with BERT encoder, the F1-score of the tagging scheme with entity type has an increase of 0.8%, moreover the scheme without entity type improves by 1.4% compared with the single model. For the model with BERT encoder, the tagging scheme with entity type has more effective performance with an increase of 1.6%.
Effectiveness of the Legal Lexicon Feature: For the purpose of evaluating the effectiveness of the legal feature enhancement on the encoder, we compare the model performance on the triplet extraction models without and with legal lexicon feature enhancement. We experiment the models with a sequence tagging layer based on both BiLSTM encoder and BERT encoder. The results on the legal dataset are summarized in Table 3.
The experimental results show that the F 1 -score of the models with legal feature has an improvement of 0.8% on BiLSTM-based model and 0.5% on BERT-based model. It illustrates that the proposed method of legal feature enhancement benefits the legal triplet extraction especially in Precision. The legal feature is built up based on the specific of legal domain and information extraction task. In Chinese drug-related judgment documents, the name of drugs is a representative feature which assists in the recognition of drug entities. Consequently, the integration of the lexicon feature of drug names improve the extraction precision of the model.

Models
Precision

Comparison with Pipelining Method
We make a comparison of the performance of our joint learning-based method and the traditional pipelining method. We conduct two pipeline-based experiments on the legal dataset to prove the efficiency and the performance of our Legal Triplet Extraction System. The results are shown in Table 4.  Pipelining in Table 4 denotes the method simply combining the two steps of triplet extraction, i.e. the entity recognition and relation classification together without any training constraints. On this basis, Pipelining-Rules denotes the method which conducts the triplet extraction considering the redundancy negative entity pairs and trains the relation extraction model with negative sampling. These two methods utilize BERT-Base, Chinese 3 for entity recognition and RoBERTa-wwm-ext, Chinese for relation extraction. A process of filtrating the entity pairs by pre-defined rules is carried out between the two steps so as to decrease the negative entity pairs. In contrast, our model is based on joint learning method and doesn't need any entity filter rules or training constraints. There is an absolute increase in Precision and F1-score has improved by 3.5% compared with the pipelining method filled with rules. We choose the joint model without legal lexicon feature for the sake of fairness.

Conclusion
In this paper, we introduce a triplet extraction system to extract the triplets from the unstructured crime judgment documents. We explore the pre-trained language model for the system. The system extracts the entities and the semantic relations jointly with the assistance of legal feature enhancement. In addition, we manually annotate a dataset for information extraction in Chinese legal domain, in order to train supervised models and evaluate the model performance. Experiments show that the adoptions of legal feature enhancement and multi-task learning framework promote the performance of legal triplet extraction. For future work, we will explore more effective legal features for the legal triplet extraction system. Information extraction on other crimes will be carried out as well.