Obligation and Prohibition Extraction Using Hierarchical RNNs

We consider the task of detecting contractual obligations and prohibitions. We show that a self-attention mechanism improves the performance of a BILSTM classifier, the previous state of the art for this task, by allowing it to focus on indicative tokens. We also introduce a hierarchical BILSTM, which converts each sentence to an embedding, and processes the sentence embeddings to classify each sentence. Apart from being faster to train, the hierarchical BILSTM outperforms the flat one, even when the latter considers surrounding sentences, because the hierarchical model has a broader discourse view.


Introduction
Legal text processing (Ashley, 2017) is a growing research area, comprising tasks such as legal question answering (Kim and Goebel, 2017), contract element extraction (Chalkidis et al., 2017), and legal text generation (Alschnerd and Skougarevskiy, 2017).We consider obligation and prohibition extraction from contracts, i.e., detecting sentences (or clauses) that specify what should or should not happen (Table 1).This task is important for legal firms and legal departments, especially when they process large numbers of contracts to monitor the compliance of each party.Methods that would automatically identify (e.g., highlight) sentences (or clauses) specifying obligations and prohibitions would allow lawyers and paralegals to inspect contracts more quickly.They would also be a step towards populating databases with information extracted from contracts, along with methods that extract contractors, particular dates (e.g., start and end dates), applicable law, legislation references etc. (Chalkidis and Androutsopoulos, 2017).
Figure 1: Heatmap visualizing the attention scores of BILSTM-ATT for some examples of Table 1.
Obligation and prohibition extraction is a kind of deontic sentence (or clause) classification (O'Neill et al., 2017).Different firms may use different or finer deontic classes (e.g., distinguishing between payment and delivery obligations), but obligations and prohibitions are the most common coarse deontic classes.Using similar classes, O ' Neill et al. (2017) reported that a bidirectional LSTM (BILSTM) classifier (Graves et al., 2013) outperformed several others (including logistic regression, SVM, AdaBoost, Random Forests) in legal sentence classification, possibly because longterm dependencies (e.g., modal verbs or negations interacting with distant dependents) are common and crucial in legal texts, and LSTMs can cope with long-term dependencies better than methods relying on fixed-size context windows.
We improve upon the work of O' Neill et al. (2017) in four ways.First, we show that selfattention (Yang et al., 2016) improves the performance of the BILSTM classifier, by allowing the system to focus on indicative words (Fig 1   1-2), which fit better the target task, where nested clauses are frequent.

Data
We experimented with a dataset containing 6,385 training, 1,595 development, and 1,420 test sections (articles) from the main bodies (excluding introductions, covers, recitals) of 100 randomly selected English service agreements. 1 The sections were preprocessed by a sentence splitter, which in clause lists (Examples 4-6 in Table 1) treats the introductory clause and each nested clause as separate sentences, since each nested clause may belong in a different class. 2he splitter produced 31,545 training, 8,036 development, and 5,563 test sentences/clauses.3Table 2 shows their distribution in the six gold (correct) classes.Each section was annotated by a single law student (5 students in total).All the annotations were checked and corrected by a single paralegal expert, who produces annotations of this kind on a daily basis, based on strict guidelines of the firm that provided the data.
We used pre-trained 200-dimensional word embeddings and pre-trained 25-dimensional POS tag embeddings, obtained by applying WORD2VEC (Mikolov et al., 2013) to approx.750k and 50k English contracts, respectively, as in our previous work (Chalkidis et al., 2017).We also pre-trained 5-dimensional token shape embeddings (e.g., all capitals, first letter capital, all digits), obtained as in our previous work (Chalkidis and Androutsopoulos, 2017).Each token is represented by the concatenation of its word, POS, shape embeddings (Fig. 2, bottom).Unknown tokens are mapped to and then assigning entire clusters to the training, development, or test subset, to avoid having similar sections (e.g., based on boilerplate clauses) in different subsets.
pre-trained POS-specific 'unk' embeddings (e.g., 'unk-n', 'unk-vb').The dataset of Table 2 has no overlap with the corpus of contracts that was used to pre-train the embeddings.
3 Methods BILSTM: The first classifier we considered processes a single sentence (or clause) at a time.It feeds the concatenated word, POS, shape embeddings (e 1 , . . ., e n ∈ R 230 ) of the tokens w 1 , w 2 , . . ., w n of the sentence to a forward LSTM, and (in reverse order) to a backward LSTM, obtaining the forward and backward hidden states ) is fed to a multinomial Logistic Regression (LR) layer, which produces a probability per class.
Figure 2: BILSTM with self-attention (ATT nodes) used on its own (BILSTM-ATT) or as the sentence encoder of the hierarchical BILSTM (H-BILSTM-ATT, Fig. 3).In X-BILSTM-ATT, the two LSTM chains also consider the words of surrounding sentences.The red dashed line is a drop-out layer.

BILSTM-ATT:
When self-attention is added (Fig. 2), the sentence (or clause) is represented by the weighted sum (h) of the hidden states Again, h is then fed to a multinomial LR layer.Figure 1 visualizes the attention scores (a 1 , . . ., a n ) of BILSTM-ATT when reading some of the sentences (or clauses) of Table 1.The attention scores are higher for modals, negations, words that indicate obligations or prohibitions (e.g., 'obliged', 'only'), and tokens indicating nested clauses (e.g., '(a)', ':', ';'), which allows BILSTM-ATT to focus more on these tokens (the corresponding states) when computing the sentence representation (h).
X-BILSTM-ATT: In an extension of BILSTM-ATT, called X-BILSTM-ATT, the BILSTM chain is fed with the token embeddings (e t ) not only of the sentence being classified, but also of the previous (and following) tokens (faded parts of Fig. 2), up to 150 previous (and 150 following) tokens, 150 being the maximum sentence length in the dataset. 4his might allow the BILSTM chain to 'remember' key parts of the surrounding sentences (e.g., a previous clause ending with 'shall not:') when producing the context-aware embeddings (states h t ) of the current sentence.The self-attention mechanism still considers the states (h t ) of the tokens of the current sentence only, and the sentence representation (h) is still computed as in Eq. 1.
H-BILSTM-ATT: The hierarchical BILSTM classifier, H-BILSTM-ATT, considers all the sentences (or clauses) of an entire section.Each sentence (or clause) is first turned into a sentence embedding (h ∈ R 600 ), as in BILSTM-ATT (Fig. 2).The sequence of sentence embeddings is then fed to a second BILSTM (Fig. 3), whose hidden states (h t ] ∈ R 600 ) are treated as contextaware sentence embeddings.The latter are passed on to a multinomial LR layer, producing a probability per class, for each sentence (or clause) of the section.We hypothesized that H-BILSTM-ATT would perform better, because it considers an entire section at a time, and salient information about a sentence or clause (e.g., that the opening clause of a list contains a negation or modal) can be 'condensed' in its sentence embedding and interact with the sentence embeddings of distant sentences or clauses (e.g., a nested clause several clauses after the opening one) in the upper BILSTM (Fig. 3).

Experimental Results
Hyper-parameters were tuned by grid-searching the following sets, and selecting the values with the best validation loss: LSTM hidden units {100, 200, 300}, batch size {8, 16, 32}, drop-out rate {0.4,0.5, 0.6}.The red dashed lines of Fig. 2-3 are drop-out layers. 5We used categorical crossentropy loss, Glorot initialization (Glorot and Bengio, 2010), Adam (Kingma and Ba, 2015), learning rate 0.001, and early stopping on the validation loss.Table 3 reports the precision, recall, F1 score, area under the precision-recall curve (AUC) per class, as well as micro-and macro-averages.The self-attention mechanism (BILSTM-ATT) leads to clear overall improvements (in macro and micro F1 and AUC, Table 3) comparing to the plain BILSTM, supporting the hypothesis that selfattention allows the classifier to focus on indicative tokens.Allowing the BILSTM to consider tokens of neighboring sentences (X-BILSTM-ATT) does not lead to any clear overall improvements.
The hierarchical H-BILSTM-ATT clearly outperforms the other three methods, supporting the hypothesis that considering entire sections and allowing the sentence embeddings to interact in the upper BILSTM (Fig. 3) is beneficial.
Notice that the three flat methods (BILSTM, BILSTM-ATT, X-BILSTM-ATT) obtain particularly lower F1 and AUC scores, compared to H-BILSTM-ATT, in the classes that correspond to nested clauses (obligation list item, prohibition list item).This is due to the fact that the flat methods have no (or only limited, in the case of X-BILSTM-ATT) view of the previous sentences, which often indicate if a nested clause is an obligation or prohibition (see, for example, examples 4-6 in Table 1).
H-BILSTM-ATT is also much faster to train than BILSTM and BILSTM-ATT (Table 4), even though it has more parameters, because it converges faster (5-7 epochs vs. 12-15).X-BILSTM-ATT is particularly slow, because its BILSTM processes the same sentences multiple times, when they are classified and when they are neighboring sentences.

Network
Training Time Parameters  2), but also included permissions, which we did not consider.Waltl et al. (2017) classified statements from German tenancy law into 22 classes (including prohibition, permission, consequence), using active learning with Naive Bayes, LR, MLP classifiers, experimenting with 504 sentences.Kiyavitskaya et al. (2008) used grammars, word lists, and heuristics to extract rights, obligations, exceptions, and other constraints from US and Italian regulations.Asooja et al. (2015) employed SVMs with ngram and manually crafted features to classify paragraphs of money laundering regulations into five classes (e.g., enforcement, monitoring, reporting), experimenting with 212 paragraphs.
In previous work (Chalkidis et al., 2017;Chalkidis and Androutsopoulos, 2017) we focused on extracting contract elements (e.g., contractor names, legislation references, start and end dates, amounts), a task which is similar to named entity recognition.The best results were obtained by stacked BILSTMs (Irsoy and Cardie, 2014) or stacked BILSTM-CRF models (Ma and Hovy, 2016); hierarchical BILSTMs were not considered.By contrast, in this paper we considered obligation and prohibition extraction, treating it as a sentence (or clause) classification task, and showing the benefits of employing a hierarchical BILSTM model that considers both the sequence of words in each sentence and the sequence of sentences.Yang et al. (2016) proposed a hierarchical RNN with self-attention to classify texts.A first bidirectional RNN turns the words of each sentence to a sentence embedding, and a second one turns the sentence embeddings to a document embedding, which is fed to an LR layer.Yang et al. use selfattention in both RNNs, to assign attention scores to words and sentences.We classify sentences (or clauses), not entire texts, hence our second BIL-STM does not produce a document embedding and does not use self-attention.Also, Yang et al. experimented with reviews and community question answering logs, whereas we considered legal texts.

Conclusions and Future Work
We presented the legal text analytics task of detecting contractual obligations and prohibitions.We showed that self-attention improves the performance of a BILSTM classifier, the previous state of the art in this task, by allowing the BILSTM to focus on indicative tokens.We also introduced a hierarchical BILSTM (also using atten-tion), which converts each sentence to an embedding, and then processes the sentence embeddings to classify each sentence.Apart from being faster to train, the hierarchical BILSTM outperforms the flat one, even when the latter considers the surrounding sentences, because the hierarchical model has a broader view of the discourse.
Further performance improvements may be possible by considering deeper self-attention mechanisms (Pavlopoulos et al., 2017), stacking BILSTMs (Irsoy and Cardie, 2014), or pre-training the BILSTMs with auxiliary tasks (Ramachandran et al., 2017).The hierarchical BILSTM with attention of this paper may also be useful in other sentence, clause, or utterance classification tasks, for example in dialogue turn classification (Xie and Ling, 2017), detecting abusive user comments in on-line discussions (Pavlopoulos et al., 2017), and discourse segmentation (Hearst, 1997).We would also like to investigate replacing its BILSTMs with sequence-labeling CNNs (Bai et al., 2018), which may lead to efficiency improvements.

Figure 3 :
Figure 3: Upper part of the hierarchical BILSTM (H-BILSTM-ATT).The sentence embeddings (SE i ) are generated by the encoder of Fig. 2.

Table 1 :
Examples of sentences and clauses, with human annotations of classes.Terms that are highly indicative of the classes are shown in bold and underlined here, but are not marked by the annotators.
fies sentences, not entire texts (e.g., news articles or product reviews).It outperforms a flat BILSTM that classifies each sentence independently, even when the latter considers neighbouring sentences, because the hierarchical BILSTM has a broader view of the discourse.Third, we experiment with a dataset an order of magnitude larger than the dataset of O' Neill et al.Fourth, we introduce finer classes (Tables

Table 3 :
Precision, recall, F1, and AUC scores, with the best results in bold and gray background.

Table 4 :
Training times and parameters to learn.