Sentence Classification for Investment Rules Detection

In the last years, compliance requirements for the banking sector have greatly augmented, making the current compliance processes difficult to maintain. Any process that allows to accelerate the identification and implementation of compliance requirements can help address this issues. The contributions of the paper are twofold: we propose a new NLP task that is the investment rule detection, and a group of methods identify them. We show that the proposed methods are highly performing and fast, thus can be deployed in production.


Introduction
Compliance requirements have augmented dramatically in the last years, specially in the financial sector. Investment funds are obliged by law to publish their investment strategy at a very detailed level. If the fund does not follow precisely these rules, it will be fined by the corresponding regulatory institution. According to Thomson Reuters there were regulatory changes every 12 minutes, on average per day in 2015 (Thomson Reuters, 2015). But, it takes months to implement every regulatory change, thus, any process that allows to spot regulatory changes can help accelerate this updating process. This is important since if an investment fund does not follow precisely these rules, it will be fined by the corresponding regulatory institution. In fact, in the last years, the income dedicated to fines and settlements has increased by almost 45x for the biggest EU and US banks (Kaminski and Robu, 2016).
The compliance department of Depositary banks are in charge of controlling that these rules are actually followed. In order to avoid sanctions, they define a 4-eye protocol for rule identification. This protocol consists in having two or more people read and highlight the investment rules of the prospectus of each investment fund they control. Once two people have highlighted the same prospectus, a third person introduces all the rules in the system. Identifying the rules is time consuming and tedious. This process takes days for human actors, we propose a method that takes seconds thanks to the use of machine learning. Although other methods have acknowledged the importance of having the rules isolated (Cashman et al., 2002;Beale, 2004), the current systems assume that the rules have already been identified and translated into executable code.
In this paper, we propose to detect investment rules using binary classification of sentences. In section 2, we present the state of the art in sentence classification. In section 3.1, we give all the details on the data and the posed problem. The proposed solutions are given in section 3.2 and the obtained results in section 3.3. Section 4 concludes the paper and gives future work. 1 2 Related Works Sentence Classification. Sentence classification is a classic research area in natural language processing. Approaches previous to 2010 focus mostly on the extraction of document meaning through representative features that would be used as input to classic machine learning algorithms, such as SVM, knn, or Naive Bayes (see (Khan et al., 2010) for a review on the topic). The rise of Deep Learning techniques impacts also the sentence classification literature, appearing methods based on CNNs. More specifically, a modification of (Collobert et al., 2011) was proposed by Kim (Kim, 2014), showing how a simple model together with pre-trained word representations can be highly performing. But the use of wordembeddings has been challenged for CNNs, Zhang, 2014, 2015) propose a semisupervised setting that allows to learn a small textregion representation. Zhang et al.  propose a CNN based directly on character representations, without explicitly encoding words. CNNs are highly dependent on the window size, (Lai et al., 2015;Visin et al., 2015) propose the use of Recurrent Convolutional Neural Networks to overcome this issue. (Guggilla et al., 2016) propose the use of LSTMs for classification of online user comments. In order to avoid problems due to lack of data, (Liu et al., 2016) propose multitask learning using LSTMs.
Word embeddings. The lack of big databases with tagged data is a common problem for Deep Learning models. Collobert et al. (Collobert et al., 2011) empirically proved the usefulness of using unsupervised word representations for a variety of different NLP tasks and since then, it is widely accepted that, for small and middle size databases (< 10k samples), the use of word embeddings improves the final results. Word embeddings is the name associated to a group of language model methods that map words into a vector space. Introduced by Bengio et al. (Bengio et al., 2003), the authors proposed a statistical language model based on shallow neural networks. The goal was to predict the following word, given the previous context in the sentence, showing a major advance with respect to n-grams. Collobert et al. (Collobert et al., 2011) set the neural network architecture for many current approaches. Mikolov et al. (Mikolov et al., 2013) proposed a simplified model (word2vec) that allows to train on larger corpora. They also show how semantic relationships emerge from this training. Pennignton et al. (Pennington et al., 2014), GloVe, maintain the semantic capacity of word2vec while introducing the statistical information from latent semantic analysis (LSA) showing that they can improve in semantic and syntactic tasks.

Rule detection in prospectus
In this section we present the problem of rule detection in investment fund prospectus, and our proposal for tackling it.

The data
Investment fund prospectus are papers where the fund informs the regulatory institution and its future clients of its investment strategy, its risk management, the company structure, etc. Most of these documents are publicly available in the regulation authority web page, see for instance for French documents (AMF, 2018). The investment rules that we want to identify are very precise rules which can be of different kinds, and, in general, very different from other sentences in the same text as can be observed in Table 1.
The Gold standard database. The data used in the supervised part of the model is around 3.5k annotated sentences for each language (English and French). The sentences were split into two classes, the label 1 is used for rules and 0 is used for non rules, as shown in Table 1.

Proposed methods
In this subsection we detail the proposed algorithms. The task required multiple pre-processing steps that are used for data preparation before training or inference. The first step is to segment the document into a list of sentence then each sentence is tokenized into multiple elements based mostly on space and punctuation characters. Each token is then mapped to a unique id in order to produce a list of integer from each sentence which then will be fed to the regression model.
Word embeddings. The word vector values are initialized using the GloVe algorithm Pennignton et al. (Pennington et al., 2014) and then fine-tuned along with the model regression parameters during training. We used a corpus of fund prospectuses and wikipedia pages to train a domain-specific word embedding. This is justified by the fact that some words used in prospectuses are uncommon in the general use of language and thus are not included in available word vectors pre-trained on Wikipedia or common crawl alone.

Linear network model
The Linear network model in this case is a logistic regression applied to an un-weighted average of dense word vectors. The advantage of this model is that it is simple while it also takes advantage of the unsupervised pre-training of the word embeddings. This also means that is very fast and computationally cheap compared to other models

Sentence
Tag The Fund will invest at least 70% of its net assets in sub investment grade corporate debt securities with a credit rating equivalent to BB+ or lower and denominated in USD.

1
The SICAV may invest in OTC markets.
1 The Company may not invest in gold, spot commodities, or real estate 1 The management fee is 0.1% 0 The asset manager JP Morgan assigns BNP Security Services as its depositary bank. 0 presented here. In Figure 1, we can see the overall architecture of the model.

Convolutional Neural Network
We used a CNN architecture similar to the one introduced in (Kim, 2014). It consists of the following layers: • Convolutional Layer : Three 1-dimensional convolution layers applied in parallel to the input embedding sequence. Each convolution layer uses a different filter size {3, 4, 5} and captures sentence information at different scales ( 3-gram, 4-gram, 5-gram ). The convolution filters learn translation-invariant representations which is useful for language because it allows for weight sharing between neurons and thus reduces significantly the number of weights compared to a fully connected layer. We use 100 filters for each layer and ReLu as a non-linearity for the convolution layers.
• Max-pooling : Applies a max operation across the sequence and returns an output that has the same size as the number of filters in each convolution layer.
• Concat Layer : Concatenates the output of each Max-pooling together.
• Linear Layer : Applies a linear mapping from the concat layer to the output.
In Figure 2, we can see the overall architecture of the model.

Bi-directional
Long-Short-Term-Memory The Bi-LSTM model was first introduced in (Graves and Schmidhuber, 2005).
Here, we used a specific model that consists of the following Layers : • Forward LSTM : Sequential layer that is applied to the list of word embeddings from the first token in the sentence to the last token and outputs the lstm cell state of the last token of the sentence.
• Backward LSTM : Sequential layer that is applied to the list of word embeddings from the last token in the sentence to the first token and outputs the lstm cell state of the first token of the sentence.
• Concat Layer : Concatenates the output of each LSTM layer.
• Linear Layer : Applies a linear mapping from the concat layer to the output.
In Figure 3, we can see the overall architecture of the model.

Implementation details
We used Keras (Chollet et al., 2015) with Tensor-Flow Backend throughout our experiments. We use Adadelta (Zeiler, 2012) Optimizer with a learning rate of 0.001 and a batch-size of 50. A Dropout (Srivastava et al., 2014) of 0.5 is used after the concat layer for LSTM and CNN and after the average layer for the Linear network model for regularization. We used Binary Cross-entropy in all the models losses.

Results
We present a performance comparison of the architectures described above both in terms of accuracy/Precision/recall but also in terms of inference time as it is a also an important metric to consider when deploying a model in a production environment.

Model
Acc (   The convolutional model seem to yield slightly better results on average compared to the Bi-LSTM which is in line with the results presented in (Guggilla et al., 2016). Both Bi-LSTM and CNN outperform the linear network model because they take into account the order of tokens in the sentence while the linear network model does not.

Model
Time per sample (s) Linear 1.2e −4 CNN 3.1e −4 Bi-LSTM 1.8e −3 Because of its simplicity the linear network model is the fastest out of the three and the Bi-LSTM is 6 times slower than the CNN while giving worse results.

Conclusions and further work
We have presented a method to detect and isolate mandatory rules in regulatory documents. The objective is to automate the detection of investment rule in prospectuses using a classifier. This helps compliance experts avoid the tedious work of reading documents that are sometimes as long as 500 pages and take days to read in order to select very few sentences.
We described the frameworks used, the preprocessing steps and compared multiple classification models in terms of Accuracy/Precision/Recall and inference time. The results show that convolutional neural networks have the best trade-off between accuracy and execution time and are thus the best model for this task.