Fast Approach to Build an Automatic Sentiment Annotator for Legal Domain using Transfer Learning

This study proposes a novel way of identifying the sentiment of the phrases used in the legal domain. The added complexity of the language used in law, and the inability of the existing systems to accurately predict the sentiments of words in law are the main motivations behind this study. This is a transfer learning approach, which can be used for other domain adaptation tasks as well. The proposed methodology achieves an improvement of over 6\% compared to the source model's accuracy in the legal domain.


Introduction
Sentiment analysis tasks are a common component in many Natural Language Processing (NLP) applications. As described by Esuli and Sebastiani (2007), sentiment analysis or sentiment classification is a recent methodology that aligns with information retrieval and computational linguistics which is focused on the opinion towards something which is represented by a certain text.
In many recent studies involving NLP in various domains, it is common to reuse the seminal RNTN (Recursive Neural Tensor Network) model (Socher et al., 2013b) trained on movie reviews for sentiment analysis. However, it is obvious that this trained model has bias towards the movie review text on which it is based. The traditional way to remedy this problem is to retrain the model from scratch using the same algorithm. But, the algorithm proposed by Socher et al. (2013b) is quite manual labour intensive given that it requires a significantly large enough corpus annotated on sentiment manually. This difficulty is the reason for most natural language processing applications to reuse the original model despite the mismatch between the trained domain and the domain to which it is being applied.
Law is a field involving grand collisions of ideas, most of which are in the form of written text, thus open to linguistic research. However, the language used in these documents is rather complex and esoteric to a certain degree, which makes it challenging to be utilized in intelligent systems. Lawyers, paralegals, and other legal professionals spend a considerable part of their time reading transcripts of past court cases, taking notes and collecting precedents to make their case stronger in court. This task is cumbersome and time-consuming. Therefore, it is an open opportunity for computer scientists to introduce efficient methods and tools for the legal domain. From this point onward we shall be referring to the court case transcripts as court cases.
In this study, we propose a novel way of applying sentiment analysis on the contents (words/phrases) of court cases. This analysis is useful in the NLP or Natural Language Understanding (NLU) tasks where it is vital to identify the stakeholder bias in each of the statements. Similarly, sentiment analysis in legal text can become useful in automating the following tasks related to legal literature.
• Identifying the arguments in a court case • Identifying the arguments which were supportive or against for a certain party in a court case • Identifying or synthesizing counterarguments for a given argument in a court case To identify the application of sentiment classification in the legal domain, consider the following example which was extracted from a legal case (Supreme Court, 2018).
The District Court concluded that Lee's counsel had performed deficiently.
In the above example, the phrase had performed deficiently induces a negative sentiment towards Lee's counsel. The sentiment of concluded that denotes that the court agrees with the inner sentence. Complete sentence denotes that court's opinion towards Lee's counsel is negative. Consider the following extracted from the same case, ...the Government conceded that Lee's counsel had performed deficiently.
This sentence contains the same inner sentence but in the legal domain the phrase called conceded that indicates a situation where the government initially disagreed but eventually had to agree. That phrase induces a negative sentiment on the inner sentence which is negative towards Lee's counsel. Therefore, it is fair to assume that the government and Lee's counsel were on the same side in this situation.
The above-mentioned facts indicate the importance of identifying the sentiment of a statement towards a party in a court case. In the proposed approach, the sentiment of a given phrase is classified into one of the two classes; negative and non-negative. This classification criterion is selected following the fact that the major use case aligns with classifying terms and entities supporting/referring to either plaintiff or defendant. Therefore, the proposed methodology is focused on identifying the statements with negative sentiment as much as possible. As per requirement, the proposed approach has the ability to extended to explicitly identify the positive sentiment as well following the same methodology.
In this approach, we propose a novel methodology to perform transfer learning on the RNTN model mentioned in Socher et al. (2013b) and build a target model. Given that this is a transfer learning approach, the manually annotated data on movie reviews is used to as the initial source model rather than creating a new comparable manually annotated dataset for the legal domain.
For the testing purposes, we created a manually annotated target domain test dataset such that the phrases belong to one of the two classes: negative or non-negative. The target system shows a recall of 70.14% for identifying phrases with neg-ative sentiment in the legal domain. Furthermore, the overall accuracy of the system is above 76% in classifying sentiments for a given phrase correctly. If this result is compared with the results of source RNTN model (Socher et al., 2013b), it is a 6% improvement in accuracy. The approach proposed in this study can be tried on other domain adaptation tasks related to sentiment classification as well.

Background
Owing to the difficulties in handling legal jargon, efficient and effective computing applications in the field are somewhat sparse. The study by Schweighofer and Winiwarter (1993) claims that there is a significant vacuum in computerized applications for the field of law which have resulted in an information crisis. The fact that legal vocabulary have words of mixed origin such as English and Latin has been raised as a reason for the difficulty of creating computing applications for the legal domain (Sugathadasa et al., 2018).
However, recently, there have been attempts to involve and build legal ontologies (Jayawardana et al., 2017a,b) as well as attempts to calculate similarity measures in legal domain text  and build information retrieval systems thereof (Sugathadasa et al., 2018). Given the popularity of knowledge embedding, a number of studies have also attempted to embed legal jargon in vector spaces Nay, 2016). A more recent study by Ratnayaka et al. (2018) uses discourse relations in an attempt to identify relationships among sentences in court case transcripts.
Social media is one of the most used domains for research in sentiment analysis due to the availability of plentiful data. Social media platforms usually contain opinions expressed by people on various topics including politics, sports, entertainment, and others. For instance, Pak and Paroubek (2010) states a research conducted in analyzing language in Twitter posts of millions of users, along with a method to automatically collect a corpus with positive and negative sentiments, where the authors have performed statistical linguistic analysis on the collected corpus and built a sentiment classification system for micro-blogging. They have used a Naive Bayes classifier that uses N-grams and part-of-speech tags as features to train the model. This method is not suitable for analyzing legal text because of the inherent objec-tivity that needs to be preserved in law.
Sentiment classification is also known as opinion mining (Esuli and Sebastiani, 2007). As such, the study on Opinion Mining in legal blogs (Conrad and Schilder, 2007) is closest implementation for this study that we have found in the literature. The Lingpipe toolkit, of which the sentiment annotation is based on a character-based language model, is used for the sentiment classification in the approach by Conrad and Schilder (2007). Further, the data set used for evaluation is based on movie reviews, customer reviews, and MPQA corpus (Wiebe et al., 2005).
SentiWordNet (Esuli and Sebastiani, 2007;Baccianella et al., 2010) classifies synsets of Word-Net (Miller, 1995) to three classes; negative, positive, and objective. Synsets that do not contain opinionated content are assigned to the objective class while the Synsets that do contain opinionated content are assigned to the subjective which is then further classified into the two classes negative and positive depending on the sentiment it carries.
There have been numerous studies that were built upon SentiWordNet (Esuli and Sebastiani, 2007;Baccianella et al., 2010) which attempts to classify sentiments of phrases and sentences. One such study by Ohana and Tierney (2009) proposes a methodology to perform opinion mining on movie reviews using support vector machine where some of the features were calculated using WordNet. This achieves an accuracy of 69.35% and claims that the inaccuracies in SentiWord-Net feature calculations are caused by the Senti-WordNet's reliance on glosses. Lu et al. (2012) evaluates the SentiWordNet for identifying opposing opinion networks in forum discussion. The average SentiWordNet opinion score of words is considered to identify whether a user's expressed comment for a given post has either for or against relationship. The achieved accuracy using the SentiWordNet opinion score of words is 56%.
The method proposed by Socher et al. (2013b) provides an algorithm to identify the sentiment of a phrase or a sentence in a supervised manner using a deep learning model of the type Recursive Neural Tensor Network (RNN). It is claimed that this learning model has the capability to identify the sentiment considering the context of that word. A dataset which consists of movie reviews where each sentence in the data set was broken into phrases and each phrase is annotated by human judges were created for this study. The authors claim a testing accuracy of 80.7% in phrase level for a test set drawn from the same dataset. Further, the authors claim that the proposed model can be trained over any domain by following the provided methodology. While, theoretically, it is possible, following this for legal domain in a practical implementation which covers a corpus which is both significant and sufficient is difficult. This claim is substantiated by referring the dataset of the original research (Socher et al., 2013b) which utilized 215,154 manually annotated phrases (from 11,855 sentences) with over 5355 unique words. In comparison to this, the legal corpus used in our study has a vocabulary exceeding 17000 words. The difficulties are not mealy of scale given that the linguistic complexity of legal jargon exceeds that of the average text corpus (Jayawardana et al., 2017b,c;Sugathadasa et al., , 2018. It is observed that the Recursive Neural Tensor Network (RNTN) model by Socher et al. (2013b) shows better accuracy in sentiment classification compared to other models. However, the trained model being biased towards the movie reviews which it was trained on is a difficulty that needs to be overcome. For this purpose, several studies (Raina et al., 2007;Socher et al., 2013a) claim the process of domain adaptation to be a suitable solution. Domain adaptation is a sub-category of Transfer Learning (Raina et al., 2007). While the generic process of transfer learning is defined as the process of "learning model is trained using data from a certain domain and tested with respect to a different domain" (Raina et al., 2007), the specific case of domain adaptation occurs when the Since the task is same in both source and target model. Given that both this study and the original study by Socher et al. (2013b) works on sentiment classifying, the transfer learning done in this study falls under the definition of domain adaptation (Raina et al., 2007). Even though transfer learning is not very common in the NLP field, it is extensively used in other fields such as image classification (Quattoni et al., 2008;Raina et al., 2007).
The aim of this study is also to build a sentiment classifier specific to the legal domain. But to prepare a manually labeled data set for training purpose is a costly process in terms of time and human effort. Therefore, a Transfer Learn-ing approach is used to adapt the RNTN model (Socher et al., 2013b) to the legal domain. When a learning model is trained using data from a certain domain and tested with respect to a different domain, it is called Transfer Learning approach (Raina et al., 2007). Since the task is same in both source (Socher et al., 2013b) and target model for legal domain, the task belongs to the subcategory called Domain Adaptation as mentioned in (Raina et al., 2007;Socher et al., 2013a). Image classification (Quattoni et al., 2008;Raina et al., 2007) is a field where transfer learning is vastly used.

Methodology
Given that the transfer learning process described in this study uses the Recursive Neural Tensor Network (RNTN) model proposed by Socher et al. (2013b) as the source model, we make numerous references to the aforementioned model throughout the paper. Therefore, to avoid clutter, from this point onward the model proposed by Socher et al. (2013b) is referred as Socher Model in the remainder of this paper. The main research contributions of this study in the methodological aspect is discussed in this section.
In brief, first, it is required to select the vocabulary from a corpus comprised of legal case transcripts. Then we input a set of words extracted from that corpus to the Socher Model for sentiment annotation. After that, three human annotators check for words with deviated sentiments based on the classified classes. Using that identified set, we perform a transfer learning method to identify the sentiment of a given phrase in the legal domain. All these steps are further elaborated in the following sub-sections.

Selecting the Vocabulary
Depending on the size of the corpus (phrases extracted from legal text), availability of human annotators and the time, it is not feasible to analyze and modify the sentiment of every word in a corpus. Therefore, it is required to select the vocabulary (unique words in the corpus) such that the end-model can correctly classify the sentiment of most of the phrases from the legal domain while not squandering human annotator time on words that occur rarely. To this end, first, the stopwords (Lo et al., 2005) are removed from the text by utilizing the classical stop-word list known as the Van stop-list (Van Rijsbergen, 1979). Next, the term frequencies for each word in the corpus is calculated and only the top 95% words of it are added to the vocabulary.

Assigning Sentiments for the Selected Vocabulary
The selected vocabulary (set of individual words) is given to the sentiment annotator Socher Model as input. From the model, sentiment is classified into one of the five classes as in table 3.2. This class scheme made sense for the movie reviews for which the Socher Model is trained and used for. However, in the application of this study, the basic requirement of finding sentiment in court cases in the legal domain is to identify whether a given statement is against the plaintiff's claim or not. Therefore, we define two classes for sentiment: negative and non-negative. Three human judges analyze the selected vocabulary and classify each word into the two classes depending on its sentiment separately and independently. If at least two judges agree, the given word's sentiment is assigned as the class those two judges agreed. For the same word, the output from the sentiment annotator Socher Model belongs to one of the five classes mentioned in the preceding subsection. In this approach, we map the output from Socher Model to the two classes we define in Table 3 For a given word, if the two sentiment values assigned by the Socher Model and human judges do not agree with the above mapping, we define that the Socher Model's output has deviated from its actual sentiment. For example: Sentence: Sam is charged with a crime. Socher Model's output: positive Human judges' annotation: negative The word charged has several meanings depending on the context. As the Socher Model was trained using movie reviews, the sentiment of the word charged is identified as positive. Although the sentiment of the term crime is recognized as negative, the sentiment of the whole sentence is output as positive. But in the legal domain, charged refers to a formal accusation. Therefore, the sentiment for the above sentence should have been negative. From the selected vocabulary, all the words with deviated sentiments are identified and listed separately for the further processing.

Brief description of the Socher Model
In the preceding subsection, we came across a situation where the sentiment values from the Socher Model do not match the actual sentiment value because of the difference in domains. And there are words like insufficient, which were not recognized by the model because those terms were not included in the training data-set. One approach to solve this is to annotate the phrases extracted from legal case transcripts manually as the Socher Model suggests, which will require a considerable amount of human effort and time. Instead of that, we can change the model such that the desired output can be obtained using the same trained Socher Model without explicitly training using phrases in the legal domain. Hence, this method is called a transfer learning method.
In order to change the model, first, it is required to understand the internals of the Socher Model model. When a phrase is provided as input, first it generates a binary tree corresponding to the input in which each leaf node represents a single word. Each leaf node is represented as a vector with d-dimensions. The parent nodes are also d-dimensional vectors which are computed in the bottom-up fashion according to some function g. The function g is composed of a neural tensor layer. Through the training process, the neural tensor layer and the word vectors are adjusted to support the relevant sentiment value. The neural tensor layer corresponds to identify the sentiment according to the structure of words representing the phrase. If we consider a phrase like not guilty ,both individual word elements have negative sentiments. But the composition of those words has the structure of negating a negative sentiment term or phrase. Hence the phrase has a non-negative sentiment. If the input was a phrase like very bad, the neural tensor layer has the ability to identify that the term very increases the negativity in the sentiment. For Example: phrase: not guilty. sentiment: non-negative Both words in the above phrase, have negative sentiment if we consider each of them individually. But the composition of those words has the structure of negating a negative sentiment term or phrase. Hence the phrase has a non-negative sentiment. If the input was a phrase like very bad, the neural tensor layer has the ability to identify that the term very increases the negativity in the sentiment. The hidden process is same as in the preceding example.

Adjusting Word Vector Values in RNTN Model
The requirement of the system is to identify the sentiment of a given phrase. The proposed approach is not to modify the neural tensor layer completely. We simply substitute the word vector values of individual words which are having deviated sentiments between Socher Model and human annotation (See sections 3.2). The vectors for the words which were not in the vocabulary of the training set which was used to train the RNTN model should be instantiated. The vectors of the words which are not deviated (according to the definition provided in the preceding subsection 3.3) will remain the same. As the words with deviated sentiments (provided by the Socher Model) in the vocabulary are already known, we initialize the vectors corresponding to the sentiment annotation for those words. Since the model is not trained explicitly, the vector initialization is done by substituting the vectors of words in which sentiment is not deviated comparing the Socher Model output and its actual sentiment. After the substitution is completed, we consider the part-of-speech tag. For that purpose, the part-of-speech tagger mentioned in Toutanova et al. (2003) is used. The substitution of vectors is carried out as shown in Table 2.
The number of words which have deviated sentiments is a considerably lower amount compared to the selected vocabulary. The rest of the words' vectors representing sentiments are not changed in the modification process. The neural tensor layer also remains unchanged from the trained Socher Model using movie reviews (Socher et al., 2013b). When the vectors for words with deviated sentiments are initialized according to the part-of-  speech tag as shown in Table 2, it is possible to make a fair assumption that when deciding the sentiment with the proposed implementation, it does not harm the structure corresponding to the linguistic features of English. Consider the sentence "evidence is insufficient." as an example. The term "insufficient" is not in the vocabulary of the Socher Model due to the limited vocabulary in training data set. Therefore, the Socher Model provides the sentiment of that word as neutral which indicates as a word with a deviated sentiment. Following the Table 2, the sentiment related vector is instantiated by substituting the vector of wrong as the part-of-speech tag of insufficient is JJ (Santorini, 1990). Therefore the modified version of the RNTN model has the capability of identifying the sentiment of the above sentence as negative. The figure 1 shows how the sentiment is induced through the newly instantiated word vector.

POS Tag
And there are scenarios where the term is in the vocabulary of the Socher Model but has a different sentiment compared to the legal domain. Consider the sentence "Sam is charged with a crime" which was mentioned in section 3.2.
In section 3.2, we have identified that the term charged denotes a different sentiment in legal domain compared to movie reviews. The source RNTN model outputs a positive sentiment for that given sentence as the term charged is identified as Figure 1: Sentiment Prediction for a phrase with words not in source's vocabulary but in target's vocabulary having a positive sentiment according to movie reviews domain. And that term is the cause for having such an output from the source model. The figure 3 indicates how the change we introduced in the target model (in section 3.2) induce the correct sentiment up to the root level of the phrase. Therefore, the target model identifies the sentiment correctly for the given phrase. To improve the recall in identifying phrases with negative sentiment, we have added another rule to the classification criteria. The source RNTN model (Socher Model) provides the score for each of the five classes such that all those five scores sum up to 1. If the negative sentiment class has the highest score, the sentiment label of the phrase will be negative. Otherwise, the phrase again can be classified as having a negative sentiment if the score for negative sentiment class is above 0.4. If those two conditions are not met, the phrase will be classified as having a non-negative sentiment. Section 4 provides observations and results regarding the improved criteria.

Experiments and Results
The proposed approach in this paper is based on transfer learning. Therefore, we needed to create a golden standard for identifying sentiments of phrases and sentences in the legal domain in order to evaluate the model. The phrases and sentences for the test data set are randomly picked from legal case transcripts based on the United States Supreme Court. During the selection process, we have selected an equal amount of phrases for both classes according to the Socher Model. Each of these phrases and sentences is annotated by three human annotators. Since the classification process is binary, we pick the sentiment class for each test subject based on the maximum number of votes. In the end, we prepare the test data set containing nearly 1500 annotations to use in the evaluation process.
In the experiment, we compare the sentiment class picked by human judges and the modified RNTN model. As the baseline model, we use the source RNTN model (Socher Model) to check the impact caused by the proposed transfer learning approach. The acquired results from the baseline model is shown in Table 3 and results from the target model is shown in Table 4. According to Table 3 and Table 4, there is a 10% improvement in identifying phrases with negative sentiment. The reason is that there are a lot of unknown words which are in the legal domain but not in movie reviews corpus. In addition, we have introduced new criteria based on a threshold for the score of negative class to improve the recall. Due to that reason, the precision in identifying phrases with a negative sentiment is 84.41%. But if we compare with the precision of the base-line model (Socher Model) for negative sentiment class is 79.62% which is a lower value. Since the test dataset is not skewed a lot towards one class, it is fair to consider the accuracy of the system in predicting the sentiment for any given phrase. The baseline model shows the accuracy of 70.17% while the target model shows 76.80%. The improvement in accuracy is above 6%.

Actual
Predicted   The observed results in Table 3 and Table 4 show that there is a 6% improvement of the sentiment with respect to the baseline model. There are a few reasons behind the results. As we randomly selected phrases from the legal case transcripts corpus, only 45% of the phrases actually contained the words where we had substituted the vector regarding sentiment. Therefore, the output for 55% of the phrases from the baseline model and the target model was the same. If we compare the output provided by the baseline model and the target model, output of 9.5% of the total phrases are different to each other. Therefore the difference between the two models is based on that 9.5% of the total phrases.

Conclusion and Future Work
This study is focused on building an automatic sentiment annotator for legal texts based on the Recursive Neural Tensor Network (RNTN) model mentioned in Socher et al. (2013b). Furthermore, this study can be identified as a transfer learning approach as it is not required to prepare a training data set for the legal domain specifically. Instead, this approach uses the same training data set stated in Socher et al. (2013b). This task can be recognized as a domain adaptation task. The proposed approach could achieve a 70.14% recall in identifying phrases with negative sentiments (improvement is 10% compared to the source model). The accuracy of the target model is above 76% which is a 6% improvement over the source model.
The proposed methodology can be adjusted for any domain adaptation task other than the legal domain, which makes this study more important. To train the model, it is not required to prepare manually annotated data for a specific domain. Another advantage is that if there are improvements introduced to the source model, those improvements can be inherited to the target model as well. The major disadvantage associated with this model is that the accuracy of the target model will be limited by the source model in most occasions. In other words, it is hard to exceed the accuracy shown by the source model for its own domain.
There are words which produce one sentiment when they are combined but provide completely different sentiments when considered as individual elements. If we consider the term "cover up" in the legal domain, it has a meaning of hiding some mistake or crime. Therefore, it should have a negative sentiment. But the individual terms do not indicate negative sentiment. Therefore, the results can be further improved by considering bi-grams and tri-grams.
The improved version of the Stanford CoreNLP sentiment annotator (Socher et al., 2013b) could be used for further research on using machine learning for the legal domain. Furthermore, the transfer learning method we have described in this study is adjustable for any domain to build an automated sentiment annotator.