A Deep Neural Network Sentence Level Classification Method with Context Information

In the sentence classification task, context formed from sentences adjacent to the sentence being classified can provide important information for classification. This context is, however, often ignored. Where methods do make use of context, only small amounts are considered, making it difficult to scale. We present a new method for sentence classification, Context-LSTM-CNN, that makes use of potentially large contexts. The method also utilizes long-range dependencies within the sentence being classified, using an LSTM, and short-span features, using a stacked CNN. Our experiments demonstrate that this approach consistently improves over previous methods on two different datasets.


Introduction
Artificial neural networks (ANN) and especially Deep Neural Networks (DNN) give state-of-the art results for sentence classification tasks. Usually, sentences are treated as separate instances for the task. However, in many situations the sentence that is the focus of classification appears in a context that can provide additional information. For example, in the below sentences from the IEMOCAP dataset, it is difficult to classify M02 as showing excitement, without the prior context: • M01: I got it. I got accepted to U.S.C.. • F01: Oh, for real? • M02: Yes! I just found out today. I just got the letter. Our work is motivated by sentence classification in the text of medical records, in which complex judgements may be made across several sentences, each adding weight and nuance to a point. We believe, however, that the techniqe is more widely applicable. In order to test generalisability and to allow reproducibility, we therefore present an evaluation of the method with publicy available, non-medical corpora.
Previous work on using context for sentence classification used LSTM and CNN network layers to encode the surrounding context, giving an improvement in classification accuracy (Lee and Dernoncourt, 2016). However, the use of CNN and LSTM layers imposes a significant computational cost when training the network, especially if the size of the context is large. For this reason, the approach presented in (Lee and Dernoncourt, 2016) is explicitly intended for sequential, shorttext classification.
In many cases, however, the context available is of significant size. We therefore introduce a new method, Context-LSTM-CNN 1 , which is based on the computationally efficient FOFE (Fixed Size Ordinally Forgetting) method (Zhang et al., 2015), and an architecture that combines an LSTM and CNN for the focus sentence. The method consistently improves over results obtained from either LSTM alone, CNN alone, or these two combined, with little increase in training time.
This paper makes three contributions: 1) a demonstration of the importance of context in some sentence classification tasks; 2) an adaptation of existing datasets for such sentence classification tasks, in order to support reproducibility of evaluations; 3) a neural architecture for sentence classification that outperforms previous methods, and can include context of arbitrary size without incurring a large computational cost.

Related work
Since their introduction (Collobert et al., 2011), CNNs with word embedding language models have become common for text classification tasks (Kim, 2014;Conneau et al., 2017). One limitation of the original CNN approach is the loss 1 The code is publicly available at https://github.com/deansong/contextLSTMCNN of long distance dependencies. In order to deal with this in image and speech recognition tasks, ; Sainath et al. (2015) combined CNNs with a Recurrent Neural Network (RNN) layer. Zhou et al. (2015) subsequently applied this to text classification. However, the CNN-RNN approach was originally devised for sequence labelling, is biased towards later words in the sequence, and does not perform better than CNN alone. Huynh et al. (2016) suggested reversing the architecture to first apply the RNN followed by a CNN with pooling to obtain global features. This gave results that improved over CNN-RNN, but not over CNN alone. In this paper, we build on Huynh et al. (2016)'s approach by replacing the GRU-based RNN (Cho et al., 2014) with an LSTM (Hochreiter and Schmidhuber, 1997) and by using multiple kernel sizes and more features in the subsequent CNN layer. Lee and Dernoncourt (2016) showed that when classifying short texts, accuracy can be boosted by adding a CNN or LSTM derived vector representation of the surrounding context. For long contexts (such as patient records which may include well over 100 sentences), however, this will incur a significant additional computational cost. In this paper, we therefore apply an adaptation of the FOFE encoding (Zhang et al., 2015) to encode context.

Model
The Context-LSTM-CNN model is shown in Figure 1. It is based on the following components: 1. Input layer using word embeddings to encode the words of the focus sentence. 2. Bi-directional LSTM applied to the word embeddings of the focus sentence. 3. CNN on the outputs of the LSTM. 4. FOFE applied to word embeddings of both left and right context. 5. A final output layer.
In brief, an LSTM layer is used to encode the focus sentence. This is followed by convolutional layers with small-size kernels and max-pooling to extract local features at specific points from the LSTM outputs. In addition to processing the focus sentence, we also encode the full left and right contexts using an adaptation of FOFE applied to our embeddings. This encodes any variable length context into a fixed length embedding, thus allowing us to include large contexts without rapidly in-  In detail, the full network takes three inputs. The first is the sequence of words X = (x 1 , x 2 , ...x T ), where T is the length of the sentence to be classified, and where each x i is a word embedding for the respective word in this sentence. Embeddings are pre-trained by Word2Vec (Mikolov et al., 2013) on the corpus used for the respective experiment. The embeddings are not updated during the training of our network.
The second and third inputs are the left and right context, which will connect to the FOFE encoders. Each context is a sequence of sentences X C = (s 1 , s 2 , ...s N ), where each sentence is a sequence of word embeddings s n = (x 1 , x 2 , ...x U ) from the same embedding space as X.
The first component of the inputs, derived from the focus sentence, is processed by a bi-directional LSTM with one layer, in order to capture longdistance dependencies within the sentence. Since LSTMs impose a significant computational cost for very long sequences we only use this layer for the input representing the focus sentence, and not for the left and right contexts.
The LSTM generates outputs h lstm = (h 1 , h 2 ..., h T ) which are passed on to the convolutional layer (CNN) in order to learn local features for different kernel sizes l from the history-aware outputs of the LSTM. For each of several kernel sizes, we generate f different features, to give CNN outputs c l cnn = (c 1 , c 2 , · · · , c T −l+1 ). For each CNN output c l cnn , we use max-overtime pooling to extract the most significant feature, and dropout to make the learned features more robust. We use an adapted version of FOFE to provide information about the left and right contexts of the focus. Instead of the original 1 of k FOFE representation, we apply FOFE encoding to word2vec embeddings. This gives a weighted sum of the context word embeddings, with weights decreasing exponentially with distance from the focus.
The embedding z for a sentence (x 1 , x 2 , ...x U ) is initialised to z 1 = x 1 , and then calculated recursively for u ∈ 2 · · · U as z u = α · z u−1 + x u . The parameter α is the forgetting factor, which controls how fast the weights used for words farther away from the start of the sentence diminish. This method is fast and compactly encodes the words of a sentence in a single embedding vector.
For our use of FOFE, we encode all sentences in the document to left and right of the focus sentence, in two hierarchical steps. First we encode each context sentence into a FOFE embedding z sent , with a slowly-decreasing α sent . Following this, the left context FOFE encodings are themselves encoded into a single context embedding using a rapidly decreasing α cont . This is calculated starting with z cont 1 = z sent 1 and is calculated for m ∈ 2 · · · |C lef t | as z cont m = α cont · z cont m−1 + z sent m . The right context FOFE encodings are encoded in the same way, starting with z cont |C right | = z sent |C right | and recursively applying the same formula for m ∈ |C right | · · · 2. This gives a heavy bias towards sentences more local to the focus sentence, but only slightly decreases the importance of words within each sentence. The final FOFE embeddings for the left and right contexts are then put through a dense linear layer to obtain the hidden layer outputs, which are combined with the LSTM-CNN outputs. The concatenated outputs from the dense FOFE layers and from the CNN layer for all kernel sizes are then used as input to a final softmax output layer.

Experiments
We compare the performance of four different network architectures: 1) CNN only; 2) LSTM only; 3) LSTM-CNN; 4) LSTM context encoded LSTM-CNN (L-LSTM-CNN), in which the one left and right context sentence are encoded by LSTM; and 5) Context-LSTM-CNN (C-LSTM-CNN). We use the following two datasets for evaluation: Interactive Emotional Dyadic Motion Capture Database (Busso et al., 2008) 2 (IEMO-CAP). Originally created for the analysis of human emotions based on speech and video, a transcript of the speech component is available for NLP research. Each sentence in the dialogue is annotated with one of 10 types of emotion. There is a class imbalance in the labelled data, and so we follow the approach of (Chernykh et al., 2017), and only use sentences classified with one of four labels ('Anger', 'Excitement', 'Neutral' and 'Sadness'). For this dataset, instead of using left and right contexts, we assign all sentences from one person to one context and all sentences from the other person to the other context. While only the sentences with the four classes of interest are used for classification, all sentences of the dialog are used as the context. This results in a set of 4936 labelled sentences with average sentence length 14, and average document length is 986.
Drug-related Adverse Effects (Gurulingappa et al., 2012) 3 (ADE). This dataset contains sentences sampled from the abstracts of medical case reports. For each sentence, the annotation indicates whether adverse effects of a drug are being described ('Positive') or not ('Negative'). The original release of the data does not contain the document context, which we reconstructed from PubMed 4 . Sentences for which the full abstract could not be found were removed, resulting in 20,040 labelled sentences, with average sentence length 21 and average document length 129.  In all experiments, five-fold cross validation was used for evaluation (for comparison with (Huynh et al., 2016)). For each fold, 50 epochs were run for training using a minibatch size of 64 for each fold, and the Adamax optimization algo-   We used word2vec embeddings with 50 dimensions (suggesed as sufficient by (Lai et al., 2016)). For the LSTM, 64 hidden units were used. For the CNN, layers for kernel sizes 2 to 6 were included in the network, and 64 features were used for each.

Effect of Forgetting Factors
We examined the effect of the two context encoder hyperparameters: α cont (context level forgetting factor) and α w (sentence level forgetting factor) on classification performance over the IEMOCAP dataset. We tested both in the range of 0.1 to 1 with an incremental step of 0.1. Results are shown in Figure 2. Accuracy improves as α cont increases, but drops at α cont = 1, at which point all context sentence are given equal weight. This may be because context closest to the focus sentence is more important than distant context. Therefore, we select α cont = 0.9 in all experiments.
For α sent , performance always increases as α sent increases, with best results at α sent = 1, at which point all words in the sentence contribute equally in the context code. This implies that for individual sentences in the context, it is more preferable to lose word order, than to down weight any individual word. In all experiments, we therefore set the sentence level forgetting fac-tor to α sent = 1 Table 1 shows the mean and standard deviations for accuracy over the cross validation folds, and training time, for both data sets. CNN alone performs better than LSTM alone in both tasks. The combined LSTM-CNN network consistently improves performance beyond both CNN alone and LSTM alone. Both context based models (L-LSTM-CNN and C-LSTM-CNN) perform better than non context based models, but note that L-LSTM-CNN increases training time by approximately 1.5x, whereas C-LSTM-CNN shows only a marginal increase in training time, with a large increase in accuracy on the IEMOCAP corpus. Table 2 shows the F1-measure for each class in the two datasets. Again, Context-LSTM-CNN outperforms the other models on all classes for all data sets. C-LSTM-CNN improves on average by 6.28 over L-LSTM-CNN, 10.16 over LSTM-CNN, 11.4 over CNN and 13.29 over LSTM.

Evaluation Results
We conducted a t-test between L-LSTM-CNN and C-LSTM-CNN. On IEMOCAP, C-LSTM-CNN is significantly better than L-LSTM-CNN (p = 0.002). On ADE, C-LSTM-CNN is not significantly better than L-LSTM-CNN (p = 0.128). This may because ADE sentences are less context dependent. Alternatively, as the ADE task is relatively easy, with all models able to achieve about 90% accuracy, a context based approach might not be able to further improve the accuracy.

Conclusion
In this paper we introduced a new ANN model, Context-LSTM-CNN, that combines the strength of LSTM and CNN with the lightweight context encoding algorithm, FOFE. Our model shows a consistent improvement over either a non-context based model and a LSTM context encoded model, for the sentence classification task.