Hierarchical Neural Networks for Sequential Sentence Classification in Medical Scientific Abstracts

Prevalent models based on artificial neural network (ANN) for sentence classification often classify sentences in isolation without considering the context in which sentences appear. This hampers the traditional sentence classification approaches to the problem of sequential sentence classification, where structured prediction is needed for better overall classification performance. In this work, we present a hierarchical sequential labeling network to make use of the contextual information within surrounding sentences to help classify the current sentence. Our model outperforms the state-of-the-art results by 2%-3% on two benchmarking datasets for sequential sentence classification in medical scientific abstracts.


Introduction
Since 1665, over 50 million scholarly research articles have been published (Jinha, 2010), with approximately 2.5 million new scientific papers coming out each year (Ware and Mabe, 2015). While this enormous corpus provides us with the ability to conclusively accept or reject hypotheses and yields insight into promising research directions, it is getting harder and harder to extract useful information from the literature in an efficient and timely manner due to its sheer amount. Therefore, an automatic and intelligent tool to help users locate the information of interest quickly and comprehensively is highly desired.
When searching for relevant literature for a certain field, investigators first check the abstracts of scientific papers to see whether they match the criterion of interest. This process can be expedited if the abstracts are structured; that is, if the rhetorical structural elements of scientific abstracts such as purpose, methods, results, and conclusions (American National Standards Institute, 1979) are explicitly stated. However, even today, a significant portion of scientific abstracts is still unstructured, which causes great difficulty in information retrieval. In this paper, we develop a machine-learning based approach to automatically categorize sentences in scientific abstracts into rhetorical sections so that the desired information can be efficiently retrieved.
In a scientific abstract, each sentence can be assigned to a rhetorical structural element sequentially. This rhetorical structure profiling process can be formulated as a sequential sentence classification task, as the element assignment of any single sentence is greatly associated with the assignments of the surrounding sentences. This is in contrast to the general sentence classification problem, where each sentence is classified individually and no contextual information can be used. Previous state-of-the-art methods relied on Conditional Random Fields (CRFs) to take into account the inter-dependence between subsequent labels, which improved joint sentence classification performance by considering the label sequence information. In this work, we add a bi-directional long short-term memory (bi-LSTM) layer over the representations of individual sentences so that it can encode the contextual content and semantics from preceding and succeeding sentences for better categorical inference of the current one.
In this work, we present a hierarchical neural network model for the sequential sentence classification task, which we call a hierarchical sequential labeling network (HSLN). Our model first uses a RNN or CNN layer to individually encode the sentence representation from the sequence of word embeddings, then uses another bi-LSTM layer to take as input the individual sentence representation and output the contextualized sentence representation, subsequently uses a single-hidden-layer feed-forward network to transform the sentence representation to the probability vector, and finally optimizes the predicted label sequence jointly via a CRF layer. We evaluate our model on two benchmarking datasets, PubMed RCT (Dernoncourt and Lee, 2017) and NICTA-PIBOSO (Kim et al., 2011), which were both generated from the PubMed database 1 . Our key contributions are summarized as follows: 1. Based on the previous best performing architecture for sequential sentence classification , we add one more layer to extract contextual information from surrounding sentences for more accurate prediction of the current one. Together with the CRF algorithm, this allows us to make use of not only the preceding labels' information but also the content and semantics of adjacent sentences to infer the label of the target sentence.
2. We remove the need for a character-based word embedding component without sacrificing performance. For individual sentence encoding, we propose the use of a CNN module as an alternative to RNN for small datasets, suffering less from over-fitting as evidenced by our experiments. Moreover, we incorporate attention-based pooling in both RNN and CNN models to further improve the performance.
3. We adopt dropout with expectation-linear regularization instead of the standard one to reduce the performance gap between training and test phases.
4. We obtain state-of-the-art results on two datasets for sequential sentence classification in medical abstracts, outperforming the previous best models by at least 2% in terms of F1 scores.

Related Work
Previous systems for sequential sentence classification concentrate on the rhetorical structure analysis of biomedical abstracts. They are mainly based on naive Bayes (Ruch et al., 2007), support vector machine (SVM) (McKnight and Srinivasan, 2003;Yamamoto and Takagi, 2005;Liu et al., 2013), Hidden Markov Model (HMM) (Lin et al., 2006), and CRF (Kim et al., 2011;Hassanzadeh et al., 2014;Hirohata et al., 2008;Chung, 2009). All these methods heavily rely on numerous carefully hand-engineered features such as lexical (bag-of-words (BOW)), semantic (hypernyms, synonyms), structural (part of speech (POS) tags, lemmas, orthographic shapes, headings), statistical (statistical distributions of token types) and sequential (sentence position, surrounding features, predicted labels) features. In contrast, current emerging artificial neural network (ANN) based models have removed the need for manually selected features; instead, features are self-learned from the token and/or character embeddings. These deep learning models have revolutionized the natural language processing (NLP) field with state-of-the-art results achieved in various tasks, including the most relevant text classification task (Kim, 2014;Zhang et al., 2016;Conneau et al., 2017;Lai et al., 2015;Ma et al., 2015). Most of these models are built upon deep CNNs or RNNs as well as combinations of them, where CNN is good at extracting local n-gram features while RNN is suitable for sequence modeling.
The above-mentioned works for short-text classification do not consider any context of sentence semantics in the models, making them underperform in the sequential sentence classification scenario, where surrounding sentences can play a big role in inferring the label of the current sentence. Recent works that apply deep neural networks to the sequential sentence classification problem include the system proposed by , where the preceding utterances were used to help classify the current utterance in a dialog into the corresponding dialogue act. Most recent work from Dernoncourt et al.  used a CRF layer to optimize the predicted label sequence, where the preceding labels have influence on determining the current label. This model outperformed the state-of-the-art results on two datasets PubMed RCT and NICTA-PIBOSO for sentence classification in medical abstracts.

Proposed Model
Notation We denote scalars in italic lowercase (e.g., k), vectors in bold italic lowercase (e.g., s) and matrices in italic uppercase (e.g., W ). Colon notations x i:j and s i:j are used to denote the se-quence of scalars (x i , x i+1 , ..., x j ) and vectors (s i , s i+1 , ..., s j ).
Our model is composed of four components: the word embedding layer, the sentence encoding layer, the context enriching layer, and the label sequence optimization layer. In the following sections they will be discussed in detail.

Word Embedding Layer
Given a sentence w = w 1 w 2 · · · w N comprising N words, this layer maps each word to a real-valued vector as its lexical-semantic representation. Word representations are encoded by the column vector in the embedding matrix W word ∈ R d w ×|V | , where d w is the dimension of the word vector and V is the vocabulary of the dataset. Each column W word i ∈ R d w is the word embedding vector for the i th word in the vocabulary. The word embeddings W word can be pre-trained on large unlabeled datasets using unsupervised algorithms such as word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and fastText .

Sentence Encoding Layer
This layer takes as input the embedding vector of each token in a sentence from the word embedding layer and produces a vector s to encode this sentence. The sequence of embedding vectors is first processed by a bi-directional RNN (bi-RNN) or CNN layer, similar to the ones used in the text classification before (Kim, 2014;Liu et al., 2016). This layer outputs a sequence of hidden states h 1:N (h ∈ R d hs ) for a sentence of N words with each hidden state corresponding to a word. To form the final representation vector s of this sentence, attention-based pooling is used, which can be described using the following equations: where H = h 1 h 2 · · · h N ∈ R d hs ×N , W s ∈ R d a ×d hs is the transformation matrix for soft alignment, b s ∈ R d a is the bias vector, U s ∈ R r×d a is the token level context matrix used to measure the relevance or importance of each token with respect to the whole sentence, softmax is performed along the second dimension of its input matrix, and A ∈ R r×N is the attention matrix.
Here each row of U s is a context vector u s ∈ R d a and it is expected to reflect an aspect or component of the semantics of a sentence. To represent the overall semantics of the sentence, we use multiple context vectors to focus on different parts of this sentence.
Finally, the sentence encoding vector s ∈ R rd hs is obtained by reshaping the matrix S into a vector. Figure 1: Model architecture. w: original word; e: word embedding vector; h: sentence-level hidden state output by the bi-RNN or CNN layer; s: sentence representation vector; h : abstract-level hidden state output by the bi-LSTM layer; r: sentence label probability vector; y: predicted sentence label.

Context Enriching Layer
This layer takes as input the sequence of individual sentence encoding vectors in a given abstract of n sentences obtained from the last sentence encoding layer, with each vector corresponding to a sentence. It outputs a new sequence of contextualized sentence encoding vectors, which are enriched with the contextual information from surrounding sentences. Specifically, the sequence of individual sentence encoding vectors is input into a bi-LSTM layer, which produces a sequence of hidden state vectors h 1:n (h ∈ R d hd ) with each corresponding to a sentence. Each of these vectors is subsequently input to a feed-forward neural network with only one hidden layer to get the corresponding probability vector r ∈ R l , which represents the probability that this sentence belongs to each label, where l is the number of labels.

Label Sequence Optimization Layer
Within the abstract, the sequence of sentence categories implicitly follows some patterns. For example, the category Results is always followed by Conclusion, and the category Methods is certainly after the Background. Making use of such patterns can boost the classification performance via the CRF algorithm (Lample et al., 2016). Given the sequence of probability vectors r 1:n from the last context enriching layer for an abstract of n sentences, this layer outputs a sequence of labels y 1:n , where y i represents the predicted label assigned to the i th sentence.
In the CRF algorithm, in order to model dependencies between subsequent labels, we incorporate a matrix T that contains the transition probabilities between two subsequent labels; we define T [i, j] as the probability that a token with label i is followed by a token with the label j. The score of a label sequence y 1:n is defined as the sum of the probabilities of individual labels and the transition probabilities: (3) The score in the above equation can be transformed into the probability of a certain label sequence by taking a softmax operation over all possible label sequences: where Y denotes the set of all possible label sequences. During the training phase, the objective is to maximize the probability of the gold label sequence. In the testing phase, given an input sequence, the corresponding sequence of predicted labels is chosen as the one that maximizes the score, computed via the Viterbi algorithm (Forney, 1973).

Datasets
We evaluate our model on two sources of benchmarking datasets on medical scientific abstracts, where each sentence of the abstract is annotated with one label that is associated with the rhetorical structure. Table 1 summarizes the statistics of the two datasets.
NICTA-PIBOSO This dataset 2 was shared from the ALTA 2012 Shared Task (Amini et al., 2012), the goal of which is to build automatic sentence classifiers that can map the sentences from biomedical abstracts into a set of pre-defined categories for Evidence-Based Medicine (EBM).
PubMed RCT This new dataset was curated by (Dernoncourt and Lee, 2017) 3 and is currently the largest dataset for sequential sentence classification. It is based on the PubMed database of biomedical literature and each sentence of each abstract is labeled with its role in the abstract using one of the following classes: background, objective, method, result, and conclusion. Table  2 presents an example abstract comprising structured sentences with their annotated labels.

Training Settings
For both datasets, test performance is assessed on the training epoch with best validation performance and F1 scores (weighted average by support (the number of true instances for each label)) are reported as the results. The token embeddings were pre-trained on a large corpus combining Wikipedia, PubMed, and PMC texts (Moen and Ananiadou, 2013) using the word2vec tool 4 (denoted as "Word2vec-wiki+P.M."). They are fixed during the training phase to avoid over-fitting.
We also tried other types of word embeddings, such as the word2vec embeddings pre-trained on the Google News dataset 5 (denoted as "Word2vec-News"), word2vec embeddings pre-trained on the Wikipedia corpus 6 (denoted as "Word2vecwiki"), GloVe embeddings pre-trained on the cor-  pus of Wikipedia 2014 + Gigaword 5 7 (denoted as "Glove-wiki"), fastText embeddings pre-trained on Wikipedia 8 (denoted as "FastText-wiki"), and fastText embeddings initialized with the standard GloVe Common Crawl embeddings and then finetuned on PubMed abstracts plus MIMIC-III notes (denoted as "FastText-P.M.+MIMIC"). The comparison results are summarized in the next section. The model is trained using the Adam optimization method (Kingma and Ba, 2014). The learning rate is initially set as 0.003 and decayed by 0.9 after each epoch. For regularization, dropout (Srivastava et al., 2014) is applied to each layer. For the version of dropout used in practice (e.g., the dropout function implemented in the TensorFlow and Pytorch libraries), the model ensemble generated by dropout in the training phase is approximated by a single model with scaled weights in the inference phase, resulting in a gap between training and inference. To reduce this gap, we adopted the dropout with expectation-linear regularization introduced by Ma et al. (2016) to explicitly control the inference gap and thus improve the generaliza-tion performance.
Hyperparameters were optimized via grid search based on the validation set and the best configuration is shown in Table 3. The window sizes of the CNN encoder in the sentence encoding layer are 2, 3, 4 and 5. The RNN encoder in the sentence encoding layer is set as LSTM for the PubMed datasets and gated recurrent unit (GRU) for the NICTA-PIBOSO dataset. Code for this work is available online 9 .

Results and Discussion
Table 4 compares our model against the best performing models in the literature Liu et al., 2013). There are two variants of our model in terms of different implementations of the sentence encoding layer: the model that uses bi-RNN to encode the sentence is called HSLN-RNN; while the model that uses the CNN module is named HSLN-CNN. We have evaluated both model variants on all datasets. And as evidenced by   Table 4: Comparison of F1 scores (weighted average by support (the number of true instances for each label)) between our model and the best published methods. The presented results of our model are evaluated on the test set of the run with the highest F1 score on the validation set. Table 5 presents the ablation analysis of our model (on the PubMed 20k dataset), where we remove one component at a time and quantify the performance drop (reported on F1 scores). As can be seen from Table 5, our HSLN-CNN model uni-formly suffers a little more from the component removal than the HSLN-RNN model, indicating that the HSLN-RNN model is more robust. When the context enriching layer is removed, both models experience the most significant performance drop and can only be on par with the previous stateof-the-art results, strongly demonstrating that this proposed component is the key to the performance improvement of our model. Furthermore, even without the label sequence optimization layer, our model still significantly outperforms the best published methods that are empowered by this layer, indicating that the context enriching layer we propose can help optimize the label sequence by considering the context information from the surrounding sentences. Last but not the least, the dropout regularization and attention-based pooling components we add to our system can help further improve the model in a limited extent.  Table 5: Ablation analysis. F1 scores are reported. "− context" is our model without the context enriching layer. "− seq. opt." is our model without the label sequence optimization layer. "− dropout reg.' is our model using the standard dropout strategy without the expectation-linearization regularization. "− attention" refers to the model without attention-based pooling, i.e., in the sentence encoding layer, the final hidden state is used for the HSLN-RNN model while maxpooling is used for the HSLN-CNN model. Table 6 and 7 detail the results of classification for each label in terms of performance scores (precision, recall and F1) and confusion matrix, respectively (for our HSLN-RNN model trained on the PubMed 20k dataset). These show that the classifier is very good at predicting the labels Methods, Results and Conclusions, whereas the greatest difficulty the classifier has is in distinguishing Background sections from Objectives sections. One fifth of Background sentences are incorrectly classified as Objectives, while around one forth of Objectives sentences are wrongly assigned to the label of Background. We conjecture this difficulty mainly comes from the fact that the difference between Background and Ob-jectives sentences in terms of writing style is less obvious compared with the other sections of the abstract. Moreover, our model has some difficulty in telling Methods sentences apart from Results sentences.    Table 8 presents a few examples of prediction errors that are produced by our HSLN-RNN model trained on the PubMed 20k dataset. This error analysis suggests that one of the biggest model error sources could be from the debatable gold standard labels of the dataset. For example, the sentence "Depressive disorders are one of the leading components of the global burden of disease with a prevalence of up to 14% in the general population." is indeed introducing the background of the problem (depressive disorders) on which this article is going to focus; however, the gold label classifies it into the Objective category. For another instance, the sentence "A post hoc analysis was conducted with the use of data from the evaluation study of congestive heart failure and pulmonary artery catheterization effectiveness (escape)." belongs to the Result label according to the gold standard, but it makes more sense that it should be classified as a Method label. Figure 2 presents an example of the transition matrix after the HSLN-RNN model has been trained on the PubMed 20k dataset, which encodes the transition probability between two subsequent labels. It effectively reflects what label is the most likely one that follows the current one. For example, by comparing the transition scores in the Result row in Figure 2, we can conclude that a sentence pertaining to the Result is typically followed by a sentence pertaining to the Conclusion and is unlikely to be followed by a sentence in the Background category (transition scores of 2.48 vs -5.46), which makes sense. From this transition matrix, we can figure out the most probable label sequence: Background → Objective → M ethod → Result → Conclusion, which is also consistent with our expectations. In order to test the importance of pretrained word embeddings, we performed experiments with different sets of publicly published word embeddings, as well as our locally curated word embeddings, to initialize our model. Table 9 gives the performance of six different word embeddings for our HSLN-RNN model trained on the PubMed 20k dataset. According to Table 9, the training methods that create the word embeddings do not have a strong influence on model performance, but the corpus they are trained on does. The combination of Wikipedia and PubMed abstracts as the corpus for unsupervised word embedding training yields the best result, and the individual use of either the Wikipedia corpus or the PubMed abstracts performs much worse. Although the dataset we  Table 8: Examples of prediction errors of our HSLN-RNN model trained on the PubMed 20k dataset. Each sentence is followed by the PMID of the abstract that this sentence belongs to, which is enclosed in middle brackets. The "Predicted" column indicates the label predicted by our model for a given sentence. The "Gold" column indicates the gold label of the sentence.

Label
are using for evaluation is also from PubMed abstracts, using only the PubMed abstracts together with MIMIC notes without the Wikipedia corpus does not guarantee better result (see the "FastText-P.M.+MIMIC" embeddings in Table 9), which may be because the corpus size of PubMed abstracts plus MIMIC notes (about 12.8 million abstracts and 1 million notes) is not large enough for good embedding training compared with the corpus consisting of at least billion tokens such as the Wikipedia.

Conclusion
In this work, we have presented an ANN based hierarchical sequential labeling network to classify sentences that appear sequentially in text. We demonstrate that incorporating the contextual information from surrounding sentences to help classify the current one by using an LSTM layer to sequentially process the encoded sentence representations can improve the overall quality of predictions. Our model outperforms the state-of-theart results by 2%-3% on two datasets for sequential sentence classification in medical abstracts. We expect that our proposed model can be generalized to any problem that is related to sequential sentence classification, such as the paragraph-level sequential sentence categorization in full-text articles for better text mining and document retrieval (Westergaard et al., 2018).

Future Work
Although the whole PubMed database contains over 2 million abstracts with part of them accompanied by full-text articles, only a small fraction of them are structured and contain the label information utilized in this work. We plan to make use of the rest unannotated abstracts or full texts to pre-train our model and then fine tune it to the target annotated datasets inspired by the work from (Howard and Ruder, 2018) so that the performance can be further boosted.