Neural Networks for Joint Sentence Classification in Medical Paper Abstracts

Existing models based on artificial neural networks (ANNs) for sentence classification often do not incorporate the context in which sentences appear, and classify sentences individually. However, traditional sentence classification approaches have been shown to greatly benefit from jointly classifying subsequent sentences, such as with conditional random fields. In this work, we present an ANN architecture that combines the effectiveness of typical ANN models to classify sentences in isolation, with the strength of structured prediction. Our model outperforms the state-of-the-art results on two different datasets for sequential sentence classification in medical abstracts.


Introduction
Over 50 million scholarly articles have been published (Jinha, 2010), and the number of articles published every year keeps increasing (Druss and Marcus, 2005;Larsen and Von Ins, 2010). Approximately half of them are biomedical papers. While this repository of human knowledge abounds with useful information that may unlock new, promising research directions or provide conclusive evidence about phenomena, it has become increasingly difficult to take advantage of all available information due to its sheer amount. Therefore, a technology that can assist a user to quickly locate the information of interest is highly desired, as it may reduce the time required to locate relevant information.
When researchers search for previous literature, for example, they often skim through abstracts in order to quickly check whether the papers match * These authors contributed equally to this work. the criteria of interest. This process is easier when abstracts are structured, i.e., the text in an abstract is divided into semantic headings such as objective, method, result, and conclusion. However, a significant portion of published paper abstracts is unstructured, which makes it more difficult to quickly access the information of interest. Therefore, classifying each sentence of an abstract to an appropriate heading can significantly reduce time to locate the desired information.
We call this the sequential sentence classification task, in order to distinguish it from general text classification or sentence classification that does not have any context. Besides aiding humans, this task may also be useful for automatic text summarization, information extraction, and information retrieval.
In this paper, we present a system based on ANNs for the sequential sentence classification task. Our model makes use of both token and character embeddings for classifying sentences, and has a sequence optimization layer that is learned jointly with other components of the model. We evaluate our model on the NICTA-PIBOSO dataset as well as a new dataset we compiled based on the PubMed database.
On the other hand, recent approaches to natural language processing (NLP) based on artificial neural networks (ANNs) do not require manual features, as they are trained to automatically learn features based on word as well as character embeddings. Moreover, ANN-based models have achieved state-of-the-art results on various NLP tasks, including the most relevant task of text classification (Socher et al., 2013;Kim, 2014;Kalchbrenner et al., 2014;Zhang et al., 2015;Conneau et al., 2016;Xiao and Cho, 2016;dos Santos and Gatti, 2014). For text classification, many ANN models use word embeddings (Socher et al., 2013;Kim, 2014;Kalchbrenner et al., 2014;Gehrmann et al., 2017), and most recent works are based on character embeddings (Zhang et al., 2015;Conneau et al., 2016;Xiao and Cho, 2016). Approaches combining word and character embeddings have also been explored (dos Santos and Gatti, 2014;.
However, most existing works using ANNs for short-text classification do not use any context. This is in contrast with sequential sentence classification, where each sentence in a text is classified taking into account its context, i.e. the surrounding sentences and possibly the whole text. One exception is a recent work on dialog act classification , where each utterance in a dialog is classified into its dialog act, but only the preceding utterances were used, as the system was designed with real-time applications in mind.

Model
In the following, we denote scalars in italic lowercase (e.g., k, b f ), vectors in bold lowercase (e.g., s, x i ), and matrices in italic uppercase (e.g., W f ) symbols. We use the colon notations x i:j and v i:j to denote the sequences of scalars (x i , x i+1 , . . . , x j ), and vectors (v i , v i+1 , . . . , v j ), respectively.

ANN model
Our ANN model ( Figure 1) consists of three components: a hybrid token embedding layer, a sentence label prediction layer, and a label sequence optimization layer.

Hybrid token embedding layer
The hybrid token embedding layer takes a token as an input and outputs its vector representation utilizing both the token embeddings and as well as the character embeddings. Token embeddings are a direct mapping V T (·) from token to vector, which can be pre-trained on large unlabeled datasets using programs such as word2vec (Mikolov et al., 2013b;Mikolov et al., 2013a;Mikolov et al., 2013c) or GloVe (Pennington et al., 2014). Character embeddings are also defined in an analogous manner, as a direct mapping V C (·) from character to vector.
Let z 1: be the sequence of characters that comprise a token x. Each character z i is first mapped to its embedding c i = V C (z i ), and the resulting sequence c 1: is input to a bidirectional LSTM, which outputs the character-based token embedding c.
The output e of the hybrid token embedding layer for the token x is the concatenation of the character-based token embedding c and the token embedding t = V T (x).

Sentence label prediction layer
Let x 1:m be the sequence of tokens in a given sentence, and e 1:m be the corresponding embedding output from the hybrid token embedding layer. The sentence label prediction layer takes as input the sequence of vectors e 1:m , and outputs a, where the k th element of a, denoted a[k], reflects the probability that the given sentence has label k.
To achieve this, the sequence e 1:m is first input to a bidirectional LSTM, which outputs the vector representation s of the given sentence. The vector s is subsequently input to a feedforward neural network with one hidden layer, which outputs the corresponding probability vector a.

Label sequence optimization layer
The label sequence optimization layer takes the sequence of probability vectors a 1:n from the label prediction layer as input, and outputs a sequence of labels y 1:n , where y i is the label assigned to the token x i . In order to model dependencies between subsequent labels, we incorporate a matrix T that contains the transition probabilities between two subsequent labels; we define T [i, j] as the probability that a token with label i is followed by a token with the label j. The score of a label sequence y 1:n is defined as the sum of the probabilities of individual labels and the transition probabilities: These scores can be turned into probabilities of the label sequences by taking a softmax function over all possible label sequences: p(ŷ 1:n ) = e s(ŷ 1:n ) y 1:n ∈Y n e s(y 1:n ) with Y being the set of all possible labels. During the training phase, the objective is to maximize the log probability of the gold label sequence. In the testing phase, given an input sequence of tokens, the corresponding sequence of predicted labels is chosen as the one that maximizes the score.
Computing the denominator y∈Y n e s(y 1:n ) can be done in O(n|C| 2 ) time using dynamic programming (where |C| denotes the number of classes), as demonstrated below. Let A (n,yn) be the log of the sum of the scores of all the sequence of length n the last label of which is y n . Then: Since A (n,yn) can be computed in Θ(|C|) time given A (n−1,y n−1 ) |y n−1 ∈ Y , computing A (n,yn) |y n ∈ Y takes Θ(|C| 2 ) time given A (n−1,y n−1 ) |y n−1 ∈ Y . Consequently, computing A (n,yn) |y n ∈ Y takes O(n|C| 2 ) time.

Datasets
We evaluate our model on the sentence classification task using the following two medical abstract datasets, where each sentence of the abstract is annotated with one label.

Training
The model is trained using stochastic gradient descent, updating all parameters, i.e., token embed-  dings, character embeddings, parameters of bidirectional LSTMs, and transition probabilities, at each gradient step. For regularization, dropout is applied to the character-enhanced token embeddings before the label prediction layer. We selected the hyperparameters manually, though we could have used some hyperparameter optimization techniques (Bergstra et al., 2011;.

Results and Discussion
Table 2 compares our model against several baselines as well as the best performing model (Lui, 2012) in the ALTA 2012 Shared Task, in which 8 competing research teams participated to build the most accurate classifier for the NICTA-PIBOSO corpus. The first baseline (LR) is a classifier based on logistic regression using n-gram features extracted from the current sentence: it does not use any information from the surrounding sentences. The second baseline (Forward ANN) uses the model presented in : it computes sentence embeddings for each sentence, then classifies the current sentence given a few preceding sentence embeddings as well as the current sentence embedding. The third baseline (CRF) is a CRF that uses n-grams as features: each output variable of the CRF corresponds to a label for a sentence, and the sequence the CRF considers is the entire abstract. The CRF baseline therefore uses both preceding and succeeding sentences when classifying the current sentence. Lastly, the model presented in (Lui, 2012) developed a new approach called feature stacking, which is a metalearner that combines multiple feature sets, and is the best performing system on NICTA-PIBOSO published in the literature.

Model
PubMed 20k NICTA Full model 89.9 82.7 -character emb 89.7 82.7 -pre-train 88.7 78.0 -token emb 88.9 77.0 -seq opt 85.0 72.8 Table 3: Ablation analysis. F1-scores are reported. "-character emb" is our model using only token embeddings, without character-based token embeddings. "-pre-train" is our model where token embeddings are initialized with random values instead of pre-trained embeddings. "-token emb" is our model using only character-based token embeddings, without token embeddings. "-seq opt" is our model without the label sequence optimization layer. The rows represent the label of the previous sentence, the columns represent the label of the current sentence.
The LR system performs honorably on PubMed 20k RCT (F1-score: 83.1), but quite poorly on NICTA-PIBOSO (F1-score: 71.6): this suggests that using the surrounding sentences may be more important in NICTA-PIBOSO than in PubMed 20k RCT. The Forward ANN system performs better than the LR system, and worse than the CRF: this is expected, as the Forward ANN system only uses the information from the preceding sentences but do not use any information from the succeeding sentences, unlike the CRF.
Our model performs better than the CRF system and the (Lui, 2012) system. We hypothesize that the following four factors give an edge to our model: No human-engineered features: Unlike most other systems, our model does not rely on any human-engineered features. No n-grams: While other systems heavily relies on n-grams, our model maps each token to a token embedding, and feeds it as an input to an RNN. This helps combat data scarcity, as for example "chronic tendonitis" and "chronic tendinitis" are  Table 4: Examples of prediction errors of our model on PubMed 20k RCT. The "predicted" column indicates the label predicted by our model for a given sentence. Our model takes into account all the sentences present in the abstract in which the classified sentence appears. The "actual" column indicates the gold label of the sentence.  two different bigrams, but share the same meaning, and their token embeddings should therefore be very similar. Structured prediction: The labels for all sentences in an abstract are predicted jointly, which improves the coherency between the predicted labels in a given abstract. The ablation analysis presented in Table 3 shows that the sequence optimization layer is the most important component of the ANN model. Joint learning: Our model learned the features and token embeddings jointly with the sequence optimization. The sequence information is mostly contained in the transition matrix. Figure 2 presents an example of transition matrix after the model has been trained on PubMed 20k RCT. We can see that it effectively reflects transitions between different labels. For example, it learned that the first sentence of an abstract is most likely to be either discussing objective (0.23) or background (0.26). By the same token, a sentence pertaining to the methods is typically followed by a sentence pertaining to the methods (0.25) or the results (0.17).
Tables 5 and 6 detail the result of our model for each label in PubMed 20k RCT. The main difficulty the classifier has is distinguishing background sentences from objective sentences. In particular, a third of the objective sentences are incor-  rectly classified as background, which causes the recall for objectives and the precision for background to be low. The classifier has also some difficulty in distinguishing method sentences from result sentences. Table 4 presents a few examples of prediction errors. Our error analysis suggests that a fair number of sentence labels are debatable. For example, the sentence "We conducted a randomized study comparing strategies X and Y." belongs to the background according to the gold target, but most humans would classify it as an objective.

Conclusions
In this article we have presented an ANN architecture to classify sentences that appear in sequence. We demonstrate that jointly predicting the classes of all sentences in a given text improves the quality of the predictions. Our model outperforms the state-of-the-art results on two datasets for sentence classification in medical abstracts.