PICO Element Detection in Medical Text via Long Short-Term Memory Neural Networks

Successful evidence-based medicine (EBM) applications rely on answering clinical questions by analyzing large medical literature databases. In order to formulate a well-defined, focused clinical question, a framework called PICO is widely used, which identifies the sentences in a given medical text that belong to the four components: Participants/Problem (P), Intervention (I), Comparison (C) and Outcome (O). In this work, we present a Long Short-Term Memory (LSTM) neural network based model to automatically detect PICO elements. By jointly classifying subsequent sentences in the given text, we achieve state-of-the-art results on PICO element classification compared to several strong baseline models. We also make our curated data public as a benchmarking dataset so that the community can benefit from it.


Introduction
The paradigm of evidence-based medicine (EBM) involves the incorporation of current best evidence, such as the reports of randomized controlled trials (RCTs), into decision making for patient care (Sackett, 1997). Such evidence, integrated with the physician's own expertise and patient-specific factors, can lead to better patient outcomes and higher quality health care (Sackett et al., 1996). In practice, successful EBM applications rely on answering clinical questions via analysis of large medical literature databases such as PubMed. And most often, a PICO framework is used to formulate a well-defined, focused clinical question, which decomposes the question into four parts: Participants/Problem (P), Intervention (I), Comparison (C) and Outcome (O) (Richardson et al., 1995).
Typically the analyses that underlie EBM begin by selecting a set of potentially relevant papers, which are then further refined by human judgment to form the evidence base on which the answer to a specific question depends. To facilitate this selection process, it would be advantageous that all papers (or at least their abstracts) can be organized according to the PICO foci. Unfortunately, a significant portion of the medical literature contains either unstructured or sub-optimally structured abstracts, without specifically identified PICO elements. Therefore, we would like to introduce a method to automate the identification of PICO elements in medical abstracts in order to make possible the automated selection of possibly relevant articles for a proposed study.
In this paper, we present a system based on artificial neural networks (ANN) to tackle the issue of extracting PICO elements in medical abstracts as a classification task at the sentence level. Our key contributions are as follows: 1. Previous methods for PICO elements extraction focused on shallow models such as Naive Bayes (NB), Support Vector Machines (SVM) and Conditional Random Fields (CRF), which are limited in modeling capacity. To significantly boost the performance, we propose a Long Short-Term Memory (LSTM) based ANN model to solve this task.
2. Most previous systems detected the PICO elements one by one; thus several classifiers needed to be built and trained separately, which is sub-optimal in efficiency. That approach also cannot take advantage of shared structure among the individual classifiers. In this work we extract PICO components simultaneously from any given medical abstract.
3. In all previous works, the only dataset used for training and test and made public is from (Kim et al., 2011). However, this dataset contains only 1000 abstracts, which is not enough for a ANN based deep learning model to obtain good generalization results. Therefore, we curate a dataset comprising of over tens of thousands of abstracts and make it public as a benchmark dataset so that everyone else can use it.
4. Instead of normally treating PICO detection as a single sentence classification problem, we view it as a sequential sentence classification task, where the sequence of sentences in an abstract is jointly predicted. In this way, the information from the context sentences can be used to help predict the current sentence, which does improve the classification accuracy considerably. Leveraging this strategy, we obtain state-of-the-art PICO elements extraction accuracy, significantly outperforming all previous methods.

Related Work
In many previous user studies, the generalized use of the PICO framework or similar schema by clinicians has been validated for its performance improvement on searching literature for clinical questions (Schardt et al., 2007;Boudin et al., 2010c;Znaidi et al., 2015). This has greatly fueled academic interest in the development of systems for automatic PICO element detection. Over the last decade, the research progress for this task can be summarized according to three aspects: models for classification, dataset generation, and task formulation. Many well-known machine learning techniques have been proposed to build stronger models for this task, including Nave Bayes (NB) (Huang et al., 2013;Boudin et al., 2010a;Demner-Fushman and Lin, 2007), Random Forest (RF) (Boudin et al., 2010a), Support Vector Machine (SVM) (Boudin et al., 2010a;Hansen et al., 2008), Conditional Random Field (CRF) (Kim et al., 2011;Chung, 2009;Chung and Coiera, 2007) and Multi-Layer Perceptron (MLP) (Boudin et al., 2010a;Huang et al., 2011). Also Boudin et al. in (Boudin et al., 2010b) proposed a location-based weighting strategy as an extension to the language modeling approach inspired by the special distribution pattern of PICO elements in medical abstracts. All these models heavily rely on careful selections of hand-engineered features including lexical features such as bag of words (BOW), stemmed words and cue-words/verbs, and semantic features such as synonyms and hypernyms provided by some ontologies (e.g., WordNet). As an important complement to this task, most recent work from Dernoncourt et al. (Dernoncourt et al., 2016) proposed the model based on currently emerging deep ANN architectures such as LSTM for further performance boosting, as well as to remove the need for hand-crafted features. However, this work has not targeted to address the issue of PICO element detection.
To generate the datasets for both training and test, earlier works mainly relied on manual annotation, which resulted in small corpora on the order of hundreds of abstracts (Demner-Fushman and Lin, 2007;Dawes et al., 2007;Chung, 2009;Kim et al., 2011). Afterwards, later works made use of the structural information embedded in some abstracts for which the authors have clearly stated distinctive sentence headings (Boudin et al., 2010a;Huang et al., 2011Huang et al., , 2013. Specifically, some abstracts contain explicit headings such as "PATIENTS", "SAMPLE" or "OUTCOMES", which can be used to locate sentences corresponding to PICO elements. In this way, tens of thousands of abstracts that contain PICO elements from PubMed can be automatically compiled as a well-annotated dataset, which can increase the size of dataset by two orders of magnitude. In terms of task formulation, most previous works focused on categorizing one PICO class at a time using an individual classifier (Boudin et al., 2010a;Huang et al., 2013). Therefore, in order to detect all four PICO components, one would need to build and train four individual models, which is inefficient. Furthermore, it is hard to disambiguate the classification label conflicts between different model predictions on the same sentence. These limitations were resolved by working directly on the labels of interest for EBM, allowing multilabel classification instead of binary and allowing sentences that are unrelated to labels of interest to be labeled as an "Other" category (Kim et al., 2011;Demner-Fushman and Lin, 2007). This is a more realistic setting and ought to provide better insight into the performance we should expect for this kind of task.

The Proposed Model
First we introduce our notation. We denote scalars in italic lowercase (e.g., k), vectors in bold lowercase (e.g., s) and matrices in italic uppercase (e.g., W ). Colon notations x i:j and s i:j are used to denote the sequence of scalars (x i , x i+1 , ..., x j ) and vectors (s i , s i+1 , ..., s j ).
Our model is composed of three components: the token embedding layer, the sentence-level label inference layer, and the label sequence optimization layer (Figure 1). In the following sections they will be discussed in detail. Figure 1: Model architecture. w: original token; e: token embedding; h: bi-LSTM hidden state; s: sentence representation vector; r: sentence label probability vector; y: predicted sentence label. Replacing bi-LSTM with convolutional neural network (CNN) did not improve the results: we therefore used bi-LSTM.

Token Embedding Layer
This layer takes as input a given sentence w comprising N words w = [w 1 , w 2 , ...w N ] and outputs its corresponding vector representation. Token representations are encoded by the column vector in the embedding matrix W word ∈ R d w ×|V | , where d w is the dimension of the word vector and V is the vocabulary of the dataset. Each column W word i ∈ R d w is the word embedding vector for the i th word in the vocabulary. To transform a certain word w into its corresponding embedding vector e w , we use the following equation: where v w is the one hot vector of word w with dimension of |V | that has 1 at the corresponding index and zero in all other positions. The word embeddings W word can be pre-trained on large unlabeled datasets using unsupervised algorithms such as word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and fasttext (Bojanowski et al., 2016).

Sentence-level Label Inference Layer
This layer takes as input the embedding vector e of each token in a sentence from the token embedding layer and produces a vector r ∈ R l to represent the probability that this sentence belongs to each label, where l is the number of labels. To this aim, the sequence of embedding vectors e is first input into a bi-directional LSTM (bi-LSTM), which outputs a sequence of hidden states h 1:N (h ∈ R d h ) for a sentence of N words with each hidden state corresponding to a token. To form the final representation vector s of this sentence, attentive pooling is used, which can be described using the following equations (Yang et al., 2016): where u s ∈ R d s is the token level context vector used to measure the relevance or importance of each token with respect to the whole sentence, and W s ∈ R d s ×d h is the transformation matrix for soft alignment.
The obtained vector s is subsequently input to a feed-forward neural network with only one hidden layer, which outputs the corresponding probability vector r.

Label Sequence Optimization Layer
Each medical abstract consists of several sentences with the sentence category following some patterns, such as that the category "Results" is always followed by "Conclusion". Such patterns can yield better classification performance via the conditional random field (CRF) algorithm. Given the sequence of probability vectors r 1:n from the last label inference layer for an abstract of n sentences, this layer outputs a sequence of labels y 1:n , where y i represents the predicted label assigned to the i th sentence.
In order to model dependencies between subsequent labels, we incorporate a matrix T that contains the transition probabilities between two subsequent labels; we define T [i, j] as the probability that a token with label i is followed by a token with the label j. The score of a label sequence y 1:n is defined as the sum of the probabilities of individual labels and the transition probabilities: The score in the above equation can be transformed into the probability of a certain label sequence by taking a softmax operation over all possible label sequences: p(y 1:n ) = e s(y 1:n ) ŷ 1:n ∈Y e s(ŷ 1:n ) , where Y denotes the set of all possible label sequences. During the training phase, the objective is to maximize the probability of the gold label sequence. While in the testing phase, given an input sequence, the corresponding sequence of predicted labels is chosen as the one that maximizes the score using the Viterbi algorithm (Forney, 1973).

Dataset Preparation
The dataset used in this study 1 is curated from MEDLINE, which is a free access database on medical articles. Specifically, we extracted 489,026 abstracts from PubMed by stating the following search limits: 1. Text Availability: Abstract; 2. Languages: English; 3. Publication Types: Randomized Controlled Trial (Search conducted on 2017/08/28). Among them, abstracts with structured section headings were selected for automatic annotation of sentence category. Although P, I and O headings were our detection targets, we also annotated the other types of sentences into one of the AIM (A), METHOD (M), RESULTS (R) and CONCLUSION (C) labels to facilitate the use of our CRF label sequence optimization method. Note that, although we have 7 labels in total, we only care about the detection accuracy of the P, I and O labels and thus mainly discuss their performance in the following sections. In this study, the C component was incorporated into the I category since the "COMPARI-SON" section also refers to a kind of intervention in an RCT. And in fact, there are very few abstracts with comparison labels found in PubMed.
We annotated a certain section heading into one of the 7 labels based on whether it contains the key words that belong to the assigned label as shown in Table 1 (section headings are only used to generate gold labels and not used for model training and inference). In very rare cases, the section heading of a certain sentence may contain the key words of more than one category, in which case that sentence will be assigned into multi-labels according to Table 1. Table 2 presents a typical abstract example with section headings annotated into the 7 labels. A total of 24,668 abstracts contain at least one of the P/I/O labels. There are 21,198 abstracts with P-labels, 13,712 with I-labels and 20,473 with O-labels (Table 3). Note that, the abstracts in PubMed follow a diversity of rhetorical structure and only a small fraction of them contain PICO elements based on their section headings.

Training Settings
Ten-fold cross-validation was employed to assess the results statistically, where abstracts were randomly split into 10 equal partitions. Nine of them were used for training and the remaining one for testing. This step repeats for ten rounds. For each round of training, 10% of the training set was randomly extracted as the development set for early stopping, that is, the test set was evaluated at the highest development set performance, which is measured by the average F1 score of all three P/I/O labels.
The token embeddings were pre-trained on a large corpus combining Wikipedia, PubMed and    Table 5 and 6 detail the results of classification for each label in terms of performance scores (precision, recall and F1) and confusion matrix, respectively (for one fold). It can be seen that the classifier is very good at predicting the labels of AIM, RESULTS and CONCLUSION but has difficulty in distinguishing among the labels of PARTICIPANTS, INTERVENTION, OUTCOME and METHOD. Indeed, the PARTICIPANTS, IN-TERVENTION and OUTCOME sections can be deemed as more specific aspects of the METHOD descriptions, therefore, it is naturally more difficult to tell the P/I/O elements apart from the METHOD section. Since our main goal is to accurately extract the P/I/O components from a given abstract, we will only discuss their performance in the following.  Table 5: Results in terms of precision (p), recall (r) and F-measure (F1) on the test set for each class obtained by our model for one of the ten folds.

Cate. p (%) r (%) F1 (%) Support
Table 7 compares our model against several previously widely-used baseline models. Since there is no benchmarking dataset, we cannot compare with published best models (this is one of the reasons why we want to publish this dataset).
scenario, each sentence is predicted individually without context information from the surrounding sentences considered. Likewise, the second baseline MLP first computes the vector representation for each sentence by taking the max pooling operation of the embeddings of all tokens in the sentence, then classifies the current sentence via a neural network with three hidden layers (hidden layer dimensions are 400, 400 and 200, respectively). On the other hand, the third baseline is a CRF model that also uses n-grams as features (only the first 100 tokens were used for each sentence since most sentences are shorter than 100 tokens) and outputs the most probable label sequence for the whole abstract. Therefore, the CRF baseline takes into account both preceding and succeeding sentences when classifying the current sentence.
As presented by Table 7, the LR baseline performs worst, which is quite reasonable considering that it is still a very shallow model and only uses the local sentence information. As a comparison, the MLP model also only considers the features from the current sentence but performs better than LR because its modeling capacity is much larger. By incorporating the surrounding sentences, the CRF baseline performs even better than MLP system, which verifies that context information is quite useful in sequential classification problems.
Lastly but most importantly, our proposed model performs much better than all the baselines for all three P/I/O labels. The advantages of our model and the reasons for its improved performance are summarized below: No human-engineered features Our model does not rely on any hand-engineered features that require much domain experience and are quite dif-  Table 7: Performance in terms of precision (p), recall (r) and F-measure (F1) on the test set with several baselines and our proposed model (average value based on 10 fold cross validation). Since the dataset used here was introduced in this work, there is no previously published method for reference. ficult to craft.
No n-gram features Unlike many other systems that rely heavily on n-grams, our model simply uses the token embedding vector to represent each token and feeds it into the recurrent neural network (RNN) model for inference. In this way, the pretrained embeddings on large corpora can encode the syntactic and semantic information of words for better language understanding. This can also help combat word scarcity problem. For example, the alternatively spelled tokens "tendonitis" and "tendinitis" are two different unigrams, however, their semantic meanings are the same, and this similarity can be revealed by their corresponding closely parallel embedding vectors.
Joint prediction Instead of predicting each sentence one by one, our model classifies all sentences in one abstract jointly, which improves the overall classification performance by implying the constraints of coherency between subsequent predicted labels. This improvement is clearly evidenced by Table 8.
Sequence modeling An RNN model is good at modeling sequences such as sentences by considering the dependency between tokens, which cannot be accounted for by context-free models such as those using bag of words features. And the long-term memory characteristic of LSTM model further grants the RNN model the ability to cope with long sentences. Figure 2 presents an example of the transition matrix after the model has been trained, which encodes the transition probability between two subsequent labels. It effectively reflects what label is the most likely one that should follow the current one. For example, a sentence pertaining to the RE-SULTS is typically followed by a sentence pertaining to the CONCLUSION (1.16), which makes sense. From this transition matrix, we can figure  Table 8: Ablation analysis. 10 fold cross validation F1-scores are reported. "-sequence optimization" is our model without the label sequence optimization layer.
out the most probable label sequence: A → M → P → I → O → R → C, which is also consistent with our observations. Table 9 presents a few examples of prediction errors that are related to P/I/O labels. This error analysis suggests that part of the model error comes from the ambiguity between some label pairs, such as O and M, O and R, and I and M. For example, the sentence "Plasma volume and total body haemoglobin were determined at rest." can be deemed as a METHOD description in a general sense, however, it can also be further specified as an OUTCOME. On the other hand, a fair number of sentence labels are indeed debatable. For instance, the sentence "Iron supplementation was given to one group as a substitution remedy, another group was given iron and folic acid and the third group was without supplementation during the collection period." belongs to the PARTICI-PANT label according to the gold standard, but it makes more sense that it should be classified as an INTERVENTION.

Conclusion
In this work we have presented an LSTM based ANN architecture to detect the PICO elements in medical RCT abstracts. We demonstrated that the use of a more advanced LSTM model and jointly predicting the classes of all sentences in a given text can improve the overall classification perfor-

Sentence
Predicted Gold The study included 16 patients who were randomized into one of three 6-month treatment protocols. P M Referral service doing n-of-1 trials at the requests of community and academic physicians. I M Iron supplementation was given to one group as a substitution remedy, another group was given iron and folic acid and the third group was without supplementation during the collection period.
I P Plasma urea and creatinine concentrations and angiotensin converting enzyme activity were measured at the start of the study and the end of each treatment period.

O R
Heart rate was recorded continuously throughout the maneuvre, while blood was sampled for catecholamine determinations prior to the start of straining and again approximately 10 s following the end of straining.
O I Plasma volume and total body haemoglobin were determined at rest. O M Table 9: Examples of prediction errors of our model that are related to P/I/O labels. The "Predicted" column indicates the label predicted by our model for a given sentence. The "Gold" column indicates the gold label of the sentence. mance of PICO components. And by publishing our curated dataset for benchmarking, we hope to encourage competition by other approaches than ours and that more effective and efficient methods can be developed in the future.