Patient Risk Assessment and Warning Symptom Detection Using Deep Attention-Based Neural Networks

We present an operational component of a real-world patient triage system. Given a specific patient presentation, the system is able to assess the level of medical urgency and issue the most appropriate recommendation in terms of best point of care and time to treat. We use an attention-based convolutional neural network architecture trained on 600,000 doctor notes in German. We compare two approaches, one that uses the full text of the medical notes and one that uses only a selected list of medical entities extracted from the text. These approaches achieve 79% and 66% precision, respectively, but on a confidence threshold of 0.6, precision increases to 85% and 75%, respectively. In addition, a method to detect warning symptoms is implemented to render the classification task transparent from a medical perspective. The method is based on the learning of attention scores and a method of automatic validation using the same data.


Introduction
Several intelligent triage systems have recently been developed that attempt to evaluate automatically the risk related to specific patient conditions and direct patients to the appropriate care provider (Semigran et al., 2015). The work presented here is part of an interactive triage system being developed for industrial applications. The system takes patient demographics and symptoms as input, assesses their current medical conditions and suggests where and by when the patients should seek medical care. A key feature of the system is the detection of warning symptoms, namely, red flags. This is crucial to distinguish potential emergencies from common or less urgent cases and therefore provides the medical rationale behind a given recommendation. In addition, for triage systems that involve a dialogue with patients through multiple question-and-answer interactions (such as Ada (2018)), warning symptom detection is fundamental to determine the most informative questions to ask patients.
We propose a model that assesses patient risk and detects warning symptoms based on a large volume of doctor notes in German, sometimes even mixed with Swiss German expressions. In this context, assessing patient risk can be regarded as a supervised text classification task, where the content of the medical records represents the feature space, and the recommendations assigned by medical professionals are the ground truth labels. The use of recurrent neural networks (RNN) has been proposed to solve text classification tasks (Tang et al., 2015). However, the proposed RNN models must be modified to be consistent with the requirement that warning symptoms must be detected, because in RNNs it is generally not possible to know which hidden states are most relevant.
To address these challenges, we propose an integrated approach to assess patient risk and detect warning symptoms simultaneously using an attention-based convolutional neural network (ACNN), which is a combination of a convolutional neural network (CNN) and an attention mechanism (Kim, 2014;Yang et al., 2016;Du et al., 2017). To the best of our knowledge, such an integrated approach is applied for the first time to the medical domain.
The main contributions of this paper are twofold. First, we propose a neural network architecture that can be used simultaneously for text classification and the detection of important words. Comparing our model to other neural architectures of similar complexity, we achieve competitive classification results. The model is especially useful to explain the recommendation rationale in classification scenarios, where the given input consists of a set of extracted entities, rather than full text. Second, a formal pipeline to detect warning symptoms based on learned importance factors is applied in an industrial application. Our model identifies symptoms that indicate a medical emergency. These warning symptoms can then be used by intelligent medical care services or in an ontology.
2 Related Work 2.1 Text Classification with Deep Learning Traditional text classification approaches represent documents with sparse lexical features, such as n-grams, and use a linear model or kernel methods on this representation (Wang and Manning, 2012;Joachims, 1998). More recently, deep learning technologies have been applied to text categorization problems. RNNs are designed to handle sequences of any length and capture long-term dependencies. Like sequence-based (Tang et al., 2015) and tree-structured (Tai et al., 2015) models, they have achieved remarkable results in document modeling.
Moreover, CNN models have achieved high accuracy on text categorization. For example, Kim (2014) used one convolutional layer (with multiple widths and filters) followed by a max pooling layer over time. Johnson and Zhang (2015) built a model that uses up to six convolutional layers, followed by three fully connected classification layers. Conneau et al. (2016) published a model with a 32-layer character-level CNN, that achieved a significant improvement on a large dataset. Models that combine CNN and RNN components for document classification also yield competitive results on several public datasets (Zhou et al., 2015;Lai et al., 2015).
To the best of our knowledge, not many research efforts have focused on augmenting CNNs for text classification with attention mechanisms. In fact, attention layers are more typically coupled with RNNs in order to better handle long-term dependencies (Yang et al., 2016). Interestingly, Du et al. (2017) used a CNN not as a classifier, but to compute the attention weights to apply to the hidden layers of a RNN. An example of combining attention layers with a CNN is the work by Shen and Huang (2016). However, the authors do not augment the CNN features using attention weights. They use an attention mechanism to compute sentence-level features, which they then concatenate to the convolutional features to ultimately perform the classification.

Intelligent Triage Systems
Intelligent triage systems inform patients where and when they should seek medical care, based on methods such as expert rules, Bayesian inference and deep learning (Semigran et al., 2015). For example, Symptomate (2018) uses a Bayesian network and a medical database for triage advice. Clinical records written by medical experts have also been used to make triage suggestions with deep learning technologies. Li et al. (2017) uses a shallow CNN model to predict a patient's diseases using the corresponding admission notes. Nigam (2016) applied a LSTM model to the multilabel classification task of assigning ICD-9 labels to medical notes.

Data Processing
To build the triage application described here, we used 600,000 case records written in German and collected over the past five years. This is only 50% of the total available data, as we selected only those cases treated by top-ranked doctors. Case records contain demographic data such as age and gender, previous illnesses, and a full-text description of the patient's current medical condition. Potential diagnoses consistent with the symptom description are listed.
The descriptions in the records are expressed in formal medical language as well as in layman's terminology. The notes are not always written in complete sentences and include misspellings, dialect vocabulary, non-standard medical abbreviations and inconsistent punctuation. This is a challenge for the linguistic processing of case files.
The original case records are very unevenly distributed over ten recommendation classes (a combination of a point-of-care and a time-to-treat class). To mitigate this problem and for the purpose of this work, the original classes, (emergency, urgent), (grundversorger, urgent), (specialist, urgent), (grundversorger, within a day), (specialist, within a day), (grundversorger, not urgent), (specialist, not urgent), (telecare, -), were merged, with the help of healthcare professionals, into three categories: Urgent Care, General Practice, Telecare. The categorization of cases is shown in Table 1.

NLP Pipeline
A natural language processing (NLP) pipeline extracted medically relevant concepts associated with each written case. The pipeline consisted of the following stages: (1) data preprocessing for misspelling correction and abbreviation expansion, (2) named entity recognition (NER) and (3) concept clustering. Acronyms and abbreviations used unambiguously were linked to the corresponding entities directly in the dictionaries. Ambiguous acronyms and abbreviations were resolved, when possible, using algorithms that include context for disambiguation. For NER, we used a rule-based medical entity extraction system built with IBM Watson Explorer, using algorithms based on dictionary look-up and advanced rules. This allowed us to detect 51 entity types in the following categories: anatomy, physiology, symptoms, diseases, medical procedures, medicines, negated symptoms, negated diseases, ability/inability of, foreign-body objects, negations, patient information, symptom characterization, disease characterization, time expressions. The distinction between symptoms and diagnosis was made using existing ontologies, where these semantic types were assigned with the help of a team of clinical experts. The dictionaries used in the NER were built partially based on existing German-language medical dictionaries and ontologies (UMLS mapped German terms, ICD10, Meddra, etc.) and partially using the list of words contained in the case records. The dictionaries therefore contain a mapping of technical and layman's terms. The NLP pipeline was designed to detect and resolve the negated mentions of the entities listed above (using German language-specific negation particles or expressions), which are very frequent in this type of records. Only 31 entity types in the categories symptoms, diseases, ability/inability of, negated symptoms, negated diseases were included in the current final list. The average number of extracted annotations per case was 70 for all entities, but only 17 for the selected entities. Performance was evaluated using the manual annotations of a set of ground truth cases performed by a team of clinical experts. Concept clustering is a hierarchical procedure that allowed us to group annotations describing the same medical concept. The same entity may be expressed in a variety of forms (compound vs. simple nouns, dialect or common language vs. medical terminology). Concept clustering is performed either at the dictionary level or by algorithms based on similarity between lemmas associated with the annotations.  In this paper, we will benchmark the classification approach of using the extracted concepts with respect to the one of using the full text.

Model Architecture
The overall architecture of the attention-based CNN is shown in Fig. 1. It consists of several components: a word embedding look-up layer obtained using word2vec (Mikolov et al., 2013), a CNN-based n-gram encoder, an n-gram level attention layer and several fully-connected layers. By means of word embeddings, each word is represented as a real-valued vector. The word embedding look-up layer is a word embedding table T ∈ R n×k , where n is the total vocabulary size and k is the embedding dimension. The parame- ters of the embedding table were fine-tuned during the training phase.

N-Gram Encoder
We used a 2D convolution layer (Kim, 2014) to encode the word sequence into n-gram representations, thus capturing contextual information. For a given document, a 2D convolution filter w ∈ R m×k was applied to a window of m words to produce a new feature. A feature c i was generated from a window of words x i:i+m−1 by This filter was applied to each possible window of words in the sentence x 1:m , x 2:m+1 , .....x n−m+1:n to produce a feature map: with c ∈ R n−m+1 . By applying multiple filters (denoted f ) on x i:i+m−1 , we obtained a new representation of the document. By setting different values for m, we obtained different n-gram representations of the documents. This operation was useful in our application setting because these layers create local region embeddings by n-grams. Moreover, this allowed us to compute the attention factors for a combination of several symptoms. This in turn enabled us to detect pairs and even triplets of symptoms that are harmless if they appear individually, yet become red flags when they appear together. For example, the individual symptoms pain in arm and sudden nausea are no cause for concern. However, if a patient experiences both, this might indicate an impending heart attack.

N-Gram Level Attention Layer
For each n-gram representation, we wanted to derive a corresponding fully-connected represen-tation for the document. As different n-grams are of different importance to the document, we introduced an attention mechanism to extract ngrams that are relevant to the meaning of the document and aggregated the representation of those informative n-grams to form a document vector. The relevant n-grams then became candidates for warning symptoms. More specifically, the attention mechanism was defined such that: where v it refers to the tth row of ith-gram representation. That is, we first fed the n-gram annotations v it through a one-layer neural network to obtain u it as a hidden representation of v it . Then we measured the importance of the word as the similarity of u it with a word-level context vector u w and obtained a normalized importance weight α it through a softmax function: The context vector u w can be regarded as a highlevel representation of a fixed query "what is the most informative word?" used in memory networks (Sukhbaatar et al., 2015;Kumar et al., 2016). Context vector u w was randomly initialized and jointly learned during the training process. Thereafter, we computed the document vector s i as a weighted sum of the n-gram annotations based on the weights: Finally, all n-gram document level representations were flattened into a one-dimensional vector (flat connection layer in Fig. 1) plus patient gender and age (a + 1 in Fig. 1). This vector was then fed into a multilayer perceptron (MLP) for classification.

Warning Symptom Detection
Warning symptoms, or red flags, indicate the need for urgent medical care. The ACNN model is able to distinguish the importance of each symptom in the final classification. Thereafter, we calculated the attention score for each symptom as follows: where Φ(c i , s j ) is equal to 1 if symptom s j is contained in case record c i and zero elsewhere; C is the set of urgent care cases in the data; occur(s i ) is the total occurrences of symptom s i ; att(s k ) are the attention weights returned by the ACNN. The attention weights gave us a measurement of the warning level of the symptoms. This procedure was applied for all classes to detect the most important symptoms that drive the model's prediction. As expected for the other classes, the model assigns high attention weights to non-warning symptoms.

Training Details
We conducted a detailed evaluation of this model on both the original full-text dataset and a dataset of a few selected medical entities (see Section 3.1.1 for details) denoted for simplicity as a symptoms dataset. The machine learning framework where all the neural network models have been implemented was based on TensorFlow and Keras. The vocabulary size, average document size and maximum document length are 134,000, 62.9 and 959 words for the full-text dataset; and 20,000, 14.15 and 94 for the symptoms dataset. We used 90% of the data for training, 5% for validation, and 5% for test randomly sampled. Both datasets were preprocessed by removing stop words and low-occurrence words and zero-padding the documents. We learned 200dimensional word embeddings on our datasets with word2vec over 25 iterations. The embeddings were different for each dataset.
We tuned our parameters on a 30,000 validation set and report the result on another 30,000 test set. For model-specific parameters, we used grid search to find the optimal values. We used a cross-entropy loss function with 256-mini-batch updating and Adam optimizer for five epochs. The learning rate was between 0.001 and 0.003; regularization was performed by weight decay of 0.0001 and a dropout of 0.8 was applied to every MLP layer. The attention vector size was set up to 100, and the window size was set from 1 to 5. For each n-gram extraction, we used up to 128 filters for 2D convolution.

Model Comparison
In this section, we compare our system to the following approaches: CLSTM (Zhou et al., 2015) applies a CNN model on text and feeds consecutive window features directly to a LSTM model. Kim CNN (Kim, 2014) uses 2D convolution windows to extract an n-gram representation followed by max-pooling.
BiGRU Attention Network (Yang et al., 2016) consists of RNNs applied on both word and sentence level to extract a hidden state. An attention mechanism is applied after the bidirectional gated recurrent units.
The results on the datasets with the full text and the symptoms only are shown in Tables 3 and 4, respectively. All the analyzed models show similar performance in the classification task. For all models, the performance decreases as we move from the full text dataset to the symptoms dataset because the medical and contextual information also diminishes by taking into account only the extracted symptom concepts.

Result Analysis
In this section, we compare our ACNN model with the state-of-the-art deep learning models to obtain a benchmark on our triage use case. We also describe how our approach, a combination of convolutional neural networks and attention mechanisms, equals the performance of existing models with the advantage of being explainable.
Kim CNN uses 2D convolution windows to extract n-gram representations. Max pooling was then applied to each of the filter outputs. A single value was retained for each feature map. This might work well for short sentences containing only a few "leading" words indicating the cate-    Table 3 but on symptoms dataset, where s 1 , s 2 , s 3 are urgent care, general practice and telecare cases, respectively. gory. For longer documents, however, all information about n-grams is lost apart from the strongest signal. The presence of highly important symptoms in clinical data is the reason why this model performs well especially for urgent care and telecare classes. This hypothesis is supported by the number of symptoms with large attention scores found in the ACNN model for these classes.
The BiGRU Attention Network applies an attention layer after bidirectional GRU components. For a given word in a sentence, it encodes information about the word context in that sentence. However, compared to a 2D convolution window, only a single context window is used. It is not trivial to choose the optimal window size. Thus, it is difficult to detect warning symptom pairs or triplets. For 2D convolution in our model, identifying such pairs or triplets would be more straightforward because attention factors are also learned for 2 and 3grams. Another limitation of GRU models is that they rely on fully sequential data. In our use case, however, the data is composed of several separate phrases, words or incomplete sentences.
Our ACNN combines the merits of 2D convolution and attention mechanisms by stacking 2D convolution layers to extract contextual information and an attention mechanism to assign importance factors to different symptoms and combinations thereof.

Warning Symptom Detection
Owing to the lack of ground truth, we used the following evaluation method to detect warning symptoms with the ACNN. First, we measured the recall of the ACNN on urgent care cases containing only symptom concepts. Then, a new dataset was created by removing from each case record the 1-gram with the highest attention score, calculated as described in Section 3.3. For urgent cases, we expected the removed 1-grams to be highly important signals of medical urgency, hence warning symptoms. For instance, starke Brustschmerzen would be removed from the case described in Section 3.1.1. We then compared the ACNN recall for the urgent cases on the new dataset (Attention Drop) with respect to the recall on the original symptoms dataset (Baseline). This procedure is performed on all the classes for validation. The decrease in recall demonstrates the importance of the detected warning symptoms in order to classify urgent cases correctly. To verify that the detected warning symptoms are indeed highly informative, we furthermore generated datasets in which either random symptoms (Random Drop) or symptoms that appear most frequently in urgent cases (Frequency Drop) are dropped.
As shown in Table 5, dropping the attentiondetected warning symptoms led to the largest decrease in performance. The difference became even more distinct if two symptoms instead of one were removed from the cases.
Performance also decreased for the urgent care and general practice classes, whereas almost a flat behavior was found for telecare class, as expected. In the latter case, random, frequency, attention drops showed the same results because several features had the same attention scores. Manual inspection of the symptoms with the highest attention scores further supports these results. The darker the color of the symptom in Figure 2, the higher its attention factor in the model. In the examined samples, darker colors did indeed correlate with symptoms that made patients require urgent care, such as vomiting blood and electric shock.
With single or double removals for the full-text dataset, a much lower decrease in performance was observed because of the higher number of features per case.

Explainable Deep Learning
In current research, but especially in medical industry applications, transparent or explainable machine learning models are becoming increasingly important. Some machine learning models have become so complex, they are black boxes. End users need to understand why a certain recommendation was made.
In our application, the attention mechanism on which we based our warning (and nonwarning) symptom detection represents a transparent method of reasoning why a given case belongs to a certain class.
For instance, by analyzing the patient symptoms with the highest attention scores, it becomes apparent why a case would be predicted to be urgent, general practice or telecare. Table 6 shows some examples with high/low attention scores computed using 1-gram attention values for urgent care, general practice and telecare classes. As can be seen, the symptoms with the highest score in the urgent cases are the most severe, whereas the symptoms in the telecare cases are less severe. In other words, symptoms with a high/low score for a given class are the most/least relevant ones for that class. As expected, if the model predicts an urgent (non-urgent) class, the model assigns a higher weight to warning (non-warning) symptoms. The computation of 1-gram feature scores results in 2,000 (3,600), 734 (3,700), 1,500 (3,800) features with scores of > 0.8 (< 0.2) for s 1 , s 2 and s 3 , respectively. The use of an attention layer on n-gram representations allowed us to compute feature relevance including correlations between pairs, triplets, etc. An example of scores of feature pairs obtained by extracting the attention weights for the 2-grams is shown in Tables 7 and 8. Strong correlation between feature pairs is found for the cases where the score of the pair is much higher than those of the single features. The computation of 2-gram feature scores results in 12,000 (28,000), 4,800 (13,000), 10,000 (24,000) features with scores of > 0.8 (< 0.2) for s 1 , s 2 and s 3 , respectively.

Confidence
To reach higher performance in an operative triage application, we define a confidence score in the classification based on which the system decides whether to trust the recommendation. In Table 9 and Table 10 we show the same results obtained  in Tables 3 and 4, respectively, discarding all test cases in which the predicted probability of the classifier was lower than 0.6. With the chosen threshold, we discarded roughly 30% cases. Overall a performance improvement of between 5% and 10% is observed. In future work, we plan to apply additional techniques, e.g., based on hierarchical decision trees, to minimize medical risk even further.

Conclusion
We have described an attention-based CNN model to assess patient risk and to detect warning symptoms, which will be used in an industrial application for medical triage. We achieved a precision of 79% on the full-text dataset and Dataset P(s 1 ) R(s 1 ) F(s 1 ) P(s 2 ) R(s 2 ) F(s 2 ) P(s 3 ) R(s 3 ) F(s 3 )   (f i , f j ) score of f i score of f j score of (f i , f j )  66% on the symptoms set. On a confidence threshold of 0.6, precision increases to 85% and 75%, respectively. The learned attention weights allowed us to compute the symptom relevance, i.e., the attention score, which is then used to extract warning symptoms more precisely and to make the recommendation rationale transparent.
(f i , f j ) score of f i score of f j score of (f i , f j )     Table 4 applying a threshold to the probabilities of 0.6.