An efficient representation of chronological events in medical texts

In this work we addressed the problem of capturing sequential information contained in longitudinal electronic health records (EHRs). Clinical notes, which is a particular type of EHR data, are a rich source of information and practitioners often develop clever solutions how to maximise the sequential information contained in free-texts. We proposed a systematic methodology for learning from chronological events available in clinical notes. The proposed methodological path signature framework creates a non-parametric hierarchical representation of sequential events of any type and can be used as features for downstream statistical learning tasks. The methodology was developed and externally validated using the largest in the UK secondary care mental health EHR data on a specific task of predicting survival risk of patients diagnosed with Alzheimer’s disease. The signature-based model was compared to a common survival random forest model. Our results showed a 15.4% increase of risk prediction AUC at the time point of 20 months after the first admission to a specialist memory clinic and the signature method outperformed the baseline mixed-effects model by 13.2 %.


Introduction
Electronic health records (EHRs) have now become ubiquitous and offer novel opportunities for clinical research by supporting the development of intelligent decision support systems and improvement of patients' care.One of the distinct features of EHR is that the data are being collected over time and might be seen as health data streams, allowing research to study longitudinal trends and make inference about the progression of disease, treatments and outcomes.However, the proper representation of sequential medical events still remains 1 Equal contribution.a challenge.Moreover, longitudinal clinical notes exhibit a multi-level hierarchical structure, where events are described and embedded in sentences, sentences in paragraphs and eventually resulting in chronologically ordered documents.Recent works have addressed the problem of capturing this information directly from raw texts by introducing novel neural network architectures, such as attention-based recurrent neural networks (Bai et al., 2018) and time-aware Transformers (Zhang et al., 2020).When dealing with chronological clinical notes, practitioners make multiple decisions on how to structure and transform these sequential events, which are often simplifications of medical histories.In this work we proposed a different methodology to address the problem of learning from events found in clinical notes, by first extracting them using natural language processing and then representing the sequential order by means of the path signatures.The signature (Lyons, 2014) is a non-parametric representation of heterogeneous sequential data, offers a feature extraction method from longitudinal events and can naturally be integrated within a general data mining pipeline.To demonstrate the methodology, we used the largest secondary care mental health EHR data in the UK to develop a survival prognostic model for patients diagnosed with Alzheimer's disease.

Data
The data in this study were sourced from the UK-Clinical Record Interactive Search system (UK-CRIS), which provides a research platform (https://crisnetwork.co/) for data mining and analysis using de-identified real-world observational electronic patients records from twelve secondary care UK Mental Health NHS Trusts (Goodday et al., 2020).UK-CRIS provides access to struc-tured information, such as ICD-10 coded diagnoses, quality of life scales and demographic information, as well as various unstructured texts, such as clinical summaries, discharge letters and progress notes.The study cohort jointly comprised records from 24,108 patients diagnosed with Alzheimer's disease and various types of dementia, containing more than 3.7 million individual clinical documents from two centres: Oxford and Southern Health Foundation NHS Trusts.The field of clinical NLP in general, and of mental health and Alzheimer's research in particular, largely suffers from the dearth of gold-annotated data.The reason is due to the shortage of trained annotators with clinical background who are also authorised to access sensitive patient-level data.Therefore, to develop a robust information extraction (IE) model from an insufficient amount of data, we leveraged the idea of transfer learning using the publicly available MIMIC-III corpus (Johnson et al., 2016) comprising information relating to patients admitted to intensive care units (ICU) with more than 2.1 million clinical notes as well as 505 gold-annotated by clinical experts discharge summaries from the 2018 n2c2 challenge (Henry et al., 2020).We assert that the study was independently approved and granted by the Oxfordshire and Southern Health NHS Foundation Trust Research Ethics Committees.

Information extraction model
The information extraction model was developed to identify diagnosis, medications and cognitive health assessment Mini-Mental State Examination score (MMSE) (Pangman et al., 2000).Additionally, the identified entities were classified according to several attributes, such as the 'experiencer' modality (i.e., whether the MMSE was actually referring to a patient or to a family member), temporal information (i.e the date of diagnosis or MMSE score) and negations (i.e.discontinued medications) (Harkema et al., 2009;Gligic et al., 2019).Such drug mentions were discarded in order to extract the most accurate information.Generic and brand drug names were normalised using the British National Formulary, the core pharmaceutical reference book (Committee et al., 2019).The architecture of the named entity recognition model comprised a hybrid approach of an ontologybased fuzzy pattern matching and a bi-directional LSTM neural network architecture with the attention mechanism (Bahdanau et al., 2014) for se-quence classification.The GloVE word embedding (Pennington et al., 2014) were fine-tuned on both MIMIC-III and UK-CRIS data (Vaci et al., 2020a;Kormilitzin et al., 2020).The developed IE model was trained only on data from the Oxford Health NHS Trust instance and externally validated on a sample of data from a regionally different Southern Health NHS Foundation Trust.

The signature of a path
Repeated measurements, speech, text, time-series or any other sequential data might be seen as a path-valued random variable.Formally, a path X of finite length in d dimensions can be described by the mapping where each coordinate X i t is real-valued and parametrised by t ∈ [a, b].The signature representation S of a path X is defined as an infinite series: where each term is a k-fold iterated integral of the path X labelled by multi-index i 1 , ..., i k : (2) However, in many real-life applications the first k-terms of the truncated signature at level L give a satisfying approximation.Intuitively, it is analogous to statistical moments of a d-dimensional vector-valued random variable, such as mean, variance or higher moments.One can define statistical moments of a path-valued random variable, which are essentially the signature moments (Chevyrev and Oberhauser, 2018) defined in Eq. ( 2).The signature S(X) completely characterises a path X up to tree-like equivalence and is invariant to reparameterisation (Hambly and Lyons, 2010).The signature can also be expressed in a more compact form known as log-signature (Liao et al., 2019;Morrill et al., 2020a), which is the formal power series of log S(X), while carrying the same information.Informally, the path signature captures the order of events.For example, consider two sequences X 1 = aabba and X 2 = baaab consisting of a simple vocabulary with only two letters {a, b}.The sequences might be presented as paths in 2d space as shown in Fig. 1.Each linear segment between two points (Fig. 1) corresponds to a single letter in the sequence and the arrows denote the temporal direction of the sequence.Despite the same Level 1 2 3 4 S(X 1 ) 3 2 1 -0.5 -1 -1/3 -0.5 0 S(X 2 ) 3 2 0 1.5 0.5 0 0 0 number of letters in the sequences {a = 3, b = 2}, the order of letters matters.The signature easily picks the differences and the first four levels of the log-signatures of paths are shown in Table 1.The lower order signature terms S (i) are the increments along the i-th direction (i.e. the distance between the endpoints), for example, S (1) = 3 − 0 = 3 and S (2) = 2 − 0 = 2 as can be seen in Figure 1.The second order corresponds to the area enclosed by a path and a chord connecting endpoints (Chevyrev and Kormilitzin, 2016).

Independent and outcome variables
The independent variables used in the prognostic model were medications and the MMSE scores collected over time.The dependent outcome variable was right-censored time to death data in months.A synthetic example of the patient's records (Table 2) and the corresponding algorithmically extracted longitudinal data is presented in Table 3.
The outcome variable was encoded as a tuple: (T rue, 34.17) indicating that a person has died after 34.17 months since the very first visit to a specialist memory clinic.The patient was treated by two different medications with a changing pattern and eventually was tapered off medication due to no further expected improvement.

Baseline longitudinal data summarisation
The signature transformation might be seen as a hierarchical statistical summarisation ("feature extraction") of the longitudinal data along the temporal dimension.In order to benchmark the proposed method, we used a time-honoured linear mixedeffects regression as a baseline model for longitudinal summarisation.Specifically, each patient-level longitudinal MMSE scores were modelled using a linear regression and the resulting coefficients, such as an intercept and a slope, were used as features representing the progression of the MMSE over time.The median number of medication categories was used as an additional feature, resulting in three features for each patient.

Survival random forests
The common statistical approach to analyse the time-to-event survival data is based on the linear Cox model (Collett, 2015).However, Miao et al. (2015) showed that a survival random forest (SRF) approach (Ishwaran et al., 2008) outperformed linear Cox model, based on the Harrell's concordance index (C-index) (Harrell et al., 1982), and was understandably capable of identifying non-linear effects of the input variables as opposed to linear Cox model.Therefore, we chose the SRF as the preferred method.The SRF approach was implemented in Python using "scikit-survival" package (Pölsterl et al., 2015).The Harell's C-index (the concordance index) is a goodness of fit measure for risk scores models.It is a common statistical approach to evaluate risk models in survival analysis, where data may be right-censored and corresponds to rank correlation between predicted risk scores and observed time points, similarly to Kendall's τ .

Information extraction model
We used a hybrid approach to developing an IE model consisting of training a baseline model using MIMIC-III and n2c2 annotated data.Specifically, the named-entity recognition (NER) model Today I saw a patient diagnosed with Alzheimer's, who deteriorated: MMSE 23/30 as compared to 25/30 from 1st January.Started on Rivastigmine.12-Feb-2017 Today MMSE 19, the patient didn't respond to Rivastigmine and was changed to Donepezil.

01-Apr-2019
The patient stopped responding to Donepezil and severely deteriorated (MMSE 14/30), stop Donepezil.comprised a transition-based system based on the chunking model (Lample et al., 2016) where tokens were represented as hashed and embedded representations of the prefix, suffix, shape and lemmatised features of words, followed by the rulebased matching using the BNF vocabulary.The IE model was implemented using "spaCy" python library 1 , including negations and temporal information identification as well as relationships classification between the word-tokens using linguistic features, such as part-of-speech and dependencies.Finally, the active learning tool "Prodigy" 2 was used for iterative model improvement (Vaci et al., 2020b).sistent performance on both validation and external validation data sets (Table 5).The annotation schema was developed following the recommendations of Pustejovsky and Stubbs (2012).The tokenlevel performance metrics were evaluated using the SemEval schema (Segura Bedmar et al., 2013) and the inter-annotator agreement (IAA) of two clinical 1 https://spacy.io 2 https://prodi.gyannotators was computed using F1 score.

Prognostic model
Four prognostic models were developed and compared to each other.All models estimated the survival probability of a patient diagnosed with Alzheimer's disease since their first admission to a memory clinic.We compared signature ("Sig", Sec.2.3) versus non-signature ("Non-sig", Sec.2.5) models.We also estimated the added value of the sequential information contained in the treatment course with medications.Specifically, we used two sets of input variables: {time, MMSE} and {time, MMSE, medications}, where time corresponds to the date of MMSE score or prescribed medication as presented in Table 3.For the "Sig" model, the input variable were first transformed into signatures, where the categorical medication names were one-hot encoded and augmented with numerical MMSE scores to create a path.For the "Non-sig" model, the longitudinal MMSE scores were summarised by means of linear models adjusting for each patients and the median number of distinct medications were computed.Both models were trained and validated using the same folds of stratified 5-fold cross validation (with fixed random seed).The quality of predictions was assessed using the Harell's C-index and the results are summarised in Table 7.The signatures were computed using the "esig" Python library 3 , however, alternative libraries are also available (Reizenstein and Graham, 2018;Kidger and Lyons, 2020).We also estimated the time-dependent area under the curve of receiver operating characteristics (Lambert and Chevret, 2016).It is a natural extension of a common AUC ROC analysis to possibly censored survival times where the patients' cognitive health is usually better at the very first visit to a memory clinic, while their condition may deteriorate later.The time-dependent cumulative dynamic AUC ROC of all four models are presented in Fig. 2. The signature features outperformed the non-signature ones at all times and the inclusion of sequential information from switching medications improved AUC ROC at later times.However, both models struggle to reliably predict the future outcomes further than 3 years.This is due to the limitation of predictors and the available number of patients after 3 years rather than the capacity of our model.

Discussion and future direction
Unstructured longitudinal electronic health records, such as free-text clinical notes, inherently contain rich information about patients' health and outcomes over time.The right analytical tools capable 3 https://esig.readthedocs.io/ of capturing sequential information can therefore maximise utilisation of longitudinal EHRs and can be valuable for supporting clinical decisions and prognostic models.In this work we implemented a signature-based approach to represent chronological events extracted using natural language processing from clinical notes.Extracted chronological events can be seen as a trajectory (path) embedded in a high-dimensional multi-modal space of events (i.e.different medications, interventions, measures, etc) and the signature uniquely characterises the path in the most succinct way.The signature-based feature extraction approach was compared to handcrafted features, comprising a slope and an intercept of MMSE scores over time and the median number of medications for each patient.The signatures represent a hierarchical collection of features, where the first order is proportional to linear statistical moments (i.e.mean) and is not sensitive to the order of data points, as illustrated in Table 1.We demonstrated that the sequential information about medications has significantly improved the timedemented AUC as captured by the signatures (Figure 2).In future works we will extend the proposed framework to include the structured information available in EHR (i.e.lab results, coded procedures or clinical encounters) and will develop an interpretability framework to make the signature-based models explainable.

Figure 2 :
Figure 2: Time-dependent AUC of risk prediction over time since the first admission to a memory clinic.

Table 1 :
The first k = 8 terms of the log signature expansion up to level L = 4.The difference between two sequences X 1 and X 2 is apparent starting from the second level.

Table 2 :
A synthetic example of chronological medical records.

Table 3 :
Extracted and chronologically structured data from Table2.

Table 4 :
The number of gold-annotated instances in the training, validation and external validation data sets.

Table 5 :
Performance (shown in %) of the information extraction model.IAA -inter annotator agreement.

Table 6 :
Summary statistics of the extracted data for survival analysis.Survival time is shown as mean(std) in months.The MMSE scores were not observed for censored people later in time, while date of death was recorded in hospital.

Table 7 :
Harell's C-index measure of four models.Values reported as mean(std) over 5-fold cross validation.