Identifying robust markers of Parkinson’s disease in typing behaviour using a CNN-LSTM network

There is urgent need for non-intrusive tests that can detect early signs of Parkinson’s disease (PD), a debilitating neurodegenerative disorder that affects motor control. Recent promising research has focused on disease markers evident in the fine-motor behaviour of typing. Most work to date has focused solely on the timing of keypresses without reference to the linguistic content. In this paper we argue that the identity of the key combinations being produced should impact how they are handled by people with PD, and provide evidence that natural language processing methods can thus be of help in identifying signs of disease. We test the performance of a bi-directional LSTM with convolutional features in distinguishing people with PD from age-matched controls typing in English and Spanish, both in clinics and online.


Introduction
Parkinson's disease is a neurodegenerative disease that affects approximately 1% of people over the age of 60 (De Lau and Breteler, 2006). Its cardinal manifestations include bradykinesia (slowness of movement), tremor and rigidity. These result from the degeneration of dopaminergic neurons in the basal ganglia (an area of the brain responsible for action selection). A particular challenge in the treatment of PD is that by the time such motor signs are present, over 50% of neurons in the affected area of the basal ganglia (the substantia nigra) have been lost (Fearnley and Lees, 1991). While neuroimaging can pick up on these changes (Barber et al., 2017), such procedures are prohibitively expensive and cannot be performed on whole populations. There is thus an urgent need for cheap and easy- † Equal contribution. 1 Code, models and data used in this paper can be found at: http://typingresearch.com/conll2020/ to-administer measures that can be used for the identification of at-risk individuals.
A long-used simple motor test for PD is the alternating finger tapping test (Burns and DeJong, 1960). This test involves asking a person to alternately tap an index finger in two locations a set distance apart on a surface or on a keyboard (Giovannoni et al., 1999). People with PD are typically able to perform fewer taps over a 30 second period than people with no diagnosis. While such measures have proved useful, they suffer from a clear lack of specificity -slowing of movement is also a strong predictor of other neurodegenerative disorders, such as Alzeimer's disease (Roalf et al., 2018). Furthermore, neural degeneration is unlikely to be detected by as coarse-grained a measure as tapping rate until the disease is relatively advanced. If specificity and earlier detection is to be achieved more targeted tests will be required.
There is good theoretical reason to think that more PD-specific markers will be present in recordings of learned serial order behaviours, such as making a cup of tea, driving or typing. Analysis of the production of such frequently-performed behaviours, and their underlying neurobiology, often distinguishes between habit (the automatic production of routinised movements) and goal-directed responses (behaviours that involve top-down planning; Dolan and Dayan 2013). There is substantial evidence that the degeneration of the basal ganglia in PD primarily affects areas responsible for automatic behaviours (Sharman et al., 2013), and results in a shift in the balance of habitual and goal-directed control (Hadj-Bouziane et al., 2013). Redgrave et al. (2010) predict that people with early-stage or prodromal PD will have a problem initiating their automatic behaviours.
This paper focuses on the detection of markers of PD in one such behaviour -that of typing. It is motivated by the prediction that people with PD will, from very early on and potentially prodromally (before the emergence of the acute symptoms that allow conventional diagnosis), change the way that they type, as they lose capacity for automatic control. Natural language processing provides us with techniques that we can use to pick up on those changes. As a motor behaviour, typing has been the focus of previous work on detecting or monitoring PD (see related work section). Some such work has focused on coarse-grained measures such as typing speed. These suffer from the lack of specificity associated with tapping measures. Other promising work has looked at more detailed timing measures. However this has continued to ignore the identity of the sequences being typed. Different sequences of keys present different motor challenges due to the position of the keys on the keyboard and the hand used. The extent to which typing these will be facilitated by prior automatisation depends on the relative frequency with which they have been typed (Behmer and Crump, 2016). We therefore expect consideration of the content of typing to be critical in picking up on PD-related changes.
We describe a method for using a convolutional neural network (CNN) long short term memory (LSTM) network to distinguish people with PD from age-matched people with no diagnosis. One motivation for this choice is that we want to pick up on the fine temporal details of the sequential data, in order to provide a measure that will be specific to PD. However the difficulty in picking up on such subtle information is that any typing dataset will also contain cruder information, such as the average differences in overall timing across participants. In order to tackle this, we normalise all temporal variables using robust scaling -subtracting each participant's median value from all their datapoints and dividing by their interquartile range. We thereby require our network to pick up on more subtle and potentially disease-specific information.
The contributions of this article are as follows: • We introduce a new task and data type of urgent clinical importance.
• We show that when we remove coarse-grained differences between people with PD and controls we are able to detect a strong (and, we suggest, more disease-specific) signal.
• We provide evidence that adding character information to a CNN-LSTM that contains only timing information improves performance across datasets.

Related work
A simple motor test that we might consider a precursor to the use of typing is the alternating finger tapping test (Burns and DeJong, 1960). This test involves asking a person to alternately tap an index finger in two locations a set distance apart on a surface or on a keyboard (Giovannoni et al., 1999). Noyce et al. (2014) report that the number of key taps in 30 seconds (averaged across hands) can be used to distinguish patients from controls, identifying 50% of true positives with only 15% false positives. Using the same measure (selecting the worst performing limb in patients and comparing it with the best performing limb in controls), Hasan et al. (2019) report an AUC of 0.87. While tapping tests are widely used they suffer from a lack of specificity to PD. While they distinguish people with PD from control participants with considerable success, there is good reason to think that they will struggle to distinguish PD from other neurodegenerative disorders. Roalf et al. (2018) report an AUC of 0.68 in distinguishing people with PD from people with Alzheimer's disease using a single tapping test.
In pursuit of an easier-to-gather alternative to finger-tapping tests, Austin et al. (2011) examine the interkey intervals (IKIs) of people typing usernames while logging-in to a website. They found a moderate-to-strong correlation between the participants' median IKIs during typing and the mean time between taps finger taps in a 10 second period. Building on this, Giancardo et al. (2016) used key-hold times during transcription typing in order to distinguish 42 people with recently-diagnosed PD (off medication) from 43 controls. Properties of the distribution of hold times for each patient was used in an -support regression to generate a unique score. In a two-fold cross-validation this achieved a combined AUC of 0.81, comparable to an AUC of 0.75 achieved with an alternating finger tapping test on the same sample. Adams (2017) logged key events during regular computer use over an extended period by 20 patients and 33 controls. Information about hold times and IKIs, including measures of variance and of asymmetry between hands was used in a classification ensemble of eight different classification methods. This ensemble, trained on the new data, achieved an AUC of 0.97 on the 85 participants from Giancardo et al. (2016).
All of the work described above has represented typing behaviour with summary statistics rather than as sequences. Furthermore they have analysed the timing of keystrokes without considering what is being typed. The one exception to this latter point is the work of Bannard et al. (2019) who look at the accuracy of typing while copying text using engineered features. They predict that people with PD, while making more errors in general, should make fewer 'habit slips'. This is when a well-learned sequence of key presses is produced in an inappropriate context, such as typing t-h-i-n-g when the intended word is t-h-i-n because -i-n-g is a frequent sequence. They find that is the case, and that adding this information to a generalised additive regression model predicting disease progression gives an improvement in fit relative to a model just including timing information.

Datasets
We perform analyses of the following three datasets, representing two different usage contexts (recruited and tested in a clinic, and recruited and tested remotely online) and two different languages: English and Spanish. All participants were tested via a browser-based app which presents a series of sentences to be copy-typed, and collects the identity and timing of each key-press. All datasets contain information about key down timing (when the typist pressed each key), and the online-recruited dataset additionally contains information about about key up timing (when they released each key). Summary statistics are found in table 1. 3.1 In-clinic English copy-typing Sixteen patients and 25 age-matched controls were recruited and tested during a visit to a hospital clinic in the UK (see Bannard et al. 2019). Patients were recruited to be in the early stages of PD (Hoehn-Yahr stages 0 − 2.5, UPDRS < 20 in the medicated state), with normal cognitive function and < 5 years from a confirmed diagnosis. All patients were asked to type 15 sentences, all of which were taken from English-language Wikipedia articles, and ranged from 10 to 25 words (average of µ = 19 words) in length. The experimental protocol was approved by NHS Health Research Authority (no. STH18662TK). All participants were tested twice -once before taking their morning medication and once after for patients. On a five point self assessment of their typing ability, ranging from none to secretarial proficiency, control participants reported an average 3.1 (4% no experience) and patients reported an average 2.7 (12% no experience).

In-clinic Spanish copy-typing
Eleven patients and nine age-matched controls were tested during a visit to a hospital clinic in Spain (see Bannard et al. 2019). The inclusion criteria for patients was the same as for the clinictested English sample. All patients were asked to type 30 sentences, all of which were taken from Spanish language Wikipedia articles, and ranged from 12 to 25 words (average of µ = 18 words) in length. The experimental protocol was approved by HM Hospitales, Spain (no. 14.11.710-GHM). Participants were tested only once. Six of the patients were tested prior to taking their morning medication and five after. On a five point self assessment of their typing ability, ranging from none to secretarial proficiency, control participants reported an average 3.9 (0 had no experience) and patients reported an average 3.7 (0 had no experience).

Online English copy-typing
For this newly-collected dataset, 130 controls and 100 people with PD were recruited and tested online. The people with PD were recruited via the recruitment service of a major US-based Parkinson's charity. The control participants were recruited via a participant recruitment service. All participants were aged between 50 and 90 and identified as resident in the US. Patients were recruited to be self-reportedly in the early stages of PD (Hoehn-Yahr stages 0 − 3 as indicated by responses to a questionnaire), and within five years of a diagnosis. The sentences typed were the same as those typed by the in-clinic English sample. The experimental protocol was approved by the University of Liverpool Ethics Committee (no. 4572). Of the 100 people with PD, 24 reported that they either do not take medication or had not taken any medication yet that day. On a five point self assessment of their typing ability, ranging from novice to expert, control participants reported an average 2.6 (13% novices), medicated people with PD an average 3 (4% novices), and unmedicated people with PD an average 3.1 (8% novices). Note that in contrast to what we see in the clinic-collected datasets, the people with PD here rate their typing ability more highly than the controls. Unlike the in-clinic samples, this dataset contains information about both key down timing and key up timing.

Method
We implement a neural language model (NLM) which receives two different types of information in variety of combinations: (1) Character identity information: one-hot encoded character sequences; continuous bag-of-words model (Mikolov et al., 2013) encoded character sequences; (2) Keypress timing information: inter-key interval (IKI), time elapsed between consecutive key down events; hold-time, time elapsed between key down and key up events for a specific character; pause, time difference between key up and key down events for consecutive key presses. The temporal information is shown pictorially in fig. 1.

Hold-time Pause
Inter-key interval Time Figure 1: Pictorial description of the compression and release of keyboard keys and the temporal information that results from those actions.
Different timing information is available in different datasets as reported in §3. We adapt our data representation accordingly. We are interested here in the value of character information over timing, and examine its utility by building models with just timing and then with timing and character.
Our approach is inspired by the recent work of Kim (2014); Kim et al. (2016); Zhang et al. (2015). The main component is the temporal convolutional module (Zhang et al., 2015), which computes a one-dimensional (1D) convolution over characters. Convolutional neural networks (CNN) employ layers with convolving filters (Kim, 2014) which are applied to local features (derived in our case from the above information list of textual information).
Diverging from their approach, we use a smaller number of convolutional layers followed by a bidirectional long short-term memory (LSTM) layer (Schuster and Paliwal, 1997). As such, our architecture is able to extract both local and global features as described in the work by Zhou et al. (2015) who utilise a similar architecture. For a detailed description of the model architecture see fig. 2 and appendix B.

Data representation
Our model takes sentences (as sequence of characters and/or key press timing information) as input. Before introducing the construction process of sentence sequences we shall give a detailed description of its elements.
First, for the sake of comparison, we conduct experiments with two different character-identity representations. The default representation is onehot encoding of characters where each unique character is associated with an index i such that the representation of a character is a binary vector c where c i = 1 and c j = 0, ∀j = i. We also evaluate using a continuous vector representation of characters, which is an adaptation of the commonlyused continuous bag-of-words (CBOW) embedding (Mikolov et al., 2013). While for word embeddings the CBOW algorithm learns the representation by predicting words from the surrounding context, our character level adaptation utilises the same algorithm but for the task of predicting characters from their context. We learn the character embeddings from a corpus of 100,000 Wikipedia articles 2 , such that we obtain a character dictionary where each character is associated with a unique continuous vector representation of 50 dimensions.
The datasets in §3 contain, in addition to the characters used, a timestamp for each character key-down press t d . However only for the online English copy-typing dataset in §3.3 are timestamps for key-up events denoted t up , available. We define the order of a character sequence as the order of the associated key-down timestamps indexed by k such that t d k−1 ≤ t d k , ∀k ∈ 1, . . . , K. For most end-to-end deep learning one typically omits feature engineering and let the networks learn feature representations from large datasets. Here however we are dealing with relatively small (in the context of deep learning) datasets. In particular the In-clinic English and Spanish datasets contain just 1165 and 575 sentences respectively (see table 1 for further details). To aid learning we thus engineer a set of three features from the key press timestamps.
As discussed in section §1 and §2 we expect the effects of bradykinesia among people with PD (PwPD) to result in differences in the average timings of keystrokes. However the goal of our experiments is not to maximise performance on any one dataset, but rather to find evidence of more robust PD-specific typing characteristics that can help improve the specificity of PD detection systems. To this end we attempt to mute the coarse-grained, between-group differences in our data by employing participant-level standardisation of all timing related features. This is done by computing the median and interquartile range of all key press timing features for each participant, and then robustly scaling their corresponding sentences by subtracting the median and dividing by the interquartile range.
Finally a sentence is represented as where ⊕ is the concatenation operator, where n indexes each sentence, k indexes each character within each sentence and x is the character identity encoding vector (one-hot or CBOW) appended with the timing features associated with the key press of that character keypress. The longest sentence in any dataset has length K max , and any encoded sentence K n < K max , ∀n ∈ {1, . . . , N } is padded with |K max − K n | all-zero vectors so that all encoded sentences X n , have the same size: X n ∈ [0, 1] Kmax×m .

Text pre-processing
Here we will briefly discuss the most important preprocessing steps. The complete procedure, with detailed description, can be found in appendix A. First, following the recommendation of Zhang et al. (2015), all sentences are converted to lower-case. Second, in this study we partially 'implement' the error correction employed by the participant. While we are interested in the errors that participants make, and indeed Bannard et al. (2019) show that the error types made can be indicative of disease status, we assume that the process by which they notice and correct those errors will be idiosyncratic and not informative regarding our classification goals. Consider the following example sentence, taken from the dataset described in §3.1:

Books include Penguin
Island, a satire on the FDreyfus afffairair.
Here the user has employed five corrective actions (backspaces) which we indicate with . For each sentence we implement and then delete all but one of these backspace actions leaving only the first errorfully pressed key (the first f in the fff) and a single backspace symbol. Thus the text becomes: Books include Penguin Island, a satire on the FDreyfus afffair.
The single correction character is left in the text to be used as indicators for the NLM in the downstream classification task. When only a single corrective action occurs it is simply left unamended as shown above. For an example see fig. 5 where the correction character passed to the NLM is ω.

Experimental setup
The purpose of our experiments is to understand the effect of including character information in the classification of PD patients, when employing copy-typing as a diagnosis medium. Using the model discussed in §4 we conduct multiple binaryclassification experiments to distinguish sentences written by people with PD (PwPD), from those written by age-matched controls. The same exercise is undertaken to classify participants themselves. We evaluate performance by measuring the area under the receiver operating characteristic curve (AUC). This is a common approach when dealing with a two-class prediction problem (binary classification), in which the outcomes are labelled either as positive (PwPD) or negative (control). The AUC scores reported in §6 are calculated on the test sets. For sentence classification we use participant level five-fold cross-validation ensuring that sentences from any one participant do not exist in both the train and test set. We report the mean and standard deviation over folds. For participant classification we aggregate the sentence classification probabilities using logistic regression with leave-one-out cross validation and employ bootstrapping to report mean and standard deviation. The model is applied to the datasets described in detail in §3 with summary statistics given in table 1.
We conduct hyperparameter search, model introspection and ablation studies. Each dataset is preprocessed according to the procedures outlined in §4.1 and §4.2, and split into train, test and validation sets. This partitioning reduces the number of samples which can be used for learning the model. Our datasets are small compared to those typically used for deep learning. We deal with this in multiple ways, as detailed in appendix C.

Results
Our main experiments, as outlined above, involve the use of timing information that has been robustly scaled at the participant level in order to remove coarse-grained differences between groups. To aid understanding of the data, however, we will first report the performance of a classifier that uses the information that we have removed -the median and interquartile range of keypresses -as the sole features. The AUCs for logistic regressions using these features as predictors can be seen in table 3. Table 3: Results from logistic regression models with median and interquartile range for interkey intervals as features. We report mean AUC (and SDs) for both medicated PwPD vs. controls in the On columns and unmedicated PwPD vs. controls in the Off column. The Spanish PwPD are mixed in medication status but treated as a single group due to the small sample size.

Dataset
Off On In-clinic English 0.76 (0.14) 0.76 ( In-clinic Spanish 0.91 (0.09) N/A As can be seen the performance of these classifiers is good in some cases, particularly for the in-clinic Spanish dataset. However performance is variable, being poorest for the online dataset. This pattern of results is to be expected and our goal here is not to surpass their performance but to see how we can perform with more PD-specific features. The results for our main models, using the robustscaled data, are reported in table 2 and fig. 3. For all datasets, the addition of character information gives an improvement in performance over timingonly models. The dataset on which the simple IKI summary-statistic models reported above do worst (the online English dataset) is the dataset on which the best performance is reported here. This is likely because it is the largest dataset and thus the best suited to deep learning methods. This suggests that performance improvements will be possible for the network models with larger datasets.

Model interpretation
Deep learning models are often criticised for being black box machines and the interpretation of deep learning techniques is a growing area of interest (Buhrmester et al., 2019). We apply one such technique -Gradient-weighted Class Activation Mapping (Grad-CAM;Selvaraju et al. 2017) -to our model. Grad-CAM is commonly used to analyse how CNN-based computer vision models make decisions and highlight regions in the image that the  model deems important. We repurpose Grad-CAM to analyse which part of a sentence a CNN-based NLM deems important for classification. For illustration we have included an example visualisation where we have applied our model to a sentiment analysis task where Grad-CAM highlights the parts of the sentence that indicate it should be classified as having positive sentiment -see fig. 4. We use the same approach to produce visualisations that highlight the parts of a sentence that our model uses to make a distinction between PwPD and Controls.
t h e m i c i s g r e a t 0 1 Figure 4: Example of Grad-CAM visualisation where we apply our network to a sentiment analysis task. Here we see the Grad-CAM highlighting the word "great" as important for determining that the sentence has positive sentiment.
Grad-CAM plots for example PwPD and control participants from the online English dataset for each of our sentences can be found in appendix D. These images show the gradients of the second and final convolutional layer for the positive diagnosis ("typist has PD") classification class. Grad-CAM visualisations for all participants and sentences, all convolutional layers and all classification classes, can be accessed at http://typingresearch.com/conll2020/. An illustrative two-word excerpt from one of our sentences typed by a single PwPD can be seen in figure  fig. 5. The first word different (typed with a single corrected error on the second character by this typist) is mostly blue indicating that there is little in the sequence of keystrokes that the model takes as indicating that it was typed by a PwPD, while the second word pronunciation spans more colours indicating that it contains keystrokes that are indicative of its being typed by a PwPD.
Looking across participants we see that certain parts of the sentences are consistently more important for classification than others, as indicated by their having gradients that diverge between PwPD and controls. The first thing that this illustrates is that key identity matters, confirming the conclusions of our ablation study. It also allows us to look at what properties the most discriminative key sequences have in common. While no single property can be identified as the clearest marker of PD, we can identify suggestive patterns that are useful in understanding typing in this population. This exploration serves as an illustration of how model interpretation can provide potential mechanistic hypotheses to be tested in future work.
One pattern that the network seems to pick up on can be seen in fig. 5 by observing the changes in gradients for the keys with respect to the interkey intervals seen below the sequence. There is a sequence of keys with high gradients at the end of the first word and beginning of the second that have relatively low IKIs. A notable property of this subsequence is that each of the keys is typically typed with a different hand from the previous key. Figure 6 shows the mean and standard deviations for (robustly-scaled) key hold times and inter-key intervals for the word pronunciation. The letters are colour coded according to whether the character is typed with the same hand (red) as the preceding character or the other hand (black). This is based on the approximation that the leftmost 5 columns of the keyboard (from Q, A and Z to T, G and B) are typed with the left hand and the rightmost 5 columns are typed with the right (Feit et al., 2016). Moving between keys when switching hands is fairly straightforward to perform while switching between keys with the same hand requires considerable agility. We observe in our data that the typing speed of PwPD is differentially affected by this more than that of the controls. The network picks up on this and has a tendency toward higher gradients at between-hand transitions where the typing  Figure 7: Key transitions for a single example sentence plotted along three dimensions. The y−axis indicates the discriminative ability of each key transition, indicated by the t-value for a comparison of the gradients for that keypress in patients and controls, such that a high value indicates that patients have consistently higher gradients than controls. The x−axis represents the extent to which the distribution of scaled IKIs for each keystroke are differ between patients and controls, again using a t−test (so that a high value indicates that patients have more consistently higher scaled IKIs than controls). The circles containing bigrams that involve a within-hand transition are shown in red and the circles containing across-hand-transition bigrams are shown in blue. The bigrams that have highest values on the y scale (that have gradients that are most consistently higher in patients in controls) are those that have a lower a value on the x−scale (they are associated with a relative dip in IKI in patients that is consistently more pronounced than anything seen in patients) and are shown in blue (involve a between key transition). This is apparent from the high ratio of blue to red circles in the top left quadrant and indicates that the model takes a dip in inter-key intervals for betweenhand-transition bigrams as a marker of PD.
speed has a relative dip for a participant. Figure 7 provides further illustration of this widespread pattern.
A second property of key sequences that appears to be important is their transitional probability. There is good reason to think that PwPD will have difficulty deploying learned habits in typing. We know that the timing of keystrokes in typists is sensitive to the transitional probabilities between keys (Behmer and Crump, 2016), and we can take this as a marker of acquired habits. We would expect this relationship to be altered in PwPD. Mixed effects modelling with by-participant random intercepts and slopes confirms that this is the case in our data with an increase of a 26% of an IKI interquartile range for each unit of standard deviation in bigram surprisal (inverse log probability of each character given the previous character) for controls, and a significant 3% lower increase in PwPD (p < 0.01) across all participants. This indicates a reduced sensitivity to decreases in key transition probabilities in pwPD relative to controls. It is also the case that gradients are significantly higher for keys with high surprisal. Figure 8 displays the three-way relationship between gradient divergence, IKI divergence and surprisal and suggests that the model is picking up on the reduced effect of transitional probabilities on interkey intervals for PwPD relative to controls.  Figure 8: Contour -red (low) to yellow (high) -of gradient divergence (t-value for PwPD-Control comparison) for different values of scaled IKI divergence (t-value for PwPD-Control comparison again) and bigram surprisal. Gradient divergence is greatest for key transitions with high surprisal for which PwPD have low scaled IKI relative to controls. The model appears to pick up on the dampening of surprisal-related IKI spikes for PwPD relative to controls.

Conclusion
In this paper we have provided evidence that natural language processing techniques and in particular CNN-LSTM networks can identify markers of Parkinson's disease in logged typing behaviour. Critically there is good reason to think that the markers identified will have high specificity with regards to Parkinson's disease. While simple motor tests like the finger tapping test, and summary timing statistics from typing data, are widely used to distinguish PwPD from people without the disease, they rely on disease signs that PD has in common with other disorders -namely general slowing. In this work we first remove this disease sign from the data and then use a CNN-LSTM to pick up on more subtle changes in performance. We report very promising performance using this approach. We further report on an analysis of the gradients in our model which suggests that it is picking up on plausible effects of PD seen in the data. Previous work has sought to distinguish PwPD from controls by observing how rapidly and consistently they press keys when typing. However, such work begins by discarding potentially valuable information -the identity of the keys pressed. We found that including key identify in our data/model provided a performance improvement relative to timing-only models. We found an improvement in performance (increased AUC) in identifying patients among participants tested in clinics in both English and Spanish. Furthermore we found a substantial leap in performance on the more difficult task of discriminating PD patients from controls in a new large dataset recruited and tested online.
These results suggest that NLP techniques allows us to identify theoretically-motivated markers of PD (Redgrave et al., 2010) in typing data. These incorporate both speed and character information, and so may be more robust than currently-used markers. Future work will of course require that we test this directly, by collecting typing data from people with other neurological disorders and using these markers for multi-class classification. This work is only the tip of the iceberg in terms of the contribution that NLP can make to the task of detecting signs of Parkinson's disease, and potentially other movement disorders, in typing data.

C Model training
To counter act over-fitting we use an array of standard techniques including dropout (Srivastava et al., 2014), weight regularisation and early stopping (Goodfellow et al., 2016, §7). Additionally we employ a task specific training schedule to aid feature learning in the convolutional layers. We first split our sentences into word pairs such that we effectively increase the number of samples, e.g. Books include Penguin Island...→[Books include, Penguin Island,...]. Given that the convolutional layers operate locally on the sentence we can then pre-train the filters on this augmented dataset in a more stochastic optimisation process. We then use the standard protocol (Chollet, 2017, §5.3) for transfer learning by freezing the convolutional filter weights before training ensues on the sentence datasets until performance on the validation set stops improving. Finally we unlock the convolutional filters and re-start training on the sentence dataset with a lower learning-rate and larger batch size. The models are trained via Adam optimisation (Kingma and Ba, 2014) over shuffled mini-batches with early stopping terminating training if validation loss does not improve for 16 epochs. The initial learning rate is set to 0.001 and is decreased by a factor 0.5 if the validation loss does not improve for 10 epochs. We use a batch size of 16 for the word-pair convolutional filter pre-training and the first round of training on the sentence datasets, for second tuning round we increase the batch size to 32 and start with a learning rate of 10 −4 . For regularisation we use dropout with probability 0.5 on and L2 regularisation with factor 10 −6 on all convolutional layers.