Grammatical error detection in transcriptions of spoken English

We describe the collection of transcription corrections and grammatical error annotations for the CrowdED Corpus of spoken English monologues on business topics. The corpus recordings were crowdsourced from native speakers of English and learners of English with German as their first language. The new transcriptions and annotations are obtained from different crowdworkers: we analyse the 1108 new crowdworker submissions and propose that they can be used for automatic transcription post-editing and grammatical error correction for speech. To further explore the data we train grammatical error detection models with various configurations including pre-trained and contextual word representations as input, additional features and auxiliary objectives, and extra training data from written error-annotated corpora. We find that a model concatenating pre-trained and contextual word representations as input performs best, and that additional information does not lead to further performance gains.


Introduction
We introduce a new resource for speech-centric natural language processing (speech NLP) -more than a thousand transcriptions and error annotations for 383 distinct recordings from the CROWDED Corpus . CROWDED is a crowdsourced English corpus of short monologues on business topics, recorded by both native and non-native speakers. It was created in response to the lack of speech corpora freely available for research use, and the lack of appropriate native speaker reference corpora with which language learners' exam monologues can be compared. In this new project, crowdworkers were asked to first correct existing speech transcriptions and then to edit the resulting transcriptions to make them more fluent. These new annotations enable both post-editing of noisy speech transcriptions -such as might come from automatic speech recognisers -and also grammatical error correction for spoken English.
There has been a marked increase in openly-available NLP resources in recent years, especially for English, but these have on the whole been sourced from written texts. Public resources for speech NLP, on the other hand, are relatively scarce, even for English -though English is again by far the best served in this respect. Obtaining linguistic data from large, distributed, online workers ('crowdsourcing') has become a well-established practice. With the contribution of these new annotations, we note that the CROWDED Corpus now features a thousand recordings which have all been transcribed, and of which almost 40% now have improved transcriptions and error annotations thanks to this work. The entire corpus has been collated through crowdsourcing means.
In this paper we describe the method for collecting the new data, analyse the annotations received, and report on some initial grammatical error detection experiments (GED). In the GED experiments we trial various configurations which have been successful in GED for written texts. We found that we can identify errors in the transcriptions fairly reliably using a publicly-available sequence labeller adapted to take contextual word representations as additional input, similar to previous work on GED in written corpora (Rei and Yannakoudakis, 2016;Bell et al., 2019). All new transcriptions and annotations referred to in this paper are made available on the corpus website 1 . strated that crowdsourcing is a fast, feasible and cost-effective method of corpus collation.
The speakers responded to 20 questions designed to prompt spontaneous monologues of up to one minute about imagined business scenarios. For instance, in the scenario of being asked to give advice about setting up a new shop, speakers were asked to respond to prompts such as the following: what do you think is the best location for a retail business? what are the most effective ways to advertise a new shop? why is it important to find good suppliers? There were 4 different scenarios: setting up a new shop, preparing for business visitors from a foreign country, running a taxi company, and the pros and cons of sponsoring sports events. Each scenario had 5 associated questions. Speakers who identified themselves as speakers of both languages were prompted to answer 10 questions in English and 10 in German; in the monolingual setting, speakers were asked to respond to all 20 questions in English.
There was a quality control filter on the collected recordings, flagging and removing any with poor audio quality or in which the speaker failed to respond appropriately. Approved recordings were passed to CrowdFlower workers for transcription, but only the English transcriptions were returned from Crowd-Flower with acceptable quality: the German recordings were not successfully transcribed, either because crowdworkers falsely claimed to know the language, or the job settings requiring German competence did not properly filter the worker pool. Therefore we focus on the English section of the corpus in the remainder of the paper, and intend to revisit the German data in future work. In what follows we describe the subset of recordings which have been error annotated since the initial release of the corpus, and report on our experiments to automatically detect grammatical errors in the transcriptions.

Crowdsourcing transcription correction & error annotation
Each recording in the CROWDED Corpus has been transcribed twice but it is not clear how to resolve this into a single transcription version. In this new work, we use an ASR-based method for the merger of different transcription versions developed by van Dalen et al. (2015). In that work the method was shown to produce a combined transcription which is more accurate than either version on its own, and the same is true for our data (section 3.2). We then uploaded the merged transcriptions to the Prolific platform (Palan and Schitter, 2018), along with the audio recordings, where crowdworkers reviewed the new transcriptions and edited them where necessary. We required that workers were educated to at least GCSE level (a U.K. exam aimed at 16 year olds) or equivalent, had an approval rate of at least 95% from previous studies, and listed English as a first language. Prolific requires a fair rate of pay to workers, above or equivalent to the U.K.'s minimum hourly wage, but outputs tend to be better quality than from other crowdsourcing services (Peer et al., 2017). We asked workers to correct and annotate 12 transcriptions as a unit of work, a task estimated to take 30 minutes, for which they were paid £3.10 (equivalent to approximately US$4 at the time). Noting that there is a service fee payable to Prolific, in common with other services, our funding allowed us to pay 100 workers in total.
One drawback of crowdsourcing from the researcher's point of view is the lack of training and contact time with workers. On the other hand its main advantages are the scale and speed of data collection, along with evaluations from population groups one might not normally reach in campus-based studies (Paolacci and Chandler, 2014;Difallah et al., 2018). Another issue is the bursty nature of responses: upon publishing the task there tends to be a rush of early responses. The consequence is that the central record of transcription correction does not keep pace with the issuing of work, and so a few transcriptions are issued to workers many times, whereas the intention was to limit the number of new annotations per item and thereby achieve greater coverage of the original corpus. This is a flaw which we will address in future work either with a redesign of the workflow, or a shift to more expensive but more evenly-paced and controllable local annotation.
Besides transcription correction, we asked Prolific workers to apply minimal grammatical error corrections to their updated transcriptions, in order to make them "sound like something you would expect to hear or produce yourself in English". With this statement we intended to convey the error correction task to crowdworkers in a straightforward way without the use of jargon: to alter the text into something linguistically acceptable in that worker's judgement, without referring to notions of grammar or 'correctness' which may have particularly strong connotations for some.
We developed the annotation web-app in R Shiny (Chang et al., 2020), adapted from Caines et al. (2017a) with a simple user-interface and text instructions kept to a minimum so as not to overload the workers with information. The relevant audio recordings were provided for unlimited playback, and workers were presented with the 'machine-made' transcription which was formed from the original two CrowdFlower transcriptions, along with the prompt for context. There were two text boxes, firstly for the worker to edit the existing transcription so that it better matched the recording, and the second being a copy of that updated transcription ready for the second task of error correction. A screenshot of the transcription correction and error annotation web-app is shown in Figure 1.
After quality checks and capping the number of submissions for a single item at 10, we reduced our 1200 submissions from Prolific (12 submissions from 100 workers) to 1108 submissions covering 383 unique recordings (mean: 2.9 per recording) which were selected at random from the original thousand recordings in the CROWDED Corpus. Figure 2 shows a histogram of transcriptions and annotations per recording. This is the set of data we work with in the remainder of this paper, totalling 39.7K word tokens (mean: 35.9 per transcription).

Error-annotated CROWDED dataset
Of the 1108 new transcriptions we received from Prolific workers, 80% had been updated in some way compared to the merged versions of the original CrowdFlower transcriptions. We aligned the merged and corrected transcriptions using the ERRANT toolkit (Bryant et al., 2017), which lists the edit operations to transform one text into another. These lists indicate that across the whole dataset there were 12 edits for every 100 tokens in the original transcriptions -whether a replacement, deletion or insertion. For transcriptions from the native speakers of English in the dataset, corrections were applied to 7% of word tokens; whereas for learners of English the correction rate was 17%, indicating that transcription is a harder and consequently more error-prone task with learner English speech.
Taking these new Prolific transcriptions as the ground truth, we can calculate the word error rate (WER) of the original CROWDED transcriptions obtained from CrowdFlower. Recall that each recording was transcribed twice in the original study: let us randomly assign each version to a transcription set for comparison with the new transcriptions. Thus the mean WER is 19.4% for version 1 averaged across all new versions from Prolific, and for version 2 it is 18.5%. WER for the merged transcriptions drops to 12%, confirming the benefit of that method.
On average there are just under 3 Prolific submissions for each original CROWDED transcription, with a median of 2, minimum of 1 and maximum of 10. Within each recording's set of updated transcriptions, similarity scores are good, indicating general agreement. We calculate one-versus-rest string distances using optimal string alignment -the restricted Damerau-Levenshtein distance (van der Loo, 2014) for each transcription within the set of other transcriptions for that recording. On average the string distance within transcription sets is 18 characters, set against an average transcription length of 200 characters. This is good, but also underlines the fact that transcription is a subjective process: not all transcribers perceive the same words, especially when the speaker is unclear or the audio is degraded 4 .
Meanwhile, 80% of the corrected transcriptions were edited for grammatical errors in some fashion, at an average of 21 edits per 100 word tokens. Again we align and type the identified errors using ERRANT. Table 2 in the appendix imitates Table 4 in Bryant et al. (2019), listing the frequency of error types as proportional distributions. We present these statistics for the whole dataset, and then the native speakers of English and learners of English separately, with statistics from the FCE Corpus for comparison.
Note that the edit rate in grammatical error correction of the transcriptions is not greatly different between the native speaker and learner groups at 18.9% and 23.9% respectively. The native speakers do produce more word tokens per recording (51.3 versus 27.0) as might be expected in comparing fluent native speakers of any language to learners with varying levels of proficiency. The main differences in terms of edit types are that unnecessary word tokens occur in native speaker transcriptions more than they do in learner transcriptions, where the replacement error type is the most common.
The distribution of error types (the middle section of Table 2) is broadly similar between native speaker and learner groups, with more 'other' errors in native speaker transcriptions than in learner transcriptions, in which there are more determiner, preposition and verb errors. In other words, there are more of the formal errors in learner speech which are typically found in written learner corpora: for comparison the most common error types in the FCE Corpus (Yannakoudakis et al., 2011) are 'other', prepositions and determiners (Bryant et al., 2019), though the 'other' type is much less frequent at just 13.3% of the errors in that corpus. Punctuation and spelling are the next most frequent error types in the FCE, each forming more than 9% of the total edit count.
In our CROWDED annotations the most frequent error types are 'other', punctuation and nouns, followed by determiners, orthography and verb. The 'other' group is the largest by far, one reason being that the ERRANT typology was designed for written language, and so disfluencies such as filled pauses, partial words, and false starts fall into this category (Caines et al., 2017b). It may seem odd that punctuation and orthography errors feature in the correction of speech transcriptions, but the former type of edit was invited in the Shiny web-app with the request to "insert full-stops (periods) to break up the text if necessary" (Figure 1). Again, we did not aim to be overly prescriptive in defining this task, adhering to previous work indicating that 'speech-unit delimitation' in transcriptions is an intuitive task which depends on a feel for appropriate delimitation based on some combination of syntax, semantics and prosody (Moore et al., 2016). One could more definitely require that the units be syntactically or semantically coherent but this was more instruction than we wished to give in a crowdsourcing interaction.
Orthography edits relate to the speech-unit delimitation task: they are changes in character casing due to the insertion of full-stops by the crowdworkers, and as such are not errors made by the speaker but rather a step towards making the transcriptions more human-readable. Included in the 'other' error category are filled pauses ('er', 'um', etc) which are filtered before GED because we can remove these with a rule. Note that filled pauses are a common occurrence in naturalistic speech, and hence are produced as much by native speakers -273 instances, or 7% of the transcription edits made for this group -as by the learners (256 instances, or 5.6% of edits).

Grammatical error detection
Automatic grammatical error detection in natural language is a well-developed area of research which tends to involve one of several established datasets: for example, the FCE Corpus (Yannakoudakis et al., 2011), CoNLL-2014(Ng et al., 2014 , JFLEG (Napoles et al., 2017), and the Write & Improve Corpus (W&I; Bryant et al. (2019)) among others. These corpora all contain written essays, whether written for language exams (FCE, JFLEG) or for practice and learning (CoNLL-2014, W&I). Thus techniques for GED are advanced and tuned to the written domain. We evaluate how well these methods transfer to spoken language.
GED is usually treated as a separate task to grammatical error correction (GEC) -which involves proposing edits to the original text. GEC could be an area for future work with the CROWDED Corpus, but at first we wish to explore GED for this speech dataset, anticipating that performance will be quite different to GED on written texts in which error types more often relate to word forms, punctuation and spelling ( Table 2).
The state-of-the-art approach to GED involves sequence labelling word tokens as correct or incorrect with a bi-directional LSTM: the original model (Rei and Yannakoudakis, 2016) has evolved to include forwards and backwards language modelling objectives (Rei, 2017) and contextual word representations concatenated to pre-trained word-level representations (Bell et al., 2019). In addition, multi-task learning has proven effective for GED, with auxiliary predictions of error types, part-of-speech tags and grammatical relations aiding model performance (Rei and Yannakoudakis, 2017). We take these insights forward to GED in the CROWDED Corpus, running a series of experiments with modifications to the publicly available sequence labeller released with Rei (2017) 5 . Note that F 0.5 has been the standard evaluation metric for GED since Ng et al. (2014), and it weights precision twice as much as recall.

Data pre-processing
As explained in section 3.2 the Prolific transcriptions and error-corrected versions were tokenized and aligned with ERRANT: we then converted the resulting M 2 format files into CoNLL-style tables in readiness for the sequence labeller. In the first column of the tables are the word tokens, one on each line, while in the final column is a 'correct' (c) or 'incorrect' (i) label for that token. Note that 'missing' error types are carried by the word token following the missing item, and that error labels are shared across tokens if they have been split from a single white-space delimited token (e.g. an erroneous "it's" would carry an 'i' label on both "it" and "'s").
We created ten train-development-test data splits in order to carry out ten-fold cross-validation in our GED experiments. Transcriptions were assigned to data splits in batches associated with distinct recordings. That is, where we have multiple transcriptions and annotations for a single recording, these are placed together in the same split. For each fold, recording sets were randomly selected from the 383 in the corpus until we had filled the development and test splits. As there are 1108 transcriptions in the corpus, we sought out a minimum of 110 transcriptions for the development and test splits in each fold.
Most of the folds have 110 or 111 transcriptions in development and test: the largest such split contains 115 transcriptions. We make our pseudo-random splits available for the sake of reproducibility.
We opted for cross-validation rather than a single train-development-test split because, with several recordings each being associated with many transcriptions, there is a danger of over-concentrating the smaller splits (development and test) with many similar texts. In the scenario of only having one dataset split, conclusions about the generalisability of our GED models would have therefore been limited. We add extra information to the text files from several sources: morpho-syntactic labels obtained from a parser, n-gram frequencies from several corpora, the identification of complex words, ERRANT, and prosodic features from the CROWDED audio recordings.
Specifically, the morpho-syntactic labels come from the pre-trained English Web Treebank (UD v2.4) parsing model for UDPipe (Bies et al., 2012;Nivre et al., 2019;Straka and Straková, 2017;Wijffels, 2019). For each word token we obtain a lemma, Universal part-of-speech tag (UPOS), Penn Treebank part-of-speech tag (XPOS), head token number and dependency relation directly from UDPipe's output. The motivation for preparing such information was that certain error types may occur with tell-tale morpho-syntactic signals, and furthermore predicting such labels has been of benefit in multi-task learning approaches to GED (Rei and Yannakoudakis, 2017).
The n-gram frequencies were obtained for values of n = {1, 2, 3} from the following corpora: the CROWDED Corpus itself, the British National Corpus (BNC Consortium, 2001), and the One Billion Word Benchmark (Chelba et al., 2014). In this case there were 6 values per corpus: a unigram frequency, two bigram frequencies with the target word in both first and second position of the gram, and three trigram frequencies with the target word in all three positions. The intuition here is that frequency information may be useful in identifying ungrammatical word sequences: if the n-gram is low frequency, that might indicate an error.
For each word token in the CROWDED Corpus we added a binary complexity label obtained from a model pre-trained on separate data (Gooding and Kochmar, 2019). Words in the complexity training data were labelled as complex or not by twenty crowdworkers (Yimam et al., 2017), and the model is currently state-of-the-art for the complex word identification task. We expect that complex words are more likely to be involved in or around grammatical errors. In total 4485 word tokens were identified as complex -for instance, 'guarantee', 'presentation', and 'suppliers'.
Error types were obtained from ERRANT (Bryant et al., 2017) based on the alignment of the Prolific transcription and its error-corrected version. Predicting error types, such as R:NOUN (replace noun), M:DET (missing determiner), and so on, as an auxiliary task improved GED performance in previous work (Rei and Yannakoudakis, 2017).
Finally, we extracted a number of prosodic values associated with each word token from the audio recordings. First we force aligned the transcriptions with the audio using the SPPAS toolkit (Bigi, 2015). Based on the resulting token start and end timestamps we could then calculate token durations, durations of any pauses preceding or following word tokens, and a number of values relating to the pitch and amplitude of the speaker's voice measured in 10 millisecond increments.
Thus for each token we collect the speaker's initial and final fundamental frequency (F0), the minimum and maximum, mean and standard deviation. We do the same for voice amplitude, or energy (E). Values of F0 and E are first smoothed with a 5-point median filter (Fried et al., 2019) in common with prosodic feature extraction described in previous work (Lee and Glass, 2012;Moore et al., 2016). The motivation for collecting such values is that speakers may display certain prosodic patterns -such as pausing before or afterwards -where they find speech production difficult and therefore may produce a grammatical error, or where they realise that they have recently or are in the process of making an error.

Experiment configuration
Our initial experiments did not involve additional features or auxiliary tasks: fundamentally we initialise input word vectors with pre-trained word representations and train the sequence labeller to predict whether each word token is a grammatical error. At first we ran several hyperparameter tuning experiments with manual search, recognising that grid search or random search might be more thorough, but  Table 1: CROWDED Corpus GED with a sequence labeller and pre-trained (GloVe 300 ) or contextual (BERT BASE ) word representations as input. Precision, recall and F 0.5 are averaged over 10-fold crossvalidation executed 10 times (standard deviations in brackets), with average time per run in hours.
also wishing to keep computational cost to a minimum (Strubell et al., 2019). Furthermore we tuned hyperparameters based on intuition and experience so as to offer good coverage of likely optimal values. The state-of-the-art for GED involves contextual word representations concatenated to pre-trained representations. We tried several different pre-trained word representations as input to the model: English fastText vectors trained on 600B tokens from Common Crawl (Mikolov et al., 2018), Wikipedia2Vec trained on an April 2018 English Wikipedia dump (Yamada et al., 2020), and English GloVe trained on 840B tokens from Common Crawl (Pennington et al., 2014).
The extra information described in section 4.1 was used for various experiments with additional features concatenated to the input embeddings, or as auxiliary objectives in multi-task learning settings. The additional features were treated as discrete or real values: real values were optionally normalised to values with a mean of 0 and standard deviation of 1, and discrete features were encoded as one-hot vectors (dimensionality reduction was attempted with no positive impact on performance). For auxiliary objectives we experimented with weights of 1.0, 0.1 and 0.01. As in both  and Lu et al. (2019), we introduce error-annotated learner essays as additional training data -namely the FCE, W&I and JFLEG corpora. We try three settings for each corpus: training on the written data only and evaluating on CROWDED, or fine-tuning a pre-trained written GED model on CROWDED training texts, or combining the written and CROWDED corpora for training from the outset.
Recall that we set up our dataset for 10-fold cross-validation. As well as 10-folds, all experiments involve 10 different random seeds to provide us with a fair picture of variance in model training and performance. F 0.5 is the primary metric and results are reported as the mean of all seeds and folds (i.e. 10 folds x 10 seeds = 100 values). Average time is reported per experimental run (i.e. per set of cross-validation experiments, or per random seed).

Results
In terms of hyperparameter search, the best performing model involves contextual word representations from BERT BASE concatenated to pre-trained GloVe representations. The AdaDelta optimizer with a learning rate of 1.0 out-performed Adam with a learning rate of 0.001, and a batch size of 32 was better than 64 or 16. We label this best model 'GloVe 300 +BERT BASE '. This result is compared to a baseline GloVe 300 approach without BERT BASE representations in summary Table 1. The difference between the two models is statistically significant (Wilcoxon signed-rank test: p < 0.001).
Full hyperparameter experiments are reported in Table 3 (appendix). Note that GloVe 300 +BERT BASE is not the best across the board but rather has the best combination of precision and recall: experiments with both the Adam optimizer and a larger batch size of 64 achieve higher precision; a smaller batch size of 16 achieves higher recall. Nevertheless GloVe 300 +BERT BASE is the best performing model overall.
We report results with additional features and auxiliary objectives in the appendix in Table 4 (second  table section downwards). In summary, no additional features or auxiliary objectives out-performed the Figure 3: Recall for each edit and error type in the CrowdED and FCE corpora using the GloVe 300 +BERT BASE GED model. Selected types with at least 1% of edits or errors in each corpus.
GloVe 300 +BERT BASE model, which we take to mean that BERT representations are very strong, and any additional information for a dataset of this size only adds noise. Or it may be that different feature types or auxiliary objectives are required for spoken error detection. We infer this conclusion from experiments with a GloVe 300 model (no BERT) in which additional features do improve performance (top section, Table 4) so are evidently not unhelpful in themselves.
Similarly, experiments with extra training data from written corpora do show improvement over the GloVe 300 model (no BERT) but not over the best GloVe 300 +BERT BASE model (appendix Table 5), whether combining written and spoken corpora from the outset or pre-training on written texts and finetuning on spoken data. It is possible that larger such corpora are needed to show any gain over a +BERT model:  and Lu et al. (2019) use the 14 million word Cambridge Learner Corpus (Nicholls, 2003) and do see improvements in GED on other spoken corpora (NICT JLE and a Cambridge Assessment dataset) after fine-tuning.

Analysis
State-of-the-art performance with a similar model to GloVe 300 +BERT BASE yields an F 0.5 of .573 for the FCE Corpus (Bell et al., 2019). Performance on CROWDED data is quite a bit lower: about .4 at best. We analyse performance on each of the edit and error types listed in Table 2, calculating accuracy of error detection for each type in the FCE test set and one of the ten CROWDED test sets (i.e. calculating recall). Figure 3 shows the edits and errors separately, including only those types which represent at least 1% of the edits in the test set. The proportion each type represents is indicated by datapoint size, CROWDED points are dark and FCE points are light, and the y-axis shows recall.
The first indicator of worse performance is the difference between CROWDED and FCE recall for the 'replacement' edit type (the majority edit type). The 'missing' edit type is also worse for CROWDED while 'unnecessary' edits are detected with better recall in CROWDED than FCE. Recall on the majority error type, 'other', is a little better in CROWDED than FCE, but for determiners and nouns it is notably worse. Also we see that spelling recall is very high for the FCE, whereas this error type is absent from CROWDED (except for some transcription anomalies). Of course this plot only tells part of the story: precision of GED is much higher on the FCE (currently .650 state-of-the-art) and therefore we are predicting more false positives in CROWDED which merits further investigation in future work.

Conclusion
In this paper we have presented a new resource for speech NLP: 1108 separate corrected transcriptions and grammatical error annotations for 383 distinct English recordings from the CROWDED Corpus. These are available for research use 6 and complement the existing CROWDED audio files and original transcriptions available in the same place. These data enable further research into automatic post-editing of speech transcriptions for readability (inserting full-stops and orthographic correction), which we have not explored here but could do in future work. In addition the error annotations allow experiments in grammatical error detection and correction (GED and GEC).
We undertook GED experiments in this work, using methods shown to be highly effective on written corpora. We find that a combination of contextual and pre-trained word representations as inputs to a bi-directional LSTM lead to good performance in sequence labelling word tokens in speech transcriptions as correct or incorrect. Performance is still some way off that for written GED, which perhaps may be accounted for by the small size of the dataset compared to written ones, the fact that the word representations are trained on written rather than spoken data, and the fairly different composition of error types found in CROWDED compared to equivalent written corpora. These factors, along with the possible need for extra pre-processing of the transcriptions to handle characteristic features of spoken language such as disfluencies, ellipsis and sections of unclear speech, may be explored in future work.
Another future improvement will be to augment the CROWDED Corpus with new data: new tasks, new languages, new recordings and annotation. Note that there are hundreds of CROWDED recordings which do not have the transcription updates and error annotation described here, so if funding allows there is unrealised potential in the existing data, besides adding to it with new data. Further insight may come from adapting the error typology to the spoken domain, in particular subdividing the majority 'other' error type into new types specific to spoken language, and then learning to better detect and correct these. We will then be able to improve upon the best GloVe 300 +BERT BASE model with techniques for GED which are tailored to the kinds of errors found in spoken language.   Table 4: CROWDED Corpus GED with additional features and auxiliary objectives. Precision, recall and F 0.5 averaged over 10-fold cross-validation executed 10 times (standard deviations in brackets), with average time per run in hours. Features are marked as discrete or real-valued, with discrete features encoded as one-hot vectors and their width noted, and real-valued features may be normalised to values between 0 and 1. Auxiliary objectives are marked as discrete or real-valued targets, with varying weights.  Table 5: CROWDED Corpus GED with additional training data. Precision, recall and F 0.5 averaged over 10-fold cross-validation executed 10 times (standard deviations in brackets; dots indicate that there was only one run and hence no standard deviation), with average time per run in hours. Point of introduction of CROWDED training data is noted as 'none' (no CROWDED data used in training), 'outset' (mixed with written corpus) or 'fine-tuning' (used to update the weights of a pre-trained model). *Asterisked experiments were run with 4 random seeds rather than 10 due to their long duration.