Using Wikipedia Edits in Low Resource Grammatical Error Correction

We develop a grammatical error correction (GEC) system for German using a small gold GEC corpus augmented with edits extracted from Wikipedia revision history. We extend the automatic error annotation tool ERRANT (Bryant et al., 2017) for German and use it to analyze both gold GEC corrections and Wikipedia edits (Grundkiewicz and Junczys-Dowmunt, 2014) in order to select as additional training data Wikipedia edits containing grammatical corrections similar to those in the gold corpus. Using a multilayer convolutional encoder-decoder neural network GEC approach (Chollampatt and Ng, 2018), we evaluate the contribution of Wikipedia edits and find that carefully selected Wikipedia edits increase performance by over 5%.


Introduction and Previous Work
In the past decade, there has been a great deal of research on grammatical error correction for English including a series of shared tasks, Helping Our Own in 2011 and 2012 (Dale and Kilgarriff, 2011;Dale et al., 2012) and the CoNLL 2013 and 2014 shared tasks (Ng et al., 2013(Ng et al., , 2014, which have contributed to the development of larger English GEC corpora. On the basis of these resources along with advances in machine translation, the current state-of-the-art English GEC systems use ensembles of neural MT models (Chollampatt and Ng, 2018) and hybrid systems with both statistical and neural MT models (Grundkiewicz and Junczys-Dowmunt, 2018).
In addition to using gold GEC corpora, which are typically fairly small in the context of MTbased approaches, research in GEC has taken a number of alternate data sources into consideration such as artificially generated errors (e.g., Wagner et al., 2007;Foster and Andersen, 2009;Yuan and Felice, 2013), crowd-sourced corrections (e.g., Mizumoto et al., 2012), or errors from native language resources (e.g., Cahill et al., 2013;Grundkiewicz and Junczys-Dowmunt, 2014). For English, Grundkiewicz and Junczys-Dowmunt (2014) extracted pairs of edited sentences from the Wikipedia revision history and filtered them based on a profile of gold GEC data in order to extend the training data for a statistical MT GEC system and found that the addition of filtered edits improved the system's F 0.5 score bỹ 2%. For languages with more limited resources, native language resources such as Wikipedia offer an easily accessible source of additional data.
Using a similar approach that extends existing gold GEC data with Wikipedia edits, we develop a neural machine translation grammatical error correction system for a new language, in this instance German, for which there are only small gold GEC corpora but plentiful native language resources.

Data and Resources
The following sections describe the data and resources used in our experiments on GEC for German. We create a new GEC corpus for German along with the models needed for the neural GEC approach presented in Chollampatt and Ng (2018). Throughout this paper we will refer to the source sentence as the original and the target sentence as the correction.

Gold GEC Corpus
As we are not aware of any standard corpora for German GEC, we create a new grammatical error correction corpus from two German learner corpora that have been manually annotated following similar guidelines. In the Falko project, annotation guidelines were developed for minimal target hypotheses, minimal corrections that transform an original sentence into a grammatical correction, and these guidelines were applied to ad-  (Reznicek et al., 2012). The MERLIN project (Boyd et al., 2014) adapted the Falko guidelines and annotated learner texts from a wide range of proficiency levels. 1 We extract pairs of original sentences and corrections from all annotated sentence spans in FalkoEssayL2 v2.4 2 (248 texts), FalkoEssayWhig v2.0 2 (196 texts), and MERLIN v1.1 3 (1,033 texts) to create the new Falko-MERLIN GEC Corpus, which contains 24,077 sentence pairs. The corpus is divided into train (80%), dev (10%), and test (10%) sets, keeping all sentences from a single learner text within the same partition.
An overview of the Falko-MERLIN GEC Corpus is shown in Table 1 with the number of errors per sentence and errors per token as analyzed by ERRANT for German (see section 3.1). On average, the Falko corpus (advanced learners) contains longer sentences with fewer errors per token while the MERLIN corpus (all proficiency levels) contains shorter sentences with more errors per token. A more detailed ERRANT-based analysis is presented in Figure 2 in section 3.2.

Wikipedia
In our experiments, we use German Wikipedia dumps of articles and revision history from June 1, 2018. Wikipedia edits are extracted from the revision history using Wiki Edits (Grundkiewicz and Junczys-Dowmunt, 2014) with a maximum sentence length of 60 tokens, since 99% of the Falko and MERLIN sentences are shorter than 60 tokens. For training the subword embeddings, plain text is extracted from the German Wikipedia articles using WikiExtractor. 4

BPE Model and Subword Embeddings
We learn a byte pair encoding (BPE) (Sennrich et al., 2016) with 30K symbols using the corrections from the Falko-MERLIN training data plus the complete plain Wikipedia article text. As suggested by Chollampatt and Ng (2018), we encode the Wikipedia article text using the BPE model and learn fastText embeddings (Bojanowski et al., 2017) with 500 dimensions.

Language Model
For reranking, we train a language model on the first one billion lines (~12 billion tokens) of the deduplicated German Common Crawl corpus (Buck et al., 2014).

Method
We extend the Falko-MERLIN GEC training data with sentence-level Wikipedia edits that include similar types of corrections. In order to automatically analyze German GEC data, we extend ER-RANT from English to German (section 3.1) and use its analyses to select suitable Wikipedia edits (section 3.2).

ERRANT
ERRANT, the ERRor ANnotation Tool (Felice et al., 2016;Bryant et al., 2017), analyzes pairs of English sentences from a GEC corpus to identify the types of corrections performed. The tokens in a pair of sentences are aligned using Damerau-Levenshtein edit distance with a custom substitution cost that includes linguistic information -lemmas, POS, and characters -to promote alignments between related word forms. After the individual tokens are aligned, neighboring edits are evaluated to determine whether two or more edits should be merged into one longer edit, such as merging wide → widespread followed by spread → ∅ into a single edit wide spread → widespread.
To assign an error type to a correction, ER-RANT uses a rule-based approach that considers information about the POS tags, lemmas, stems, and dependency parses. To extend ERRANT for German, we adapted and simplified the English error types, relying on UD POS tags instead of language-specific tags as much as possible. Our top-level German ERRANT error types are shown with examples in Table 2. For substitution errors, In ERRANT for English, all linguistic annotation is performed with spaCy. 5 We preserve as much of the spaCy pipeline as possible using spaCy's German models, however the lemmatizer is not sufficient and is replaced with the TreeTagger lemmatizer. 6 All our experiments are performed with spaCy 2.0.11 and spaCy's default German model. The word list for detecting spelling errors comes from Hunspell igerman98-20161207 7 and the mapping of STTS to UD tags from TuebaUDConverter (Çöltekin et al., 2017).
An example of a German ERRANT analysis is shown in Figure 1. The first token is analyzed as an adjective substitution error where both adjectives have the same lemma (S:ADJ:FORM), the inflected deverbal adjective bestandenen 'passed' is inserted before Prüfung 'exam' (I:ADJ), and the past participle bestanden 'passed' is deleted at the end of the sentence (D:VERB). Note that ERRANT does not analyze Prüfung bestanden → bestandenen Prüfung as a word order error because the reordered word forms are not identical. In cases like these and ones with longer distance movement, which is a frequent type of correction in non-native German texts, ERRANT has no way to indicate that these two word forms are related or that this pair of edits is coupled.

Filtering Edits with ERRANT
Even though the Wiki Edits algorithm (Grundkiewicz and Junczys-Dowmunt, 2014) extracts only sentence pairs with small differences, many edits relate to content rather than grammatical errors, such as inserting a person's middle name or updating a date. In order to identify the most relevant Wikipedia edits for GEC, we analyze the gold GEC corpus and Wikipedia edits with ERRANT and then filter the Wikipedia edits based on a profile of the gold GEC data.
First, sentences with ERRANT error types that indicate content or punctuation edits are discarded: 1) sentences with only punctuation, proper noun, and/or OTHER error types, 2) sentences with edits modifying only numbers or non-Latin characters, and 3) sentences with OTHER edits longer than two tokens. Second, the ERRANT profile of the gold corpus is used to select edits that: 1) include an original token edited in the gold corpus, 2) include the same list of error types as a sentence in the gold corpus, 3) include the same set of error types as a sentence in the gold corpus for 2+ error types, or 4) for sets of Gold and W iki error types have a Jaccard similarity coefficient to a gold sentence greater than 0.5: After ERRANT-based filtering, approximately one third of the sentences extracted with Wiki Edits remain. The distribution of selected ERRANT error types for the Falko and MERLIN gold GEC corpora vs. the unfiltered and filtered Wikipedia edit corpora are shown in Figure 2 in order to provide an overview of the similarities and differences between the data. As intended, filtering Wikipedia edits as described above decreases the number of potentially content-related PNOUN and OTHER edits while increasing the proportion of other types of edits. Both in the unfiltered and filtered Wikipedia edits corpora, the overall frequency of errors remains lower than in the Falko-MERLIN GEC corpus: 1.7 vs. 2.8 errors per sentence and 0.08 vs. 0.18 errors per token. 'Congratulations on passing your exam.'

Results and Discussion
We evaluate the effect of extending the Falko-MERLIN GEC Corpus with Wikipedia edits for a German GEC system using the multilayer convolutional encoder-decoder neural network approach from Chollampatt and Ng (2018), using the same parameters as for English. 8 We train a single model for each condition and evaluate on the Falko-MERLIN test set using M 2 scorer (Dahlmeier and Ng, 2012 The results, presented in Table 3, show that the addition of both unfiltered and filtered Wikipedia edits to the Falko-MERLIN GEC training data lead to improvements in performance, however larger numbers of unfiltered edits (>250K) do not consistently lead to improvements, similar to the results for English in Grundkiewicz and Junczys-Dowmunt (2014). However for filtered edits, increasing the number of additional edits from 100K to 1M continues to lead to improvements, with an overall improvement of 5.2 F 0.5 for 1M edits over the baseline without additional reranking.
In contrast to the results for English in Chollampatt and Ng (2018), edit operation (EO) rerank-ing decreases scores in conditions with gold GEC training data in our experiments and reranking with a web-scale language model (LM) does not consistently increase scores, although both reranking methods lead to increases in recall. The best result of 45.22 F 0.5 is obtained with Falko-MERLIN + 1M Filtered Wiki Edits with language model reranking that normalizes scores by the length of the sentence.
An analysis of the performance on Falko vs. MERLIN shows stronger results for MERLIN, with 44.19 vs. 46.52 F 0.5 for Falko-MERLIN + 1M Filtered Wiki Edits + LM N orm . We expected the advanced Falko essays to benefit from being more similar to Wikipedia than MERLIN, however MERLIN may simply contain more spelling and inflection errors that are easy to correct given a small amount of context.
In order to explore the possibility of developing GEC systems for languages with fewer resources, we trained models solely on Wikipedia edits, which leads to a huge drop in performance (45.22 vs. 24.37 F 0.5 ). However, the genre differences may be too large to draw solid conclusions and this approach may benefit from further work on Wikipedia edit selection, such as using a language model to exclude some Wikipedia edits that introduce (rather than correct) grammatical errors.

Future Work
The combined basis of ERRANT and Wiki Edits make it possible to explore MT-based GEC approaches for languages with limited gold GEC resources. The current German ERRANT error analysis approach can be easily generalized to rely on a pure UD analysis, which would make it possible to apply ERRANT to any language with a UD parser and a lemmatizer. Similarly, the process of filtering Wikipedia edits could use alternate methods in place of a gold reference corpus, such as a list of targeted token or error types, to generate GEC training data for any language with resources similar to a Wikipedia revision history.
For the current German GEC system, a detailed error analysis for the output could identify the types of errors where Wikipedia edits make a significant contribution and other areas where additional data could be incorporated, potentially through artificial error generation or crowd-sourcing.

Conclusion
We provide initial results for grammatical error correction for German using data from the Falko and MERLIN corpora augmented with Wikipedia edits that have been filtered using a new German extension of the automatic error annotation tool ERRANT (Bryant et al., 2017). Wikipedia edits are extracted using Wiki Edits (Grundkiewicz and Junczys-Dowmunt, 2014), profiled with ER-RANT, and filtered with reference to the gold GEC data. We evaluate our method using the multilayer convolutional encoder-decoder neural network GEC approach from Chollampatt and Ng (2018) and find that augmenting a small gold German GEC corpus with one million filtered Wikipedia edits improves the performance from 39.22 to 44.47 F 0.5 and additional language model reranking increases performance to 45.22. The data and source code for this paper are available at: https://github.com/adrianeboyd/ boyd-wnut2018/