Error-repair Dependency Parsing for Ungrammatical Texts

We propose a new dependency parsing scheme which jointly parses a sentence and repairs grammatical errors by extending the non-directional transition-based formalism of Goldberg and Elhadad (2010) with three additional actions: SUBSTITUTE, DELETE, INSERT. Because these actions may cause an infinite loop in derivation, we also introduce simple constraints that ensure the parser termination. We evaluate our model with respect to dependency accuracy and grammaticality improvements for ungrammatical sentences, demonstrating the robustness and applicability of our scheme.


Introduction
Robustness has always been a desirable property for natural language parsers: humans generate a variety of noisy outputs, such as ungrammatical webpages, speech disfluencies, and the text in language learner's essays. Such non-canonical text contains grammatical errors such as substitutions, insertions, and deletions. For example, a nonnative speaker of English might write "*I look in forward hear from you", where in is inserted, to is deleted, and hearing is substituted incorrectly.
We propose a novel dependency parsing scheme that jointly parses and repairs ungrammatical sentences with these sorts of errors. The parser is based on the non-directional easy-first (EF) parser introduced by Goldberg and Elhadad (2010) (GE herein), which iteratively adds the most probable arc until the parse tree is completed. These actions are called ATTACHLEFT and ATTACHRIGHT depending on the direction of the arc. We extend the EF parsing scheme to be robust for ungrammatical inputs by correcting grammatical er- Solid arrows represent ATTACHRIGHT and ATTACHLEFT in Goldberg and Elhadad (2010). Dotted arcs correspond to actions for each step. Following the notation by GE: arcs are directed from a child to its parent.
rors with three new actions: SUBSTITUTE, INSERT, and DELETE. These new actions do not add an arc between tokens but instead they edit a single token. As a result, the parser is able to jointly parse and correct grammatical errors in the input sentence. We call this new scheme Error-Repair Non-Directional Easy-First parsing (EREF). Since the new actions may greatly increase the search space (e.g., infinite substitutions), we also introduce simple constraints to avoid such issues. We first describe the technical details of EREF ( §2) and then evaluate our EREF parser with respect to dependency accuracy (robustness) and grammaticality improvements ( §3). Finally, we position this effort at the intersection of noisy text parsing and grammatical error correction ( §4).

Model
Non-directional Easy-first Parsing Let us begin with a brief review of a non-directional easyfirst (EF) parsing scheme proposed by GE, which is the foundation of our proposed scheme described in the following sections.
The EF parser has a list of partial structures p 1 , ..., p k (called pending) initialized with sentence tokens w 1 , ..., w n , and it keeps updating pending through derivations. Unlike left-to-right (e.g., shift-reduce) parsing algorithms (Yamada and Matsumoto, 2003;Nivre, 2004), EF iteratively selects the best pair of adjoining tokens and chooses the direction of attachment: ATTACHLEFT or ATTACHRIGHT. Once the action is committed, the corresponding dependency arc is added and the child token is removed from pending. The first two derivations in Figure 1 depict ATTACHRIGHT and ATTACHLEFT. Pseudocode is shown in Algorithm 1 (lines 1, 3-12).
The parser is trained using the structured perceptron (Collins, 2002) to choose actions to take given a set of features expanded from templates. The cost of actions is computed at every step by checking the validity: whether a new arc is included in the gold parse and whether the child already has all its children. See GE for further description of feature templates and structured perceptron training. Since it is possible that there are multiple valid sequence of actions and it is important to examine a large search space, the parser is allowed to explore (possibly incorrect) actions with a certain probability, termed learning with exploration by Goldberg and Nivre (2013).
Error-repair variant of EF Error-repair nondirectional easy-first parsing scheme (EREF) is a variant of EF. We add three new actions: SUBSTITUTE, DELETE, INSERT as Acts ER . We do not deal with a swapping action (Nivre, 2009) to deal with word reordering errors, since the errors are even less frequent than other error types (Leacock et al., 2014). SUBSTITUTE replaces a token to a grammatically more probable token, DELETE removes an unnecessary token, and INSERT inserts a new token at a designated index. These actions are shown in Figure 1 and Algorithm 1 (lines 13-25). Because the length of pending decreases as an attachment occurs, the parser Algorithm 1: Error-repair non-directional parsing Input: ungrammatical sentence= w 1 ... w n Output: a set of dependency arcs (Arcs), repaired sentence (ŵ 1 ...ŵ m ) if best ∈ Acts then 10 (parent, child) ← edgeFor(best)

11
Arcs.add((parent, child)) 12 pending.remove(child) Arcs.updateIndex() 26 end 27 return Arcs, repaired also keeps the token indices in repaired (line 5), which holds all tokens in a sentence throughout the parsing process. Furthermore, the parser updates token indices in pending and repaired when INSERT or DELETE occurs. Technically, when a token at i is deleted/inserted, the parser decrements/increments the indices that are k >= i (before executing the action) in pending, repaired, and parents and children in a (partial) dependency tree (Arcs). To find the best candidate for SUBSTITUTE and INSERT efficiently, we restrict candidates to the same part-of-speech or pre-defined candidate list. We select the best candidate by comparing each n-gram language model score with the same surrounding context. Similar to EF, while training the parser, the cost for Acts ER is based on validity. The validity of the new actions is computed by taking the edit distance (d) between the Gold tokens (w * 1 ... w * r ) and the sentence state that the parser stores in repaired (ŵ 1 ...ŵ m ). When the edit distance after taking an action (d after ) is smaller than before (d before ), we regard the action as valid (Algorithm 2).
One serious concern of EREF is that the new actions may cause an infinite loop during parsing (e.g., infinite SUBSTITUTE, or an alternative DELETE and INSERT sequence.). To avoid this, we introduce two constraints: (1) edit flag and (2) edit limit. Edit flag is assigned for each token as a property, and a parser is not allowed to execute Acts ER on a token if its flag is on. The flag is turned on when a parser executes Acts ER on a token whose flag is off. In INSERT action, the flag of the inserted token is activated, while the subsequent token (which gave rise to the INSERT) is not. Edit limit is set to be the number of tokens in a sentence, and the parser is not allowed to perform Acts ER when the total number of execution of Acts ER exceeds the limit. These two constraints prevent the parser from falling into an infinite loop as well as parsing in the same order of time complexity as GE. We also add the following constraints to avoid unreasonable derivations: (i) a word with a dependent cannot be deleted and (ii) any child words cannot be substituted. All the constraints are implemented in the isLegal() function in Algorithm 1 (line 8). We note that these constraints not only prevent undesirable derivations but also leads to an efficiency in exploring the search space during training.

Experiment
Data and Evaluation We evaluate EREF with respect to dependency parsing accuracy (Exp1) and grammaticality improvement (Exp2). 1 In the first experiment, as in GE, we train and evaluate our parser on the English dataset from the Penn Treebank (Marcus et al., 1993) with the Penn2Malt conversion program (Sections 2-21 for training, 22 for tuning, and 23 for test). We use the PTB for the dependency experiment, since there are no ungrammatical text corpora that has dependency annotation on the corrected texts by human.
We choose the following most frequent error types that are used in CoNLL 2013 shared task (Ng et al., 2013): 1. Determiner (substitution, deletion, insertion) 2. Preposition (substitution, deletion, insertion) 3. Noun number (singular vs. plural) 4. Verb form (tense and aspect) 5. Subject verb agreement Regarding the candidate sets for INSERT and SUBSTITUTE actions, following Rozovskaya and Roth (2014), we focus on the most common candidates for each error type, setting the determiner candidates to be {a, an, the, φ (as deletion)}, preposition candidates to be {on, about, from, for, of, to, at, in, with, by, φ}, and verb forms to be {VB(P|Z|G|D|N)}. We build a 5-gram language model on English Gigaword with the KenLM Toolkit (Heafield, 2011) for EREF to select the best candidate.
We manually inject grammatical errors into PTB with certain error-rates similarly to the Gen-ERRate toolkit by Foster and Andersen (2009), which is designed to create synthetic errors into sentences to improve grammatical error detection.
We train and tune EREF models with different token-level error injection rates from 5% (E05) to 20% (E20), because language learner corpora have generally around 5% to 15% of token level errors depending on learners' proficiency (Leacock et al., 2014). Since the error injection is stochastic, we train each model with 10 runs and take an average of parser performance on the test set.
As a baseline, we use the original parser as described by GE, which is equivalent to EREF with training on an error-free corpus (E00). Since the EF baseline does not allow error correction during parsing, we pre-process the test data with a grammatical error correction system similar to Rozovskaya and Roth (2014), where a combination of classifiers for each error type corrects grammatical errors.
For evaluation, we jointly parse and correct grammatical errors in the test set with different   (Heilman et al., 2014). The first row shows the number of sentences that are made (at least one) change. Bold numbers show statistically significant improvements. error injection rates (from 0% to 20%). It is important to note that the number of tokens between the parser output and the oracle may be different because of error injection into the test set and Acts ER during parsing. To handle this mismatch, we evaluate the dependency accuracy with alignment (Favre et al., 2010) in the spirit of SParseval (Roark et al., 2006), where tokens between a hypothesis and oracle are aligned prior to calculating the dependency accuracy.
In the second experiment, we use the Treebank of Learner English (TLE) (Berzak et al., 2016) to see the grammaticality improvement in a real scenario. TLE contains 5,124 sentences and 2.69 (std 1.9) token errors per sentence. The average sentence length is 19.06 (std 9.47). TLE also provides dependency labels and POS tags on the raw (ungrammatical) sentences. It is important to note that TLE has dependency annotation only for the original ungrammatical sentences, and therefore we do not compute the accuracy of dependency parse in this experiment. Since the corpus size is small, we train EREF (E05 to E20) on 100k sentences from Annotated Gigaword (Napoles et al., 2012) and used TLE as a test set. Spelling errors are ignored because EREF can use the POS information. Grammaticality is evaluated by a regression model (Heilman et al., 2014), which scores grammaticality on the ordinal scale (from 1 to 4).
Results Table 1 shows the result of unlabeled dependency accuracy (UAS). 2 As previously pre-  sented (Foster, 2007;Cahill, 2015), our experiment also shows that parser performance deteriorated as the error rate in the test corpus increased. On the error-free test set (0%), the baseline (EF pipeline) outperforms other EREF models; the accuracy is lower when the parser is trained on noisier data. The difference among the models becomes small when the test set has 10% error injection rate. As the rate increases further, the trend of parser accuracy reverses. When the test set has 15% or higher noise, the E20 is the most accurate parser. This trend is presented by the slope of deterioration ∇ = accuracy 20% −accuracy 0% 20 in Table 1; a parser trained on noisier training data shows smaller decline and more robustness. 3 This indicates that the EREF is more robust than the vanilla EF on ungrammatical texts by jointly parsing and correcting errors. Table 2 demonstrates the result of grammaticality improvement (1-4 scale) on the TLE corpus, and Table 3 shows successful and failure corrections by EREF. Minimally trained models (E05 and E10) show little improvement in grammaticality because the models are too conservative to make edits. The models with higher error-injection rates (E15 and E20) achieve 0.1 to 0.3 improvements that are statistically significant. There is still room to improve regarding the amount of corrections. This is probably because TLE contains a variety of errors (e.g., collocation, punctuation) in addition to the five error types we focus. To deal with other error types, we can extend EREF by adding more actions, although it increases the search space.
From a practical perspective, the level of ungrammaticality should be realized ahead of time. This is an issue to be addressed in future research.
search space and training time. The primary goal of this experiment is to see if the EREF is able to detect and correct grammatical errors.
3 Baseline model without preprocessing always underperformed the preprocessed baseline.

192
Our work lies at the intersection of parsing noncanonical texts and grammatical error correction.
Joint dependency parsing and disfluency detection has been pursued Tetreault, 2013, 2014;Honnibal and Johnson, 2014;Wu et al., 2015;Yoshikawa et al., 2016), where a parser jointly parses and detects disfluency (e.g., reparandum and interregnum) for a given speech utterance. Our work could be considered an extension via adding SUBSTITUTE and INSERT actions, although we depend on easy-first non-directional parsing framework instead of a left-to-right strategy. Importantly, the DELETE action is easier to handle than the SUBSTITUTE and INSERT actions, because they bring us challenging issues about a process of candidate word generation and avoiding an infinite loop in derivation. We have addressed these issues as explained in §2.
In terms of the literature from grammatical error correction, this work is closely related to Dahlmeier and Ng (2012), where they show an error correction decoder with the easy-first strategy. The decoder iteratively corrects the most probable ungrammatical token by applying different classifiers for each error type. The EREF parser also depends on the easy-first strategy to find ungrammatical index to be deleted, inserted, or substituted, but it parses and corrects errors jointly whereas the decoder is designed as a grammatical error correction framework rather than a parser.
There is a line of work for parsing ungrammatical sentences (e.g., web forum) by adapting an existing parsing scheme on domain specific annotations (Petrov and McDonald, 2012;Cahill, 2015;Berzak et al., 2016;Nagata and Sakaguchi, 2016). Although we share an interest with respect to dealing with ungrammatical sentences, EREF focuses on the parsing scheme for repairing grammatical errors instead of adapting a parser with a domain specific annotation scheme.
More broadly, our work can also be regarded as one of the joint parsing and text normalization tasks such as joint spelling correction and POS tagging (Sakaguchi et al., 2012), word segmentation and POS tagging (Kaji and Kitsuregawa, 2014;Qian et al., 2015).

Conclusions
We have presented an error-repair variant of the non-directional easy-first dependency parser. We have introduced three new actions, SUBSTITUTE, INSERT, and DELETE into the parser so that it jointly parses and corrects grammatical errors in a sentence. To address the issue of parsing incompletion due to the new actions, we have proposed simple constraints that keep track of editing history for each token and the total number of edits during derivation. The experimental result has demonstrated robustness of EREF parsers against EF and grammaticality improvement. Our work is positioned at the intersection of noisy text parsing and grammatical error correction. The EREF is a flexible formalism not only for grammatical error correction but other tasks with jointly editing and parsing a given sentence.