Adapting Sequence Models for Sentence Correction

In a controlled experiment of sequence-to-sequence approaches for the task of sentence correction, we find that character-based models are generally more effective than word-based models and models that encode subword information via convolutions, and that modeling the output data as a series of diffs improves effectiveness over standard approaches. Our strongest sequence-to-sequence model improves over our strongest phrase-based statistical machine translation model, with access to the same data, by 6 M2 (0.5 GLEU) points. Additionally, in the data environment of the standard CoNLL-2014 setup, we demonstrate that modeling (and tuning against) diffs yields similar or better M2 scores with simpler models and/or significantly less data than previous sequence-to-sequence approaches.


Introduction
The task of sentence correction is to convert a natural language sentence that may or may not have errors into a corrected version. The task is envisioned as a component of a learning tool or writing-assistant, and has seen increased interest since 2011 driven by a series of shared tasks (Dale and Kilgarriff, 2011;Dale et al., 2012;Ng et al., , 2014. Most recent work on language correction has focused on the data provided by the CoNLL-2014 shared task (Ng et al., 2014), a set of corrected essays by second-language learners. The CoNLL-2014 data consists of only around 60,000 sentences, and as such, competitive systems have made use of large amounts of corrected text without annotations, and in some cases lower-quality crowd-annotated data, in addition to the shared data. In this data environment, it has been suggested that statistical phrase-based machine translation (MT) with task-specific features is the state-of-the-art for the task (Junczys-Dowmunt and Grundkiewicz, 2016), outperforming wordand character-based sequence-to-sequence models (Yuan and Briscoe, 2016;Xie et al., 2016;Ji et al., 2017), phrase-based systems with neural features (Chollampatt et al., 2016b,a), re-ranking output from phrase-based systems (Hoang et al., 2016), and combining phrase-based systems with classifiers trained for hand-picked subsets of errors (Rozovskaya and Roth, 2016).
We revisit the comparison across translation approaches for the correction task in light of the Automated Evaluation of Scientific Writing (AESW) 2016 dataset, a correction dataset containing over 1 million sentences, holding constant the training data across approaches. The dataset was previously proposed for the distinct binary classification task of grammatical error identification.
Experiments demonstrate that pure characterlevel sequence-to-sequence models are more effective on AESW than word-based models and models that encode subword information via convolutions over characters, and that representing the output data as a series of diffs significantly increases effectiveness on this task. Our strongest character-level model achieves statistically significant improvements over our strongest phrasebased statistical machine translation model by 6 M 2 (0.5 GLEU) points, with additional gains when including domain information. Furthermore, in the partially crowd-sourced data environment of the standard CoNLL-2014 setup in which there are comparatively few professionally annotated sentences, we find that tuning against the tags marking the diffs yields similar or superior effectiveness relative to existing sequence-to-sequence approaches despite using significantly less data, with or without using secondary models. All code is available at https://github. com/allenschmaltz/grammar.

Background and Methods
Task We follow recent work and treat the task of sentence correction as translation from a source sentence (the unedited sentence) into a target sentence (a corrected version in the same language as the source). We do not make a distinction between grammatical and stylistic corrections.
We assume a vocabulary V of natural language word types (some of which have orthographic errors). Given a sentence s = [s 1 · · · s I ], where s i ∈ V is the i-th token of the sentence of length I, we seek to predict the corrected target sentence t = [t 1 · · · t J ], where t j ∈ V is the j-th token of the corrected sentence of length J. We are given both s and t for supervised training in the standard setup. At test time, we are only given access to sequence s. We learn to predict sequence t (which is often identical to s).
Sequence-to-sequence We explore word and character variants of the sequence-to-sequence framework. We use a standard word-based model (WORD), similar to that of Luong et al. (2015), as well as a model that uses a convolutional neural network (CNN) and a highway network over characters (CHARCNN), based on the work of , instead of word embeddings as the input to the encoder and decoder. With both of these models, predictions are made at the word level. We also consider the use of bidirectional versions of these encoders (+BI).
Our character-based model (CHAR+BI) follows the architecture of the WORD+BI model, but the input and output consist of characters rather than words. In this case, the input and output sequences are converted to a series of characters and whitespace delimiters. The output sequence is converted back to t prior to evaluation.
The WORD models encode and decode over a closed vocabulary (of the 50k most frequent words); the CHARCNN models encode over an open vocabulary and decode over a closed vocabulary; and the CHAR models encode and decode over an open vocabulary.
Our contribution is to investigate the impact of sequence-to-sequence approaches (including those not considered in previous work) in a series of controlled experiments, holding the data constant. In doing so, we demonstrate that on a large, professionally annotated dataset, the most effective sequence-to-sequence approach can significantly outperform a state-of-the-art SMT system without augmenting the sequence-to-sequence model with a secondary model to handle lowfrequency words (Yuan and Briscoe, 2016) or an additional model to improve precision or intersecting a large language model (Xie et al., 2016). We also demonstrate improvements over these previous sequence-to-sequence approaches on the CoNLL-2014 data and competitive results with Ji et al. (2017), despite using significantly less data.
The work of Schmaltz et al. (2016) applies WORD and CHARCNN models to the distinct binary classification task of error identification.
Additional Approaches The standard formulation of the correction task is to model the output sequence as t above. Here, we also propose modeling the diffs between s and t. The diffs are provided in-line within t and are described via tags marking the starts and ends of insertions and deletions, with replacements represented as deletioninsertion pairs, as in the following example selected from the training set: "Some key points are worth <del> emphasiz </del> <ins> emphasizing </ins> .". Here, "emphasiz" is replaced with "emphasizing". The models, including the CHAR model, treat each tag as a single, atomic token.
The diffs enable a means of tuning the model's propensity to generate corrections by modifying the probabilities generated by the decoder for the 4 diff tags, which we examine with the CoNLL data. We include four bias parameters associated with each diff tag, and run a grid search between 0 and 1.0 to set their values based on the tuning set.
It is possible for models with diffs to output invalid target sequences (for example, inserting a word without using a diff tag). To fix this, a deterministic post-processing step is performed (greedily from left to right) that returns to source any non-source tokens outside of insertion tags. Diffs are removed prior to evaluation. We indicate models that do not incorporate target diff annotation tags with the designator -DIFFS.
The AESW dataset provides the paragraph context and a journal domain (a classification of the document into one of nine subject categories) for each sentence. 1 For the sequence-to-sequence  GLEU and M 2 differences on test are statistically significant via paired bootstrap resampling (Koehn, 2004;Graham et al., 2014) at the 0.05 level, resampling the full set 50 times. models we propose modeling the input and output sequences with a special initial token representing the journal domain (+DOM). 2

Experiments
Data AESW (Daudaravicius, 2016; consists of sentences taken from academic articles annotated with corrections by professional editors used for the AESW shared task. The training set contains 1,182,491 sentences, of which 460,901 sentences have edits. We set aside a 9,947 sentence sample from the original development set for tuning (of which 3,797 contain edits), and use the remaining 137,446 sentences as the dev set 3 (of which 53,502 contain edits). The test set contains 146,478 sentences. The primary focus of the present study is conducting controlled experiments on the AESW dataset, but we also investigate results on the CoNLL-2014 shared task data in light of recent neural results (Ji et al., 2017) and to serve as a baseline of comparison against existing sequenceto-sequence approaches (Yuan and Briscoe, 2016;Xie et al., 2016). We use the common sets of public data appearing in past work for training: the National University of Singapore (NUS) Corpus of Learner English (NUCLE) (Dahlmeier et al., 2013)  data (Tajiri et al., 2012;Mizumoto et al., 2012). The Lang-8 dataset of corrections is large 4 but is crowd-sourced 5 and is thus of a different nature than the professionally annotated AESW and NUCLE datasets. We use the revised CoNLL-2013 test set as a tuning/dev set and the CoNLL-2014 test set (without alternatives) for testing. We do not make use of the non-public Cambridge Learner Corpus (CLC) (Nicholls, 2003), which contains over 1.5 million sentence pairs.
Evaluation We follow past work and use the Generalized Language Understanding Evaluation (GLEU)  and MaxMatch (M 2 ) metrics (Dahlmeier and Ng, 2012).
Parameters All our models, implemented with OpenNMT (Klein et al.), are 2-layer LSTMs with 750 hidden units. For the WORD model, the word embedding size is also set to 750, while for the CHARCNN and CHAR models we use a character embedding size of 25. The CHARCNN model has a convolutional layer with 1000 filters of width 6 followed by max-pooling, which is fed into a 2-layer highway network. Additional training details are provided in Appendix A. For AESW, the WORD+BI model contains around 144 million parameters, the CHARCNN+BI model around 79 million parameters, and the CHAR+BI model around 25 million parameters.
Statistical Machine Translation As a baseline of comparison, we experiment with a phrase-based machine translation approach (SMT) shown to be state-of-the-art for the CoNLL-2014 shared task data in previous work (Junczys-Dowmunt and Grundkiewicz, 2016), which adds task specific features and the M 2 metric as a scorer to the Moses statistical machine translation system. The SMT model follows the training, parameters, and dense and sparse task-specific features that generate state-of-the-art results for CoNLL-2014 shared task data, as implemented in publicly available code. 6 However, to compare models against the same training data, we remove language model features associated with external data. 7 We exper-iment with tuning against M 2 (+M 2 ) and BLEU (+BLEU). Models trained with diffs were only tuned with BLEU, since the tuning pipeline from previous work is not designed to handle removing such annotation tags prior to M 2 scoring. Table 1 shows the full set of experimental results on the AESW development and test data.

Results and Analysis: AESW
The CHAR+BI+DOM model is stronger than the WORD+BI+DOM and CHARCNN+DOM models by 2.9 M 2 (0.2 GLEU) and 3.3 M 2 (0.3 GLEU), respectively. The sequence-to-sequence models were also more effective than the SMT models, as shown in Table 1. We find that training with target diffs is beneficial across all models, with an increase of about 5 M 2 points for the WORD+BI model, for example. Adding +DOM information slightly improves effectiveness across models.
We analyzed deletion, insertion, and replacement error types. Table 2 compares effectiveness across replacement errors. We found the CHARCNN+BI models were less effective than CHARCNN variants in terms of GLEU and M 2 , and the strongest CHARCNN models were eclipsed by the WORD+BI models in terms of the GLEU and M 2 scores. However, Table 2 shows CHARCNN+DOM is stronger on lower frequency replacements than WORD models. The CHAR+BI+DOM model is relatively strong on article and punctuation replacements, as well as errors appearing with low frequency in the training set and overall across deletion and insertion error types, which are summarized in Table 3.
Errors never occurring in training The comparatively high Micro F 0.5 score (18.66) for the CHAR+BI+DOM model on replacement errors (Table 2) never occurring in training is a result of a high precision (92.65) coupled with a low recall (4.45). This suggests some limited capacity to generalize to items not seen in training. A selectively chosen example is the replacement from "discontinous" to "discontinuous", which never occurs in training. However, similar errors of low edit distance also occur once in the dev set and never in training, but the CHAR+BI+DOM model filtered against the NUCLE corpus, hurt effectiveness for the phrase-based models. This is likely a reflection of the domain specific nature of the academic text and LaTeX holder symbols appearing in the text. Here, we conduct controlled experiments without introducing additional domain-specific monolingual data. never correctly recovers many of these errors, and many of the correctly recovered errors are minor changes in capitalization or hyphenation.
Error frequency About 39% of the AESW training sentences have errors, and of those sentences, on average, 2.4 words are involved in changes in deletions, insertions, or replacements (i.e., the count of words occurring between diff tags) per sentence. In the NUCLE data, about 37% of the sentences have errors, of which on average, 5.3 words are involved in changes. On the AESW dev set, if we only consider the 9545 sentences in which 4 or more words are involved in a change (average of 5.8 words in changes per sentence), the CHAR+BI model is still more effective than SMT+BLEU, with a GLEU score of 67.21 vs. 65.34. The baseline GLEU score (No Change) is 60.86, reflecting the greater number of changes relative to the full dataset (cf. Table 1).

Re-annotation
The AESW dataset only provides 1 annotation for each sentence, so we perform a small re-annotation of the data to gauge effectiveness in the presence of multiple annotations. We collected 3 outputs (source, gold, and generated sentences from the CHAR+BI+DOM model) for 200 randomly sampled sentences, reannotating to create 3 new references for each sentence. The GLEU scores for the 200 original source, CHAR+BI+DOM, and original gold sentences evaluated against the 3 new references were 79.79, 81.72, and 84.78, respectively, suggesting that there is still progress to be made on the task relative to human levels of annotation. Table 4 shows the results on the CoNLL dev set, and Table 5 contains the final test results.

Results and Analysis: CoNLL
Since the CoNLL data does not contain enough data for training neural models, previous works add the crowd-sourced Lang-8 data; however, this data is not professionally annotated. Since the distribution of corrections differs between the dev/test and training sets, we need to tune the precision and recall.
As shown in Table 4, WORD+BI effectiveness increases significantly by tuning the weights 8 assigned to the diff tags on the CoNLL-2013 set 9 .    Note that we are tuning the weights on this same CoNLL-2013 set. Without tuning, the model very rarely generates a change, albeit with a high precision. After tuning, it exceeds the effectiveness of WORD+BI-DIFFS. The comparatively low effectiveness of WORD+BI-DIFFS is consistent with past sequence-to-sequence approaches utilizing data augmentation, additional annotated data, and/or secondary models to achieve competitive levels of effectiveness. Table 5 shows that WORD+BI is within 0.2 M 2 of Ji et al. (2017), despite using over 1 million fewer sentence pairs, and exceeds the M 2 scores of Xie et al. (2016) and Yuan and Briscoe (2016) without the secondary models of those systems. We hypothesize that further gains are possible utilizing the CLC data and moving to the character model. (The character model is omitted here due to the long training time of about 4 weeks.) Data M 2 Yuan and Briscoe (2016)   Notably, SMT systems (with LMs) are still more effective than reported sequence-to-sequence results, as in Ji et al. (2017), on CoNLL. 10

Conclusion
Our experiments demonstrate that on a large, professionally annotated dataset, a sequence-tosequence character-based model of diffs can lead to considerable effectiveness gains over a stateof-the-art SMT system with task-specific features, ceteris paribus. Furthermore, in the crowdsourced environment of the CoNLL data, in which there are comparatively few professionally annotated sentences in training, modeling diffs enables a means of tuning that improves the effectiveness of sequence-to-sequence models for the task.