A Nested Attention Neural Hybrid Model for Grammatical Error Correction

Grammatical error correction (GEC) systems strive to correct both global errors inword order and usage, and local errors inspelling and inflection. Further developing upon recent work on neural machine translation, we propose a new hybrid neural model with nested attention layers for GEC.Experiments show that the new model can effectively correct errors of both types by incorporating word and character-level information, and that the model significantly outperforms previous neural models for GEC as measured on the standard CoNLL-14 benchmark dataset.Further analysis also shows that the superiority of the proposed model can be largely attributed to the use of the nested attention mechanism, which has proven particularly effective incorrecting local errors that involve small edits in orthography.


Introduction
One of the most successful approaches to grammatical error correction (GEC) is to cast the problem as (monolingual) machine translation (MT), where we translate from possibly ungrammatical English sentences to corrected ones (Brockett et al., 2006;Gao et al., 2010;Junczys-Dowmunt and Grundkiewicz, 2016).Such systems, which are based on phrasebased MT models that are typically trained on large sets of sentence-correction pairs, can correct global errors such as word order and usage and local errors in spelling and inflection.The approach has proven superior to systems based on local classifiers that can only fix focused errors in prepositions, determiners, or inflected forms (Rozovskaya and Roth, 2016).
Recently, neural machine translation (NMT) systems have achieved substantial improvements in translation quality over phrase-based MT systems (Sutskever et al., 2014;Bahdanau et al., 2015).Thus, there is growing interest in applying neural systems to GEC (Yuan and Briscoe, 2016;Xie et al., 2016).In this paper, we significantly extend previous work, and explore new neural models to meet the unique challenges of GEC.
The core component of most NMT systems is a sequence-to-sequence (S2S) model which encodes a sequence of source words into a vector and then generates a sequence of target words from the vector.Unlike the phrase-based MT models, the S2S model can capture long-distance, or even global, word dependencies, which are crucial to correcting global grammatical errors and helping users achieve native speaker fluency (Sakaguchi et al., 2016).Thus, the S2S model is expected to perform better on GEC than phrase-based models.However, as we will show in this paper, to achieve the best performance on GEC, we still need to extend the standard S2S model to address several task-specific challenges, which we will describe below.
First, a GEC model needs to deal with an extremely large vocabulary that consists of a large number of words and their (mis)spelling variations.
Second, the GEC model needs to capture structure at different levels of granularity in order to correct errors of different types.For example, while correcting spelling and local grammar errors requires only word-level or sub-word level information, e.g., violets → violates (spelling) or violate → violates (verb form), correcting errors in word order or usage requires global semantic relationships among phrases and words.
Standard approaches in neural machine translation, also applied to grammatical error correction by Yuan and Briscoe (2016), address the large vocabulary problem by restricting the vocabulary to a limited number of high-frequency words and re-sorting to standard word translation dictionaries to provide translations for the words that are out of the vocabulary (OOV).However, this approach often fails to take into account the OOVs in context for making correction decisions, and does not generalize well to correcting words that are unseen in the parallel training data.An alternative approach, proposed by Xie et al. (2016), applies a character-level sequence to sequence neural model.Although the model eliminates the OOV issue, it cannot effectively leverage word-level information for GEC, even if it is used together with a separate word-based language model.Our solution to the challenges mentioned above is a novel, hybrid neural model with nested attention layers that infuse both word-level and character-level information.The architecture of the model is illustrated in Figure 1.The word-level information is used for correcting global grammar and fluency errors while the character-level information is used for correcting local errors in spelling or inflected forms.Contextual information is crucial for GEC.Using the proposed model, by combining embedding vectors and attention at both word and character levels, we model all contextual words, including OOVs, in a unified context vector representation.In particular, as we will discuss in Section 5, the character-level attention layer captures most useful information for correcting local errors that involve small edits in orthography.
Our model differs substantially from the wordlevel S2S model of Yuan and Briscoe (2016) and the character-level S2S model of Xie et al. (2016) in the way we infuse information at both the word level and the character level.We extend the wordcharacter hybrid model of Luong and Manning (2016), which was originally developed for machine translation, by introducing a character attention layer.This allows the model to learn substitution patterns at both the character level and the word level in an end-to-end fashion, using sentencecorrection pairs.We validate the effectiveness of our model on the CoNLL-14 benchmark dataset (Ng et al., 2014).Results show that the proposed model outperforms all previous neural models for GEC, including the hybrid model of Luong and Manning (2016), which we apply to GEC for the first time.When integrated with a large word-based n-gram language model, our GEC system achieves an F 0.5 of 45.15 on CoNLL-14, substantially exceeding the previ-

Related Work
A variety of classifier-based and MT-based techniques have been applied to grammatical error correction.The CoNLL-14 shared task overview paper of Ng et al. (2014) provides a comparative evaluation of approaches.Two notable advances after the shared task have been in the areas of combining classifiers and phrase-based MT (Rozovskaya and Roth, 2016) and adapting phrase-based MT to the GEC task (Junczys-Dowmunt and Grundkiewicz, 2016).The latter work has reported the highest performance to date on the task of 49.5 in F 0.5 score on the CoNLL-14 test set.This method integrates discriminative training toward the task-specific evaluation function, a rich set of features, and multiple large language models.Neural approaches to the task are less explored.We believe that the advances from Junczys-Dowmunt and Grundkiewicz (2016) are complementary to the ones we propose for neural MT, and could be integrated with neural models to achieve even higher performance.
Two prior works explored sequence to sequence neural models for GEC (Xie et al., 2016;Yuan and Briscoe, 2016), while Chollampatt et al. (2016) integrated neural features in a phrase-based system for the task.Neural models were also applied to the related sub-task of grammatical error identification (Schmaltz et al., 2016).Yuan and Briscoe (2016) demonstrated the promise of neural MT for GEC but did not adapt the basic sequence-to-sequence with attention to its unique challenges, falling back to traditional word-alignment models to address vocabulary coverage with a post-processing heuristic.Xie et al. (2016) built a character-level sequence to sequence model, which achieves open vocabulary and character-level modeling, but has difficulty with global word-level decisions.
The primary focus of our work is integration of character and word-level reasoning in neural models for GEC, to capture global fluency errors and local errors in spelling and closely related morphological variants, while obtaining open vocabulary coverage.This is achieved with the help of character and word-level encoders and decoders with two nested levels of attention.Our model is inspired by advances in sub-word level modeling in neural machine translation.We build mostly on the hybrid model of Luong and Manning (2016) to expand its capability to correct rare words by fine-grained character-level attention.We directly compare our model to the one of Luong and Manning (2016) on the grammar correction task.Alternative methods for MT include modeling of word pieces to achieve open vocabulary (Sennrich et al., 2016), and more recently, fully character-level modeling (Lee et al., 2017).None of these models integrate two nested levels of attention although an empirical evaluation of these approaches for GEC would also be interesting.

Nested Attention Hybrid Model
Our model is hybrid, and uses both word-level and character-level representations.It consists of a word-based sequence-to-sequence model as a backbone, and additional character-level encoder, decoder, and attention components, which focus on words that are outside the word-level model's vocabulary.

Word-based sequence-to-sequence model as backbone
The word-based backbone closely follows the basic neural sequence-to-sequence architecture with attention as proposed by Bahdanau et al. (2015) and applied to grammatical error correction by Yuan and Briscoe (2016).For completeness, we give a sketch here.It uses recurrent neural networks to encode the input sentence and to decode the output sentence.
Given a sequence of embedding vectors, corresponding to a sequence of input words x: the encoder creates a corresponding context-specific sequence of hidden state vectors e: e = (h 1 , . . ., h T ) The hidden state h t at time t is computed as: where GRU enc f and GRU enc b stand for gated recurrent unit functions as described in Cho et al. (2014).We use the symbol GRU with different subscripts to represent GRU functions using different sets of parameters (for example, we used the enc f and enc b subscripts to denote the parameters of the forward and backward word-level encoder units.) The decoder network is also an RNN using GRU units, and defines a sequence of hidden states d1 , . . ., dS used to define the probability of an output sequence y 1 , . . ., y S as follows: The context vector c s at time step s is computed as follows: where: Here φ 1 and φ 2 denote feedforward linear transformations followed by a tanh nonlinearity.The next hidden state ds is then defined as: where y s−1 is the embedding of the output token at time s-1.ReLU indicates rectified linear units (Hahnloser et al., 2000).
The probability of each target word y s is computed as: p(y s |y <s , x) = softmax(g( ds )), where g is a function that maps the decoder state into a vector of size the dimensionality of the target vocabulary.
The model is trained by minimizing the crossentropy loss, which for a given (x, y) pair is: For parallel training data C, the loss is: log p(y s |y <s , x)

Hybrid encoder and decoder with two nested levels of attention
The word-level backbone models a limited vocabulary of source and target words, and represents out-of-vocabulary tokens with special UNK symbols.In the standard word-level NMT approach, valuable information is lost for source OOV words and target OOV words are predicted using postprocessing heuristics.

Hybrid encoder
Our hybrid architecture overcomes the loss of source information in the word-level backbone by building up compositional representations of the source OOV words using a character-level recurrent neural network with GRU units.These representations are used in place of the special source UNK embeddings in the backbone, and contribute to the contextual encoding of all source tokens.For example, a three word input sentence where the last term is out-of-vocabulary will be represented as the following vector of embeddings in the word-level model: x = (x 1 , x 2 , x 3 ), where x 3 would be the embedding for the UNK symbol.
The hybrid encoder builds up a word embedding for the third word based on its character sequence: x c 1 , . . ., x c M .The encoder computes a sequence of hidden states e c for this character sequence, by a forward character-level GRU network: The last state h c M is used as an embedding of the unknown word.The sequence of embeddings for our example three-word sequence becomes: We use the same dimensionality for word embedding vectors x i and composed character sequence vectors h c M to ensure the two ways to define embeddings are compatible.Our hybrid source encoder architecture is similar to the one proposed by Luong and Manning (2016).

Nested attention hybrid decoder
In traditional word-based sequence-to-sequence models special target UNK tokens are used to represent outputs that are outside the target vocabulary.A post-processing UNK-replacement method is then used (Cho et al., 2015;Yuan and Briscoe, 2016) to replace these special tokens with target words.The hybrid model of (Luong and Manning, 2016) uses a jointly trained character-level decoder to generate target words corresponding to UNK tokens, and outperforms the traditional approach in the machine translation task.
However, unlike machine translation, models for grammar correction conduct "translation" in the same language, and often need to apply a small number of local edits to the character sequence of a source word corresponding to the target UNK word.For example, rare but correct words such as entity names need to be copied as is, and local spelling errors or errors in inflection need to be corrected.The architecture of Luong and Manning (2016) does not have direct access to a source character sequence, but only uses a single fixed-dimensionality embedding of source unknown words aggregated with additional contextual information from the source.
To address the needs of the grammatical error correction task, we propose a novel hybrid decoder with two nested levels of attention: word level and character-level.The character-level attention serves to provide the decoder with direct access to the relevant source character sequence.
More specifically, the probability of each target word is defined as follows: For words in the target vocabulary, the probability is defined by the wordlevel backbone.For words outside the vocabulary, the probability of each token is the probability of UNK according to the backbone, multiplied by the probability of the word's character sequence.
The probability of the target character sequence corresponding to an UNK token at position s in the target is defined using a character-level decoder.As in Luong and Manning (2016), the "separate path" architecture is used to capture the relevant context and define the initial state for the character-level decoder: where Ŵ are parameters different from W , and ds is not used by the word-level model in predicting the subsequent tokens, but is only used to initialize the character-level decoder.
To be able to attend to the relevant source character sequence when generating the target character sequence, we use the concept of hard attention (Xu et al., 2015), but use an arg-max approximation for inference instead of sampling.A similar approach to represent discrete hidden structure in a variety of architectures is used in Kong et al. (2017).
The source index z s corresponding to the target position s is defined according to the word-level attention model: where α sk are the intermediate outputs of the word-level attention model we described in Eq.( 3).
The character-level decoder generates a character sequence y c = (y c 1 , . . ., y c N ), conditioned on the initial vector ds and the source index z s .The characters are generated using a hidden state vector d c n at each time step, via a softmax(gc(d c n )), where gc maps the state to the target character vocabulary space.
If the source word x zs is in the source vocabulary, the model is analogous to the one of Luong and Manning (2016) and does not use characterlevel attention: the source context is available only in aggregated form to initialize the state of the decoder.The state d c n for step n in the characterlevel decoder is defined as follows, where GRU c dec are parameters for the gated recurrent cell of this decoder: In contrast, if the corresponding token in the source x zs is also an out-of-vocabulary word, we define a second nested level of character attention and use it in the character-level decoder.The character-level attention focuses on individual characters from the source word x zs .If e c are the source character hidden vectors computed as in Eq.( 6), the recurrence equations for the characterlevel decoder with nested attention are: where c c n is the context vector obtained using character-level attention on the sequence e c and the last state of the character-level decoder d c n , computed following equations 2, 3 and 4, but using a different set of parameters.
These equations show that the character-level decoder with nested attention can use both the wordlevel state ds , and the character-level context c c n and hidden state d c n to perform global and local editing operations.
Since we introduced two architectures for the character-level decoder depending on whether the source word x zs is OOV, the combined loss function is defined as follows for end-to-end training: Here Loss w is the standard word-level loss in Eq.( 5); character level losses Loss c1 and Loss c2 are losses for target OOV words corresponding to source known and unknown tokens, respectively.α and β are hyper-parameters to balance the loss terms.
As seen, our proposed nested attention hybrid model uses character-level attention only when both a predicted target word and its corresponding source input word are OOV.While the model can be naturally generalized to integrate characterlevel attention for known words, the original hybrid model proposed by Luong and Manning (2016) does not use any character-level information for known words.Thus for a controlled evaluation of the impact of the addition of character-level attention only, in this paper we limit character-level attention to OOV words, which already use characters as a basis for building their embedding vectors.A thorough investigation of the impact of characterlevel information in the encoder, attention, and decoder for known words as well is an interesting topic for future research.

Decoding for word-level and hybrid models
Beam-search is used to decode hypotheses according to the word-level backbone model.For the hybrid model architecture, word-level beam search is conducted first; for each target UNK token, character-level beam-search is used to generate a corresponding target word.

Dataset and Evaluation
We use standard publicly available datasets for training and evaluation.One data source is the NUS Corpus of Learner English (NUCLE) (Dahlmeier et al., 2013), which is provided as a training set for the CoNLL-13 and CoNLL-14 shared tasks.From the original corpus of size about 60K parallel sentences, we randomly selected close to 5K sentence pairs for use as a validation set, and 45K parallel sentences for use in training.A second data source  is the Cambridge Learner Corpus (CLC) (Nicholls, 2003), from which we extracted a substantially larger set of parallel sentences.Finally, we used additional training examples from the Lang-8 Corpus of Learner English v1.0 (Tajiri et al., 2012).As Lang-8 data is crowd-sourced, we used heuristics to filter out noisy examples: we removed sentences longer than 100 words and sentence pairs where the correction was substantially shorter than the input text.Table 2 shows the number of sentence pairs from each source used for training.
We evaluate the performance of the models on the standard sets from the CoNLL-14 shared task (Ng et al., 2014).We report final performance on the CoNLL-14 test set without alternatives, and analyze model performance on the CoNLL-13 development set (Dahlmeier et al., 2013).We use the development and validation sets for model selection.
The sizes of all datasets in number of sentences are shown in Table 1.We report performance in F 0.5 -measure, as calculated by the m2scorerthe official implementation of the scoring metric in the shared task. 1 Given system outputs and gold-standard edits, m2scorer computes the F 0.5 measure of a set of system edits against a set of gold-standard edits.

Baseline
We evaluate our model in comparison to the strong baseline of a word-based neural sequenceto-sequence model with attention, with postprocessing for handling out-of-vocabulary words (Yuan and Briscoe, 2016); we refer to this model as word NMT+UNK replacement.Like Yuan and Briscoe (2016), we use a traditional wordalignment model (GIZA++) to derive a wordcorrection lexicon from the parallel training set.However, in decoding, we don't use GIZA++ to find the corresponding source word for each tar-1 http://www.comp.nus.edu.sg/˜nlp/sw/ m2scorer.tar.gzget OOV, but follow Cho et al. (2015), Section 3.3 to use the NMT system's attention weights instead.The target OOV is then replaced by the most likely correction of the source word from the wordcorrection lexicon, or by the source word itself if there are no available corrections.

Training Details and Results
The embedding size for all word and characterlevel encoders and decoders is set to 1000, and the hidden unit size is also 1000.To reproduce the model of Yuan and Briscoe (2016), we selected the word vocabulary for the baseline by choosing the 30K most frequent words in the source and target respectively to form the source and target vocabularies.In preliminary experiments for the hybrid models, we found that selecting the same vocabulary of 30K words for the source and target based on combined frequency was better (.003 in F 0.5 ) and use that method for vocabulary selection instead.However, there was no gain observed by using such a vocabulary selection method in the baseline.Although the source and target vocabularies in the hybrid models are the same, like in the word-level model, the embedding parameters for source and target words are not shared.
The hyper-parameters for the losses in our models are selected based on the development set and set as follows: α = β = 0.5.All models are trained with mini-batch size of 128 (batches are shuffled), initial learning rate of 0.0003 and a 0.95 decay ratio if the cost increases in two consecutive 100 iterations.The gradient is rescaled whenever its norm exceeds 10, and dropout is used with a probability of 0.15.Parameters are uniformly ini- We perform inference on the validation set every 5000 iterations to log word-level cost and characterlevel costs; we save parameter values for the model every 10000 iterations as well as the end of each epoch.The stopping point for training is selected based on development set F 0.5 among the top 20 parameter sets with best validation set value of the loss function.
Training of the nested attention hybrid model takes approximately five days on a Tesla k40m GPU.The basic hybrid model trains in around four days and the word-level backbone trains in approximately three days.
Table 3 shows the performance of the baseline and our nested attention hybrid model on the devel- opment and test sets.In addition to the word-level baseline, we include the performance of a hybrid model with a single level of attention, which follows the work of Luong and Manning (2016) for machine translation, and is the first application of a hybrid word/character-level model to grammatical error correction.Based on hyper-parameter selection, the character-level component weight of the loss is α = 1 for the basic hybrid model.
As shown in Table 3, our implementation of the word NMT+UNK replacement baseline approaches the performance of the one reported in Yuan and Briscoe (2016) (38.77 versus 39.9).We attribute the difference to differences in the training set and the word-alignment methods used.Our reimplementation serves to provide a controlled experimental evaluation of the impact of hybrid models and nested attention on the GEC task.As seen, our nested attention hybrid model substantially improves upon the baseline, achieving a gain of close to 3 points on the test set.The hybrid word/character model with a single level of attention brings a large improvement as well, showing the importance of character-level information for this task.We delve deeper into the impact of nested attention for the hybrid model in Section 5.

Integrating a Web-scale Language Model
The value of large language models for grammatical error correction is well known, and such models have been used in classifier and MT-based systems.To establish the potential of such models in word-based neural sequence-to-sequence systems, we integrate a web-scale count-based language model.In particular, we use the modified Kneser-Ney 5-gram language model trained from Common Crawl (Buck et al., 2014), made available for download by Junczys-Dowmunt and Grundkiewicz (2016).
Candidates generated by neural models are reranked using the following linear interpolation of log probabilities: s y|x = log P N N (y|x) + λ log P LM (y).Here λ is a hyper-parameter that balances the weights of the neural network model Table 4: F 0.5 results on the CoNLL-13 and CoNLL-14 test sets of main model architectures, when combined with a large language model.and the language model.We tuned λ separately for each neural model variant, by exploring values in the range [0.0, 2.0] with step size 0.1, and selecting according to development set F 0.5 .The selected values of λ are: 1.6 for word NMT + UNK replacement and 1.0 for the nested attention model.
Table 4 shows the impact of the LM when combined with the neural models implemented in this work.The table also lists the results reported by Xie et al. (2016), for their character-level neural model combined with a large word-level language model.Our best results exceed the ones reported in the prior work by more than 4 points, although we should note that Xie et al. (2016) used a smaller parallel data set for training.

Analysis
We analyze the impact of sub-word level information and the two nested levels of attention in more detail by looking at the performance of the models on different segments of the data.In particular, we analyze the performance of the models on sentences containing OOV source words versus ones without OOV words, and corrections to orthographically similar versus dissimilar word forms.

Performance by Segment: OOV versus Non-OOV
We present a comparative performance analysis of models on the CoNLL-13 development set.First, we divide the set into two segments: OOV and NonOOV, based on whether there is at least one OOV word in the given source input.words in the nested attention model, which could improve performance on this segment as well.Table 6 shows an example where the nested attention hybrid model successfully corrects a misspelling resulting in an OOV word on the source, whereas the baseline word-level system simply copies the source word without fixing the error (since this particular error is not observed in the parallel training set).

Impact of Nested Attention on Different Error Types
To analyze more precisely the impact of the additional character-level attention introduced by our design, we continue to investigate the OOV segment in more detail.The concept of edit, which is also used by the official M2 score metric, is defined as a minimal pair of corresponding sub-strings in a source sentence and a correction.For example, in the sentence fragment pair: "Even though there is a risk of causing harms to someone, people still are prefers to keep their pets without a leash."→ "Even though there is a risk of causing harm to someone, people still prefer to keep their pets without a leash.", the minimal edits are "harms → harm" and "are prefers → prefer".The F 0.5 score is computed using weighted precision and recall of the set of a system's edits against one or more sets of reference edits.
For our in-depth analysis, we classify edits in the OOV segment into two types: small changes and large changes, based on whether the source and target phrase of the edit are orthographically similar or not.More specifically, we say that the target and source phrases are orthographically similar, iff: the character edit distance is at most 2 and the source or target is at most 8 characters long, or edit ratio < 0.25, where edit ratio = character edit distance min(len(src),len(tar))+0.1 , len( * ) denotes number of characters in * , and src and tgt denote the pairs in the edit.There are 307 gold edits in the "small changes" portion of the CoNLL-13 OOV segment, and 481 gold edits in the "large changes" portion.
Our hypothesis is that the additional characterlevel attention layer is particularly useful to model edits among orthographically similar words.Table 7 contrasts the impact of character-level attention on the two portions of the data.We can see that the gains in the "small changes" portion are indeed quite large, indicating that the fine-grained character-level attention empowers the model to more accurately correct confusions among phrases with high character-level similarity.The impact in the "large changes" portion is slightly positive in precision and slightly negative in recall.Thus most of the benefit of the additional character-level attention stems from improvements in the "small changes" portion.Table 8 shows an example input which illustrates the precision gain of the nested attention hybrid model.The input sentence has a source OOV word which is correct.The hybrid model introduces an error in this word, because it uses only a single source context vector, aggregating the characterlevel embedding of the source OOV word together with other source words.The additional characterlevel attention layer in the nested hybrid model enables the correct copying of this long source OOV word, without employing the heuristic mechanism of the word-level NMT system.

Conclusions
We have introduced a novel hybrid neural model with two nested levels of attention: word-level and character-level.The model addresses the unique challenges of the grammatical error correction task and achieves the best reported results on the CoNLL-14 benchmark among fully neural systems.Our nested attention hybrid model deeply combines the strengths of word and character level information in all components of an end-to-end neural model: the encoder, the attention layers, and the decoder.This enables it to correct both global wordlevel and local character-level errors in a unified way.The new architecture contributes substantial improvement in correction of confusions among rare or orthographically similar words compared to word-level sequence-to-sequence and non-nested hybrid models.

Figure 1 :
Figure 1: Architecture of Nested Attention Hybrid Model ously reported top performance of 40.56 achieved by using a neural model and an external language model (Xie et al., 2016).
An example where the nested attention hybrid model outperforms the non-nested model.

Table 1 :
Overview of the datasets used.

Table 2 :
Training data by source.

Table 5 :
F 0.5 results on the CoNLL-13 set of main model architectures, on different segments of the set according to whether the input contains OOVs.

Table 6 :
An example sentence from the OOV segment where the nested attention hybrid model improves performance.

Table 7 :
Precision, Recall and F 0.5 results on CoNLL-13,on the "small changes" and "large changes" portions of the OOV segment.