Adapting Grammatical Error Correction Based on the Native Language of Writers with Neural Network Joint Models

An important aspect for the task of grammatical error correction (GEC) that has not yet been adequately explored is adaptation based on the native language (L1) of writers, despite the marked inﬂuences of L1 on second language (L2) writing. In this paper, we adapt a neural network joint model (NNJM) using L1-speciﬁc learner text and integrate it into a statistical machine translation (SMT) based GEC system. Speciﬁcally, we train an NNJM on general learner text (not L1-speciﬁc) and subsequently train on L1-speciﬁc data using a Kullback-Leibler divergence regularized ob-jective function in order to preserve generalization of the model. We incorporate this adapted NNJM as a feature in an SMT-based English GEC system and show that adaptation achieves signiﬁcant F 0 . 5 score gains on English texts written by L1 Chinese, Russian, and Spanish writers.


Introduction
Grammatical error correction (GEC) deals with the automatic correction of errors (spelling, grammar, and collocation errors), particularly in non-native written text. The native language (L1) background of the writer has a noticeable influence on the errors made in second language (L2) writing, and considering this factor can potentially improve the performance of GEC systems. For example, consider the following sentence written by a Finnish writer (Jarvis and Odlin, 2000): "When they had escaped in the police car they sat under the tree." The preposition in appears to be grammatically correct. However, in the given context, the preposition 'from' is the correct choice in place of the preposition 'in'. Finnish learners of English tend to overgeneralize the use of the preposition 'in'. Knowledge of L1 makes the correction more probable whenever the preposition in appears in texts written by Finnish writers. Similarly, Chinese learners of English tend to make frequent verb tense and verb form errors, since Chinese lacks verb inflection (Shaughnessy, 1977). The cross-linguistic influence of L1 on L2 writing is a highly complex phenomenon, and the errors made by learners cannot be directly attributed to the similarities or differences between the two languages. As Ortega (2009) points out, learners seem to operate on two complementary principles: "what works in L1 may work in L2 because human languages are fundamentally alike; but if it sounds too L1-like, it will probably not work in L2". In this paper, we follow a data-driven approach to model these influences and adapt GEC systems using L2 texts written by writers of the same L1 background.
The two most popular approaches for grammatical error correction are the classification approach Rozovskaya et al., 2014) and the statistical machine translation (SMT) approach Junczys-Dowmunt and Grundkiewicz, 2014). The SMT approach has emerged as a popular paradigm for GEC because of its ability to learn text transformations from illformed to well-formed text enabling it to correct a wide variety of errors including complex errors that are difficult to handle for the classification approach (Rozovskaya and Roth, 2016). The phrasebased SMT approach has been used in state-ofthe-art GEC systems (Rozovskaya and Roth, 2016;Hoang et al., 2016). The SMT approach does not model error types specifically, nor does it require linguistic analysis like parsing and part-of-speech (POS) tagging. We adopt a phrase-based SMT approach to GEC in this paper. Additionally, we implement and incorporate a neural network joint model (NNJM) (Devlin et al., 2014) as a feature in our SMT-based GEC system. It is easy to integrate an NNJM into the SMT decoding framework as it uses a fixed-window context and it has shown to improve SMT-based GEC . We adapt the NNJM to L1specific data (i.e., English text written by writers of a particular L1) and obtain significant improvements over the baseline which uses an unadapted NNJM. Adaptation is done by using the unadapted NNJM trained on general domain data (i.e., not L1-specific) using a log likelihood objective function with selfnormalization (Devlin et al., 2014) as the starting point, and training for subsequent iterations using the smaller L1-specific in-domain data with a modified objective function which includes a Kullback-Leibler (KL) divergence regularization term. This modified objective function prevents overfitting on the smaller in-domain data and preserves the generalization capability of the NNJM. We show that this method of adaptation works on very small and highquality L1-specific data as well (50-100 essays).
In summary, the two major contributions of this paper are as follows. (1) This is the first work that performs L1-based adaptation for GEC using the SMT approach and covering all error types. (2) We introduce a novel method of NNJM adaptation and demonstrate that this method can work with indomain data that are much smaller than the general domain data.

Related Work
In the past decade, there has been increasing attention on GEC in English, mainly due to the growing number of English as second language (ESL) learners around the world. The popularity of this problem grew further through Helping Our Own (HOO) (Dale and Kilgarriff, 2011;Dale et al., 2012) and CoNLL shared tasks . The majority of the published work on GEC aimed at building classifiers or rule-based systems for specific error types and combined them to build hybrid systems Rozovskaya et al., 2014).
The cross-linguistic influences between L1 and L2 have been mainly used for the task of native language identification (Massung and Zhai, 2016). It has also been used in typology prediction (Berzak et al., 2014) and predicting error distributions in ESL data (Berzak et al., 2015). L1-based adaptation has previously shown to improve GEC for specific error types using the classification approach. Rozovskaya and Roth (2010) used an approach to correct preposition errors by restricting the candidate corrections to those observed in L1-specific data. They further added artificial training data that mimic the error frequency in L1-specific text to improve accuracy. In their later work, Rozovskaya and Roth (2011) focused on L1-based adaptation for preposition and article correction, by modifying the prior probabilities in the naïve Bayes classifier during decision time based on L1-specific ESL learner text. Both approaches use native data for training, but rely on non-native L1-specific text to introduce artificial errors or to modify the prior probabilities. Dahlmeier and Ng (2011) implemented a system to correct collocation errors, by adding paraphrases derived from L1 into the confusion set. Specifically, they use a bilingual L1-L2 corpus, to obtain L2 paraphrases, which are likely to be translated to the same phrase in L1. There is no prior work on L1-based adaptation for GEC using the machine translation approach, which is a one of the most popular approaches for GEC.
With the availability of large-scale error corrected data (Mizumoto et al., 2011), the statistical machine translation (SMT) approach to GEC became popular and was employed in state-of-the-art GEC systems. Comparison of the classification approach and the machine translation approach can be found in (Rozovskaya and Roth, 2016) and . Recently, an end-to-end neural machine translation framework was proposed for GEC (Yuan and Briscoe, 2016), which was shown to achieve competitive results. Neural network joint models have shown to be improve SMT-based GEC systems  due to their ability to model words and phrases in a continuous space, access to larger contexts from source side, and abil-ity to learn non-linear mappings from input to output. In this paper, we exploit the advantages of the SMT approach and neural network joint models (NNJMs) by adapting an NNJM based on the L1 background of the writers and integrating it into the SMT framework. We perform KL divergence regularized adaptation to prevent overfitting on the smaller in-domain data. KL divergence regularization was previously used by Yu et al. (2013) for speaker adaptation. Joty et al. (2015) proposed another NNJM adaptation method, which uses a regularized objective function that encourages a network trained on general-domain data to be closer to an indomain NNJM. Other adaptation techniques used in SMT include mixture modeling (Foster and Kuhn, 2007;Moore and Lewis, 2010;Sennrich, 2012) and alternative decoding paths (Koehn and Schroeder, 2007).

A Machine Translation Framework for Grammatical Error Correction
We formulate GEC as a translation task from a possibly erroneous input sentence to a corrected sentence. We use the popular phrase-based SMT system, Moses , which employs a log linear model to find the best correction hypothesis T * given an input sentence S: where µ i and f i (T, S) are the i th feature weight and feature function, respectively. We use the standard features in Moses, without any re-ordering models. The two main components of an SMT system are the translation model (TM) and the language model (LM). The TM (typically, a phrase table), responsible for generating hypotheses, is trained using parallel data, i.e., learner-written sentences (source data) and their corresponding corrected sentences (target data). It also scores the hypotheses using features like forward and inverse phrase translation probabilities and lexical weights. The LM is trained on well-formed text and ensures the fluency of the corrected output. The feature weights µ i are computed by minimum error rate training (MERT), optimizing the F 0.5 measure (Junczys-Dowmunt and Grundkiewicz, 2014) using a devel-opment set. The F 0.5 measure computed using the MaxMatch scorer  is the standard evaluation metric for GEC used in the CoNLL-2014 shared task , weighting precision twice as much as recall. Apart from the TM and the n-gram LM, we add a neural network joint model (NNJM) (Devlin et al., 2014) as a feature, following , who reported that NNJM improves the performance of a state-of-the-art SMT-based GEC system. Unlike Recurrent Neural Networks (RNNs) and Long Short Term Memory networks (LSTMs), NNJMs have a feed-forward architecture which relies on a fixed context. This makes it easy to integrate NNJMs into a machine translation decoder as a feature. The feature value is given by log P (T |S), which is the sum of the log probabilities of individual target words in the hypothesis T given the context: where |T | is the number of words in the target hypothesis (corrected sentence), t i is the i th target word, and h i is the context of t i . The context h i consists of n−1 previous target words and m source words surrounding the source word that is aligned to the target word t i . Each dimension in the output layer of the neural network  gives the probability of a word t in the output vocabulary given its context h: is the unnormalized output score before the softmax, and V o is the output vocabulary.
The neural network parameters which include the weights, biases, and embedding matrices are trained using back propagation with stochastic gradient descent (LeCun et al., 1998). Instead of using the noise contrastive estimation (NCE) loss as done in , we use the log likelihood objective function with a self-normalization term similar to Devlin et al. (2014): where N is the number of training instances, and t i is the target word in the i th training instance. Each training instance consists of a target word t and its context h. α is the self-normalization coefficient which we set to 0.1, following Devlin et al. (2014). The training can be done efficiently on GPUs. We adapt this NNJM using L1-specific learner text using a Kullback-Leibler divergence regularized objective function as described in Section 4.

KL Divergence Regularized Adaptation
We first train an NNJM with the general-domain data (the erroneous and corrected texts, not considering the L1 of the writers) as described in the previous section. Let p GD (y|h) be the probability distribution estimated by the general-domain NNJM. Starting from this NNJM, subsequent iterations of training are done using the L1-specific in-domain data alone. The in-domain data consists of the erroneous texts written by writers of a specific L1 and their corresponding corrected texts. This adaptive training is done using a modified objective function having a regularization term K, which is used to minimize the KL divergence between p GD (y|h) and the network's output probability distribution p(y|h) (Yu et al., 2013): The term K will prevent the estimated probability distribution from deviating too much from that of the general domain NNJM during training. Our final objective function for the adaptation step is a linear combination of the terms in L and K, with a regularization weight λ and a self-normalization coefficient α: We integrate the unadapted NNJM and adapted NNJM independently into our SMT-based GEC system in order to compare the effect of adaptation.

Other Adaptation Methods
We compare our method against two other adaptation methods previously used in SMT. Translation Model Interpolation: Following Sennrich (2012), we linearly interpolate the features in two phrase tables, one trained on indomain data (L1-specific learner text) and the other on out-of-domain data. The interpolation weights are set by minimization of perplexity using phrase pairs extracted from our in-domain development set. The lexical weights are re-computed from the lexical counts and the interpolation weights are renormalized if a phrase pair exists only in one of the phrase tables.
Neural Domain Adaptation Model: Joty et al. (2015) proposed an adaptation of NNJM for SMT. They first train an NNJM using in-domain data, and then perform regularized adaptation on the general domain data (concatenation of in-domain and out-of-domain data) which restricts the model from drifting away from the in-domain NNJM. Specifically, they add a regularization term J to the objective function in their adaptive training step: where p ID (y|h) id probability distribution estimated by the in-domain NNJM. NDAM has the following drawbacks compared to our method: (1) Regularization is done using probabilities of the target words alone and not on the entire probability distribution over all words, leading to a weak regularization. (2) If the in-domain data is too small, the probability distribution learnt by the indomain NNJM will be overfitted. Therefore, encouraging adaptation to be closer to this probability distribution may not yield a good model. Our method, on the other hand, can utilize in-domain data of very small sizes to fine tune a general NNJM. (3) Their method requires retraining of the model on complete training data in order to adapt to each domain. On the contrary, our method can adapt a single general model to different domains using small in-domain data, leading to a considerable reduction in training time.
We re-implement their method by incorporating this regularization term into the log likelihood objec-tive function with self-normalization, L (Equation 2), during adaptive training.

Data and Evaluation
The training data consist of two corpora: the NUS Corpus of Learner English (NUCLE) (Dahlmeier et al., 2013) and the Lang-8 Learner Corpora v2 (Mizumoto et al., 2011). We extract texts written by learners who learn only English from Lang-8. A language identification tool langid.py 1 (Lui and Baldwin, 2011) is then used to obtain purely English sentences. In addition, we remove noisy sourcetarget sentence pairs in Lang8 where the ratio of the lengths of the source and target sentences is outside [0.5, 2.0], or their word overlap ratio is less than 0.2. A sentence pair where the source or target sentence has more than 80 words is also removed from both NUCLE and Lang-8. The statistics of the data after pre-processing are shown in Table 1.  We obtain L1-specific in-domain data for adaptation based on the L1 information provided in Lang-8. Adaptation is performed on English texts written by learners of three different L1 backgrounds: Chinese, Russian, and Spanish. The statistics of the in-domain data from Lang-8 for each L1 are given in Table 2. For each L1, its out-of-domain data are obtained by excluding the L1-specific in-domain data (from Table 2) from the combined training data (CONCAT).

L1
#sents #src tokens #tgt tokens  The corpus identifies the L1 of each writer. We extract the scripts written by Chinese, Russian, and Spanish native writers. We split the data for each L1 into two roughly equal parts based on the number of scripts, of which one part is used as the development data and other part is used as the test data. Splitting based on the number of scripts ensures that there is no overlap between the writers of the development and test data, as each script is written by a unique learner. The details of the FCE dataset corresponding to each L1 are given in Table 3  For evaluation, we use the F 0.5 measure, computed by the M 2 scorer v3.2 , as our evaluation metric. The error annotations in FCE are converted to the format required by the M 2 scorer. The statistics of error annotations after converting to this format are given in Table 3. To deal with the instability of parameter tuning in SMT, we perform five runs of tuning and calculate the statistical significance by stratified approximate randomization test, as recommended by (Clark et al., 2011).

Baseline SMT-based GEC system
We use Moses (Version 3) to build all our SMTbased GEC systems. The phrase table of the baseline system (S CONCAT ) is trained using the complete  training data. We use two 5-gram language models (LMs) trained using KenLM (Heafield et al., 2013). One LM is trained on the English Wikipedia (about 1.78 billion tokens) and another on the target side of the complete training data. The system is tuned using MERT, optimizing the F 0.5 measure on the L1specific development data in Table 2. For comparison, we show two other baselines S IN and S OUT , where the phrase tables for each L1 are trained on the in-domain data only (Table 2) and the out-of-domain data only, respectively. The results of the above baseline GEC systems on L1 Chinese, Russian, and Spanish FCE test data are summarized in Table 4. We enhance the baseline S CONCAT with an NNJM feature, as described in following subsection.

NNJM Adaptation
We implement NNJM in Python using the deep learning library Theano 2 (Bergstra et al., 2010) in order to use the massively parallel processing power of GPUs for training. We first train an NNJM (NNJM BASELINE ) with complete training data for 10 epochs. The source context window size is set to 5 and the target context window size is set to 4, making it a (5+5)-gram joint model. Training is done using stochastic gradient descent with a mini-batch 2 http://deeplearning.net/software/theano size of 128 and learning rate of 0.1. To speed up training and decoding, a single hidden layer neural network is used with an input embedding dimension of 192 and 512 hidden units. We use a selfnormalization coefficient of 0.1. We pick 16,000 and 32,000 most frequent words on the source and target sides as our source context vocabulary and target context vocabulary, respectively. The output vocabulary is set to be the same as the target vocabulary. The vocabulary is selected from the complete training data, and not based on the L1-specific in-domain data. We add the self-normalized NNJM as a feature to our baseline GEC system, S CONCAT to build a stronger baseline. This is referred to as S CONCAT + NNJM BASELINE in Table 4.
We perform adaptation on NNJM BASELINE by training for 10 additional epochs using the indomain training data alone. We use the same hyperparameters, network structure, and vocabulary, but with the KL-divergence regularized objective function (regularization weight λ = 0.5). We train the adapted NNJM (NNJM ADAPTED ) specific to each L1. We integrate these to our baseline GEC system, and the adapted systems are referred to as S CONCAT + NNJM ADAPTED in Table 4. The results are averaged over five runs of tuning and evaluation. Our evaluation shows that each adapted system S CONCAT + NNJM ADAPTED outperforms every baseline system (S IN , S OUT , S CONCAT , and S CONCAT + NNJM BASELINE ) significantly on all three L1s (p < 0.01).

Comparison to Other Adaptation Techniques
We compare our method to two different adaptation techniques described in Section 5: Translation Model Interpolation (TM INT ) (Sennrich, 2012) and Neural Domain Adaptation Model (NDAM) (Joty et al., 2015) 3 . The optimization of interpolation weights for TM INT is done using the L1-specific FCE development data. NDAM is trained on the complete training data (CONCAT) for 10 epochs by regularizing using an in-domain NNJM also trained for 10 epochs on L1-specific in-domain data from Lang-8. For NDAM, we use the same vocabulary and hyperparameters as our NNJMs.
The results are shown in the rows TM INT + NNJM BASELINE and S CONCAT + NDAM in Table  4. Our evaluation shows that for L1 Russian and L1 Spanish, our adapted system S CONCAT + NNJM ADAPTED significantly outperforms both TM INT + NNJM BASELINE and S CONCAT + NDAM (p < 0.01), but the improvement is not statistically significant for L1 Chinese.
Our evaluation also shows that the combination of TM INT and adapted NNJM is similar (for L1 Chinese and Russian) or worse (for Spanish) in performance compared to S CONCAT + NNJM ADAPTED . This is because NNJM ADAPTED by itself is a translation model adaptation (because it considers source and target side contexts) and hence using TM INT along with it does not bring in any newer information and may even hurt the performance when the in-domain data is very small (in the case of Spanish).

Effect of Adaptation Data
We also perform adaptation on the L1-specific FCE development set in Table 3 (which is also our development data for the GEC systems), instead of the in-domain data from Lang-8 in Table 2. Our neural network overfits easily on the FCE development set due to its much smaller size. Hence, we perform adaptive training for only 2 epochs on top of NNJM BASELINE . The results are shown in the row S CONCAT + NNJM ADAPTED (FCE) in Table 4. Although the FCE development data is much smaller in size than the L1-specific in-domain data from Lang-8, we observe similar improvements on all three L1s. This is likely due to the similarity of the development and test sets, which are obtained from the same FCE corpus. These experiments show that smaller high-quality L1-specific error annotated data (1,000-2,000 sentences) similar to the target data can be used for adaptation to give competitive results compared to using much larger in-domain data (20,000-250,000 sentences) from other sources.
We perform experiments with smaller general domain data. In order to do this, we select a subset of CONCAT composed of the in-domain data of the three L1s along with 300,000 sentences randomly selected from the rest of CONCAT. This corpus is referred to as SMALL-CONCAT (623,717 sentences and 7,990,659 source tokens). We perform both KL-divergence regularized NNJM adaptation (NNJM SMALL-ADAPTED ) as well as Neural Domain Adaptation Model (Joty et al., 2015) (NDAM SMALL ) and compare them to NNJM trained with SMALL-CONCAT (NNJM SMALL-BASELINE ). We use these NNJMs with our S CONCAT baseline. The results are summarized in Table 4. When the ratio between in-domain data and general domain data is high, both adaptation methods do not significantly improve over an unadapted NNJM. In the case of L1 Spanish, KL-divergence regularized adaptation significantly outperforms the unadapted NNJM and NDAM as the size of in-domain data is smaller.

Effect of Regularization
To analyze the effect of regularization when smaller data are used, we vary the regularization weight λ in our objective function (Section 4). The results are shown in Figure 1. λ = 0 corresponds to no regularization and training completely depends on the in-domain data apart from using the general NNJM as a starting point. On the other hand, setting λ = 1 forces the distribution learnt by the network to be similar to that of the unadapted model. We see that having no regularization (λ = 0) fails on all three L1s, due to overfitting on the smaller in-domain data. However, the effect of varying regularization is more significant on L1 Russian and Spanish, as the general NNJM has seen much smaller in-domain data compared to L1 Chinese.

Evaluation on Benchmark Dataset
We also evaluate our system on the benchmark CoNLL-2014 shared task  test set for GEC in English. The CoNLL-2014 shared task consists of 1,312 sentences with two annotators. We also perform evaluation on the extension of CoNLL-2014 test set (Bryant and Ng, 2015), which contains eight additional sets of annotations over the two sets of annotations provided in the original test set. Following the settings of the CoNLL-2014 shared task, we tune our unadapted baseline system and the L1adapted systems on the CoNLL-2013 shared task test set consisting of 1,381 test sentences. The results are summarized in Table 5. We find that only the systems adapted based on L1 Chinese improves over the unadapted baseline system (S CONCAT + NNJM BASELINE ). When the smallersized, high-quality FCE data is used for adaptation the margin of improvement is higher. This could be due to large proportion of Chinese learner written text in CoNLL-2014 test set, as the essays are written by the students of National University of Singapore comprising mostly of native Chinese speakers. Adaptation to L1 Russian and Spanish, does not help the system on CoNLL-2014 test set. We also compare our baseline SMT-based system with other state-of-the-art GEC systems. Our baseline system which is SMT-based, achieves the best F 0.5 score compared to other systems using the SMT approach alone, making it a competitive SMT-based GEC baseline. Overall, (Rozovskaya and Roth, 2016)   achieves the best F 0.5 score (47.40) after adding classifier components, spelling checker, punctuation and capitalization correction components in a pipeline with their SMT-based system. However, their SMTbased system alone achieves an F 0.5 score of 39.48 only.

Discussion and Error Analysis
Our results show that L1-based adaptation of the NNJM using L1-specific in-domain data from Lang-8 significantly improves the F 0.5 score of the GEC system on the three L1s by 1.27 (Chinese), 0.88 (Russian), and 0.64 (Spanish). We observe similar gains when smaller in-domain development data from FCE is used for adaptation. These results show that adaptation based on L1 is beneficial for targeted error correction based on the native language of the writers. Our results also show that the proposed method of NNJM adaptation is scalable to different sizes of in-domain and general domain data and outperforms other methods of adaptation like phrase table interpolation (Sennrich, 2012) and Neural Domain Adaptation Model (NDAM) (Joty et al., 2015). We perform error analysis on four error types  which are difficult for non-native learners of English.
We compare the outputs produced by our adapted system: S CONCAT + NNJM ADAPTED and the baseline: S CONCAT + NNJM BASELINE . We perform per error type quantitative analysis of our results by observing the difference in the per error type F 0.5 scores averaged over five runs of tuning and evaluation of baseline and system. Computing per error type F 0.5 scores is difficult for SMT-based GEC systems, as the error types for corrections proposed by the SMTbased GEC system cannot be determined trivially. To overcome this difficulty, we attempt to determine the error type of the proposed corrections by matching them to the available human annotations (the source/target phrase without the surrounding context) in the complete FCE dataset (1,244 scripts). We select those sentences from the test data where the error type of every correction proposed by the baseline and the system can be determined. This process selects 928, 1102, and 2179 sentences for L1 Chinese, Russian, and Spanish, respectively. The differences in per error type F 0.5 scores between system and baseline are shown in Table 6. For Chinese, the largest gain in F 0.5 is observed for determiner errors. Determiner errors are frequent in our L1 Chinese FCE test set (10.02%) . Moreover, we see that adaptation improves verb form errors by approximately 0.4% F 0.5 . Verb form errors are the most frequent error type in our L1 Chinese test set (14.46%). Also, the highest gain for L1 Russian comes from determiner errors which is the most frequent error type in our FCE test data for L1 Russian (13.55%). Similarly, the highest gain for L1 Spanish comes from preposition errors which is the most frequent error type for L1 Spanish (12.51%).
From a practical standpoint, the adapted system can be used as an educational aid in English classes of local students in non-English-speaking countries, where all the students share the same L1 and their L1 is known in advance. The adapted system can give focused feedback to learners by correcting mistakes frequently made by learners having the same L1. Also, NNJM adaptation proposed in this paper can be done using a small number of essays (50-100 essays) in a relatively short time (20-30 minutes), making it easy to adapt GEC systems in practice.

Conclusion
In this paper, we perform NNJM adaptation using L1-specific learner text with a KL divergence regularized objective function. We integrate adaptation into an SMT-based GEC system. The systems with adapted NNJMs outperform unadapted baselines significantly. We also demonstrate the necessity for regularization when adapting on smaller indomain data. Our method of adaptation performs better compared to other adaptation methods, especially when smaller in-domain data is used. Our results show that adapting GEC systems for learners of similar L1 background gives significant improvement and can be adopted in practice to improve GEC systems.