Improving Chinese Grammatical Error Correction with Corpus Augmentation and Hierarchical Phrase-based Statistical Machine Translation

In this study, we describe our system submitted to the 2nd Workshop on Natural Language Processing Techniques for Educational Applications (NLP-TEA-2) shared task on Chinese grammatical error diagnosis (CGED). We use a statistical machine translation method already applied to several similar tasks (Brockett et al., 2006; Chiu et al., 2013; Zhao et al., 2014). In this research, we examine cor-pus-augmentation and explore alternative translation models including syntax-based and hierarchical phrase-based models. Finally, we show variations us-ing different combinations of these factors.


Introduction
The concept of "translating" an error sentence into a correct one was first researched by Brockett et al. (2006). They proposed a statistical machine translation (SMT) system with noisy channel model to correct automatically erroneous sentences for learners of English as a Second Language (ESL).
It seems that a statistical machine translation toolkit has become increasingly popular for grammatical error correction. In the CoNLL-2014 shared task on English grammatical error correction (Ng et al., 2014), four teams of 13 participants each used a phrase-based SMT system. Grammatical error correction using a phrasebased SMT system can be improved by tuning using evaluation metrics such as F 0.5 (Kunchukuttan et al., 2014;Wang et al., 2014) or even a combination of different tuning algo-rithms (Junczys-Dowmunt and Grundkiewicz, 2014). In addition, SMT can be merged with other methods. For example, the language modelbased and rule-based methods can be integrated into a single sophisticated but effective system (Felice et al., 2014).
For Chinese, SMT has also been used to correct spelling errors (Chiu et al., 2013). Furthermore, as is shown in NLP-TEA-1, an SMT system can be applied to Chinese grammatical error correction if we can employ a large-scale learner corpus (Zhao et al., 2014).
In this study, we extend our previous system (Zhao et al., 2014) to the NLP-TEA-2 shared task on Chinese grammatical error diagnosis, which is based on SMT. The main contribution of this study is as follows:  We investigate the hierarchical phrasebased model (Chiang et al., 2005) and determine that it yields higher recall and thus F score than does the phrase-based model, but is less accurate.
 We increase our Chinese learner corpus by web scraping (Yu et al., 2012;Cheng et al., 2014) and show that the greater the size of the learner corpus, the better the performance.
 We perform minimum error-rate training (Och, 2003) using several evaluation metrics and demonstrate that tuning improves the final F score.

Hierarchical phrase-based model
A hierarchical phase-based model for SMT was first suggested by Chiang et al. (2005). The system first achieves proper word alignment, and instead of extracting phrase alignment, the sys-tem extracts rules in the form of synchronous context-free grammar (SCFG) rules. In a Chinese error correction task, such error-correction rules are extracted as follows: The symbols X and X i here are non-terminal and represent all possible phrases. In addition, glue rules are used to combine a sequence of Xs to form an S.
The glue rules are given as: S → (X 1 , X 1 ) S → (S 1 X 2 , S 1 X 2 ) A complete derivation of this simple example can then be written: To determine a weight of a derivation, this model utilizes features such as generation probability, lexical weights, and phrase penalty. In addition, to avoid too many distinct yet similar translations, rules are constrained by certain filters that, for example, limit the length of the initial phrase the number of non-terminals per rule.

Lang-8 Learner Corpus
The Lang-8 Chinese Learner Corpus was built by extracting error-correct sentence pairs from the Internet (Mizumoto et al., 2011;Zhao et al., 2014). We use it as a training corpus for our SMT-based grammatical error diagnosis system in NLP-TEA-1.
However, after we analyzed edit distance (ED) between error-correct sentence pairs based on word level, we determined it may not be suitable for training our translation model. As Figugre 1 shows, NLP-TEA-2 training data has ED mostly from 1 to 3 whereas Lang-8 Chinese Corpus has many ED longer than 4. This is reasonable because the NLP-TEA-2 training data are extracted from essays written by high-level Chinese learners and, in most cases, these learners produce only one-or two-wordmistakes. By contrast, Lang-8 is a language exchange social networking website where sentences are written by language learners of any level. If we use this corpus as it currently exists, sentences having too long ED may confuse the SMT system. Therefore, we cleaned the Lang-8 Chinese Learner Corpus by randomly sampling sentence pairs whose ED is between 4 and 8 and deleting sentences pairs whose ED is longer than 8. This ensures it has a similar ED distribution to that of the NLP-TEA-2 training data. After cleaning, the number of sentences in the corpus decreased from 95,000 to approximately 58,000. The distribution of ED in the Lang-8 Chinese Learner Corpus shown here is prior to cleaning.

HSK Dynamic Essay Corpus
In this shared task, we augment the Chinese learner corpus with another learner corpus extracted from the Internet (Yu et al., 2012;Cheng et al., 2014). The HSK Dynamic Essay Corpus 1 is one such corpus built by Beijing Language and Culture University. In this corpus, approximately 11,000 essays are collected from HSK Chinese tests taken by foreign Chinese language learners, and error sentences are annotated with special marks.
For example: 这就{CQ 要}由有关部门和政策管理制度来控制。 1 http://nlp.blcu.edu.cn/online-systems/hsk-language-libindexing-system.html Lang-8 Chinese Corpus However, detaching an erroneous sentence and a corresponded correction sentence from an annotated one as above is not easy because we don't know the position information of the reordering error. Moreover, such detachment is also difficult when dealing with some more complex errors, for example, a "ba (把)" error (a special preference of active voice in Chinese) or "bei (被)" error (a special preference of passive voice in Chinese), if we depend only on such marks.
Thus, we extracted sentences having only insertion, deletion, or replacement errors. We also cleaned the HSK corpus by deleting sentences pairs having too long ED as described. As a result, the corpus now contains approximately 59,000 sentences. The distribution of ED in the combined corpus is shown in Figure 2.

Tuning
As previously described, an SMT system with tuning is proved to perform better than one without tuning. Because this shared task uses several evaluation metrics such as accuracy, F1 score, and FP rate, we tune our system using all these metrics with minimum error rate training (MERT) (Och, 2003) at identification level 1 . Our linear evaluation score is computed according to the following: 1 Detection level: All error types will be regarded as incorrect. Identification level: All error types should be clearly identified, i.e., Redundant, Missing, Disorder, and Selection. Position level: The system results should be perfectly identical with the quadruples of gold standard. We tried to tune in position level but we omit these results since this attempt mostly failed. Score =α*Accuracy+β* F 0.5 +γ*(1-FP_rate) where α+β+γ = 1.0.
We conducted a series of preliminary experiments to discover the most effective set of parameters. We followed Kunchukuttan et al. (2014) and Wang et al. (2014) in using F 0.5 instead of F1. In other words, we expected our system to have high accuracy because, as Ng et al. say in CoNLL-2014, "it is important for a grammar checker that its proposed corrections are highly accurate in order to gain user acceptance." However, we discovered that even when we used a parameter set of α=0.0, β=1.0, andγ=0.0, we still failed to reach a satisfactory correction rate.
Finally, we use α=0.5, β=0.0, andγ=0.5 as a final parameter set for phrase-based and hierarchical phrase-based systems because it produces the greatest number of corrections at identical level among our in-house experiments. In addition, our in-house experiments revealed that an improper parameter set could produce a reasonable but unacceptable result. We discuss this aspect with reference to an experiment regarding a syntax-based system in the next section.

Official Runs
We followed the WAT2015 2 baseline system to build phrase-based and hierarchical phrasebased SMT systems. This involves segmenting words using Stanford Word Segmenter version 2014-01-04, running GIZA++ v1.07 on training corpus in both directions, and parsing Chinese sentences with Berkeley parser (for java 1.7). We ran Moses v2.11 for decoding using the same parameters with the WAT2015 baseline. We trained two hierarchical phrase-based systems using different sized corpora according to whether the HSK corpus is included. For error classification, we followed Zhao et al. (2014) to identify error types and locate the positions of errors.
All three runs we submitted are shown in Table 1. In addition, the results of our runs at position level are shown in Table 2. RUN3 produced more corrections and obtained a higher F1 score at position level than did the other runs. However,

Combined Corpus
Edit Distance Percent it is inferior in terms of accuracy and FP rate compared to RUN2. At position level, the phrase-based system generated only 15 correct predictions and among them only one Disorder and no Selection types appeared. By contrast, the hierarchical system performed much better, as it successfully predicted seven Disorder and five Selection types. In addition, it produced more correct predictions on Missing and Redundant types.

Hierarchical Phrase-based Model
We provide an example of the official test set to explain why hierarchical phrase-based systems appear to be more effective than those that are phrase-based. The following Chinese sentence is used: B1-1033: 其中有一个人丢护照了。 (One of them lost his passport.) In a hierarchical-phrase-based system and according to the synchronous CFG rule, the partial derivation of the phrase "丢 护照 了 (lost his passport)" is: (X, X)→(丢 X 1 , 丢 X 1 ) →(丢 X 2 了, 丢 了 X 2 ) →(丢 护照 了, 丢 了 护照) where X denotes any phrase. Because "X 了" wrongly written as "了 X" is a typical Disorder error in Chinese sentences, the hierarchical phrase-based system extracts the rule X→(X 了, 了 X) and weighs it highly when training on the corpus. This means the model actually examined syntax errors in sentences. By contrast, the phrase-based system lacks the ability to identify syntax errors. Therefore, this translation model is less effective than the hierarchical phrase-based system, as it failed to select a correct translation such as "丢 了 X."

Corpus Augmentation
According to the results shown in Table 4, expanding the corpus has a beneficial effect. In RUN1, the F1 score of 0.024 means it nearly failed to produce any correction prediction. However, after we increased the corpus size, the F1 score increased to 0.10. The improved F1 score with corpus augmentation is illustrated in Figure 3. Among F1 scores, our RUN3 ranks exactly in the middle of 15 RUNS of all teams.

Tuning
To determine the effect of tuning for improving the two systems, we developed a test on the NLP-TEA-1 training set offered by organizers. Table 3 shows a contrast between tuned and untuned systems. As with the English grammatical error correction task, MERT clearly boosts the F1 score in this task. We tuned the system using the Z-MERT toolkit (Zaidan, 2009  To compare different syntax-based systems, we also developed a string-to-tree (s2t) SMT system. However, in our attempt to tune it, we failed to obtain a best set of parameters. We first tried a parameter set of (0.5, 0.0, 0.5), which performs most effectively with the phrase-based model. However, it failed to improve the F1 score, as is shown in Table 4.  Table 4: Tuning result suitable to an evaluation score but unacceptable for its low precision and recall.
The system is clearly optimized to achieve the best performance in terms of FP rate and accuracy. However, this is because, as experiments showed, the system produces nearly all negative predictions, which causes low precision and recall, as increasing true negatives improves both the accuracy and FP rate. We determined thatα =0.5, β=0.0, γ=0.5 may not be a "good" parameter set in this situation, even though it seemed acceptable for a preliminary experiment. Unfortunately, we did not identify any parameter sets that can generate more acceptable results than can the s2t system without tuning.

Conclusion
We have described a Chinese grammatical error correction system based on SMT for the TMU-NLP team. First, we examined hierarchical phrase-based and string-to-tree translation models of SMT on CGED. Second, we constructed an error-correction parallel corpus based on the HSK Dynamic Essay Corpus, which is nearly equal in size to the Lang-8 Chinese Learner Corpus. We then cleaned and combined the two into a single expanded corpus. Third, we tuned the system with a linear combination of evaluation metrics using MERT. Finally, we showed that the augmented corpus considerably improved performance. In addition, the hierarchical phrasebased translation model generated a higher F1 score than did the phrase-based model.
For future research, we will attempt to expand the corpus further. A possible direction in building a large-scale parallel corpus is to introduce errors artificially to correct sentences. This has already been applied in an English error correction task of Yuan and Felice (2013). In addition, we confirmed that our system produces correct predictions in generated N-best output. However, oracle predictions were not selected during decoding. To solve this, we will employ a much more powerful language model such as the Google n-gram model as well as a re-ranking approach on the N-best output.