A Hybrid System for Chinese Grammatical Error Diagnosis and Correction

This paper introduces the DM_NLP team’s system for NLPTEA 2018 shared task of Chinese Grammatical Error Diagnosis (CGED), which can be used to detect and correct grammatical errors in texts written by Chinese as a Foreign Language (CFL) learners. This task aims at not only detecting four types of grammatical errors including redundant words (R), missing words (M), bad word selection (S) and disordered words (W), but also recommending corrections for errors of M and S types. We proposed a hybrid system including four models for this task with two stages: the detection stage and the correction stage. In the detection stage, we first used a BiLSTM-CRF model to tag potential errors by sequence labeling, along with some handcraft features. Then we designed three Grammatical Error Correction (GEC) models to generate corrections, which could help to tune the detection result. In the correction stage, candidates were generated by the three GEC models and then merged to output the final corrections for M and S types. Our system reached the highest precision in the correction subtask, which was the most challenging part of this shared task, and got top 3 on F1 scores for position detection of errors.


Introduction
More and more people are learning a second or third language as an interest, a career plus, or even a challenge to oneself. Chinese is one of the oldest and most versatile languages in the world. Many * Equal Contribution † This work was done while the author at Alibaba Group people choose to learn Chinese, and the number of CFL leaner grows rapidly. However, it would be difficult to learn Chinese, because Chinese has a lot of differences from other languages. For example, Chinese has neither the change of singular and plural, nor the tense change of the verb. It has quite flexible expressions and loose structural grammar. These traits bring a lot of trouble to CFL learners, so the demands for Chinese Grammatical Error Diagnosis (CGED) as well as Correction (CGEC) is growing rapidly. GEC for English has been studied for many years, with many shared tasks such as CoNLL-2013(Ng et al., 2013 and CoNLL-2014(Ng et al., 2014, while those kinds of studies on Chinese is less yet. This CGED shared task (Gaoqi et al., 2017;Lee et al., 2016Lee et al., , 2015Yu et al., 2014) gives researchers an opportunity to build the system and exchange opinions in this field. It could make the community more flourish which benefits all CFL learners. Compared with previous years, this year's NLPTEA CGED shared task requests participants to generate candidate corrections for errors of M and S types. This correction subtask is more challenging and valuable, so we focused on this subtask and got the highest precision in this subtask.
This paper is organized as follows: Section 2 describes some related works in English as well as Chinese. Dataset will be described in Section 3. Section 4 illustrates our hybrid system with two stages, including four models. Section 5 shows the evaluation and discussion of the hybrid model. Section 6 concludes the paper and discusses the future work.

Related Work
Earlier attempts to GEC involve rule-based models (Heidorn et al., 1982;Bustamante and León, 1996) and classifier-based approaches (Han et al., 2004;Rozovskaya and Roth, 2011), which can cope with only specific type of errors.
As a sentence may contain multiple errors of different types, a practical GEC system should be able to cope with most of those errors, which is difficult to be achieved by rule-based or classifier models alone. The combination of rule-based and classifier models (Rozovskaya et al., 2013) can correct multiple errors, but it is useful only when the errors are independent of each other, which means that it is unable to solve the problem of dependent errors.
To address more complex errors, MT models are proposed and developed by many researchers. Statistical Machine Translation (SMT) has been dominant for the past two decades. In the work of Brockett et al. (2006), they propose an SMT model used for GEC, and later the round-trip translation is also used in GEC (Madnani et al., 2012). A POS-factored SMT system is proposed (Yuan and Felice, 2013) to correct five types of errors in the text. In the work of Felice et al. (2014), they propose a pipeline of the rule-based system and a phrase-based SMT system augmented by a sizeable web-based language model. The wordlevel Levenshtein distance between source and target can be used as a translation model feature (Junczys-Dowmunt and Grundkiewicz, 2014) to enhance the model. Rule-based method and ngram statistical method are combined (Wu et al., 2015) to get a hybrid system for CGED shared task. Recently Napoles and Callison-Bursh (2017) propose a lightweight approach to GEC called Specialized Machine translation for Error Correction.
Nevertheless, Neural Machine Translation (NMT) systems have achieved substantial improvements in this field (Sutskever et al., 2014;Bahdanau et al., 2014). Inspired by this phenomenon, Sun et al. (2015) utilize the Convolutional Neural Network (CNN) for the article error correction. The Recurrent Neural Network (RNN) is also used (Yuan and Briscoe, 2016) to map the sentence from learner space to expert space. Recently Ji et al. (2017) propose a hybrid neural model with nested attention layers for GEC.

Dataset Description
The dataset is provided by the 5th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA) 2018 with a Shared Task for CGED. The NLPTEA CGED has been held since 2014, and it provides several sets of training data for this field.
Each instance in the CGED training dataset is composed of an original sentence with a unique sentence number 'sid', some 'target edits', and a correction sentence. The original sentence contains grammatical errors in Chinese sentences written by CFL learners. All errors are divided into four types, including redundant words (denoted as R), missing words (M), word selection errors (S), and word ordering errors (W). Some typical examples are shown in Table 1.
Each edit in the 'target edits' indicates the error type and the position at which it occurs in the original sentence. If an input sentence contains one or more grammatical errors, the 'target edits' will include many items, each of which is in the form of [start-off, end-off, error-type], where start-off and end-off respectively denote the starting and ending position of the grammatical error, and the error-type is in the set of R, M, S, and W. For each original sentence given in the test dataset, the developed system should predict the 'target edits' in the format which is same as the training set, and for the error type of S and M, the system should predict the candidate corrections.
We also used an external dataset Lang-8 1 to train our GEC models, which contains more than 700,000 items, and each item consists of an original sentence and corresponding corrected sentences. Each original sentence has k correction

System Description
We proposed a hybrid system for the CGED shared task this year, which contained two stages: the detection stage and the correction stage. In the detection stage, given a sentence s i , which is composed of characters as [c 1 , c 2 , ..., c n ], our system generates an edit set E i which contains one or more errors of this sentence in the form of [sid, start, end, err], where start and end denote a specific part of this sentence [c start , c end ] has the error of type err. Then, in the correction stage, for the err ∈ {M, S}, our system can generate candidate corrections for [c start , c end ]. If err is M, c start must be equal to c end , and the correction will be inserted at this position. The whole pipeline of our hybrid system is shown in Figure  1.
Our model consists of four models, including the BiLSTM-CRF model for tagging possible errors by sequence labeling at the detection stage, and three GEC models to convert the Chinese sentence from the 'learner space' to the 'expert space'. Those GEC models not only generate candidate corrections for M and S errors at the correction stage, but also help the BiLSTM-CRF model to tag the possible error position at the detection stage. The three GEC models are Rule-based model, NMT model, and SMT model, which are able to cope with different types of grammatical errors.

BiLSTM-CRF
In the detection stage, we treated the error detection problem as a sequence labeling problem and utilized the BiLSTM-CRF model  to get the corresponding label sequence in the form of BIO encoding (Kim et al., 2004). More specifically, given an input sentence which is composed of characters as [c 1 , c 2 , ..., c n ], we utilized this model to predict the label L i of c i , for i ∈ 1, 2, ..., n. Since the prior knowledge can be used in this task, we incorporated many additional features for this sequence labeling problem, including Char Bigram, Part-of-speech (POS) tagging, POS score, Adjacent Word Collocation (AWC), Dependent Word Collocation (DWC), as used in (Xie et al., 2017).

Rule-based Model
The rule-based model starts by segmenting Chinese characters into chunks, which incorporates useful prior grammatical information to identify possible out-of-vocabulary errors. The segments are looked up in the dictionary built by Gigawords (Graff and Chen, 2005), and if a segment is out of vocabulary, it will go through the following steps: 1. If the segment consists of two or more characters, and turn out to be in the dictionary by permuting the characters, it will be added to the candidate list.
the same or similar Pinyin (the Romanization system for Standard Chinese) or similar strokes to the segment are generated. The generated keys for the segment itself, concatenated with those of previous or next segments, will be added to the candidate list of possible corrections.
After the steps, a candidate list of all possible corrections will be processed to identify whether there might be out-of-vocabulary error and it's probability using a language model. The negative log likelihood of a size-5 sliding window suggests whether the top-scored candidate should be a correction of the original segment.

NMT GEC Model
The NMT model can capture complex relationships between the original sentence and the corrected sentence in GEC. We used the encoderdecoder structure (Bahdanau et al., 2014) with the general attention mechanism (Luong et al., 2015). We used two-layer LSTM model for both encoder and decoder. To enhance the ability of NMT models, we trained four NMT models with different parallel data pairs and configurations as described in Section 5.1. Those four NMT models were denoted as N j , where j ∈ {1, 2, 3, 4} was the model index. The correction result of sentence s i generated by N j was denoted as C iN j . We used the character-based NMT because most characters in Chinese has its meaning, which is quite different from English characters, and the Chinese word's meaning often depends on the meaning of its characters. For example, we have two characters 昨天 (yesterday), and we can split it as [yester] + [day]. As in English, the second character 天 means day, and the first one is not a word if taken alone. But it is sufficiently unique to give the whole word its meaning. On the other hand, the errors in original sentences can make the word-based tokenization worse, which will introduce larger and lower quality vocabulary list. So, we chose to use char-based NMT for the CGEC problem.

SMT GEC Model
The SMT model consists of two components. One is a language model and the other one is a translation model. The language model is learned from a monolingual corpus of the target language, while the parameters of the translation model are calcu-lated from the parallel corpus. We used the noisy channel model (Brown et al., 1993) to combine the language model and the translation model, and incorporated beam search to decode the result.
To explore the ability of SMT models with different configurations, we trained six SMT models with different data granularity and monolingual dataset as described in Section 5.1. Those six SMT models were denoted as S j , where j ∈ {1, 2, 3, 4, 5, 6} was the model index. The correction result of sentence s i generated by S j was denoted as C iS j .

Grammatical Error Detection and Correction
For the detection stage, we used the BiLSTM-CRF model as described in Section 4.1 to tag possible errors, by generating labels for each character in sentence s i . Then each sequence labeling was converted to the editing format [s id , start, end, err]. Next, we used the correction results generated by our three different GEC models to help to tune the detection result. For an original sentence s i , we predicted the corrected sentence C iM with our GEC model M , where M could be NMT N j or SMT S j . After getting the predicted correction sentence, we converted it to the editing format [s id , start, end, err], which was consistent with the detection result of the BiLSTM-CRF model. The conversion from C iM to editing format is based on the minimum editing distance, and we only focused on the error whose type is R, M, or S. On one side, these three types of errors are simple and clear, which can be generated by comparing the s i and C iM with high confidence. On the other side, the error of type W is more complicated, and the diversity of our GEC model would introduce a great number of noises into the original result on this type of error. Considered that there may exist many kinds of edit trace between a specific pair of s i and C iM , we kept tracing the edit list which minimized the editing distance between s i and C iM .
With the edits e ij of sentence s i , which are generated by BiLSTM-CRF and GEC models, the next step of our system is to ensemble all those edits. When it comes to the ensemble, we tried two methods. One is merging, which combines all detections generated by BiLSTM-CRF model as well as those GEC models, and take the union of their editing sets. The other is voting, in which we BiLSTM dec data all set a voting threshold thre and accept the edit with T ij ≥ thre, where T ij is the times of appearance of edit e ij for sentence s i . In the correction stage, we used the editing set E i generated in the detection subtask. For the edit e ij in E i whose error type is M or S, we selected the candidate characters in the corresponding correction sentence predicted by our GEC models. Finally, all candidates of corrections generated by different GEC models will be collected and merged to create the submission file with detections as well as corrections.

Data Split and Experiment Setting
To train the BiLSTM-CRF model, we collected several datasets of CGED, which are 2015, 2016, 2017, and 2018. We split 20% of the 2017 training data as the validation dataset, which is denoted as '17-dev', and all the rest as training. We used the character embeddings and word embeddings pre-trained on the Gigawords and fixed them. For other parameters, we initialized them randomly.
To train our GEC models, we used the external Lang-8 dataset as explained in Section 3. Because each original sentence could have more than one corrected sentences, we used two approaches to generate parallel data pairs to train our GEC models. The first choice is to use only the correct sentence whose edit distance is smallest from the original sentence. The training data generated by the first choice is denoted as data ed . The second choice is to use all the correct sentences of the corresponding original sentence. The training data generated by the first choice is denoted as data all .
For the NMT model, we used the pre-trained embedding in different parts of the model. The first choice was to use it for the whole model, which forced the model to learn a proper embedding by itself. Considering the dataset is not large enough for the model to learn the embedding from scratch, we also tested the pre-trained embedding phrase CGED+NLPCC data all used for both encoder and decoder parts. But the embedding was trained on the Gigaword (Graff and Chen, 2005), which was quite different from the sentences written by CFL learners, so we also used the pre-trained embedding only in the decoder part. The configurations of our four different NMT GEC models N j , j ∈ {1, 2, 3, 4} are shown in Table 2. For the 'Network' column, the 'BiL-STM' means bi-directional LSTM (Schuster and Paliwal, 1997), and for the 'Embed' column, the 'enc-dec' means using pre-trained embedding for both encoder and decoder part in our model. For the SMT model, we trained the language model part on different corpora, including the Gigaword, the Chinese Wikipedia corpus (Denoyer and Gallinari, 2006), and the corpus consists of CGED as well as Lang-8 correct sentences which are constructed by ourselves. Besides, we also tested different granularities of the model, which means, used char-level or phrase-level translation model. It is worth to mention that we found that using data all outperformed data ed significantly, so we only did detailed experiments on data all because of the time limitation of the contest. The configurations of our six different SMT models S j , j ∈ {1, 2, 3, 4, 5, 6} are shown in Table 3 Many excellent tools can emancipate us from the heavy burden of implementing models from scratch. For those NMT GEC models, we implemented it with the OpenNMT (Klein et al., 2017) toolkit, and for those SMT GEC models, we implemented the language model with KenLM (Heafield, 2011) toolkit and translation model with Moses (Koehn et al., 2007).
For the Lang-8 dataset, we found that in those 717,241 lines data, 474,638 lines contained traditional Chinese. The traditional Chinese cannot convey more information than its corresponding simplified Chinese, but will make the size of vocabulary much larger. So, we used the opencc Table 4: Experiments of Grammatical Error Detection on 17-dev dataset by merging eleven models. The corresponding configuration of the models in 'NMT-type' and 'SMT-type' can be found in Table 2 and  Table 3. The values for 'Detection', 'Identification', and 'Position' columns are all F 1 values.
NMT-type SMT-type FP-rate Detection Identification Position  Table 5: Experiments of Grammatical Error Detection on 17-dev dataset by voting eleven models. The corresponding configuration of the models in 'NMT-type' and 'SMT-type' can be found in Table 2 and toolkit to convert all the traditional Chinese to simplified Chinese.

Experiment Result
The evaluation metrics for NLPTEA CGED shared task consists of four subtasks: 'Detection' (determine if the sentence contains errors), 'Identification' (determine the error types), 'Position' (determine the position of errors), and 'Correction' (determine the candidate corrected words for M and S error types). Those four subtasks are from easy to hard, and the last metric is the most valuable, which will be paid more attention by us. The former three metrics are related to the detection stage, and the last metric is related to the correction stage.

Grammatical Error Detection
We used different parameters and initial states of BiLSTM-CRF model to get eight different results on detection stage. Each of three GEC models can generate the result in the editing format as described in Section 4.5. We utilized different methods to ensemble those eleven models, including merging and voting as explained in Section 4.5. Because both NMT and SMT models have different configurations, we tried all combinations of N j , j ∈ {1, ..., 4} and S j , j ∈ {1, ..., 6}, with the fixed rule-based model, and part of the experiment result with merging is shown in Table 4, while voting method is shown in Table 5.
It's shown in Table 4 and 5 that voting method is more powerful than the merging method on all metrics except for the 'Detection', which is the easiest subtask. We also found out that different combinations of models can cope with different types of errors, and can generate results good at different subtasks. To better utilize the correction generated by our translation model, we preferred the model which performs best on the 'Position' metric, so we chose to use the voting method with threshold 2 to operate on the test dataset with N 2 and S 4 .

Grammatical Error Correction
We found that our GEC models can focus on different type of errors, as shown in the Table 6 on the official testing data of CGED 2018, which is denoted as '18-test'. The Table 7 shows some cases in which our different models generated various types of corrections for the original sentence.
As shown in Table 6, the rule-based model can correct those word selection errors which share similar morphology or pronunciation with the ground truth characters. The rule-based model focuses on the correction of word selection errors, so it is able to yield high precision for the error correction problem. The SMT model can handle some errors whose type is R, even that part seems reasonable in the local context. The NMT model is good at correcting many types of errors, including simple errors of word missing or word redun-   It can also add punctuations in the middle of the original sentence.
In Table 7, it shows that in some cases, given an original sentence, different GEC models can give different corrections. For the first two rows, the rule-based model and the SMT model give different corrections for the same position of the original sentence, and both of those corrections are reasonable. For the last two rows, the NMT model and the SMT model give corrections at different positions of the original sentence. The ensemble of those models could be helpful because they can generate corrections for many parts of the original sentences, and if they produce different candidates for the same position, we use the voting method to determine the final output.
We explored the ablation test after the release of CGED 2018 ground truth labels. Given error detection results generated by BiLSTM-CRF in the detection stage, we used different combination of three GEC models to generate the candidate corrections for errors of S and M. As we mentioned before, we picked the model combination that performed best on the 'Position' metric in Table 5 to better utilize the candidates generated by our GEC models. It's worth to mention that our rule-based GEC model is not customized for this dataset and the errors made by CFL learners are quite different from native speakers, which leads to relatively low precision. The result of the combination of all three models is slightly better than the version we submitted to CGED shared task because we fixed a small bug in the GEC model. From the ablation study, it showed that the combination of three GEC models improved the F 1 score of Correction Subtask significantly.

Conclusion and Future Work
This paper describes our system approach in NLPTEA 2018 shared task of CGED. We proposed a two-stage hybrid system which combined the BiLSTM-CRF model and three GEC models. In the detection stage, we utilized the correction results generated by GEC models to tune the error tags generated by the BiLSTM-CRF model. While in the correction stage, outputs of our GEC models were merged to generate candidate corrections for errors whose type were S or M. Our system achieved the highest precision in the 'Correction' subtask, which is the most challenging part of this shared task and got top 3 on F1 scores for position detection of errors.
In the future, we will further explore the strengths as well as limitations of three GEC models in our system and find a better method to combine them.