Chinese Grammatical Error Diagnosis by Conditional Random Fields

This paper reports how to build a Chinese Grammatical Error Diagnosis system based on the conditional random fields (CRF). The system can find four types of grammatical errors in learners’ essays. The four types or errors are redundant words, missing words, bad word selection, and disorder words. Our system presents the best false positive rate in 2015 NLP-TEA-2 CGED shared task, and also the best precision rate in three diagnosis levels.


Introduction
Learning Chinese as foreign language is on the rising trend. Since Chinese has its own unique grammar, it is hard for a foreign learner to write a correct sentence. A computer system that can diagnose the grammatical errors will help the learners to learn Chinese fast (Yu et al., 2014;Wu et al., 2010;Yeh et al., 2014;Chang et al., 2014).
In the NLP-TEA-2 CGED shared task data set, there are four types of errors in the leaners' sentences: Redundant, Selection, Disorder, and Missing. The research goal is to build a system that can detect the errors, identify the type of the error, and point out the position of the error in the sentence.

Methodology
Our system is based on the conditional random field (CRF) (Lafferty, 2001). CRF has been used in many natural language processing applications, such as named entity recognition, word segmentation, information extraction, and parsing (Wu and Hsieh, 2012). For different task, it requires different feature set and different labeled training data. The CRF can be regarded as a sequential labeling tagger. Given a sequence data X, the CRF can generate the corresponding label sequence Y, based on the trained model. Each label Y is taken from a specific tag set, which needs to be defined in different task. How to define and interpret the label is a task-depended work for the developers. Mathematically, the model can be defined as: where Z(X) is the normalization factor, f is a set of features, is the corresponding weight. In this task, X is the input sentence, and Y is the corresponding error type label. We define the tag set as: {O, R, M, S, D}, corresponding to no error, redundant, missing, selection, and disorder respectively. Figure 1 shows a snapshot of our working file. The first column is the input sentence X, and the third column is the labeled tag sequence Y. Note that the second column is the Part-of-speech (POS) of the word in the first column. The combination of words and the POSs will be the features in our system. The POS set used in our system is listed in Table 1, which is a simplified POS set provided by CKIP 1 . Figure 2 (at the end of the paper) shows the framework of the proposed system. The system is built based on the CRF++, a linear-chain CRF model software, developed by Kudo 2 .

Training phase
In the training phase, a training sentence is first segmented into terms. Each term is labeled with the corresponding POS tag and error type tag. Then our system uses the CRF++ leaning algorithm to train a model. The features used in CRF++ can be expressed by templates. Table 12 (at the end of the paper) shows one sentence in our training set. Table 13 (at the end of the paper) shows all the templates of the feature set used in our system and the corresponding value for the example. The format of each template is %X [row, col], where row is the number of rows in a sentence and column is the number of column as we shown in Figure 1. The feature templates used in our system are the combination of terms and POS of the input sentences. For example, the first feature template is "Term+POS", if an input sentence contains the same term with the same POS, the feature value will be 1, otherwise the feature value will be 0. The second feature template is "Term+Previous Term", if an input sentence contains the same term bi-gram, the feature value will be 1, otherwise the feature value will be 0.

Test phase
In the Test phase, our system use the trained model to detect and identify the error of an input sentence. Table 2, Table 3, and Table 4 show the labeling results of examples of sentences with error types Redundant, Selection, Disorder, and Missing respectively.    If all the system predict tags in the fourth column are the same as the tags in the third column, then the system labels the sentence correctly. In the formal run, accuracy, precision, recall (Clevereon, 1972), and F-score (Rijsbergen,1979) are considered. The measure metrics are defined as follows. The notation is listed in Table 6.

System predict tag A B
Known tag A tpA eAB B eBA tpB

Data set
Our training data consists of data from NLP-TEA1 (Chang et al.,2012)Training Data, Test Data, and the Training Data from NLP-TEA2. Figure 3 (at the end of the paper)shows the format of the data set. Table 7 shows the number of sentences in our training set.

Experiments result
In the formal run of NLP-TEA-2 CGED shared task, there are 6 participants and each team submits 3 runs. Table 8 shows the false positive rate. Our system has the lowest false positive rate 0.082, which is much lower than the average. Table 9, Table 10, and Table 11 show the formal run result of our system compared to the average in Detection level, Identification level, and Position level respectively. Our system achieved the highest precision in all the three levels, but the accuracy of our system is fare. However, the recall of our system is relatively low. The numbers in boldface are the best performance amount 18 runs in the formal run this year.

Error analysis on the official test result
There are 1000 sentences in the official test set of the 2015 CGED shared task. Our system labeled them according to the CRF model that we trained based on the official training set and the available data set from last year. The number of tag O dominates the number of other tags in the training set for sentences with or without an error. For example, sentence no.

(他們)，O(從)，O(公車站)，O(走路)，O(走)， O(二十)，O(分鐘) ，O(才) ，O(到) ，O(電影 院) ，R(了)}
Therefore, our system tends to label words with tag O and it is part of the reason that our system gives the lowest false positive rate this year. Our system also has high accuracy and precision rate, but the Recall rate is lower than other systems. We will analyze the causes and discuss how to improve the fallbacks.
We find that there are 11 major mistake types of our system result. 1. Give two error tags in one sentence. 2. Fail to label the Missing tag 3. Fail to label the Disorder tag 4. Fail to label the Redundant tag 5. Fail to label the Selection tag 6. Label a correct sentence with Missing tag 7. Label a correct sentence with Redundant tag 8. Label a correct sentence with Disorder tag 9. Label a correct sentence with Selection tag 10. Label a Selection type with Redundant tag 11. Label a Disorder type with Missing tag Analysis of the error cases: 1. Give two error tags in one sentence: In the official training set and test set, a sentence has at most one error type. However, our method might label more than one error tags in one sentence. For example, a system output: {他是很聰明學生，O(他)，R(是)， O(很)，O(聰明)，M(學生)}. Currently, we do not rule out the possibility that a sentence might contain more than one errors. We believe that in the real application, there might be a need for such situation. However, our system might compare the confidence value of each tag and retain only one error tag in one sentence. In this case, "新有的" is also bad Chinese, it should be "新建的". However, the word segmentation result makes our system hard to detect the error.

Fail to label the
5. Fail to label the Selection tag: We believe that it required more knowledge to recognize the selection error than limited training set. Where "本來" should be "就 ". However, in a different context, it could be "本來想"+"但是…".
11. Label a Disorder type with Missing tag: Since a Disorder error might involve more than two words, comparing to other types, it is hard to train a good model. For example, a system output: , and the correct sentence should be "到了中國新年的時候". A grammar rule such as "到了"+Event+"的時候" might be help.

Conclusion and Future work
This paper reports our approach to the NLP-TEA-2 CGED Shared Task evaluation.
Based on the CRF model, we built a system that can achieve the lowest false positive rate and the highest precision at the official run. The approach uniformly dealt with the four error types: Redundant, Missing, Selection, and Disorder. According to our error analysis, the difficult cases suggest that to build a better system requires more features and more training data. The system can be improved by integrating rule based system in the future.
Due to the limitation of time and resource, our system is not tested under different experimental settings. In the future, we will test our system with more feature combination on both POS labeling and sentence parsing.