Chinese Grammatical Error Diagnosis Based on CRF and LSTM-CRF model

When learning Chinese as a foreign language, the learners may have some grammatical errors due to negative migration of their native languages. However, few grammar checking applications have been developed to support the learners. The goal of this paper is to develop a tool to automatically diagnose four types of grammatical errors which are redundant words (R), missing words (M), bad word selection (S) and disordered words (W) in Chinese sentences written by those foreign learners. In this paper, a conventional linear CRF model with specific feature engineering and a LSTM-CRF model are used to solve the CGED (Chinese Grammatical Error Diagnosis) task. We make some improvement on both models and the submitted results have better performance on false positive rate and accuracy than the average of all runs from CGED2018 for all three evaluation levels.


Introduction
Nowadays, more and more foreigners take Chinese as their second language. Unlike English, Chinese has no verb tenses or pluralities, and meanwhile there are various ways to express the same meaning in Chinese, so Chinese has been considered as one of the most difficult languages in the world (Bo Zheng et al., 2016). Chinese as a Foreign Language(CFL) learners often make grammatical errors such as redundant words (R), missing words (M), word selection errors (S), and word ordering errors (W), due to language negative migration, over-generalization, teaching methods, learning strategies and other reasons. Natural Language Processing System(NLPS) which can detect and correct grammatical errors are important and invaluable to language learners. (Leacock et al., 2010). However, few grammar checking applications have been developed to support CFL learners. The goal of the CGED (Chinese Grammatical Error Diagnosis) task is to develop NLP (Natural Language Processing) techniques to automatically diagnose grammatical errors in Chinese sentences written by CFL learners.
In this paper, we use both a conventional linear CRF model (Lafferty et al., 2001) with specific feature engineering and a LSTM-CRF model to solve CGED task. Many researchers have already used these two models in the past few years, but our team make some improvement on both models. For CRF model, we integrate the syntactic feature into the CRF model. Character itself, POS feature and syntactic feature are used to generate 50 combinatorial features by template technology. As for LSTM-CRF model, most researchers use tag transition features only in CRF layer. The major improvement of our work is that more conventional sparse CRF features are incorporated into the CRF layer such as bag of POS n-grams features, words features, tag transition features, etc.
The rest of the paper is organized as follows: Section 2 gives the definition of the CEGD task. Section 3 introduces two methods we use to solve the CGED task. Section 4 describes the dataset we use, the evaluation results on the validation set and the test set. Section 5 discusses conclusion and future work.

Task Definition
The task of CGED is defined as follows: given a Chinese sentence, the goal of CGED tool is to diagnose four types of grammatical errors, including redundant words (R), missing words (M), words selection errors (S) and word ordering errors (W).
The input sentence may contain one or more such errors. The developed tool should indicate each error type and its position in the given sentence. To be specific, if an input sentence contains the grammatical errors, the output of each error should include four items: the id of the sentence, the positions of starting and ending character at which the grammatical error occurs, and the error type which should be one of the defined errors: "R", "M", "S", and "W". Example sentences and corresponding notes are shown in Table 1 and Table 2.

Methodology
We use two different models to solve the CGED task. One is the traditional model based on Conditional Random Field (CRF) with specific feature engineering. Many researchers have chosen CRF based models to solve CGED2016 and CGED2017 task. From previous research, we know that the CRF model with carefully designed feature templates could maintain the performance with neural networks at the same level (Lung-Hao Lee et al., 2016), especially when the training data is not big enough. Another is LSTM-CRF model with conventional sparse CRF features. The LSTM-CRF model is also used by some researchers before (Bo Zheng et al., 2016). The research proved that LSTM is effective in various applications that involves sequence modeling. This time, we make some improvements on both CRF model and LSTM-CRF model.

CRF model with feature engineering
Conditional random fields (CRF), an extension of both Maximum Entropy Model (MEMS) and Hidden Markov Models (HMMs), has been used to solve some natural language processing problems such as word segmentation, information extraction and parsing. The CGED task can be considered as a sequence labeling problem which assigns each Chinese character in a sentence with a tag including the error types (R, M, S, W). CRF is a sequence labelling model with flexible feature space. Therefore, with given feature set and labeled training data, the CRF model can be used to solve the CGED task. The model can be defined as: where Z(x) is the normalization factor, 0 . is the feature sets and / . is the corresponding weight of the features. x is the sequence of the training sentences (the first column of Table 3), and y is the error type label (the forth column of Table 3 Table 1: Two errors are found in the sentence above, one is word ordering error (W) from position 3 to 5, the other is word selection error (R) from position 16 to 17..

B I 6
Error Type R M

Error position-Start 6 19
Error position-End 6 19 Correction B I 6 Table 2: Two errors are found in the sentence above, one is redundant word (R) error at position 6, the other is missing word (M) error at position 19.
For example, the label 'B-S' indicates this character is the beginning of a words selection error. The CRF model can generate the corresponding label sequence y according to the sequence data x. The second column of Table 3 is the POS(Part-of-speech) feature. The task is being solved at the character level. The POS tag was split of a word to character level by attaching position indicators ('B-' and 'I-') to the POS of a word. We use LTP Segmenter and Postagger which is a Chinese Language Technology Platform (Wanxiang Che et al., 2010) to tag the training sentences.
The third column of Table3 is syntactic feature of the character. Syntactic feature is the dependency parsing results of a sentence. Dependency parsing provides a representation of grammatical relations between words in a sentence. To be specific, dependency parsing can be used to identify the grammatical components of the subject in the sentence and analyze the relationship between the components. Figure 1 and Figure 2 shows the example of the dependency parsing. LTP is also used to parse the sentence. The output of the parsing of the sample sentence is "2:SBV 0:HED 5:ADV 5:ATT 2:VOB". Table 4 describe the meaning of these tags. The number means which word in the sentence is related to the current word. For example, 2:SBV means the 2th word 4 5and the current word 4 5 are the subject-predicate relationships . We can find out the grammatical relations of the sentence more clearly from the figures below. Figure 1 is the sentence with grammatical errors and Figure 2 is the correction. The number of the output is used as the syntactic feature.     (Kudo et al.,2007), a linear-chain CRF model software tool, is used to built the CRF model. To train a model with CRF++, we need to build some templates first. We use 50 templates to generate 50 combinatorial features which is listed in Table 5. The format of each template is %X [row, col], in which row is the number of row in a sentence and column is the number of column. The template %x[0,0]/%x[0,1] means the feature combining the current character and the next POS tag. Take the character " " in sample sentence in Table  3 as an example, %x[0,0]/%x[0,1] represents " /B-v".

LSTM-CRF model
LSTM-CRF model is currently a strong baseline in the task of sequence labeling. Compared with the conventional Bi-LSTM neural network, LSTM-CRF model can directly model probability distribution of the the label sequence by a CRF layer, and achieve better performance on several datasets (Z. Huang et al., 2015;X.Ma et al., 2016). An illustrative graph is shown in Figure 3. Under this framework, neural network (i.e. LSTM) is used to compute the features score in CRF, which are called neural features. These neural features are similar to the conventional sparse CRF features, which are directly used to compute the score of a given label sequence.
A LSTM-CRF model can efficiently capture past input features via a LSTM layer and other user specified sparse features (e.g. transition feature, n-gram feature.) via a CRF layer. In our case, plenty of features are considered, here we only take tag transition feature as an example for simplicity. Here we modified the objective function to attend differentially to neural features and conventional CRF sparse features. It is worth noting that the dynamic programming can be used efficiently to compute [A] D,F and optimal tag sequences for inference. Then, the modified CRF layer models the conditional probability of possible output sequence s over input sequence x as: is the score of a sentence Maximum likelihood training chooses parameters W such that the log-likelihood ℒ^W is maximized.
The training algorithm is giving as follows: In most LSTM-CRF based models (Z. Huang et al., 2015;X.Ma et al., 2016;M.Rei et al., 2016;  Table 6: the LSTM-CRF training procedure L. Kong et al., 2016;G. Lample et al., 2016), only tag transition features are considered in CRF layer. In our case, more conventional sparse CRF features are incorporated into the CRF layer. Specifically, we consider the following features defined over the inputs: • Words features. Words that appear around the current position with a window of size 3.
• POS tags features. POS tags that appear around the current position with a window of size 3.
• Word n-grams features. Word n-grams that contain the current position, for n = 2, 3, 4.
• Bag of words features. Bag of words that contains the current word, with a window of size 5.
• Tag transition features. Tag n-grams that contain the current position, for n = 2.

Dataset
We collect datasets from CGED-HSK-2016, CGED-2017 and CGED-2018 as our training set and validation set. Table 7 shows the distributions of error types in both the training set and validation set. The ratio of training set size to validation set size is about 8:1. Besides the sentences with grammatical errors, 1539 correct sentences are added into the validation set.

Validation
We use the validation set to evaluate the results of the CRF models with and without syntactic feature. CRF-1 refers to the model with syntactic feature and CRF-2 refers to the model without syntactic feature. According to the results in Table 8, we can find out that syntactic feature does help to improve the performance of the CRF model. Therefore, CRF model with both Part-Of-Speech(POS) feature and syntactic feature is used in our final run.
We also thoroughly study the effectiveness of the handcraft features in our LSTM-CRF model. Experiment results are shown in Table 9. LSTM-CRF-1 refers to the LSTM-CRF model with handcraft features defined in section 3.2. LSTM-CRF2 refers to the LSTM-CRF model with no handcraft features (i.e. only tag transition feature is considered). As the experiment results shown that the feature engineering in CRF part can improve the performance (i.e. F1 value) about 2%, thus we use the LSTM-CRF1 model as our final model.

Evaluation Results
In the CGED2018 shared task, there are 12 teams submitted the results, totally 32 runs. Among them, our team submitted three runs. Run1 and Run2 are based on the CRF model with different size of training set while Run3 is based on the LSTM-CRF model. The average of all runs is calculated from 32 runs of the 12 teams. Table 10 shows the false positive rate of the 3 runs of our team and the average of all runs. FP (False Positive) is the number of sentences in which non-existent grammatical errors are identified as errors, so the lower the better. The best false positive rate of our team is 0.1255 (Run3) which is much lower than the average rate of all runs. Table 11 Table 12 and Table 13 shows the evaluation result for detection level, identification level and position level. The submitted results of our

Conclusion and Future Work
In this paper, we thoroughly study the task of Chinese grammatical error diagnosis and propose two models to handle this issue. We use a conventional linear CRF with specific feature engineering and a LSTM-CRF model to solve this task. We make some improvements on these two models based on the previous research and get better performance on False Positive Rate and Accuracy than the average of all runs from CGED2018 for all three evaluation levels including detection level, identification level and position level, but all three runs do not perform well on recall rate which should be improved in the future . Future work includes explorations of semi-CRFs and neural semi-CRFs for the CGED shared task and exploring more task specific features such as phonology feature and grapheme feature.