CYUT-III Team Chinese Grammatical Error Diagnosis System Report in NLPTEA-2018 CGED Shared Task

This paper reports how we build a Chinese Grammatical Error Diagnosis system in the NLPTEA-2018 CGED shared task. In 2018, we sent three runs with three different approaches. The first one is a pattern-based approach by frequent error pattern matching. The second one is a sequential labelling approach by conditional random fields (CRF). The third one is a rewriting approach by sequence to sequence (seq2seq) model. The three approaches have different properties that aim to optimize different performance metrics and the formal run results show the differences as we expected.


Introduction
Learning Chinese as foreign language is getting popular. However, it is very hard for a foreign learner to write a correct Chinese sentence. We believe that a computer system that can diagnose the grammatical errors will help the learners to learn Chinese fast.
Since 2014, the NLP-TEA workshop provides a Chinese Grammar Error Detection (CGED) shared task to promote the research on diagnosis. The organizer provides learners' corpus tagged with error labels. There are four types of errors in the leaners' sentences: Redundant, Selection, Disorder, and Missing. The research goal is to build a system that can detect the errors, identify the type of the error, and point out the position of the error in the sentence (Yu et al., 2014). This year, the CGED added a new requirement: for errors of missing words and word selection, systems are required to recommend at most 3 corrections. If one of the corrections of these instances is identical with gold standard, the instances will be regarded as correct cases.
In 2018, we sent three formal runs in three different approaches. The first two are based on previous works, the first one is a patternbased approach by frequent error pattern matching and language model scoring; the second one is a sequential labelling approach by conditional random fields (CRF), which performs well in year 2015 and 2016. The third one is a new approach, called rewriting approach by sequence to sequence (seq2seq) model. In the following sections, we will introduce the three approaches, discuss the formal run results, and give conclusion and future works.

Pattern-Based Approach
The pattern matching approach is an old approach, which has been used in many previous works (Wu et al., 2010;Chen et al., 2011). The pattern contains frequent error terms, in which a character is replace by a similar one. This is based on an assumption that students often make mistake among similar characters (Liu et al., 2009). The advantage of pattern matching is stable, the many drawback is it cost a lot to collect the patterns.
The system is based on the previous work, the error pattern from a native student essay corpus in traditional Chinese. Before testing the system, the test data is transformed into traditional by MS-Word 2010.

Sequential Labelling Approach
The second one is a sequential labelling approach by conditional random fields (CRF), which performs well in CGED 2015 and 2016. (Chen et al., 2015;Chen et al., 2016b) The sequential labelling approach is based on the conditional random field (CRF) model (Lafferty, 2001). CRF has been used in many NLP applications, such as named entity recognition, word segmentation, information extraction, and parsing. To apply it to a new task, it requires a specific feature set and labeled training data. The CRF model is regarded as a sequential labeling tagger. Given a sequence X, the CRF can generate the corresponding label sequence Y, based on the trained model. Each label Y is taken from a specific tag set, which needs to be defined in each task. How to define and interpret the label is a task-depended work for the developers.
Mathematically, the model can be defined as: where Z(X) is the normalization factor, is a set of features, λ k is the corresponding weight which will be learned in the training process. In the CGED task, X is the input sentence, and Y is the corresponding error type label. We define the tag set as: {O, R, M, S, D}, corresponding to no error, redundant, missing, selection, and disorder respectively. Table 1: A sample of the CRF sequential labeling dataset shows a sample of our working file. The first column is the input sentence X, and the third column is the labeled tag sequence Y. Note that the second column is the Part-of-speech (POS) of the word in the first column. The combination of words and the POSs will be the features in our system. The POS set used in our system is a simplified POS set provided by CKIP 1 .

Rewriting Approach
This year, we propose a new approach, called rewriting approach. Given a sentence with grammar errors, a system can rewrite it and output a sentence without grammar error. This idea is inspired from the RNN encoderdecoder models, which have been used in many deep learning researches. In such models, with the help of a large training set, a sequence can be transformed into another corresponding sequence. Amount them Sequence-to-sequence (seq2seq) models (Sutskever et al., 2014, Cho et al., 2014 have been applied successfully to a variety of NLP tasks such as machine translation, speech recognition, text summarization and conversation generation (Wu et al., 2017). In this task, we also adopt the seq2seq model as it is in Neural Machine Translation (NMT) which was the very first testbed for seq2seq model.

Seq2seq Model
Our rewrite approach system is built on TensorFlow Sequence to sequence (Seq2Seq) model 2 with the long-short-term-memory (LSTM). The training set is the 2017 and 2018 CGED training dataset.  Figure 1 shows the training flowchart of our system. The first step is collecting all the vocabulary in the training corpus to build a dictionary. Then uses the word2vec model (Mikolov et al., 2013) to find the vector representation of each word. The sentences written by the students and the corresponding correct version sentences are used to train the seq2seq model. Since we do not have a validation set to find a better early stop point. The termination criterion of training is an empirical value, perplexity equal 100.

Preprocessing
The sentences are segmented by Jieba 3 word segmentation toolkit. The size of the vocabulary set is 5,424. The vocabulary is not very large, comparing to other the corpus used in other NLP tasks.

Post-processing
After the input is rewritten by the system, then the system will compare the rewritten sentence to the input sentence. We assume the rewritten one is the correct one and report the differences as grammar errors.

Metrics
In the formal run, accuracy, precision, recall, and F-score are reported in three different levels. False positive rate is reported for the detection levels.

Formal Run result
The performance of our systems is shown in the following tables comparing to the average of all 32 formal runs in 2018. Table  2 shows the false positive rate; the only index that should be as low as possible. As we expected, the run1 pattern based approach gives the lowest FPR in all 32 runs. Table 3 shows the performance evaluation in detection level. At this level, the run2 sequential labelling approach perform well in both accuracy and precision. The recall is also improved from the performance in 2016 (Chen et al, 2016a). The performance of rewriting approach gives highest recall and high F1, however, poor accuracy and precision. This is also as we expected, since   the training corpus is too small and the vocabulary size is also too small.

Conclusion and Future Works
This paper reports our approach to the NLP-TEA-5 CGED Shared Task evaluation. By comparing three different approaches, we find that the systems can be tuned to optimize different performance metrics.
Our system presents the best false positive rate in detection level by pattern matching approach and high accuracy, precision by sequential labelling approach and high recall and F1 by rewriting approach.
Due to the limitation of time and resource, our system is not tested under different experimental settings. In the future, we will use a larger corpus to train a better rewriting system to improve the performance on error diagnosis.

Acknowledgments
This study is conducted under the "III System-of-systems driven emerging service business development Project" of the Institute for Information Industry which is subsidized by the Ministry of Economic Affairs of the Republic of China.