Overview of NLPTEA-2018 Share Task Chinese Grammatical Error Diagnosis

This paper presents the NLPTEA 2018 shared task for Chinese Grammatical Error Diagnosis (CGED) which seeks to identify grammatical error types, their range of occurrence and recommended corrections within sentences written by learners of Chinese as foreign language. We describe the task definition, data preparation, performance metrics, and evaluation results. Of the 20 teams registered for this shared task, 13 teams developed the system and submitted a total of 32 runs. Progress in system performances was obviously, reaching F1 of 36.12% in position level and 25.27% in correction level. All data sets with gold standards and scoring scripts are made publicly available to researchers.


Introduction
Automated grammar checking for learners of English as a foreign language has achieved obvious progress. Helping Our Own (HOO) is a series of shared tasks in correcting textual errors (Dale and Kilgarriff, 2011;Dale et al., 2012). The shared tasks at CoNLL 2013 and 2014 focused on grammatical error correction, increasing the visibility of educational application research in the NLP community (Ng et al., 2013;2014).
Many of these learning technologies focus on learners of English as a Foreign Language (EFL), while relatively few grammar checking applications have been developed to support Chinese as a Foreign Language(CFL) learners. Those applications which do exist rely on a range of techniques, such as statistical learning (Chang et al, 2012;Wu et al, 2010;Yu and Chen, 2012), rule-based analysis (Lee et al., 2013), neuro network modelling (Zheng et al., 2016;Zhou et al., 2017) and hybrid methods (Lee et al., 2014).
In response to the limited availability of CFL learner data for machine learning and linguistic analysis, the ICCE-2014 workshop on Natural Language Processing Techniques for Educational Applications (NLP-TEA) organized a shared task on diagnosing grammatical errors for CFL (Yu et al., 2014). A second version of this shared task in NLP-TEA was collocated with the ACL-IJCNLP-2015 (Lee et al., 2015), COLING-2016. Its name was fixed from then on: Chinese Grammatical Error Diagnosis (CGED). As a part of IJCNLP 2017, the shared task was organized (Rao et al., 2017). In conjunction with NLP-TEA workshop in ACL 2018, CGED is organized again. The main purpose of these shared tasks is to provide a common setting so that researchers who approach the tasks using different linguistic factors and computational techniques can compare their results. Such technical evaluations allow researchers to exchange their experiences to advance the field and eventually develop optimal solutions to this shared task.
The rest of this paper is organized as follows. Section 2 describes the task in detail. Section 3 introduces the constructed datasets. Section 4 proposes evaluation metrics. Section 5 reports the results of the participants' approaches. Conclusions are finally drawn in Section 6.

Task Description
The goal of this shared task is to develop NLP techniques to automatically diagnose (and furtherly correct) grammatical errors in Chinese sentences written by CFL learners. Such errors are defined as PADS: redundant words (denoted as a capital "R"), missing words ("M"), word selection errors ("S"), and word ordering errors ("W"). The input sentence may contain one or more such errors. The developed system should indicate which error types are embedded in the given unit (containing 1 to 5 sentences) and the position at which they occur. Each input unit is given a unique number "sid". If the inputs contain no grammatical errors, the system should return: "sid, correct". If an input unit contains the grammatical errors, the output format should include four items "sid, start_off, end_off, error_type", where start_off and end_off respectively denote the positions of starting and ending character at which the grammatical error occurs, and error_type should be one of the defined errors: "R", "M", "S", and "W". Each character or punctuation mark occupies 1 space for counting positions. Example sentences and corresponding notes are shown as Table 1 shows. This year, we only have one track of HSK.

Datasets
The learner corpora used in our shared task were taken from the writing section of the HSK (Pinyin of Hanyu Shuiping Kaoshi, Test of Chinese Level) (Cui et al, 2011;Zhang et al, 2013). Native Chinese speakers were trained to manually annotate grammatical errors and provide corrections corresponding to each error. The data were then split into two mutually exclusive sets as follows.
(1) Training Set: All units in this set were used to train the grammatical error diagnostic systems.
Each unit contains 1 to 5 sentences with annotated grammatical errors and their corresponding corrections. All units are represented in SGML format, as shown in Table 2. We provide 402 training units with a total of 1,067 grammatical errors, categorized as redundant (208 instances), missing (298), word selection (474) and word ordering (87).
In addition to the data sets provided, participating research teams were allowed to use other public data for system development and implementation. Use of other data should be specified in the final system report.  Test Set: This set consists of testing units used for evaluating system performance. Table 3 shows statistics for the testing set for this year. According to the sampling in the writing sessions in HSK, over 40% of the sentences contain no error. This was simulated in the test set, in order to test the performance of the systems in false positive identification. The distributions of error types (shown in Table 4) are similar with that of the training set. The proportion of the correct sentences is sampled from data of the online Dynamic Corpus of HSK 1 .   Table 5 shows the confusion matrix used for evaluating system performance. In this matrix, TP (True Positive) is the number of sentences with grammatical errors are correctly identified by the developed system; FP (False Positive) is the number of sentences in which non-existent grammatical errors are identified as errors; TN (True Negative) is the number of sentences without grammatical errors that are correctly identified as such; FN (False Negative) is the number of sentences with grammatical errors which the system incorrectly identifies as being correct.

Performance Metrics
The criteria for judging correctness are determined at three levels as follows.
(1) Detection-level: Binary classification of a given sentence, that is, correct or incorrect, should 1 http://bcc.blcu.edu.cn/hsk be completely identical with the gold standard. All error types will be regarded as incorrect.
(2) Identification-level: This level could be considered as a multi-class categorization problem. All error types should be clearly identified. A correct case should be completely identical with the gold standard of the given error type.
(3) Position-level: In addition to identifying the error types, this level also judges the occurrence range of the grammatical error. That is to say, the system results should be perfectly identical with the quadruples of the gold standard.
Besides the traditional criteria in the past share tasks, Correction-level was introduced to CGED 2018.
(4) Correction-level: For the error types of Selection and Missing, recommended corrections are required. At most 3 recommended corrections are allowed for each S and M type error. In this level the amount of the corrections recommended would influent the precision and F1 in this level. The trust of the recommendation would be test.
The following metrics are measured at all levels with the help of the confusion matrix.

Participant (Ordered by names) #Runs Correction-level
In correction-level, DM_NLP achieved best precision (0.2932 and 0.3077) in correction and top3 correction track. HFL's runs reached best F1 of 0.1723 and 0.2527.
10 participants submitted 11 reports on their systems. Though neural networks achieved good performances in various NLP tasks, traditional statistic models and pipe-lines were still widely implemented in the CGED task. LSTM+CRF has been a standard implementation. Unlike CGED 2017, participants began to rethink the importance of the feature selection and statistics.
In summary, none of the submitted systems provided superior performance using different metrics, indicating the difficulty of developing systems for effective grammatical error diagnosis, especially in CFL contexts. From organizers' perspectives, a good system should have a high F1 score and a low false positive rate. Overall, HFL, DM_NLP, and CMMC-BDRC achieved relatively better performances.

Runs FPR Detection Identification Position
Acc. pre rec F1 pre re F1 pre rec F1   Table 9 summarizes the approaches and resources for each of the submitted systems, according to their 1 st draft of system reports (some details were not clearly described yet). PkU_ICL, NCYU and IIT(BHU) did not submit reports on their systems. Though neural networks achieved good performances in various NLP tasks, traditional pipe-lines were still widely implemented in the CGED task. CRF, as a sequence labelling model with flexible feature space, was chosen by DM_NLP, CMMC, ECNU, HFL, walker and UIUC in their system pipe-lines. Further, UIUC applied its pipe-line only with CRF and post processing, achieving comparable results. NTOU conducted their runs based on frequent subsentences matching in internet corpus.
For LSTM modelling, feature choice played an important role, influencing the system performance a lot. Besides character and word, part of speech (POS) based on the segmentation, are widely selected. ePMI, cPMI, Adjacent Word Collocation (AWC), Dependent Word Collocation (DWC), Contextualized Char Representation are newly implemented features in this task.
For LSTM itself, AutoNLP applied policy gradient in modelling. Some participant added additional memory gate in the neuro, a quite normal trick in machine translation, helping their system achieve high F1 score over 50% in position-level and over 40% in correction-level. The submissions were withdrawn, due to the suspected overfitting of testing set. Although it cannot reflect the real achievement in this task, the phenome is still meaningful in particular context, like computer assistant essay correction 2 .
In correction-level, DM_NLP applied rulebased, NMT and SMT models and merge the generated results in hybrid pipe-line. HFL also followed the strategy of multi-model merging, using PMI scoring and a seq2seq network Their pipelines are shown in Fig.1.
More various additional resources appeared in CGED 2018. Besides Gigawords and Wikipedia Corpus, Google Ngram, People's Daily, Chinese 5gram are newly introduced resources in this task. More impressively, CMMC utilized domain dictionary in L2 teaching to form pseudo writing data for training set enhancement, improving their performances in all aspects.  Table 9: Summary of approaches and additional resources used by the submitted systems.

Conclusion
This study describes the NLP-TEA 2018 shared task for Chinese grammatical error diagnosis, including task design, data preparation, performance metrics, and evaluation results. Regardless of actual performance, all submissions contribute to the common effort to develop Chinese grammatical error diagnosis system, and the individual reports in the proceedings provide useful insights into computer-assisted language learning for CFL learners. We hope the data sets collected and annotated for this shared task can facilitate and expedite future development in this research area. Therefore, all data sets with gold standards and scoring scripts are publicly available online at http://www.cged.science.