Ling@CASS Solution to the NLP-TEA CGED Shared Task 2018

In this study, we employ the sequence to sequence learning to model the task of grammar error correction. The system takes potentially erroneous sentences as inputs, and outputs correct sentences. To breakthrough the bottlenecks of very limited size of manually labeled data, we adopt a semi-supervised approach. Specifically, we adapt correct sentences written by native Chinese speakers to generate pseudo grammatical errors made by learners of Chinese as a second language. We use the pseudo data to pre-train the model, and the CGED data to fine-tune it. Being aware of the significance of precision in a grammar error correction system in real scenarios, we use ensembles to boost precision. When using inputs as simple as Chinese characters, the ensembled system achieves a precision at 86.56% in the detection of erroneous sentences, and a precision at 51.53% in the correction of errors of Selection and Missing types.


Introduction
An inter-language is an idiolect developed by a learner of a second language (or L2). It is characteristic that it preserves some features of the first language (or L1), and can overgeneralize some L2 linguistic rules. An investigation on the grammatical errors made by L2 learners will disclose the error patterns, which are beneficial to the teaching and learning process. On the other hand, it will promote the development of systems which can correct grammatical errors made by L2 learners automati-cally.
The rest of this paper is organized as follows: Section 2 briefly introduces the definition of the NLP-TEA CGED Shared Task 2018. Section 3 gives a quick review on previous studies. Section 4 describes the generation of pseudo data in detail. Section 5 introduces the modeling of the correction task using sequence to sequence learning. Section 6 analyses the experimental results. Finally, conclusions and prospects are drawn in Section 7.

NLP-TEA CGED Shared Task 2018
The goal of Chinese Grammar Error Diagnosis (CGED) Shared Task in NLP Tech for Education Application (NLP-TEA) is to develop NLP techniques to automatically correct grammatical errors in Chinese sentences written by L2 learners. The shared task facilitate researchers using different linguistic knowledges and computational techniques to compare their results on the basis of common datasets and evaluation frameworks. Grammatical errors made by speakers as a second language consist of different types. In CGED, the errors are defined as four types: Missing words ("M"), Redundant words ("R"), word Selection errors ("S"), and Word ordering errors ("W"). It is noticeable that this categorization is different from that of a traditional linguistic point of view, in which the errors are typically categorized into mis-usages of determiners, prepositions, noun forms, verb forms and subject-verb agreement etc. The categorization of errors in CGED tasks correspond to the four operations, i.e. insertions, deletions, substitutions, and transpositions, as defined in Damerau-Levenshtein dis-tance (Bard, 2006), respectively. These operations are used to edit a sequence into another.
A developed system should indicate types and positions of the errors, and propose corrections for the errors of S and M types. A system is to be evaluated using four tasks, including the detection of errors, the identification of error types, the identification of positions, and the corrections. Lee et al. (2013) employed handcrafted linguistic rules to detect grammatical errors made by learners of Chinese as a second language.

Previous Solutions: A Quick Review
Their system is further integrated with Ngram models to detect the errors (Lee et al., 2014). Most previous studies take the diagnosis of grammatical errors as a sequence labeling problem. They generally assign a B/I/O tag to each word in an input sentence, or each character in a word, to detect the errors. Yu and Chen (2012) proposed to use Conditional Random Field (CRF) (Lafferty et al., 2001) to detect Chinese word ordering errors. In 2014, Cheng et al. (2014) adopted a Support Vector Machine (SVM) (Hearst et al., 1998) to identify Chinese word ordering errors. In recent years, Long-short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) has been a popular neural network model used for this task (Zheng et al., 2016;Yang et al., 2017). Various features have been taken as the inputs into sequence labeling models, including characters, words, Part-of-Speech (POS) tags (Zheng et al., 2016), dependency information, and Point-wise Mutual Information (Yang et al., 2017), among many others.

Pseudo Labeling
The manually labeled dataset for the task of grammar error correction is of very limited size. Since manual labeling is both labor and time consuming, the size of the data set has been a bottleneck for the performances of automatic error correction systems. There have been several approaches to tackling this problem. Cahill et al. (2013) and Grundkiewicz and Junczys-Dowmunt (2014) use the error corrections extracted from Wikipedia revision history as training corpora. Further-more, many studies adopt a semi-supervised approach to automatically generating a large scale pseudo data set and have reported promising results (Foster and Andersen, 2009;Rozovskaya and Roth, 2010;Dickinson, 2010;Imamura et al., 2012;Felice and Yuan, 2014;Rozovskaya et al., 2017).

Error Types
In our study, the pseudo data are generated based on a close observation on the errors collected from the manually labeled dataset.

Missing
It is observed that missing words are often functional words. As shown in Sentence 1, in which a particle and a preposition are missing. (The erroneous sentence is represented with E; and the correct sentence, C. The erroneous phrases are in bold.) Sentences 2 and 3 show another type of missing errors, which are caused by improper uses of ellipses. (

Redundant
Of all the redundant errors in CGED data set, functional words are among the most frequent. For instance, the particle and the conjunction in Sentences 4-5 are redundant.

Selection
Selection errors often occur when nearsynonyms are misused, as shown in Sentences 6-7. The differences of the usages between these near-synonyms are subtle.

Word Order
Word ordering errors are typically related to the modification of verbs. For instance, the modifiers of the verbs, the auxiliary verb and the adverbs, are misplaced in Sentences 8-10.

Data Generation
Based on the above observations, we adapt the sentences written by native Chinese speakers to generate ungrammatical sentences. The canonical sentences come from 12 serials of textbooks for students learning Chinese as a second language, 7 serials of textbooks for native Chinese students, and People's Daily newspapers. The sentences are filtered with a length threshold and the controlled vocabularies for teaching Chinese as a second language (Hanban, 2001(Hanban, , 2010. These sentences are tokenized using LTP (Che et al., 2010). And then, the errors of redundant words, missing words, word selection errors and word ordering errors are generated using the operations of insertions, deletions, substitutions, and transpositions, respectively. All adaptations are done in terms of words. 2 millions sentences are adapted in this way.

Missing
(1) To make erroneous sentences with missing words, we randomly select a position in the input sentence.
(2) If the word in that position is a functional word, or it is a content word with an antecedent in that sentence, drop this word. Example sentences are shown below.

Redundant
(1) Randomly select a position in the input sentence.
(2) Randomly select a word according to word frequencies.

Selection
(1) Randomly select a position in the input sentence.

Word Order
(1) Randomly select a position in the input sentence.

Ling@CASS Solution: Methodology and System Development
A new task, the corrections of the errors of missing and selection types, has been intro-duced to CGED 2018. We accordingly need a reconsideration of the appropriateness of using sequence labeling models (Sakaguchi et al., 2017). Unlike the B/I/O tag set which is close, the corrections of the missing, and selection types of errors form an open set. In addition, the corrections generally give rise to output sentences with lengths different from input ones. Therefore, the correction task has gone beyond the capabilities of sequence labeling models. Sequence to sequence learning (seq2seq) maps an input sequence to an output sequence of varying lengths. It has been the mainstream model for machine translation nowadays (Klein et al., 2017). The correction task can be modeled as a translation task, in which the ungrammatical sentences are from an original language, and the corrections are from a target language. The translation model has been used in several previous studies on grammar error corrections (Schmaltz et al., 2016;Chaitanya, 2017;Yuan and Felice, 2013).
The state-of-the-art performances on machine translation are presented by FairSeq in terms of both accuracy and speed (Gehring et al., 2017). FariSeq significantly differs from previous seq2seq models in that its architecture is based entirely on Convolutional Neural Networks (CNN), instead of the prevalent Recurrent Neural Networks (RNN), so that computations can be fully parallelized during training and optimization.
In our study, we employ the FairSeq model. The Fairseq models are pre-trained with the pseudo labeled data, and fine-tuned with the manually labeled data delivered in CGED. The inputs to Fairseq models are as simple as Chinese characters and POS tags of characters. The POS are tagged using LTP (Che et al., 2010). We use the default settings of FairSeq, except that we use 512 dimensions of character embeddings. The embeddings are randomly initialized and we do NOT use any other resources.

Evaluation on Corrections
As shown in Table 1, we have four basic system configurations. These configurations are different in the use of pseudo corpus and POS tags. The evaluation in Table 1 reveals that the use of pseudo data has improved both precision and recall in the correction task of the word selection errors and missing errors, while that of POS tags does not make a significant contribution.
In real scenarios of grammar error diagnoses, the evaluation metrics of precision, recall and F1 are not of the same importance. A teacher would always prefers a grammar error correction system with high precision, even if it has a low recall, than a system returns lots of noises. Being aware of the significance of precision in a grammar error correction system in practice, we further use ensembles to boost precisions. The tag "(>1)" indicates that the correction has been confirmed by at least two basic systems; and "(>2)", at least three. The ensembled systems steadily achieve a precision greater than 50%, with a recall greater than 8%. These performances are much higher than the best in CGED 2018 submissions, where the precision is 29.32%, and recall is 1.58%.
The official submission of our team to CGED 2018 is the result of an ensemble of the systems 3 and 4, where the results are simply merged.

Figure 1: Impacts of Pseudo Data
We also evaluated the systems on the detections, and the identifications of error types and positions. Figure 2 shows a detailed analysis on the precision of the identification of error positions for all four types of errors. It reveals   that the current pseudo data has a positive impact on the precision of all error types, except for the word ordering errors. It indicates the word ordering pseudo data has much room for improvements. Figure 1 shows that the identification of the positions of these errors is of different difficulties to the systems. While the ensembled systems are proficient in handling word ordering errors, they have the most difficulties in handling redundant errors. Table 2 shows the ensembled system 1+3 (>1) achieves a False Positive Rate (FPR) at 4.48% and a precision of 86.56% the detection of erroneous sentences, which are better than the best FPR 4.99% and the best precision 82.76% in CGED 2018 submissions, respectively.

Conclusion and Future Work
In CGED 2018, we employ the sequence to sequence learning to model the task of grammar error correction. We adopt a semi-supervised approach to breakthrough the bottlenecks of very limited size of manually labeled data. Specifically, we adapt correct sentences written by native Chinese speakers to generate pseudo grammatical errors made by learners of Chinese as a second language. The pseudo data is used to pre-train the model and gives rise to improvements in both precision and recall. Being aware of the significance of precision in a grammar error correction system in real scenarios, we use ensembles to boost precision. The use of pseudo data has a positive impact on the identification of missing errors, redundant errors, and word selection errors.
In the future work, we will use multi-task to jointly optimize the four tasks all together (Luong et al., 2015). In addition, we will investigate more sophisticated techniques for the generation of pseudo data.