Detecting Simultaneously Chinese Grammar Errors Based on a BiLSTM-CRF Model

In the process of learning and using Chinese, many learners of Chinese as foreign language(CFL) may have grammar errors due to negative migration of their native languages. This paper introduces our system that can simultaneously diagnose four types of grammatical errors including redundant (R), missing (M), selection (S), disorder (W) in NLPTEA-5 shared task. We proposed a Bidirectional LSTM CRF neural network (BiLSTM-CRF) that combines BiLSTM and CRF without hand-craft features for Chinese Grammatical Error Diagnosis (CGED). Evaluation includes three levels, which are detection level, identification level and position level. At the detection level and identification level, our system got the third recall scores, and achieved good F1 values.


Introduction
With the rapid development of China's economy, "Chinese Fever" has been set off in the world and more foreigners begin to learn Chinese. Writing is an important part of Chinese learning, and the grammar is the basis of writing. In the process of writing and communicating with each other using Chinese, learners of Chinese as foreign language(CFL) may have grammar errors due to negative migration of their native languages.
Traditional learning methods for CFL rely on heavily manual work to point out grammar errors, which costs a lot of time and labor. In order to reduce the workload of manual identification, it is necessary to explore effective methods for Chinese Grammatical Error Diagnosis (CGED). In the field of natural language processing, CGED is a great challenge because of the flexibility and irregularity in Chinese, so a series of CGED evaluation tasks are arranged.
The CGED evaluation tasks provided a platform for many researchers to study the automatic detection of Chinese grammatical errors. The CGED 2018 evaluation task defines Chinese grammatical errors as four categories: redundant(R), selection (S), missing(M), disorder(W). As shown in Table 1, the example sentences corresponding to each error are given.
In this paper, we regarded the CGED 2018 shared task as a character-based sequence labeling task. We proposed a Bidirectional LSTM CRF(BiLSTM-CRF) neural network that combines LSTM and CRF for sequence labeling without any hand-craft features. Firstly, we use BiLSTM network to learn the information in the sentence and extract features, then we utilize CRF for sequence labeling to complete automatically Chinese grammatical errors detection.

Error Type
Error  The rest of this paper is organized as follows: Section 2 briefly introduces related work in this field. Section 3 introduces the model that we proposed. Section 4 discusses experiments and results analysis, including data preprocessing, hyperparameters and experiment results. Finally, conclusion and prospects are arranged.

Related Work
Automatic detection of grammatical errors is one of the most important tasks in the field of natural language processing. Researchers have already done a lot of work in the field of English grammatical errors diagnosis. For example, Helping Our Own (HOO) is a series of shared tasks in correcting textual errors (Dale and Kilgarriff,2011;Dale et al., 2012). The CoNLL2013 and CoNLL2014 shared tasks (Ng et al., 2013;Ng et al., 2014) focused on grammatical error correction, and many approaches were proposed, such as based N-gram language model methods (Hdez et al., 2014), statistical machine translation methods (Felice et al., 2014), machine learning methods (Wang et al., 2014), etc.
Compared with English, the study for Chinese grammatical errors diagnosis started later. The researchers also proposed many methods, such as statistical learning methods (Chang et al., 2012), ruled-based methods (Lee et al., 2013), and hybrid-based model methods .
However, due to the lack of corpora and the limitations of technology, the research progress is limited greatly. The CGED shared tasks (Yu et al., 2014;Lee et al., 2015Lee et al., , 2016RAO et al., 2017) provided researchers with a good platform to present their work. In CGED2016 shared task, a CRF-based model achieved good precision (Liu et al., 2016) and a model based on CRF+LSTM get good results (Zheng et al., 2016). In CGED 2017, researchers used some features such as part of speech, collocation words, N-gram etc., and put forward the BiLSTM+CRF model to train models for each error type respectively, then analyzed the errors by model fusion, finally made great progresses for CGED (Xie et al., 2017;Liao et al., 2017).
In this paper, we propose a bidirectional LSTM CRF Neural Network (BiLSTM-CRF) for CGED. The model is described as follows: (1) Different from the previous methods that train models for each error type, in our system, only one model is trained for all error types, and multiple error types are predicted at the same time.
(2) Our model captures sentence-level features based on the powerful long-term memory ability of BiLSTM and uses CRF for sequence labeling.
(3) The model only learns from word information without any handcraft features.

Model
In this paper, we regard Chinese Grammatical errors diagnosis as the sequence labeling task based on character level, and the tag sets are R (Redundant), S (Selection), M (Missing), W (Word Order), C (Correct). The BiLSTM-CRF model presented in this paper is shown in Figure 1, which includes Embedding Layer, BiLSTM Layer and CRF layer.
(1) Embedding Layer: transforms the index of word into word vector.
(2) BiLSTM Layer: learns the information of each word and extracts features from sentence.

Embedding Layer
Embedding Layer aims to transform words into distributed representations which capture syntactic and semantic meanings of words. Therefore, we use word embeddings to represent words in the sentence.
Given a sentence S, then we can describe it as S w , w , w , … , w , w , which contains a sequence of words, and each word is derived from a vocabulary V. Words are represented by distributional vectors w ∈ which are drawn from a word embedding matrix W ∈ | | . After Embedding Layer, then we can get X: X , , , … , , .

BiLSTM Layer
Due to the powerful long-term memory ability of LSTM, LSTM based neural networks, which have access to both past and future contexts, are proven to be effective in sequence labeling task. The hidden states in bidirectional LSTM can capture both past and future context information and accomplish sequence labeling for each token. Basically, a LSTM unit is composed of three multiplicative gates which control the proportions of information to forget and to pass on to the next time step. Three components composite the LSTM-based recurrent neural networks: one input gate with corresponding weight matrix , , , ; one forget gate with weight matrix , , , ; one output gate with corresponding weight matrix , , , . Formally, the formulas (1) to update an LSTM unit at time are: where σ is the element-wise sigmoid function and ⨀ is the element-wise product. is the input vector at time , and is the hidden state vector storing all the useful information at (and before) time .
Mathematically, the input of the BiLSTM layer is a sequence X of word vectors from Embedding Layer, where X , , , … , , . The output of the BiLSTM Layer is a sequence of the hidden states for each input word vectors, denoted as h , , , … , , . Each final hid-den state is the concatenation of the forward and backward hidden states, then we can get : , , , ,

CRF Layer
Since there are many syntactic constraints in natural language sentences, the relationship among adjacent tags is very important for CGED shared task. If we simply transfer directly the hidden states of BiLSTM Layer to a Softmax layer for tag prediction, it is possible to break the syntactic constraints and it is difficult to consider the correlation among adjacent tags. Conditional random field (CRF) is the most commonly used method in structural prediction, and its basic idea is to use a series of potential functions to approximate the conditional probability of the output label sequence for the input word sequence. The sequence of hidden states in the BiLSTM Layer can be described as h , , , … , , , then we treat it as the input to the CRF Layer. The output of CRF Layer is our final prediction label sequence, we can see that , , , … , , , where ∈ and represents the set of all possible label sequences. So we can use the hidden state sequence to get the conditional probability of the output sequence, and the conditional probability is: Where , is the two weight matrices, and the subscription indicates that we extract the weight vector for the given label pair , . At the same time, in order to train the CRF Layer, we use the classical maximum conditional likelihood estimation to train our model. The final log-likelihood of the weight matrix is as follows: Finally, the Viterbi algorithm is used to train the CRF Layer and decode the optimal output sequence.

Experiments and Results Analysis
In this paper, based on the CGED series evaluations, we adopted the dataset of CGED 2016 and CGED 2018 shared tasks as out training dataset, then we manually deleted some incorrect sentenc-es in the training set and rebuilt the dataset. The CGED 2017 test set was selected as the validation set and the CGED 2018 test set was used as the test set. We selected BiLSTM-CRF model for CGED 2018 shared task. This part mainly includes data preprocessing, parameter settings, results analysis on the validation set and the test set.

Data Preprocessing
Since the CGED evaluation task involves identification of incorrect boundary positions, word segmentation may cause the misalignment between the end points of words and corresponding error intervals. At the same time, it may also result in overlapping problems among multiple types of er-rors. Therefore, in this paper we employed characters for Chinese grammatical error diagnosis. Different from previous methods that trained models for each error type, only one model which can identity simultaneously four types of errors is trained in our system.
Using previous data preprocessing method (Liu et al., 2016), we extracted correct sentences and wrong sentences from the corpus according to the manual annotation, and then respectively marked characters with the corresponding labels that include redundant(R), missing(M), selection(S), disorder(W), correct (C). we give some preprocessing examples that are shown in Table 2.

Parameter Settings
In this paper, word vector is randomly initialized, and word vector dimension is 50. Here is the overview of optimized parameters: · Word vector dimension 50 · Hidden size 50 · Adam learning rate 0.001 · Epoch 300

Experiments Results
In this paper, we use two different models to conduct experiments respectively, which are CRF model (M1) and BiLSTM-CRF model (M2).

CRF model:
The CRF model adds a variety of grammatical features such as bigram and trigram features. The selection of features directly affects the performance of the model. Therefore, this experiment adopts the feature length of 7 and uses bigram and trigram to extract features. BiLSTM-CRF model: The BiLSTM-CRF model combines LSTM and CRF for sequence labeling. Firstly, we use BiLSTM network to learn information in the sentence and extract features, then we utilize CRF for sequence labeling to complete automatically CGED shared work.
The results on the validation set: The valuation set used in this paper is the test set in the CGED2017 shared task. Two different models are used to conduct experiments on the valuation set, results are shown in Table 3.
From Table 3, we can see that CRF model has lower False Positive Rate (FPR) than BiLSTM-CRF model, and CRF model achieves better precision performance at the detection level and the identification level, because that CRF model has more features information such as bi-gram, trigram. However, CRF model and BiLSTM-CRF model are not good at position level. We think that our models are short of identification of position boundary. Next, we will focus on the position level by adding character position features.
The results on the test set: The test set is the test set in the CGED 2018 shared task. We submitted only one result in this task. The Table 4 lists the result Run1 we submitted and the test result based on CRF model.
At the error detection level and error identification level, our system achieves a third recall rate and gets a good F1 value. However, our system has a poor performance at the error position level and FPR. Since our system recognizes four types of errors at the same time, increasing the difficulty of recognition, it is easier to identify a correct sentence as an error sentence, it results in lower FPR performance on the test set. In addition, our system is based on character level, although the BiLSTM network has a powerful long-term memory function, the lack of word collocation information also results in lower position level efficiency. Another reason for low position level efficiency is that tag does not distinguish among locations. For example, Error: 我/C 朋/C 友/C 的/C 努/C 力/C 真/C 是/C 可/S 看/S 的/C。/C Correction: 我朋友的努力真是有效的。

(My friend's efforts are really effective)
In this sentence, "可看" should be corrected as "有 效". There was no distinction in two "/S", so we think it leads to lower position level efficiency.

Conclusion
On the basis of CGED series evaluation tasks, this paper proposes a neural network model based on BiLSTM-CRF, which is used for Chinese grammatical error detection. It has good effect at the detection level and identification level, especially the high recall rate. But it has low performance at the position level. Next, we will add some external features, such as parts of speech, character position features and collocation features to improve the performance of our system.