Chinese Grammatical Error Diagnosis Based on Policy Gradient LSTM Model

Chinese Grammatical Error Diagnosis (CGED) is a natural language processing task for the NLPTEA2018 workshop held during ACL2018. The goal of this task is to diagnose Chinese sentences containing four kinds of grammatical errors through the model and find out the sentence errors. Chinese grammatical error diagnosis system is a very important tool, which can help Chinese learners automatically diagnose grammatical errors in many scenarios. However, due to the limitations of the Chinese language’s own characteristics and datasets, the traditional model faces the problem of extreme imbalances in the positive and negative samples and the disappearance of gradients. In this paper, we propose a sequence labeling method based on the Policy Gradient LSTM model and apply it to this task to solve the above problems. The results show that our model can achieve higher precision scores in the case of lower False positive rate (FPR) and it is convenient to optimize the model on-line.


Introduction
In English and many other languages , the space is a good approximation of a word divider (word delimiter), a sentence separated by spaces into multiple words. Unlike the English, Chinese does not have a separator on the written scripts, a sentence consists of Chinese characters that are next to each other, where sentences but not words are delimited. This is very difficult for the machine or learner without a Chinese foundation to analyze Chinese grammar, because it first has to face the problem of Chinese word segmentation (Xue, 2003). Compared to English, Chinese has neither singular/plural change, nor the tense changes of the verb, and it uses more short sentences but less clauses. In addition, the same word may express different meanings in different contexts, namely ambiguity. All these problems make learning Chinese very difficult. Most non-native Chinese language learners usually need professional Chinese teachers to guide them and correct grammatical errors. However, online teaching has recently become the main channel for language learning, which requires the system to automatically diagnose and give advice to a large number of learners' grammatical errors. Therefore, the study of Chinese grammatical error automatic diagnosis system is very important. The goal of Chinese Grammatical Error Diagnosis (CGED) is to build a system that can automatically diagnose errors in Chinese sentences. Such errors are defined as redundant words (denoted as a capital "R"), missing words ("M"), word selection errors ("S"), and word ordering errors ("W"). Evaluation includes three levels, which are detection level, identification level and position level.
At present, most methods regard the Chinese grammatical error diagnosis task as a sequence labeling task (Settles and Craven, 2008), such as using a conditional random field construction sequence labeling model (Lafferty et al., 2001) and a sequence labeling model constructed using LSTM (Hochreiter and Schmidhuber, 1997). However, the characteristics of Chinese language leads to a obvious problem in constructing Chinese grammatical error diagnosis model, which is the imbalance between positive and negative samples. For example, a sentence to be labeling is: The correct labeling result should be:  "NNNNNNNNPNNNNNNPNNNNNNNNNNN", where N denotes a negative label, ie there is no wrong label, P denotes a positive label, ie there is a wrong label. We can see that the proportion of positive and negative sample labels in a not very long sentence is seriously unbalanced, in the above example, the ratio is 2:27, which is a serious problem faced by the Chinese grammatical error diagnosis model. In order to solve the above problems, we propose a Policy Gradient-based model to tag Chinese sentences. Similar to the recent work, we also use the LSTM model to handle this task as a sequence labeling problem . Moreover, we use the Policy Gradient method to deal with the imbalance of positive and negative samples. The results show that our method can achieve better results. This paper is organized as follows. Section 2 introduces some related work . Section 3 briefly describes the CGED Shared Task. Section 4 illustrates our methodology, including data preparation, model description and the details of policy gradient method. Section 5 shows the experiment settings and results. And finally, section 6 concludes the paper and presents future work.

Related works
The English Grammatical Error Correction task has been held for two consecutive years as one of the natural language processing tasks of the Conference on Computational Natural Language Learning (CoNLL). The researchers used many different methods to study the task and achieved good results (Tou et al., 2017). where (Junczys-Dowmunt and Grundkiewicz, 2014) usesd phrasebased translation optimized for F-score using a combination of kb-MIRA and MERT with augmented language models and task-specific features, and got a good result. As a universal language model, the Long Short-Term Memory network (LSTM) (Hochreiter and Schmidhuber, 1997) has achieved good results in many tasks in natural language processing in recent years, including text classification tasks, machine translation tasks, and sequence annotation tasks. (Yuan and Briscoe, 2016) used the Encoder-Decoder model similar to neural machine translation to process the English Grammatical error correction Task and achieved good results. Compared with English, the research time of Chinese grammatical error diagnosis system is short, the data sets and effective methods are lacking. (Yu and Chen, 2012) uses the CRF-based model to construct a Chinese word ordering error detection model and obtains a higher accuracy on the experimental data set. In recent years, Chinese grammatical error diagnosis has been cited as a shared task of NLPTEA CGED. Many researchers in the field of natural language processing have researched and proposed several effective methods (Yu et al., 2014;Lee et al., 2015Lee et al., , 2016. HIT propose a CRF+BiLSTM model based on character embedding on bigram embedding, on the CGED-HSK dataset of NLP-TEA-3 shared task, their system presents the best F1-scores in all the three levels .

CGED Task Description
The goal of The NLPTEA CGED task is to use a model to perform a grammar diagnosis on a data set containing Chinese sentences, these datasets are written by Chinese Foreign Language (CFL) leaner. These datasets contain the following four errors, such errors are defined as redundant words (denoted as a capital "R"), missing words ("M"), word selection errors ("S"), and word ordering errors ("W"). The input sentence may contain one or more such errors, and there may also be no errors.
The developed system should indicate which error types are embedded in the given sentence and the position at which they occur. Some typical examples are shown in Table 1 Table1 shows the CGED shared task input data and output data samples. Each sentence contains a single id, each output error contains the sentence id, and the number in Errors indicates the index of the error location. The criteria for judging correctness are determined at three levels as detection level, identification level and position level.

Methodology
In this section, we will introduce our entire process of the CGED task, including data preprocess-ing, model construction, and the construction of objective functions based on the Policy Gradient. Same as previous work, we treat the CGED task as a sequence labeling problem. Such as given a sentence x, our model generates a corresponding label sequence y. Each label in y is a token from a specific tag set. We use "O" to indicate the correct character's tag, 'B-X' indicating the beginning positions for errors of type 'X' and 'I-X' as middle and ending positions for errors of type 'X'.
First, we will introduce our CGED task data preprocessing process, including Bigram feature construction, POS data annotation, and data label settings. Second, we will introduce the construction of the ensemble model that combines Biggram feature, POS feature, and character embedding. Finally, we will introduce the idea and mathematical formula of the objective function based on the Policy Gradient.

Data Preparation
First, we use the Word2vec tool to train the Bigrams of all Chinese sentences in the data set into word vectors. These word vectors will be used to generate input sentence features during model building. we first convert the original character sequence to a bigram sequence. Then we can train bigram embeddings readily using word2vec (Mikolov et al., 2013) on the resulting bigram sequences.
We use the Part-of-speech (POS) feature to improve the performance of the system. Therefore, we use the part-of-speech (POS) feature to generate a corresponding POS tag sequence for each Chinese sentence sequence of the data set, B-pos indicating the beginning character's POS tag while I-pos indicating the middle and end characters'.
We define each character in the sentence as a separate tag that contains the character's position in the word. We use "O" to indicate the correct character's tag, 'B-X' indicating the beginning positions for errors of type 'X' and 'I-X' as middle and ending positions for errors of type 'X'. In the CGED task, we will get 8 labels: B-W, I-R, B-R, B-M, I-S, I-W, B-S, O. After the data is preprocessed, each sample can be represented as the structure shown by Table 2. The input of each sample during training is composed of three parts as shown in the inputs features of Table 2, and the label sequence of each sample is composed of 8 pre-defined labels.

Model Description
We regard the Chinese grammatical error diagnosis task as a sequence labeling task, and first use LSTM to construct a sequence labeling model. LSTM network is a variant of recurrent neural network (RNN) and have better ability to capture long term dependencies. Given a sequence of input vectors X = x 1 , x 2 , . . . , x T = {x t } T 1 , a recurrent unit H computes a sequence of hidden vectors h=h 1 , h 2 , . . . , h T = {h t } T 1 and a sequence of output symbolsŶ =ŷ 1 ,ŷ 2 , . . . ,ŷ T = {ŷ t } T 1 by iterating the following equations, where sof tmax(z m ) = e zm / i e z i , The LSTM recurrent unit H represents the calculation process of the LSTM network. A typical LSTM network consists of input gates, oblivion gates, output gates, and memory cells. Which input gate controls the current time step which information will be input into the memory cell, the forgotten gate controls the current time step which history information will be forgotten by the memory cell, and the output gate controls which information will be output as h t according to the current memory cell state. Each gate consists of a sigmoid neural net layer and a point-wise multiplication operation. In this work, we denote the input of the time step i as: Where σ represents the nonlinear activation function, c t is the character embeddings that are initialized immediately, b t represents the bigram Sentence 我根本不能了解这妇女辞职回家的现象。 Table 2: A snapshot of our training data after the pre-processing vector of the current time step, and p t represents the POS discrete feature. These three simple features are combined as the input vector for the time step t. The ensemble model is shown in Figure 1.

Policy Gradient
Deep Reinforcement Learning (DRL) is divided into Value-Based Deep RL (Mnih et al., 2015) and Policy-Based Deep RL (Lillicrap et al., 2015) in terms of implementation [16]. Value-Based Deep RL is a Neural Network usually used as a Q function to estimate the return of an action which can be obtained in the current environment, namely Deep Q-network (DQN). Such as (Mnih et al., 2013) present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning, model is a convolutional neural network, trained with a variant of Q-learning. The Policy-Based Deep RL is Represent policy by deep network with weights u, as shown below: a = π(a|s, u) or a = π(s, u) Where π is the policy expressed by the neural network and u is the network learning parameter. Define objective function as total discounted reward: L(u) denotes the objective function, r 1 , r 2 , ... denotes the returns obtained in each step. In this paper, the value of the return of the tagged result of each token is indicated. γ ∈ [0, 1] is the discount factor, which indicates the importance of future returns. In this article we set γ = 0.9. To make high-value actions more likely, the gradient of a stochastic policy π(a|s, u) is given by: Where Q π is a function value that measures the return of each action. In this article, we define that the return value of the tag "O" is successfully marked as 1, and the return value of the failed tag is -1. Defining all other error labels "B-W, I-W, B-M, I-W ..." is marked with a score of 10 for a successful return, and a return of -10 for a failed tag. Finally, update parameters u by stochastic gradient ascent. Our ensemble model is shown in Figure 2. Where Q π (X) represents the reward after label "X" was tagged, for example, the "X" is "B-R", y t represents the policy obtained by the network. Finally, the final output π(a|s, u)Q π (s, a) of the network is obtained with the policy π and reward Q known. This output is used to calculate the policy gradient ∂L(u) ∂u , and then the gradient is used to update the network parameters.

Experiments
In this section, we introduce the entire process of the experiment. First of all, we introduce the use of data sets and division, and then briefly introduce the CGED experimental results evaluation method. Finally, we introduce the results on the validation dataset and the results from the evaluation dataset based on our proposed model.

Dataset and criteria
During the training of the model, we use the collection of training set of CGED2017 and training set of CGED2018 as the training dataset. In CGED2017 training set, provide 10,449 training units with a total of 26,448 grammatical errors, categorized as redundant (5,852 instances), missing (7,010), word selection (11,591) and word ordering(1,995). In the CGED2018 training set, contain total of 1,067 grammatical errors, categorized as redundant (208 instances), missing (298), word selection (87) and word ordering(474). In addition, use CGED2017's test set as the validation set during training, it's contain total of 4,871 grammatical errors, categorized as redundant (1,060 instances), missing (1,269), word selection (2,156) and word ordering(386).  The criteria for judging correctness are determined at three levels, (1)Detection-level: Binary classification of a given sentence, that is, correct or incorrect, should be completely identical with the gold standard. All error types will be regarded as incorrect. (2)Identification-level: This level could be considered as a multi-class categorization problem. All error types should be clearly identified. A correct case should be completely identical with the gold standard of the given error type. (3)Position-level: In addition to identifying the error types, this level also judges the occurrence range of the grammatical error. That is to say, the system results should be perfectly identical with the quadruples of the gold standard. The False Positive Rate(FPR), Accuracy (Acc), Precision (Pre), Recall (Rec) and F1 score(F1) are measured at all levels with the help of the confusion matrix.

Experiment results
We use the above data partitioning to train and converge the training set based on our proposed Policy Gradient-based model, the trained model was tested on the validation set and evaluation set.

Results on Validation Dataset
We refer to the model's results on the validation dataset and select the best hyper-parameters model. Table 4 shows the results.

Results on evaluation Dataset
We testing on the final evaluation dataset for CGED2018 test set, the result showing with table 5. As we can see, our model can obtain better identification score and position score while obtaining a better detection level score.
Our model obtains good results at three levels, and the Policy Gradient-based model can be easily applied to online tasks to optimize the network structure through continuous interaction and attempting to obtain maximum rewards.

Conclusion and Future Work
This paper proposes a method based on policy gradient applied to NLPTEA 2018 CGED shared task. We use the value function method of deep reinforcement learning to map the labeling results to rewards to solve the problem of imbalanced positive and negative samples in Chinese grammatical error diagnosis. Moreover, our system can be applied to online optimization as easily as a depthenhanced model. In this paper, we verify the effectiveness of the Policy Gradient through experiments on the validation dataset and the evaluation dataset.
In the future, we hope to betterly solve the problem of serial labeling with imbalanced positive and negative samples in Chinese grammatical error diagnosis through deep reinforcement learning strategies. In terms of Policy Gradients, we hope to be able to define reward functions that are more in line with the mission requirements and optimize the entire network. In addition, we hope to optimize the network through multiple rounds of online annotation results and further conduct relevant online experiments. Ultimately, the network can achieve good labeling results while also being able to cope with the challenges posed by online data changes.