Contextualized Character Representation for Chinese Grammatical Error Diagnosis

Nowadays, more and more people are learning Chinese as their second language. Establishing an automatic diagnosis system for Chinese grammatical error has become an important challenge. In this paper, we propose a Chinese grammatical error diagnosis (CGED) model with contextualized character representation. Compared to the traditional model using LSTM (Long-Short Term Memory), our model have better performance and there is no need to add too many artificial features.


Introduction
With the rapid development of China, more and more non-native Chinese speakers begin to learn Chinese. Writing is a very important part of Chinese learning. However, there are some differences between Chinese and English, such as no changes in tense in Chinese, which makes it difficult for many Chinese learners to find their own mistakes in writing. Traditional Chinese learning methods cost a lot of labor and time, so it is very important to establish an automatic diagnosis system for Chinese grammatical error. This is also the purpose of this shared task.
The task of CGED2018 1 is to automatically diagnose grammatical errors in Chinese sentences written by second language learners. The errors include four types, redundant words (denoted as a capital "R"), missing words ("M"), word selection errors ("S") and word ordering errors ("W"). Table 1 shows examples of errors. The CGED system needs to detect the location of errors and gives the type of each error. For error typed S and M, the model can give at most three correct candidates. In this paper, we regard CGED task as a sequence labeling problem (Zheng et al., 2016) and propose a CGED model with contextualized character representation. This model have better considered the different semantics of words in Chinese texts. The experiment results show that our model have better result compared to the baseline without artificial features.

Character Embedding
Words are the smallest unit of semantic expressions in Chinese texts. In different contexts, the same words may express different meanings. Also, the same situation exists for single characters. For example, the character "S" in word " S" (a dozen) means dozen, in word "S " (play the drum) means play. Therefore, we use the same character vector to represent the same character in different contexts is inaccurate, and sometimes there may be a big semantic deviation. To address this issue, we propose to use the contextualized character representation for CGED to solve the ambiguity problem. (Choi et al., 2016) puts forward that each dimension of a word vector may represent some semantic information of the word. But in different texts, the semantic information we need to use is different, so we need to ignore the unneeded semantic information. That is to say, under the different context conditions, we need to mask out some dimensions of the word embedding vectors. We take advantage of this method proposed in (Choi et al., 2016) for our model.

Building Contextualized Character Representation
where x t in our work represents the character representation in each time step. T represents the text representation. M is the max sequence length for the sentence. N N ξ : R C E → R T E is a feedforward neural network parametrized by ξ. C E is the character embedding size and T E is the text representation size. Then we use T to calculate the contextualized character vectors as input of traditional sequence labeling model of LSTM instead of the traditional character vectors.
where σ is the sigmoid activation function to control the output between 0 to 1. W m is the weight of calculating mask and b m is the bias. is an element-wise multiplication.
We use the mask to get the contextualized character representation which can better represent the meaning of characters and better obtain the information we need in the text.   In the given Chinese text, we find that a relatively long sentence may only contains one or two errors. Although one sentence may contain multiple errors but the number of errors is insufficient. In Table 2 and 3, we give the number of errors in CGED2018.
After dividing the errors into four categories, it can be seen that due to the small number of errors, it may not be conducive to the training of the model.

Function of Save Model
We use the traditional training method, accuracy, to train our model. However, when the development set has reached the greatest accuracy, the output of the model in test set is not good. Analyzing the result, we see that the model learns the correct part more, and learns the error information less. The model discriminates most of test sentences to be correct. Therefore, we propose to save the model no longer when the development set achieves the max accuracy, but when Eq. 4 is max in development set.
where p represents the output label of the model of a character and y represents the ground-truth label. The significance of Eq. 4 is that when we save the model, we expect the model to detect more wrong information and ignore some correct information. The model can capture more error information when there are fewer errors in the sentence.

Loss Function
Although the model can detect more error information but it is not enough, when we use Eq. 4 to save the model. From the table 6, 7, 8, 9, it can also be seen that although the results have improved but the increase is limited.
In the traditional LSTM model of sequence labelling, the cross-entropy loss function, Eq. 7, is generally used as its loss function.
However, the problem that the number of correct characters in the dataset is much larger than the number of incorrect characters still exists. Therefore, the training of the model may have some problem. To address this issue, we add a loss function Eq. 8 to loss 1 .
where we use mask r to keep the correct place in the training tag, forcing the model to capture more error information. The overall loss function is Eq. 11.

Correction System
Correct system we use in our model is the method proposed in (Chen et al., 2016). Since we mainly deal with the detection problem, we have simplified the method in (Chen et al., 2016) and only put forward one candidate correction.
(Chen et al., 2016) uses the method of calculating the n-gram score of each word to judge whether the word is correct or not and put forward correct candidates. If the original word has the highest score, the original word is considered to be correct. If the candidate word has a higher score than the original word, the original word is considered to be wrong. The candidate word with highest scoring is regarded as the correction.
log (gsf (u))) (12) Eq. 12 gives the equation of length-weighted string log-frequency score SL(S). Where S represents the sentence after word segmentation or character segmentation. SubStr(S, n) represent all substring of sentence S with n words or characters. gsf (·) is the frequency of u. Obviously, matching a higher gram is more welcome than a lower gram. To increase the accuracy of correction, (Chen et al., 2016) adds weights to the different n-gram by their length to favor higher gram.
We use this score for errors typed S. In order to reduce the amount of calculation, we only keep the calculation of 2-gram and 3-gram, the example of n-gram of words is shown in table 4.
For the error which is typed with S is a word, we will calculate the SL score of the word. We use the dictionaries of characters with similar pronunciation and similar shape in (Wu et al., 2013) and convert characters into simplified Chinese 2 . We merged the two dictionaries to one dictionary of candidates for characters. When we choose the word to replace, we prefer to select the word that have only one character different from the original word. We replace each characters in the words and calculate the score separately. We select the candidate word with the highest score as the correct one.
For the error which is typed with S is a character, we calculate the SL score for the character. The candidate dictionary is directly used to replace the character and the score is calculated. The character with the highest score is considered to be correct.
For the error typed with M, we also use SL to calculate the score using 2-gram and 3-gram. We first search the words in the word dictionary which have the same character as the character labeled M. Then, calculate the candidates' score. We regard that the candidate with the highest score is the correct candidate.

Baseline
In this experiment, we build the Bi-LSTM model for sequence labelling as our baseline model. Unlike traditional sequence labeling, Chinese grammatical error diagnosis may result in inaccurate word segmentation due to existing errors, so we use character embeddings to replace word embeddings.

Hyper-parameter and Data
We use word2vec 3 to pretrain our character embeddings by wiki corpus 4 . We also use wiki corpus to build our n-gram dictionaries. The character embedding size is 400, the hidden units of Bi-LSTM is 256. We set the batch size is 32. We use Adam optimizer to train our model and the learning rate is 0.001.
The training data we use comes from NLPTEA2016 and NLPTEA2018 and we di-  vide part of data from NLPTEA2016 to the development set. We use two test set from NLPTEA2016 and NLPTEA2017. Table 5 shows the data information in detail.

Evaluation Method
According to (Lee et al., 2016), the evaluation method includes three levels, detection level, identification level, position level. And this year add correction level. Detection level: Determines whether a sentence is correct or not. If there is an error, the sentence is incorrect. In other words, the sentences are classified into two categories.
Identification level: The correct situation should be exactly the same as the gold standard for a given type of error. This can be considered as a multiclassification problem.
Position level: The system results should be perfectly identical with the quadruples of the gold standard.
Correction level: Characters marked as S and M need to give correct candidates. The model can recommend at most 3 correction at each error.
The following metrics are measured at detection, identification, position-level.

Result
In this part, we show our experiment results in the CGED2016 test set. Since the experiment results are similar on CGED2017 dataset, they are not given. The first part of table 6, 7, 8, 9 shows the results of the comparison between the model using new function to save model, with the reconstruction loss function and the original model. The γ of the model with reconstructive loss is set to 0.5. It can be seen from the experiment that modifying the save function and rebuilding the loss function all have a good improvement on the error detection of the model. The results of mixing the above methods are also given. There is an improvement in error detection, but too many errors are detected and the correct information is ignored. So after that we modify the value of the weight γ in Eq. 11 to get more reasonable model.
The second part of table 6, 7, 8, 9 shows the different models with new function to save model and reconstructive loss for modifying the value of γ in Eq. 11. It can be seen that when the weight decreases, the false positive rate decreases significantly, which indicates that the model captures more correct information. When γ is 0.2 or 0.1 is more suitable for our task. When the weight is too large, false positive rate is too large indicates that the error is not detected, which is not consistent with the objectives of this task. At 0.05, the F1 values of all levels are too low, so we use 0.1 as the weight in the following experiments.
The third part of table 6 Table 7: Results on detection level: ACC represents accuracy. Pre means precision. Re is recall.
perimental results of our proposed model with new save function and reconstructive loss. γ is set to 0.1. The results from F1 show that the proposed model is improved compared to the baseline model. The model can also detect error information very well without artificial features. We also tried to add artificial information to the model to improve the experimental results, so we added POS (Part of Speech) information. Since we are dealing with characters, so we use POS for the character's corresponding word as the character's POS. It can be seen that POS is useful in Chinese error detection. For errors, POS may provide some information to help the model detect better. Table 10, 11, 12, 13 shows the experiment results we submitted in CGED2018 in detection part. Table 14 show the results in CGED2018 in correction part. Since our model only proposes one candidate, the results on Correction and Top3 Correction are the same.

Related Work
Chinese grammatical error diagnosis task has been developed for a long time. From the initial statistical methods to the current machine learning, more and more attention has been paid to.

Acc
Pre  Table 8: Results on identification level: ACC represents accuracy. Pre means precision. Re is recall. (Zhang et al., 2000) searched the optimal string from all possible derivation of the input sentence using operations of character substitution, insertion, and deletion with a traditional word 3-gram language model. (Chen et al., 2013) still used n-gram as the main method, and added Web resources to improve detection results. (Lin and Chu, 2015) used n-gram to establish a scoring system to better give correction options. (Yeh et al., 2017) based on n-gram used the KMP algorithm to speed up the search for correct candidates.
Due to the continuous rise of machine learning in recent years, the field of natural language processing is increasingly turning to machine learning. In the past few years, the diagnosis of Chinese grammatical errors has also been developing in machine learning. Grammatical error detection is usually considered as the sequence labeling task (Zheng et al., 2016). (Huang and WANG, 2016) used Bi-LSTM to annotate the errors in the sentence. (Shiue et al., 2017) combined machine learning with traditional n-gram methods, using Bi-LSTM to detect the location of errors and adding additional linguistic information, POS, ngram.     the probability of each characters, and used two strategies to decide whether a character is correct or not. (Liao et al., 2017) used the LSTM+CRF model to detect dependencies between outputs to better detect error messages. (yang et al., 2017) added more linguistic information on LSTM+CRF model, such as POS, n-gram, PMI score and dependency features.

Conclusion
As more and more people learn Chinese, the automatic diagnosis of Chinese grammatical error becomes more and more important. This paper proposes a contextualized character representation for CGED and related solutions for the error sparse problem, which are improved compared to the baseline approach.
In the future, we will add this contextualized character representation to models that are better at Chinese grammatical error diagnosis such as Bi-   Table 13: Results on position level in CGED2018: Pre means precision. Re is recall.