Word Vector/Conditional Random Field-based Chinese Spelling Error Detection for SIGHAN-2015 Evaluation

In order to detect Chinese spelling errors, especially for essays written by foreign learners, a word vector/conditional random field (CRF)- based detector is proposed in this paper. The main idea is to project each word in a test sentence into a high dimensional vector space in order to reveal and examine their relationships by using a CRF. The results are then utilized to constrain the time-consuming language model rescoring procedure. Official SIGHAN-2015 evaluation results show that our system did achieve reasonable performance with about 0.601/0.564 ac-curacies and 0.457/0.375 F1 scores in the detection/correction levels.


1! Introduction
Chinese spelling check could be treated as an abnormal word sequence detection and correction problem. Convention approaches to do this job often heavenly rely on a language models (LM) trained from a large text corpus (for example Chinese Gigaword 1 ) to find potential errors and provide suitable candidate words (Bengio 2003, Wang 2013 to replace them. These approaches usually could be successfully applied to examine essays written by Chinese element or junior school students. However, for essays written by foreign learners, conventional LM methods may not be so helpful. Because, the writing behaviors of foreign learners are usually different with native Chinese writers. They may embedded spelling errors into rarely used word sequences (low LM scores, but are somehow grammar or syntactic corrected). For example: !! (" " should be ") 1 http://www.aclclp.org.tw/use_asbc_c.php !! (" " should be ") !! (" " should be ") They may also produce some semantic errors (but are all grammar and syntactic corrected and therefore with high LM scores). This type of errors are difficult, if not impossible, to detect using only LM models trained from conventional Chinese text corpora. For example: (" " should be " ") In order to properly deal with those errors, it is necessary to understand foreign learners' writing behaviors. Therefore, this paper focus on how to automatically learn the behaviors of foreign learners. Our major idea is to transform the problem into a machine learning task. To this end, the vector representations of the words were first constructed and then CRF-based approach was adopted to detect the errors.

2! Overview of the proposed system
The block diagram of our system is shown in Fig. 1. There are four main components including (1) a misspelling correction rules frontend, (2) a CRF-based parser, (3) a word vector/CRFbased spelling error detector and (4) a 120k trigram LM.
Basically, our approach is to utilize the error detection results to guide and speed up the timeconsuming LM rescoring procedure. It iteratively exchanges potential error words with their confusable ones and examine the modified sentence using the tri-gram LM. The final goal is to produce a modified sentence with maximum LM score. By this way, potential Chinese spelling errors could be detected and corrected.
Since, the details of our parser, LM modules and character replacement procedure could be found in (Wang 2013), only the newly added word vector/CRF-based error detection module will be further described in the following subsections. Fig. 1: The schematic diagram of the proposed Chinese spelling checker. The are four modules including a rule-based frontend, a CRF-based parser, a tri-gram LM and a word vector/CRF-based spelling error detector. Among them, the spelling error detector is newly added for SIGHAN-2015 evaluation.

3! Word Vector/CRF-based Spelling Error Detector
Fig . 2 shows the block diagram of the word vector/CRF-based Chinese spelling error detection module. Its two main modules, i.e., word2vec and CRF will be discussed in the following subsections.

3.1! Word vector representation
The word to vector algorithm proposed by Tomas Mikolov (Mikolov 2013a(Mikolov , 2013b) is adopted in this paper to encode words. It uses the CBOW (continuous bag of words, as shown in Fig. 3) representations to project each word into a high dimensional vector space. These representations have been shown to be capable to capture deep linguistic information beyond surface words . Therefore, CBOW is used here to reveal the prosperities and relationship between normal and abnormal word sequences. Fig. 3: The CBOW word to vector encoding architecture that predicts the current word based on the context.

3.2! CRF Chinese spelling error detector
To detect potential spelling errors, the word vectors and parser outputs are further combined into a feature sequences for CRF error detector. CRF then learns from a set of labels samples (groundtruth) to tell between correct and incorrect word spellings instances. Fig. 4 shows a typical example of the extracted feature sequences of a training sample. Here each word is transformed into a 5 dimensional vector including (1) the length of the word, its (2) POS and (3) reduced POS tags, (4) the word class index and the ground-truth (correct or error spelling) labels.

4.1! System setting
Basically, the parser, 120K tri-gram LM and word vector representation were all trained using Sinica Balanced Corpus version 4.0 2 . There is in total about 4.4 billion words in the corpus. For the parser, its F-measure of the word segmentation is 96.72% and 97.67% for the original and manually corrected corpus. The accuracy of the 47-type POS tagging is about 94.24%. To build the word vector representation, a window of 17 (8+1+8) words was used. Each word was first projected into a 200 dimensional CBOW vector and then further clustered into one of 1024 classes.
On the other hand, to build the CRF-based spelling error detector, Bake-off 2014 and SIGHAN-2015 development corpora were utilized. There are in total 106,815 words in the training set. Among them, only 4,537 words are incorrect. For the test set, there are 11,808 words including 498 errors.

4.2! Error detection frontend results
First of all, Fig. 5 shows a typical output of the word vector/CRF-based spelling error detector. It is worth to note that the last column in Fig. 5 shows the correct scores reported by the CRF. If the scores are less than 0.5, the corresponding words will be treated as good ones, otherwise spelling errors will be reported. For example, the last word " " has a very low score 0.0048 and is therefore will be labelled as an error. Moreover, Table 1 show the evaluation results of the error detection frontend on Bake-off 2014 and SIGHAN-2015 development corpora. From the table, it can be found that the detection results for the training set is quite good. But for test set, there is serious bias issue. This may due to the over-fitting problem since there are unbalanced numbers of correct and incorrect spelling word samples in the training set. To alleviate the difficulties, we will try to lower detector's decision threshold for the following LM rescoring procedure to cover more hypotheses.

4.3! Overall detection and correction results
Finally, three system configurations (Run1~3) were tested to explore different LM rescoring space. i.e., using three different CRF score thresholds including 0.999, 0.98 and 0.95. Among them, the search space of Run1 is very restricted and Run3 is much larger than others. Table 2 show the official evaluation results given by the SIGHAN-2015 evaluation organizer. From Table 2, it can be found that Run1 had lowest false positive and recalls rates in both measures. On the other hand, Run3 had highest recall rates and F1 scores but produced many more false alarms.
In summary, these results show that our approach had achieved reasonable performance. But the settings of our systems (even Run3) were still too conservative. Therefore, there are still some rooms to further lower the threshold in order to improve the F1 scores.

5! Conclusions
In this paper, a word vector/CRF-based Chinese spelling error detector have been newly added to improve our spelling check system. Evaluation results show that our systems had achieved reasonable performance. Especially, configuration Run3 achieves about 0.601/0.564 accuracies and 0.457/0.375 F1 scores in the detection/correction level, respectively. Experimental results also showed that our error detector frontend suffered serious overfitting problem. Beside, the time consuming LM scoring procedure should be replaced with a candidate word predictor (for example the CBOW structure shown in Fig. 3). These two issues will be further studied in the future. Finally, our latest traditional Chinese parser is available on-line at http://parser.speech.cm.nctu.edu.tw.