NTOU Chinese Spelling Check System in Sighan-8 Bake-off

.


Introduction
Automatic spell checking is a basic and important technique in building NLP systems. It has been studied since 1960s as Blair (1960) and Damerau (1964) made the first attempt to solve the spelling error problem in English. Spelling errors in English can be grouped into two classes: non-word spelling errors and real-word spelling errors.
A non-word spelling error occurs when the written string cannot be found in a dictionary, such as in fly fron* Paris. The typical approach is finding a list of candidates from a large dictionary by edit distance or phonetic similarity (Mitten, 1996;Deorowicz and Ciura, 2005;Carlson and Fette, 2007;Chen et al., 2007;Mitten 2008;Whitelaw et al., 2009).
A real-word spelling error occurs when one word is mistakenly used for another word, such as in fly form* Paris. Typical approaches include using confusion set (Golding and Roth, 1999;Carlson et al., 2001), contextual information (Verberne, 2002;Islam and Inkpen, 2009), and others (Pirinen and Linden, 2010;Amorim and Zampieri, 2013).
Spelling error problem in Chinese is quite different. Because there is no word delimiter in a Chinese sentence and almost every Chinese character can be considered as a one-character word, most of the errors are real-word errors.
On the other hand, there is also an illegalcharacter error where a hand-written symbol is not a legal Chinese character (thus not collected in a dictionary). Such an error cannot happen in a digital document because all characters in Chinese character sets such as BIG5 or Unicode are legal.
There have been many attempts to solve the spelling error problem in Chinese (Chang, 1994;Zhang et al., 2000;Cucerzan and Brill, 2004;Li et al., 2006;Liu et al., 2008). Among them, lists of visually and phonologically similar characters play an important role in Chinese spelling check (Liu et al., 2011).
This bake-off is the third Chinese spelling check evaluation project. A CSC system will be evaluated in two levels: error detection and error correction. The task is organized based on some research works (Wu et al., 2010;Chen et al., 2011;Liu et al., 2011;Yu et al., 2014).

NTOU CSC System Description
This year, the architecture of NTOU CSC system mostly follows the previous version, only that three new preference rules are added. The architecture of previous NTOU CSC system is explained as follows. Figure 1 shows the architecture of NTOU Chinese spelling checking system. A sentence under consideration is first word-segmented. New sentences are generated by replacing candidates of spelling errors with their similar characters one at a time. New sentences are also word-segmented. Their likelihoods of being acceptable Chinese sentences are measured by sorted by n-gram linguistic model. If the new sentence with the top-1 likelihood is better than the original sentence, a spelling error is reported.

Original sentence
There are 6 kinds of confusion sets used in this system. One of them was generated from the Four-Corner Code system, proposed by us in CSC 2014 (Chu and Lin, 2014). The other 5 were provided by the organizers in CSC 2013 (Wu et al., 2013). They are characters with the same sound in the same tone, characters with the same sound in different tones, characters with similar sound in the same tone, characters with similar sound in different tones, and visually similar characters.  There are three cases of spelling error candidates in our system. Two of them have been described in our CSC 2014 system description paper. Multi-word replacement will be explained in Section 3.1.
One-character word replacement: every one-character word in the original sentence is considered as a spelling error candidate and should be replaced with its similar characters in its confusion set. For example, "座" in Topic A2-0101-2 is a one-character word and its similar characters are 柞坐雁挫..., the replacement is as follows.
A2-0101-2, Original: Multi-character word replacement: the method to create multi-character word confusion sets has been proposed by Lin and Chu (2015). Given a multi-character word, if one of the characters is replaced with a similar character and becomes another legal word, these two words are considered as collected into each other's multi-character word confusion set. The resource to create our word confusion set is the Revised Mandarin Dictionary by the Ministry of Education 1 .

Rule-1 No error in personal names:
discard a replacement if it becomes a personal name; it is unlikely to see errors in personal names. Take C1-1701-2 in the CLP Bakeoff 2014 CSC test set as an example as an example. When the one-character word "位" is replaced by its similar character "魏", "魏產齡" is identified as a personal name, so this replacement is discarded. C1-1701-2, Original segmented: 每 位 產 齡 婦女 Replaced and discarded: 每 魏產齡(PERSON) 婦女 Rule-2 Stopword filtering: discard a replacement if the original character is a personal anaphora (你'you', 我'I', 他她它祂 牠'he/she/it') or numbers from 1 to 10 (一二 三四五六七八九十).
N-gram preference score is defined as [P(S new ) / P(S org ) -1], where P(S) is the probability of a sentence S in a language model. When sorting, word-bigram preference score has the higher priority, word-unigram preference score has the second priority, and POS-bigram preference score has the lowest priority.
If the top-1 sentence is a newly generated sentence, and all of its preference scores are not lower than predefined thresholds, report it as an error with the location of the replacement. Otherwise, report "no error". The threshold of word-bigram preference score is 0.0571, and 0.0171 for word-unigram, 0 for POS-bigram preference scores.

Multi-word replacement
In our observation, a spelling error occurs in at least three different cases. The first case is that the error alone is identified as a one-character word. The second case is that one character in a multi-character word is misused but the wrong word is still a legal word. The third case is that the erroneous character, combining with the character to its left or to its right, is misidentified as a multi-character word. Take Topic 00043 in the SIGHAN7 Bakeoff 2013 CSC Datasets as an example. The error " 帶 " occurs in a multicharacter word "膠帶", but the correct word "塑 膠袋" is a longer word.
Topic 00043, Original: 外面也會包塑膠帶啦 Segmented: 外面 也 會 包 塑 膠帶 啦 Correct: 外面 也 會 包 塑膠袋 啦 To deal with such an error case, we proposed a new replacement procedure: if a multi-character word is preceded or followed by a one-character word, each character in this multi-character word is substituted with its similar characters one by one. Again, take Topic 00043 as an example. "外面" and "膠帶" are multi-character words and adjacent to one-character words, so they are candidates of spelling errors. By replacing similar characters of "外", "面", "膠", and "帶", newly generated sentences are as follows.

Preference rules
Three kinds of preference rules were proposed this year to deal with special cases: Simplified Chinese characters or variants, sentence-final particles, and DE-particles. If any of the rules are matched, an error is reported immediately.

Rule 1: Simplified and variant Chinese character detection
Because the sentences in the datasets are written in Traditional Chinese, all Simplified Chinese characters or variants of Traditional Chinese characters appearing in the datasets are marked as errors.
A mapping table (Lin et al., 2012) from variants (including Simplified Chinese characters) to their corresponding Traditional Chinese characters is adopted to correct such a kind of errors.
Take B1-0840-2 in the CLP Bakeoff 2014 CSC Datasets as an example of Simplified Chinese character replacement, where "尔" is a Simplified Chinese character and should be replaced with its corresponding Traditional Chinese character "爾" directly.
B1-0840-2, Original: 首尔是韓國的首都 Correct: 首爾是韓國的首都 Take B1-3981-1 in the CLP Bakeoff 2014 CSC Datasets as an example of variant replacement, where "偺 " is a variant of the more-common Traditional Chinese character "咱", so it should be replaced directly.

Rule 2: Sentence-final particle detection
In our observation, some sentence-final particles were frequently misspelled in the datasets, including "嗎", "吧", and "啊". We collected the errors in the dataset whose corrections were these particles and created the following three replacement rules: 1. If a sentence ends with a one-character word "碼" or "馬", it should be replaced with "嗎". 2. If a sentence ends with a one-character word "把" or "巴", it should be replaced with "吧". 3. If a sentence ends with a one-character word "阿", it should be replaced with "啊".
The following examples show the application of these rules.

Rule 3: DE-particle detection
In Chinese, "的", "得", and "地" serve as function words in various different cases. They are grouped together and receive a special POS "DE".
However, despite their usages are different, they are easily messed up with one another, even for native speakers.

Patterns
Correction

Table1. Replacement Rules for DE-particles
To deal with such kind of errors, we extracted most frequently-seen POS patterns in the training set. Table 1 lists the 6 patterns learned and used in our system. To demonstrate how to apply these rules, take B1-0184-3 in the CLP Bakeoff 2014 CSC Datasets as an example. The DEparticle " 得 " is followed by a common noun (whose POS is "Na") and matched the first DEparticle replacement rule in Table 1, so it is replaced with "的".

Google N-gram Scoring Functions
As described in Section 2, our previous language models were trained by Academia Sinica Balanced Corpus. We found that the volume and vocabulary of ASBC was not large enough. So we turn to use Chinese Web 5-gram dataset 2 instead.
Several n-gram scoring functions have been proposed by Lin and Chu (2015). Some examples from the Chinese Web 5-gram dataset are given here: Moreover, in order to avoid interference of word segmentation errors, we further design some likelihood scoring functions which utilize substring frequencies instead of word n-gram frequencies.
By removing space between n-grams in the Chinese Web 5-gram dataset, we constructed a new dataset containing identical substrings with Run FPAlarm Accuracy their web frequencies. For instances, n-grams in the previous example will become: where Zhar(S) is defined as the number of Chinese or other characters in a sentence S. Note that if two different n-gram sets become the same after removing the space, their will merge into one entry with the summation of their frequencies. Simplified Chinese words were translated into Traditional Chinese in advanced. Given a sentence S, let SubStr(S, n) be the set of all substrings in S whose Zhar values are n. We define Google string frequency gsf(u) of a string u to be its frequency data provided in the modified Chinese Web 5-gram dataset. If a string does not appear in that dataset, its gsf value is defined to be 0.
Equation 1 give the definition of averaged weighted log frequency score GS wgt (S) which sums up the logarithm of frequencies of all substrings with length n, averages scores at the same n level, and multiplies logn. As the same algorithm of error detection as described in Section 2, a top-1 replacement should have a Google n-gram preference score no lower than the threshold 0.0002 so that it could be reported as an error correction.

Experimental Results
We submitted 2 formal runs this year by two different statistics-based systems.
The first system checks the word-unigram, word-bigram, and POS-bigram preference scores of the top-1 sentence to decide the occurrence of a spelling error, as described in Section 2. The second system uses Google n-gram preference scores instead to check the occurrence of a spelling error, as described in Section 3.3. Table 2 and 3 illustrate the evaluation results of formal runs. As we can see, the first system guesses errors more correctly but too cautiously. The second system, on the other hand, proposed more errors so it achieved a higher recall rate and a higher F-score.

Conclusion
It is our third time to participate in a Chinese spelling check evaluation project. Based on our previous CSC system, we further proposed three preference rules to handle three special cases: (1) Simplified Chinese characters or variants; (2) sentence-final particles, and (3) DE-particles. Moreover, a new sentence-likelihood scoring function, averaged weighted log frequency score, was proposed which used Google n-gram frequency information.
Two formal runs were submitted this year. The first one was predicted by three n-gram language models trained by a large corpus ASBC. The second one was predicted by the system which used Google n-gram averaged weighted log frequency scores to decide the occurrence of errors. The evaluation results show the system using Google n-gram frequency information outperformed the traditional language models.