Confusionset-guided Pointer Networks for Chinese Spelling Check

This paper proposes Confusionset-guided Pointer Networks for Chinese Spell Check (CSC) task. More concretely, our approach utilizes the off-the-shelf confusionset for guiding the character generation. To this end, our novel Seq2Seq model jointly learns to copy a correct character from an input sentence through a pointer network, or generate a character from the confusionset rather than the entire vocabulary. We conduct experiments on three human-annotated datasets, and results demonstrate that our proposed generative model outperforms all competitor models by a large margin of up to 20% F1 score, achieving state-of-the-art performance on three datasets.


Introduction
In our everyday writing, there exists different types of errors, one of which that frequently occurs is misspelling a character due to the characters' similarity in terms of sound, shape, and/or meaning. Spelling check is a task to detect and correct such problematic usage of language. Although these tools been useful, detecting and fixing errors in natural language, especially in Chinese, remains far from solved. Notably, Chinese is very different from other alphabetical languages (e.g., English). First, there are no word delimiters between the Chinese words. Second, the error detection task is difficult due to its context-sensitive nature, i.e., errors can be only often determined at phrase/sentence level and not at character-level.
In this paper, we propose a novel neural architecture for the Chinese Spelling Check (CSC) task. For the task at hand, it is intuitive that the generated sentence and the input sentence would usually share most characters, along with same sentence structure with a slight exception for several incorrect characters. This is unlike other generative tasks (e.g., neural machine translation or di-alog translation) in which the output would differ greatly from the input.
To this end, this paper proposes a novel Confusionset-guided copy mechanism which achieves significant performance gain over competitor approaches. Copy mechanisms (Gulcehre et al., 2016), enable the copying of words directly from the input via pointing, providing an extremely appropriate inductive bias for the CSC task. More concretely, our model jointly learns the selection of appropriate characters to copy or to generate a correct character from the vocabulary when an incorrect character occurs. The clear novelty of our work, however, is the infusion of Confusionsets 1 with Pointer Networks, which help reduce the search space and vastly improve the probability of generating correct characters. Experimental results on three benchmark datasets demonstrate that our model outperforms all competitor models, obtaining performance gains of up to 20%.

Our Proposed Model
Given an input, we represent the input sentence as X = {c s 1 , c s 2 , · · · , c s n }, where c i is a Chinese character 2 and n is the number of characters. We map X to an output sentence Y = {c t 1 , c t 2 , · · · , c t n }, namely maximizing the probability P (Y |X). Our model consists of an encoder and a decoder similar to (Sutskever et al., 2014), as shown in Figure 1. The encoder maps X to a higher-level representation with a bidirectional BiLSTM architecture similar to that of (Hochreiter and Schmidhuber, 1997). The decoder is also a recurrent neural network with the attention mechanism  to attend to the encoded representation and generate Y one character at a time. In our setting, the length of Y is limited to be equal to the length of X.
Confusionset M Confusionset, a prepared set which consists of commonly confused characters plays a key role in spelling error detection and correction. Most Chinese characters have similar characters in shape or pronunciation. According to the statistic result of incorrect Chinese characters collected from the Internet (Liu et al., 2010), 83% of these errors were related to phonological similarity, and 48% of them were related to visual similarity between the involved characters. To reduce the searching space while ensuring that the target characters are not excluded, we build a confusionset matrix M ∈ R n * w , where w is the size of the vocabulary, n corresponds to the number of characters in X, in which each element is 0 or 1. Take an input "这使我永生难望" as an example, the 7-th character "望" is a spelling error and its confusion set 3 is "汪圣忘晚往完万网· · · ". In M [7], the locations these confusion words occur in will be set to be 1 and the left are set to be 0.

Encoder
Before diving into the model, we first give a character-level reasoning. Consider the charac-teristic of Chinese characters, in which there is no explicit delimiter between words like some alphabetic-based languages, i.e., English, so our neural network model operates at the character level. One of reasons is that even for the stateof-the-art word segmenter, there exists some segmenting errors , and texts with spelling errors will exacerbate this phenomenon. Incorrectly segmented results might influence the capture of semantic representation in X for the encoder.
The encoder reads X and outputs a sequence of vectors, associated with each word in the sentence, which will be selectively accessed during decoding via a soft attentional mechanism. We use a bidirectional LSTM network to obtain the hidden states h s i for each time step i, where h s i is the concatenation of the forward hidden state ← − h s i and the backward hidden state − → h s i , and e s i is the character embedding 4 for c s i in X.

Decoder
The decoder utilizes another LSTM that produces a distribution over the next target character given the source vectors [h s 1 , h s 2 , · · · , h s n ], the previously generated target charactersŶ <j = [ĉ t 1 ,ĉ t 2 , · · · ,ĉ t j ], and M ∈ R n * w , mathematically, where h t j is the summary of the target sentence up to the j-th word, where e t j is the word embedding for c t j−1 . Note that during training the ground truth c t j−1 is fed into the network to predict c t j , while at test time the most probableĉ t j−1 is used. We extend this decoder with an attention based model Luong et al., 2015), where, at every time step t, an attention score a s i is computed for each hidden state h s i of the encoder, using the attention mechanism of (Vinyals et al., 2015). Mathematically, The source vectors are multiplied with the respective attention weights, and summed to a new vector as the summary of the source vectors, h t j . h t j is then interacted with the current decoder hidden state h t j to produce a context vector C j : where U , W 1 , W 2, and W are trainable parameters of the model. C j is then used for generating two distributions: one is over the vocabulary, which is given by applying an affine transformation to C j followed by a softmax, and the other is over the input sentence, in which we use the copy mechanism. Additionally, we add the location information of the corresponding character c s j in X, Loc j , and this allows the decoder to have knowledge of previous (soft) alignments at each time step. Loc j is a vector of length n initialized by 0, and at the timestep j, the j-th element in Loc j is set to be 1 and the other is kept to be 0. The hidden state for generating the distribution over the input sentence is as follows, where ·; · denotes the concatenation operation. To train the pointer networks, we define the position label at the decoding time step j as, The position n+1 is a sentinel token deliberately concatenated to the end of X that allows us to calculate loss function even if c t j does not exist in the input sentence. Then, the loss between L t and L loc t is defined as, During the inference time,ĉ t j is defined as, is the element-wise multiplication, and M [j] is utilized to limit the scope of generated words based on the assumption that the correct character is contained in the corresponding confusionset of the erroneous character.

Experiments
Train data We use the large annotated corpus which contains spelling errors, either visually or phonologically resembled characters, by an automatic approach proposed in . In addition, a small fraction of three humanannotated training datasets provided in (Wu et al., 2013;Tseng et al., 2015) are also included in our training data.
Test data To evaluate the effectiveness of our proposed model, we test our trained model on benchmark datasets from three shared tasks of CSC (Wu et al., 2013;Tseng et al., 2015). Since these testing datasets are written in traditional Chinese, we convert them into simplified Chinese characters using OpenCC 5 .
Details of experimental data statistics information, including the training datasets, the testing datasets and the Confusionsets used in our model, are shown in Table 1.
Evaluation metrics We adopt precision, recall and F1 scores as our evaluation metrics, which are widely used as evaluation metrics in CSC tasks.
Baseline models We compare our model with two baseline methods for CSC: one is N-gram language modeling with a pre-constructed confusionset (LMC), and for its simplicity and power, it is widely used in CSC Yu Name Data Size(lines) Avg. Sentence Length # of Errors  Table 2: Experimental results of detection-level and correction-level performance on three testing datasets (%). + and -denote using Confusionsets and not using Confusionsets, respectively. and Li, 2014; Xie et al., 2015). By utilizing the confusionset to replace characters in a sentence, the sentence probability is calculated after and before the replacement, which is then used to determine whether the sentence contains spelling errors. We re-implement the pipline proposed in (Xie et al., 2015); Another is the sequence labeling method (SL), which casts Chinese spelling error detection into a sequence tagging problem on characters, in which the correct and incorrect characters are tagged as 1 and 0, respectively. We follow the baseline model ) that implements a LSTM based sequence tagging model.

Model Hyperparameters
The training hyperparameters are selected based on the results of the validation set. The dimension of word embedding is set to 300 and the hidden vector is set to 512 in both the encoder and decoder. The dimension of the attention vector is also set to 512 and the dropout rate is set to 0.5 for regularization. The mini-batched Adam (Kingma and Ba, 2014) algorithm is used to optimize the objective function. The batch size and base learning rates are set to 64 and 0.001, respectively.
Results As shown in Table 2, we compare our confusionset-guided pointer networks with two baseline methods. Not to our surprise, except for two precision results lower than LMC, our model consistently improves performance over other models for both detection-level and correctionlevel evaluation. One reason might be that compared with SL, which considers the spelling check as a classification task at the character-level, and the information available for the current timpstep is somewhat constrained while our generative model can utilize both the location information and the whole input information by an attention mechanism, and the copy mechanism also make the decoding more effective. As for LMC, how to set a threshold probability for judging whether a given sentence is correct remain explored, and there exists great trade-off between the precision and the recall as reported in (Jia et al., 2013).
Utility of M Specifically, by comparing the experimental results of Ours − and Ours + , we can observe that the latter achieves better performance, which validates the effectiveness of utilizing Confusionsets that can help improve the probability of generating correct target characters.

Discussion and Future Work
In our everyday Chinese writing, there exist a variety of problematic usage of language, one of which is the spelling error referred in this paper. Such spelling errors are mainly generated due to the similarity of Chinese characters in terms of sound, shape, and/or meaning, and the task is to detect the misspelled words and then replace them with their corresponding correct ones. Besides the spelling errors mentioned above, grammar errors are also common in our Chinese writing, which requires us to correct the erroneous sentence by insertion, deletion and even re-ordering. Take as an example "我 真不不 明 白 ， 为 啥 他 要 自 杀。" (Translation: I really don't understand why he committed suicide.), we need to delete the character in red in order to guarantee the correctness of the sentence. However, our model is unable to handle such errors in that we limit the length of the generated sentence to be same to that of the input sentence in order to incorporate Confusionsets into our model as a guiding resource. For the future work, we hope to extend this idea proposed in this paper to train a model capable of handling different types of errors through the generative model since it can generate different lengths of results. One concern is that we need to reconsider how to incorporate Confusionsets into the encoder-decoder architecture.

Related Work
Most CSC related studies have emerged as a result of a series of shared tasks (Wu et al., 2013;Tseng et al., 2015;Fung et al., 2017;Gaoqi et al., 2018), which involve automatic detection and correction of spelling errors for a given sentence. Earlier work in CSC focus mainly on unsupervised methods such as language model with a pre-constructed confusionset Yu and Li, 2014). Subsequently, some work cast CSC as a sequential labeling problem, in which conditional random fields (CRF) (Lafferty et al., 2001), gated recurrent networks (Hochreiter and Schmidhuber, 1997;Chung et al., 2014) have been employed to model the problem (Zheng et al., 2016;Xie et al., 2017;Wu et al., 2018). More recently, motivated by a serials of remarkable suc-cess achieved by neural network-based sequenceto-sequence learning (Seq2Seq) in various natural language processing (NLP) tasks (Sutskever et al., 2014;, generative models have also been applied to the spelling check task by considering it as an encoder-decoder (Xie et al., 2016;Ge et al., 2018).

Conclusion and Future Work
We proposed a novel end-to-end confusionsetguided encoder-decoder model for the Chinese Spelling Check (CSC) task. By the infusion of Confusionsets with copy mechanism, our proposed approach achieves a huge performance gain over competitive baselines, demonstrating its effectiveness on the CSC task.