Context-aware Stand-alone Neural Spelling Correction

Existing natural language processing systems are vulnerable to noisy inputs resulting from misspellings. On the contrary, humans can easily infer the corresponding correct words from their misspellings and surrounding context. Inspired by this, we address the stand-alone spelling correction problem, which only corrects the spelling of each token without additional token insertion or deletion, by utilizing both spelling information and global context representations. We present a simple yet powerful solution that jointly detects and corrects misspellings as a sequence labeling task by fine-turning a pre-trained language model. Our solution outperform the previous state-of-the-art result by 12.8% absolute F0.5 score.


Introduction
A spelling corrector is an important and ubiquitous pre-processing tool in a wide range of applications, such as word processors, search engines and machine translation systems. Having a surprisingly robust language processing system to denoise the scrambled spellings, humans can relatively easily solve spelling correction (Rawlinson, 1976). However, spelling correction is a challenging task for a machine, because words can be misspelled in various ways, and a machine has difficulties in fully utilizing the contextual information.
Misspellings can be categorized into non-word, which is out-of-vocabulary, and the opposite, realword misspellings (Klabunde, 2002). The dictionary look-up method can detect non-word misspellings, while real-word spelling errors are harder to detect, since these misspellings are in the vocabulary (Mays et al., 1991;Wilcox-O'Hearn et al., 2008). In this work, we address the stand-alone (Li et al., 2018)  corrects the spelling of each token without introducing new tokens or deleting tokens, so that the original information is maximally preserved for the down-stream tasks.
We formulate the stand-alone spelling correction as a sequence labeling task and jointly detect and correct misspellings. Inspired by the human language processing system, we propose a novel solution on the following aspects: (1) We encode both spelling information and global context information in the neural network. (2) We enhance the real-word correction performance by initializing the model from a pre-trained language model (LM).
(3) We strengthen the model's robustness on unseen non-word misspellings by augmenting the training dataset with a synthetic character-level noise. As a result, our best model 1 outperforms the previous state-of-the-art result  by 12.8% absolute F 0.5 score.

Approach
We use the transformer-encoder (Vaswani et al., 2017) to encode the input sequences and denote it as Encoder. As illustrated in Figure 1, we present both Word+Char encoder and Subword encoder, because we believe the former is better in encoding spelling information, while the latter has the benefit of utilizing a large pre-trained LM.
Word+Char encoder. We use a word encoder to extract global context information and a character encoder to encode spelling information. As shown in equation 1, in order to denoise the noisy word sequence S * to the clean sequence S, we first separately encode S * using a wordlevel transformer-encoder Encoder word and each noisy spelling sequence C * k of token k via a character-level transformer-encoder Encoder char . For Encoder word , we replace non-word misspellings, i.e. OOV words, with a unk token. For Encoder char , we treat each character as a token and each word as a "sentence", so each word's character sequence embedding h k char is independent of each other. Since the transformer-encoder (Vaswani et al., 2017) computes contextualized token representations, we take h char , the [CLS] token representation of each character sequence as the local character-level representation of S * . Finally, we jointly predict S by concatenating the local and global context representations.
Subword encoder. Alternatively, we use subword tokenization to simultaneously address the spelling and context information. Formally, as shown in equation 2, given a noisy subword token sequence S * sub , we encode it using a transformerencoder Encoder sub and simply use an affine layer to predict the sequence of each subword token's corresponding correct word token S sub in BIO2 tagging scheme (Sang and Veenstra, 1999).
1 https://github.com/jacklxc/ StandAloneSpellingCorrection Furthermore, we fine-tune our Subword encoder model with a pre-trained LM initialization to enhance the real-word misspelling correction performance.
We use cross-entropy loss as our training objective. Finally, in addition to the natural misspelling noise, we apply a synthetic character-level noise to the training set to enhance the model's robustness to unseen misspelling patterns. The details will be introduced in section 3.1.

Dataset
Since we cannot find a sentence-level misspelling dataset, we create one by using the sentences in the 1-Billion-Word-Language-Model-Benchmark (Chelba et al., 2013) as gold sentences and randomly replacing words with misspellings from a word-level natural misspelling list (Mitton, 1985;Belinkov and Bisk, 2017) to generate noisy input sentences. In a real scenario, there will always be unseen misspellings after the model deployment, regardless of the size of the misspelling list used for training. Therefore, we only use 80% of our full word-level misspelling list for train and dev set. In order to strengthen the robustness of the model to various noisy spellings, we also add noise from a character-level synthetic misspelling list (Belinkov and Bisk, 2017) to the training set. As a result, real-word misspelling contributes to approximately 28% of the total misspellings for both dev and test set. The details are described in Section A.1

Results
Performance Metrics We compare word-level precision, recall and F 0.5 score, which emphasizes precision more. We also provide accuracy for reference in Table 1, because both of the baselines were evaluated with accuracy score. Table 3 shows the definition of true positive (TP), false positive (FP), false negative (FN) and true negative (TN) in this work to avoid confusions. We calculate them using the following equations: where β = 0.5 in this work.

Models
Dev Test Acc P R F 0.5 Acc P R F 0.5 1 ScRNN (Sakaguchi et al., 2017)

Models
Real   Baselines. Sakaguchi et al. (2017) proposed semi-character recurrent neural network (ScRNN), which takes the first and the last character as well as the bag-of-word of the rest of the characters as features for each word. Then they used an LSTM (Hochreiter and Schmidhuber, 1997) to predict each original word.  proposed MUDE, which uses a transformer-encoder (Vaswani et al., 2017) to encode character sequences as word representations and used an LSTM (Hochreiter and Schmidhuber, 1997) for the correction of each word. They also used a Gated Recurrent Units (GRU) (Cho et al., 2014) to perform the character-level correction as an auxiliary task during training. We train ScRNN (Sakaguchi et al., 2017) and MUDE , both of which are stand-alone neural spelling correctors, on our dataset as baselines.
Overview. As row 11 of Table 1 shows, finetuning the Subword (WordPiece (Peters et al., 2018)) encoder model with LM initialization (ERNIE 2.0 (Sun et al., 2019)) on the augmented dataset with synthetic character-level misspellings yields the best performance. Without leveraging a pre-trained LM, the Word+Char Encoder model trained on the augmented dataset with synthetic character-level misspellings performs the best (row 6). In fact, the differences between these approaches are small. In Table 2, we calculate real-word and non-word correction performance to explain the effect of each training technique applied. Note that as shown in Figure 1, because non-word misspellings are preprocessed already, the detection of these non-word misspellings can be trivially accomplished, which results in all models having non-word recall of 1.000.
As Table 2 shows, strong models overall perform well on both real-word misspellings and nonword misspellings. Although our models perform better on non-word misspellings than real-word misspellings, the significant improvement of our models over the baselines comes from the realword misspellings, due to the usage of the pretrained LM. In the following paragraphs, we state our claims and support them with our experimental results.
Spelling correction requires both spelling and context information. As Table 2 shows, without the context information, the character encoder model (row 3) performs poorly on real-word misspellings. On the contrary, word encoder model (row 4) performs well on real-word misspellings, but poorly on non-word misspellings, due to the lack of the spelling information. The combined Word+Char encoder model (row 5) leverages both spelling and context information and thus improves nearly 40% absolute F 0.5 in Table 1. It even outperforms the LM intialized model (row 10). Both of the baseline models (row 1 and 2) perform poorly, because they perform spelling corrections upon character sequences, which disregards the semantics of the context, as their poor real-word performance in Table 2 row 1 and 2 suggests. On the other hand, since subword embeddings essentially subsume character embedding, an additional character encoder does not improve the performance of the Subword encoder model (Table 1 row 8).
Pre-trained LM facilitates spelling correction. As row 10 of Table 1 shows, fine-tuning the model with a pre-trained LM weight initialization im-proves both precision and recall score over the Subword encoder model (row 7). The LM pretraining mainly improves real-word recall as Table  2 row 10 suggests. Pre-trained LMs are trained with multiple unsupervised pre-training tasks on a much larger corpus than ours, which virtually expands the training task and the training set.
Because most neural language models are trained on the subword level, we are not able to obtain a pre-trained LM initialized version of Word+Char encoder model (row 5). Nonetheless, we hypothesize that such a model will yield a very promising performance given sufficient training data and proper LM pre-training tasks.
Training on additional synthetic characterlevel noise improves model robustness. As row 6, 9 and 11 of Table 1 and 2 shows, in addition to frequently occurring natural misspellings, training models on the texts with synthetic characterlevel noise improves the test performance, which is mainly contributed by the improvement of precision on non-word misspellings. Note that the train and dev set only cover 80% of the candidate natural misspellings. Adding character-level noise in the training data essentially increases the variety of the missplelling patterns, which makes the model more robust to unseen misspelling patterns.
In recent years, there are some attempts to develop better spelling correction algorithms based on neural nets (Etoori et al., 2018). Similar to our baselines ScRNN (Sakaguchi et al., 2017) and MUDE , Li et al. (2018) proposed a nested RNN to hierarchically encode characters to word representations, then correct each word using a nested GRU (Cho et al., 2014). However, these previous works either only train models on natural misspellings (Sakaguchi et al., 2017) or synthetic misspellings , and only focus on denoising the input texts from orthographic perspective without leveraging the retained semantics of the noisy input.
On the other hand, Tal Weiss proposed Deep Spelling (Weiss), which uses the sequence-tosequence architecture (Sutskever et al., 2014;Bahdanau et al., 2014) to generate corrected sentences. Note that Deep Spelling is essentially not a spelling corrector since spelling correction must focus only on the misspelled words, not on transforming the whole sentences. For similar reasons, spelling correction is also different from GEC (Grammar Error Correction) (Zhang and Wang, 2014;Junczys-Dowmunt et al., 2018).
As a background, recently pre-trained neural LMs (Peters et al., 2018;Devlin et al., 2018;Radford et al., 2019;Sun et al., 2019) trained on large corpus on various pre-training tasks have made an enormous success on various benchmarks. These LMs captures the probability of a word or a sentence given their context, which plays a crucial role in correcting real-word misspellings. However, all of the LMs mentioned are based on subword embeddings, such as WordPiece (Peters et al., 2018) or Byte Pair Encoding (Gage, 1994) to avoid OOV words.

Conclusion
We leverage novel approaches to combine spelling and context information for stand-alone spelling correction, and achieved state-of-the-art performance. Our experiments gives insights on how to build a strong stand-alone spelling corrector: (1) combine both spelling and context information, (2) leverage a pre-trained LM and (3)

A.1 Dataset Details
We keep the most frequent words in the 1-Billion-Word-Language-Model-Benchmark dataset (Chelba et al., 2013) as our word vocabulary Ψ w , and all characters in Ψ w to form our character vocabulary Ψ c . After deleting sentences containing OOV words, we randomly divide them into three datasets S train , S dev and S test . We merge the two word-level misspelling lists (Mitton, 1985;Belinkov and Bisk, 2017) to get a misspelling list Ω. We randomly choose 80% of all misspellings in Ω to form a known-misspelling-list,Ω.
To strengthen the robustness of the model to various noisy spellings, we also utilize the methods in Belinkov and Bisk (2017) , namely, swap, middle random, fully random and keyboard type, to generate character-level synthetic misspellings. To encourage the model to learn contextual information, we add an additional method, random generate, to generate arbitrary character sequences as misspellings.
While replacing gold words with misspellings, for a sentence with n words, the number of replaced words is m = max( αn , 1), where α = min(|N (0, 0.2)|, 1.0) and N represents a Gaussian distribution.
The dev set is created with misspellings from sampled listΩ, and the test set is created with misspellings from the full list Ω. We compare 2 train sets, the first has only natural misspellings fromΩ, and the second has natural misspellings as well as synthetic misspellings, which is denoted as +random char in Section 3.2. We always use the same dev set and test set that only contain natural misspellings for comparison. Table 4 shows the parameters of our stand-alone spelling correction dataset. We will release the dataset and codes after this paper is published.

A.2 Implementation Details
We use PaddlePaddle 2 for the network implementation and keep the same configuration for the Subword encoders as ERNIE 2.0 (Sun et al., 2019). We tune the models by grid search on the dev set according to F 0.5 score. The detailed hyperparameters shown in Table 5. In addition, we use Adam optimizer (Kingma and Ba, 2014) with learning rate of 5e-5 as well as linear decay. We used