Contextual Text Denoising with Masked Language Model

Recently, with the help of deep learning models, significant advances have been made in different Natural Language Processing (NLP) tasks. Unfortunately, state-of-the-art models are vulnerable to noisy texts. We propose a new contextual text denoising algorithm based on the ready-to-use masked language model. The proposed algorithm does not require retraining of the model and can be integrated into any NLP system without additional training on paired cleaning training data. We evaluate our method under synthetic noise and natural noise and show that the proposed algorithm can use context information to correct noise text and improve the performance of noisy inputs in several downstream tasks.


Introduction
Based on our prior knowledge and contextual information in sentences, humans can understand noisy texts like misspelled words without difficulty.However, NLP systems break down for noisy text.For example, Belinkov and Bisk (2017) showed that modern neural machine translation (NMT) system could not even translate texts with moderate noise.An illustrative example of English-to-Chinese translation using Google Translate 1 is presented in Table 1.
Text correction systems are widely used in real-world scenarios to address noisy text inputs problem.Simple rule-based and frequency-based spell-checker are limited to complex language systems.More recently, modern neural Grammatical Error Correction (GEC) systems are developed with the help of deep learning (Zhao et al., 2019;Chollampatt and Ng, 2018).These GEC systems heavily rely on annotated GEC corpora, such as CoNLL-2014(Ng et al., 2014).The parallel GEC corpora, however, are expansive, limited, and even unavailable for many languages.Another line of researches focuses on training a robust model that inherently deals with noise.For example, Belinkov and Bisk (2017) train robust characterlevel NMT models using noisy training datasets, including both synthetic and natural noise.On the other hand, Malykh et al. (2018) consider robust word vectors.These methods require retraining the model based on new word vectors or noise data.Retraining is expensive and will affect the performance of clean text.For example, in Belinkov and Bisk (2017), the robustness scarifies the performance of the clean text by about 7 BLEU score on the EN-FR translation task.
In this paper, we propose a novel text denoising algorithm based on the ready-to-use masked language model (MLM, Devlin et al. (2018)).Notice that we are using English Bert.For other languages, We need to use MLM model pretrained on that specific language.The design follows the human cognitive process that humans can utilize the context, the spell of the wrong word (Mayall et al., 1997), and even the location of the letters on the keyboard to correct noisy text.The MLM essentially mimics the process that the model predicts the masked words based on their context.There are several benefits of the proposed method: • Our method can make accurate corrections based on the context and semantic meaning of the whole sentence as Table 1 shows.
• The pre-trained masked language model is ready-to-use (Devlin et al., 2018;Liu et al., 2019).No extra training or data is required.
• Our method makes use of Word Piece embeddings (Wu et al., 2016)

Method
Our denoising algorithm cleans the words in the sentence in sequential order.Given a word, the algorithm first generates a candidate list using the MLM and then further filter the list to select a candidate from the list.In this section, we first briefly introduce the masked language model, and then describe the proposed denoising algorithm.

Masked Language Model
Masked language model (MLM) masks some words from a sentence and then predicts the masked words based on the contextual information.Specifically, given a sentence where [M ASK] is a masking token over the j-th word.Actually, MLM can recover multiple masks together, here we only present the case with one mask for notation simplicity.In this way, unlike traditional language model that is in left-to-right order (i.e., p(x j |x 1 , ..., x j−1 )), MLM is able to use both the left and right context.As a result, a more accurate prediction can be made by MLM.
In the following, we use the pre-trained masked language model, BERT (Devlin et al., 2018).So no training process is involved in developing our algorithm.

Denoising Algorithm
The algorithm cleans every word in the sentence with left-to-right order except for the punctuation and numbers by masking them in order.For each word, MLM first provide a candidate list using a transformed sentence.Then the cleaned word is selected from the list.The whole process is summarized in Algorithm 1.
Text Masking The first step is to convert the sentence x into a masked form x ′ .With the use of Word Piece tokens, each word can be represented by several different tokens.Suppose the j-th word (that needs to be cleaned) is represented by the j sth token to the j e -th token, we need to mask them out together.For the same reason, the number of tokens of the expected cleaned word is unknown.So we use different number of masks to create the masked sentence {x ′ n } N n=1 , where x ′ n denotes the masked sentence with n-gram mask.Specifically, given x = x 1 , ..., x js , ..., x je , ..., x L , the masked form is x ′ n = x 1 , ..., [M ASK] × n, ..., x L .We mask each word in the noisy sentence by order.The number of masks N can not be too small or too large.The candidate list will fail to capture the right answer with a small N .However, the optimal answer would fit the noisy text perfectly with a large enough N .Empirically, we find out N = 4 is sufficiently large to obtain decent performance without too much overfitting.
Text Augmentation Since the wrong word is also informative, so we augment each masked text x ′ n by concatenating the original text x.Specifically, the augmented text is x, where [SEP ] is a separation token. 3ompared with directly leaving the noisy word in the original sentence, the masking and augmentation strategy are more flexible.It is benefited from that the number of tokens of the expected word does not necessarily equal to the noisy word.Besides, the model pays less attention to the noisy words, which may induce bias to the prediction of the clean word.

Candidate Selection
The algorithm then constructs a candidate list using the MLM, which is semantically suitable for the masked position in the sentence.We first construct candidate list V n c for each x n , and then combine them to obtained the final candidate list . Note that we need to handle multiple masks when n > 1.So we first find k most possible word pieces for each mask and then enumerate all possible combinations to construct the final candidate list.Specifically, where V is the whole vocabulary and × means the Cartesian product.
There may be multiple words that make sense for the replacement.In this case, the spelling of the wrong word is useful for finding the most likely correct word.We use the edit distance to select the most likely correct word further.
w c = arg min w∈Vc E(w, x j ), where E(w, x j ) represent the edit distance between w and the noisy word x j .
Algorithm 1: Denoising with MLM We test the performance of the proposed text denoising method on three downstream tasks: neural machine translation, natural language inference, and paraphrase detection.All experiments are conducted with NVIDIA Tesla V100 GPUs.We use the pretrained pytorch Bert-large (with whole word masking) as the masked language model4 .For the denoising algorithm, we use at most N = 4 masks for each word, and the detailed configuration of the size of the candidate list is shown in Table 2.We use a large candidate list for one word piece which covers the most cases.For multiple masks, a smaller list would be good enough.
For all tasks, we train the task-specific model on the original clean training set.Then we compare the model performance on the different test sets, including original test data, noise test data, and cleaned noise test data.We use a commerciallevel spell-checker api5 as our baseline method.
No In this section, we first introduce how the noise is generated, and then present experimental results of three NLP tasks.

Noise
To control the noise level, we randomly pick words from the testing data to be perturbed with a certain probability.For each selected word, we consider two perturbation setting: artificial noise and natural noise.Under artificial noise setting, we separately apply four kinds of noise: Swap, Delete, Replace, Insert with certain probability.Specifically, • Swap: We swap two letters per word.
• Delete: We randomly delete a letter in the middle of the word.• Replace: We randomly replace a letter in a word with another.• Insert: We randomly insert a letter in the middle of the word.
Following the setting in (Belinkov and Bisk, 2017), the first and the last character remains unchanged.
For the artificial noise, we follow the experiment of Belinkov and Bisk (2017) that harvest naturally occurring errors (typos, misspellings, etc.) from the edit histories of available corpora.It generates a lookup table of all possible errors for each word.We replace the selected words with the corresponding noise in the lookup table according to their settings.

Neural Machine Translation
We conduct the English-to-German translation experiments on the TED talks corpus from IWSLT 2014 dataset 6 .The data contains about 160, 000 sentence pairs for training, 6, 750 pairs for testing.
For the artificial noise setting, we perturb 20% words and apply each noise with probability 25%.For that natural noise setting, we also perturb 20% words.All experiment results is summarized in Table 3, where we use BLEU score (Papineni et al., 2002)   As can be seen, both fairseq model and Google Translate suffer from a significant performance drop on the noisy texts with both natural and synthetic noise.When using the spell-checker, the performance even drops more.Moreover, our purposed method can alleviate the performance drop.

Natural Language Inference
We test the algorithm on Natural Language Inference (NLI) task, which is one of the most challenge tasks related to the semantics of sentences.We establish our experiment based on the SNLI (the Stanford Natural Language Inference, Bowman et al. (2015)) corpus.Here we use accuracy as the evaluation metric for SNLI.
Here we use state-of-the-art 400 dimensional Hierarchical BiLSTM with Max Pooling (HBMP) (Talman et al., 2019).The implementation follows the publicly released code8 .We use the same noise setting as the NMT experiments.All results are presented in Table 4.We observe performance improvement with our method.To see if the denoising algorithm would induce noises to the clean texts, we also apply the algorithm to the original sentence and check if performance will degrade.It can be seen that, unlike the traditional robust model approach, applying a denoising algorithm on a clean sample has little influence on performance.

Method
Original As shown in the Table4, the accuracy is very close to the original one under the artificial noise.Natural noises contain punctuations and are more complicated than artificial ones.As a result, inference becomes much harder in this way.

Paraphrase Detection
We conducted Paraphrase detection experiments on the Microsoft Research Paraphrase Corpus (MRPC, Dolan and Brockett ( 2005)) consisting of 5800 sentence pairs extracted from news sources on the web.It is manually labelled for presence/absence of semantic equivalence.
We evaluate the performance using the stateof-the-art model: fine-tuned RoBERTa (Liu et al., 2019).For all implemented details follows the publicly released code9 .All experiment results is summarized in Table 5.We increase the size of the candidate list to 10000+25+27+16 = 10068 because there are a lot of proper nouns, which are hard to predict.

Conclusion and Future Work
In this paper, we present a novel text denoising algorithm using ready-to-use masked language model.We show that the proposed method can recover the noisy text by the contextual information without any training or data.We further demonstrate the effectiveness of the proposed method on three downstream tasks, where the performance drop is alleviated by our method.A promising future research topic is how to design a better candidate selection rule rather than merely using the edit distance.We can also try to use GEC corpora, such as CoNLL-2014, to further fine-tune the denoising model in a supervised way to improve the performance.

Table 1 :
Illustrative example of spell-checker and contextual denoising.

Table 2 :
. of [M ASK] (n) Top k Size Size of the candidate list

Table 3 :
BLEU scores of EN-to-DE tranlsation

Table 4 :
SNLI classification accuracy with artificial noise and natural noise.
* : Applying denoising algorithm on original texts.