TNT: Text Normalization Based Pre-training of Transformers for Content Moderation

In this work, we present a new language pre-training model TNT (Text Normalization based pre-training of Transformers) for content moderation. Inspired by the masking strategy and text normalization, TNT is developed to learn language representation by training transformers to reconstruct text from four operation types typically seen in text manipulation: substitution, transposition, deletion, and insertion. Furthermore, the normalization involves the prediction of both operation types and token labels, enabling TNT to learn from more challenging tasks than the standard task of masked word recovery. As a result, the experiments demonstrate that TNT outperforms strong baselines on the hate speech classiﬁcation task. Additional text normalization experiments and case studies show that TNT is a new potential approach to misspelling correction.


Introduction
Language model pre-training (self-supervised or unsupervised learning) has been a popular thread in Natural Language Understanding (NLP) studies recently due to its universal representation capacity (Radford et al., 2018;Devlin et al., 2019;Radford et al., 2019;Brown et al., 2020). It has been thus widely used in a multitude of language processing tasks such as named entity recognition, sentiment analysis, question answering and content moderation (Bodapati et al., 2019). In addition, the masking pre-training paradigm introduced by BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019) has been employed for other tasks such as image processing (Trinh et al., 2019), optical flow (Liu et al., 2019a), and audio-visual co-segmentation (Rouditchenko et al., 2019).
Recently, many variants have been proposed to further improve the pre-training procedure (Liu et al., 2019b;Sun et al., 2019). They have also advanced the state-of-the-art performance on multiple downstream natural language understanding tasks (Leaderboard) consistently. Almost all these studies train language models by predicting the masked words in different manners. The underlying mechanism is Cloze task (Taylor, 1953). The pre-training model itself, however, has not been fully exploited to address complicated yet feasible tasks. It is reasonable to expect that models can learn a better universal language representation if the pre-training procedure can be aligned with more challenging tasks.
In this article, we attempt to improve the language representation by proposing TNT: Text Normalization based pre-training of Transformers. TNT enhances the language learning by utilizing text normalization pre-training objective, inspired by misspelling correction. Specifically, TNT randomly manipulates tokens from the input text. The objective is then to reconstruct the original tokens of the manipulated words based on the context by predicting both recovery operation type and original token labels as illustrated in Fig. 1. Unlike the masked language model, TNT has to offer two predictions to reconstruct the original text. In addition, TNT does not have to be given the prediction positions in advance. This aligns with the fact that misspelling correction needs to perform the position-agnostic prediction for both aspects.
Perpetrators often intentionally obfuscate certain words about groups, or abusive words, by misspelling, or leetspeak (e.g.,"f@ggot","ph*ck", "w.e.t.b.a.c.k.") (Perea et al., 2008). This could sidestep the content moderation algorithms easily as exemplified in Table 5. To assess the learning capacity of TNT for obfuscated text and reduce the training cost, it is pre-trained only on one dataset and then applied to three datasets related to a hate speech detection task. TNT achieves better results compared to strong baselines on these datasets. Furthermore, we conduct an experiment on misspelling correction, and demonstrates that TNT has appealing language understanding capacity.
Our contributions are summarized as follows: (1) we introduce text normalization into the language training paradigm, which involves challenging tasks of predicting both operation types and token labels; (2) we show that TNT advances the challenging downstream text classification task, which benefits the content moderation; (3) TNT offers a new perspective on misspelling correction.

Related Works
BERT (Devlin et al., 2019) is developed by introducing the bidirectional encoder of well-known transformers to learn the contextual representation of text, which is underpinned by the attention mechanism (Vaswani et al., 2017). It randomly masks a certain portion of tokens from the input and then learns to predict these masked words. This cloze task based pre-training strategy enables BERT to advance the state-of-the-art performance on various key NLP tasks. It also inspires the community with a plethora of subsequent works (Yang et al., 2019;Sun et al., 2019;Liu et al., 2019b). Among them several are closely related to our approach: XLNet (Yang et al., 2019) and StructBERT  improve the masking by imposing the permutation and shuffling among words and sentences. StructBERT is one of the current state-of-the-art algorithms topping the GLUE leaderboard 1 (Leaderboard), and is most similar to this work. TNT differs from StructBERT in that it is inspired by the need for misspelling correction, and therefore not only allows permutation of words, but also deletion and insertion.
For online abusive language moderation, BERT has also been shown effective and advances the overall performance largely (Bodapati et al., 2019). In addition, early works formulate the hate speech detection as the generic text classification, alternatively focus on certain ethnic groups or building up blacklists of swear words (Nobata et al., 2016a;Badjatiya et al., 2017). Misspelling correction is also a long-standing problem in NLP (Hirst and Budanitsky, 2005;Bassil, 2012;Islam and Inkpen, 1 The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. It consists of 9 sentence-or sentence-pair language understanding tasks.
2009) and has been widely used in real-word scenarios like word processing system and email spell checking.
She is a nice lady. c(4) 3 Pre-training

Architecture
We develop TNT based on a multi-layer bidirectional Transformer network as encoder used in popular language models like BERT (Devlin et al., 2019). The well-known masking strategy employed in BERT and subsequent works is inspired by cloze task for human-level language understanding. People are required to fill out the omitted words from a passage based on context. Unlike the cloze procedure, text normalization involves more diversified and challenging tasks. The goal is to rewrite a sentence that was not properly formed, either due to misspelling, or due to intentional manipulation. A perpetrator would misspell on purpose, with the intention of evade detection, through substitution, transposition, deletion and insertion as exemplified in Table 1. The task of text normalization is to understand the manipulated sentence, and normalize it to the correct form. Therefore in TNT, motivated by the task of text normalization, we propose the pre-training tasks of substitution, transposition, deletion and insertion. The masking procedure could be viewed as a special case of the text normalization objective, under substitution.
For substitution (e.g., masking) and insertion, the corresponding token label predictions alongside operation type is required to reconstruct text. Specifically, we have normalization type o ∈ O = {0, 1, 2, 3, 4}, where o is the operation type with values 0 (substitution), 1 (transposition), 2 (deletion), 3 (deletion) or 4 (identity), as exemplified in Table 1  are ground-truth tokens for corresponding types, where V is the vocabulary. It is noted that for transposition, deletion and identity, no token labels are required to reconstruct the original text. Thus, we introduce a special token symbol [NLB] as the placeholder. Fig. 1 illustrates the generation procedure of pre-training instances and the joint training of two subtasks.

Objective
Given an input sequence with manipulated tokens, the operation type and token label objectives can be denoted simply as O operation = arg max respectively. t i is the observed token from the input sentence. o i and l i are ground-truth operation type and token label as illustrated in Fig. 1. M is the maximum length of input sequence. θ o and θ l are sets of trainable parameters. w o i and w l i are weights for operation type and token label, respectively. The overall pretraining objective is O = O operation + O label .

Experiments
We perform the hate speech classification and misspelling correction tasks based on the pre-trained model.

Datasets
Our primary dataset is extracted from user comments on Yahoo News and Finance, and consisted of 1.43M labeled comments. Among them, 7% of the comments are labelled as abusive (including hatespeech and profanity). The labeled data were collected as follows: comments that are reported as "abusive" for any reason by users of Yahoo properties are sent to in-house trained raters for review, and the decisions of the raters form the labels. Further details can be found at (Nobata et al., 2016b). In addition, we experimented on two publicly available hatespeech datasets: Twitter, and Wikipedia (Wiki) (Agrawal and Awekar, 2018). Wiki set here is a collection of discussions among editors on talk pages for improving Wiki articles, which includes inflammatory posts. The statistics of three datasets are shown in Table 2.
We split the dataset into train/development/test sets with a ratio 70%/10%/20%. We generate vocabulary, pretrain the language modeling tasks, and train the hate speech prediction task using only the training set. We tune hyper-parameters on the development set, and report final results on the test set.

Experiment Setups
For TNT pre-training, the substitution follows the masking setting in BERT. We also set up 5% manipulation rate for transposition, deletion and insertion, respectively. The rest of the tokens remain unchanged. The vocabulary generation, wordpiece tokenization, learning rate, weight decay, warm-up and other training settings follow BERT. The model size is reduced to quarter of the original BERT. The maximum length of input sequence M is set to 256. Parameter scale is O = (V + M + S) × H + L × 12H 2 + H 2 where V , S are vocabulary size |V|, segment type size. H and L are the hidden layer dimension and the number of transformer block layers, respectively.
We mainly report two models with wordpiece and character inputs as detailed in Table 3. The main difference is that manipulation and normalization are performed on different levels. Wordpieceand character-level operations are exemplified in Fig. 1   For both TNT models, we run the pre-training procedure for 64 epochs. We pre-train quarter size of BERT (wordpiece) and StructBERT (wordpiece) with the same dataset, epochs and wordpiece vocabulary. In the fine-tuning phase, batch size, the number of batches and learning rate are set to 64, 10 and 2e-5. For all models, we discard the sentence-pair training objective.

Results
For the downstream hate speech classification, we do fine-tuning on top of aggregate embedding [CLS] of wordpiece TNT as BERT does. Threshold-free AUC@ROC and AUC@PR 2 (Davis 2 Precision-Recall  and Goadrich, 2006) and threshold (0.5 used here) based F1 score and Matthews correlation coefficient (MCC) are adopted. MCC and AUC@PR are generally regarded as balanced metrics. We first report the comparison results on Yahoo set. As shown in Table 4, our TNT outperforms BERT and StructBERT on test set. Particularly, it achieves more than 1% and 2% improvement in terms of MCC and F1 score, respectively. The performance gain could be reasonably attributed to the new training objective. We further do fine-tuning based on pre-trained models directly for Twitter and Wikipedia sets, respectively. The advantage of TNT over baselines still holds. The superiority of TNT on classification task over multiple datasets signalizes that text normalization based training strategy is a promising direction for better universal language representation learning. Although all three sets are user generated content, Wiki users are somewhat different. As Wiki itself is a collaborative knowledge repository, editors are likely to attack others due to disputes on specific domain knowledge. However, the users are the general public who post comments and tweets more casually for Yahoo and Twitter. In this context, text misspellings in Wiki are likely to be less severe and intentional than others. The way we develop the model enables it to learn better representations especially for garbled text compared to standard masking schemes. Thus, the performance gain is more salient in Yahoo and Twitter.
To better understand their performance difference intuitively, we illustrate in Table 5 some specific error case analysis, where toxic comments are created by users to attack a certain group of people.
The key parts are all intentionally manipulated to

Misspelling Correction
Misspelling correction is a long-standing research topic (Islam and Inkpen, 2009;Whitelaw et al., 2009;Bassil, 2012) and has been widely commercialized as a service such as Bing Spell Check (Bing) and Grammarly (Grammarly). TNT can be readily employed in misspelling correction. We here evaluate TNT using its character-level 3 variant without additional fine-tuning. We aggressively misspell the test set of Yahoo in Table 2 by 15% for each sample. Then we employ the pre-trained TNT model to recover the text. As a comparison, we also examine an open-source 4 tool autocorrector (Autocorrect) for reference. Edit distance (Distance), and BLEU (BLEU) are adopted to measure the distance and similarity between corrected samples and original ones as detailed in Table 7. TNT performs significantly better than the dictionary look-up algorithm. In addition, we cross-check the results between TNT and popular commercial spell check products through case studies as reported in Table 6. Among all tools, Bing leads the performance, followed by Google Docs and Grammarly, and Autocorrect performs the worst. Overall, TNT functions very well particularly for case 1 as a combination of multiple challenging misspellings. It is noted that received and definitely are two of most commonly misspelled words (Words), but TNT fails on the correction of "definately". Overall, "definitely" is not a strongly contextual word derived from the whole sentence here. The limited training set might restrict the correction capacity as well.

Discussions and future work
We conducted experiments on three classification tasks. The data size for Yahoo Finance and News, while being one of the largest in the context of hatespeech classification, is nevertheless small in the context of language modeling. We plan to perform large-scale pre-training and evaluation on GLUE datasets for the comprehensive analysis.
This work targets sentence level language understanding. As far as we know, no data available for misspelled words in the context of sentences, we thus have to generate the evaluation set by ourselves. The main goal here is not to develop a more powerful misspelling corrector, but rather to propose a new and stronger language modeling approach. We thus don't set up the strict and comprehensive evaluation for apples-to-apples comparison on spelling correction. We will continue to explore this line in the future.

Conclusion
In this work, we propose a new language representation training strategy TNT. TNT improves language modeling by training a transformer to reconstruct text from four operation types typically seen in text manipulation. We show that when finetuned for the content moderation task of detecting hatespeech, the new model performed better than the state of the art baselines. We also demonstrate its effectiveness in misspelling correction.