Shuffled-token Detection for Refining Pre-trained RoBERTa

State-of-the-art transformer models have achieved robust performance on a variety of NLP tasks. Many of these approaches have employed domain agnostic pre-training tasks to train models that yield highly generalized sentence representations that can be fine-tuned for specific downstream tasks. We propose refining a pre-trained NLP model using the objective of detecting shuffled tokens. We use a sequential approach by starting with the pre-trained RoBERTa model and training it using our approach. Applying random shuffling strategy on the word-level, we found that our approach enables the RoBERTa model achieve better performance on 4 out of 7 GLUE tasks. Our results indicate that learning to detect shuffled tokens is a promising approach to learn more coherent sentence representations.


Introduction
The method of pre-training natural language models has been shown to greatly improve model performance on a wide range of NLP tasks (Peters et al., 2018;Radford et al., 2018;Howard and Ruder, 2018). State-of-the-art models that utilize transformers and deep bi-directional representations of text such as BERT, RoBERTa, and ALBERT (Devlin et al., 2019;Lan et al., 2020) have achieved superior results by pre-training on general, large corpora to learn rich representations from unlabeled data. Particularly helpful in low training data resource scenarios, unsupervised pretraining has become the first step for many language models to build powerful linguistic representations before fine tuning for downstream target tasks.
BERT style models use masked language modeling (MLM) and sometimes next sentence prediction, as pre-training tasks. While these tasks have been shown to produce transferable sentence representations for many NLP tasks, using additional domain-agnostic pre-training tasks such as sentence shuffling may improve model performance.
In a seminal cognitive psychology study it has been demonstrated that humans have a well trained ability to parse shuffled sentences (McCusker et al., 1981). Moreover, it has been shown that pre-trained models sometimes overlook word order while making predictions (Pham et al., 2020), and encouraging models to capture word order improves the classification performance. Shuffling as a pre-training task may therefore help expand transformer models to achieve even better performance on NLP tasks.
Drawing inspiration from recent work in reconstructing shuffled text (Lewis et al., 2020;Raffel et al., 2020), we propose that pre-training the RoBERTa model with a token modification discrimination head on randomly shuffled sentences provides constructive learning objective, which helps the model learn coherent representations and facilitate model recognition of the key pieces of a sentence and their association. To substantiate the argument, we design experiments to examine the model performance of RoBERTa with the proposed approach. The results demonstrate that pre-training the model with shuffled sentences enhances the scores of a majority of GLUE tasks.

Related Work
Shuffling sentences and words has often been used as a downstream task to evaluate model performance. One relevant example is the work by Sakaguchi et al. (2017) to develop a semi-character RNN model that surpasses previous spell-check methodologies on the Cmadbrigde Uinervtisy effect, where humans can easily reconstruct the shuffled token. Yang and Gao (2019) explored the performance of BERT on a shuffled sentence downstream task and highlighted some induced bias in the model that is the cause of incorrect predictions for noisy inputs. While the authors propose removing the induced bias from the representations to improve results, they do not consider the possibility of pre-training the model with shuffled sentences.
The use of un-ordered or noisy data in model training itself has proven effective. A number of studies have focused on using shuffled input to create useful sentence representation vectors for language models. Kiros et al. (2015) developed the skip-thoughts method to accomplish the task of reconstructing sentence order from a shuffled input. The authors used an encoder-decoder RNN model at the sentence level that allows a sentence to predict the adjacent sentences. Logeswaran et al. (2016) explored how sentence ordering tasks can help models learn text coherence. Using an RNN based approach, they train models to identify the correct ordering of sentences and show that models learn both document structure and useful sentence representations during this task. Jernite et al. (2017) employed discourse based learning objectives to help models understand discourse coherence. Specifically, given some sentences, they ask the model to predict if the sentences are in order, or if one sentence comes next to a set of sentences, or to predict the conjunction that joins the sentences. They showed that using these objectives to train models achieves significant reduction in computational training costs and is also effective when using unlabeled data.
There are a number of papers that focus on wordlevel shuffling, as opposed to sentence-level shuffling. Hill et al. (2016) developed the Sequential Denoising Autoencoder (SDAE) method, where a sentence is corrupted using a noise function determined by free parameters. After a certain percentage of words have been corrupted, an LSTM encoder-decoder model is tasked with predicting the original sentence from the corrupted version. The authors demonstrate training with noisy inputs allowed SDAE to significantly outperform regular SAE models, which did not introduce word-level-noise factors.
One closely related paper in the field of computer vision leverages the use of shuffled input in model training. Noroozi and Favaro (2016) employ a CNN model that is trained to solve jigsaw puzzles to determine correct spatial representation. Their results show that using shuffled input helps models learn that images are made up of different parts, and their relationship to the whole.
Finally, a variety of studies demonstrate that further pre-training performed after the general purpose BERT pre-training leads to better model results instead of simply fine-tuning downstream. Domain specific pre-training, such as BioBERT , story ending prediction by Trans-BERT (Li et al., 2019), and video caption classification by videoBERT (Sun et al., 2019) are all examples where expanding the pre-training tasks for BERT has achieved enhancement in model performance. TransBERT in particular demonstrates that further pre-training using targeted supervised tasks achieves better results than relying only on the unsupervised pre-training in BERT.

Methodology
Consider a sequence of tokens x. We first obtain x shuffled from x by shuffling a set of tokens of x. Given x shuffled , we detect if tokens are shuffled or not by using a token modification discrimination head on top of the RoBERTa base model. Our choice of the discriminative head is motivated by the recent success of ELECTRA (Clark et al., 2019).

Creating Shuffled Tokens for Training
We permute text sequences at the word level based on a probability p. We consider shuffling on a word level rather than a sub-word level. One straightforward approach to achieve is to create the shuffled tokens from a sequence and then use RobertaTokenizer to tokenize the shuffled sequence. However, this approach is problematic since the number of sub-words after tokenization may differ between the original and the shuffled sentence. In order to ensure that the sub-words belonging to a word stay intact and are not shuffled away, we create a mapping, which maps each subword to the corresponding word. Then, we tokenize the original sequence and shuffle the tokens based on the mapping so that all the sub-words belonging to a word occur together. Further, we define the target tensor which has binary labels for each token that specifies whether the token was shuffled or not.

Shuffling Strategy
We randomly permute the words in a sequence based on a probability p for our experiments highlighted in Section 4. Note that fraction ≥ p of the input tokens would be shuffled since one or more input tokens (sub-words) belong to a single word.

RoBERTa Model with Token
Modification Discrimination Head

Baseline
As our baseline approach, we trained the RoBERTa base model with the token modification discrimination head for detecting masked tokens instead of detecting shuffled tokens. The baseline training was done for the same number of optimization steps as the proposed approach for a fair comparison.

Dataset for Shuffled-Token Detection
We extracted 133K articles from Wikidump. 2 We used each paragraph in the extracted text as a data sample for our model. We filtered out samples that were either spaces-only or had more than 512 tokens after tokenizing with the pretrained RobertaTokenizer of the roberta-base model. We finally randomly split the samples into 1.3M for training and 14K for validation.
Dataset for masked token detection We used the same Wikidump dataset for the baseline approach as well, where we continue training pretrained RoBERTa on the objective of detecting masked tokens.

Implementation
We built our model using HuggingFace transformers (Wolf et al., 2020). All experiments have been performed using the RoBERTa base model with the token modification discrimination head described in Section 3.3.
The hyperparameters used in our experiments follow the hyperparameters of the RoBERTa base model except for the warmup steps, batch size, peak learning rate, and the maximum training steps. For our experiments, we use 100 linear warmup steps followed by linear decay of the learning rate outlined in Figure 3.
To find the optimal peak learning rate and the maximum steps, we performed a hyperparameter search over the learning rates {1e-4, 5e-5, 1e-6} 2 Timestamp May 9th, 2020. We used the scripts from https://github.com/ NVIDIA/DeepLearningExamples/tree/ master/PyTorch/LanguageModeling/BERT# getting-the-data to extract the data.  and over the maximum steps from [100, 1000] with a step size of 100. Changes in learning rate with the increase in optimization steps for different peak learning rates are shown in Figure 3. The results for the validation loss with an increasing number of optimization steps for the different learning rates is illustrated in Figure 2. The training loss is outlined in Figure 4. We observe that the minimum training loss, as well as validation loss, are achieved with the peak learning rate of 1e-4. Moreover, the training loss and the validation loss keep on decreasing with the number of optimization steps continuously till 1000 steps which shows that training the model for more number of steps could be beneficial. The optimal maximum steps as shown in Figure 4 and 2 is 1000. 3 For training our baseline approach of detecting mask tokens, we set the learning rate to 1e-4.
The probability of masking tokens (sub-words) in the baseline approach was fixed to 0.15 as done in previous work (Devlin et al., 2019;. For the proposed approach, we also set the probability p of shuffling tokens (words) to 0.15.
On using large batch sizes Pre-training procedures have been shown to be effective when using large batch sizes . Training our model directly on a very large batch size required computation power beyond what was available. To alleviate this problem, we used gradient accumulation for 64 steps with a per GPU batch size of 16. We used distributed training on 4 Nvidia K80 GPUs to train our models. The effective batch size during training was 4096.

Downstream Evaluation
We evaluate our approach on 7 GLUE tasks using the metrics outlined in Table 1. We use the same set of hyperparameters for fine-tuning for downstream tasks for each approach for a fair comparison. Methods for comparison to our approach include (a) the baseline approach where the training objective is detecting masked tokens, and (b) the plain pre-trained RoBERTa base model. The values of hyperparameters used for GLUE fine-tuning are outlined in   Table 1 presents the results for the 7 GLUE tasks. Our model trained to detect randomly shuffled to-kens performs the best in 4 of the 7 downstream tasks, namely CoLA, MRPC, QNLI and RTE. The scores for the baseline, where the objective is to detect masked tokens, are interestingly sometimes worse than the plain pre-trained RoBERTa's scores. For example, the CoLA score using plain pre-trained RoBERTa is 0.557 whereas the score obtained by the baseline is 0.508. The model performance based on the proposed approach on individual tasks gives us insights about what aspects of natural language our model improved in learning. Our model's performance on CoLA, which predicts grammatical correctness of a sentence, is better, indicating that the pre-training task may have enhanced the model's ability to learn grammatical information. Moreover, better performance on RTE, MRPC and QNLI shows that with the proposed approach, the model better understands the semantic relationships such as similarity and entailment.

Results and Analysis
However, random shuffling hurts the performance of the model on WNLI significantly in comparison to the baseline. This may be due to the fact that WNLI forms a pair of sentences by replacing the ambiguous pronouns with their referents. Since we are shuffling the words, it is likely that the nouns will be shuffled, resulting in misleading replacement of the ambiguous pronoun.
Our baseline model outperforms the shuffledtoken detection approach on SST-2 task which predicts the sentiment polarity of the movie reviews. One possible explanation is that shuffling negations in presence of contrasting conjunctions can significantly change the sentiment associated with the sentence. 5

Conclusion and Future Work
In this paper, we examine the performance of RoBERTa model with token modification discrimination head on detecting randomly shuffled tokens. We have demonstrated that detecting shuffled tokens is indeed a challenging yet advantageous task, which allows the model to learn coherent representations of the sentences. In this work, we start with pre-trained RoBERTa base model and train it further on the shuffled token detection task.
For future work, the model can be further explored by expanding the shuffling strategy. One possible strategy is part of speech (POS) shuffling,