Revisiting Pre-Trained Models for Chinese Natural Language Processing

Bidirectional Encoder Representations from Transformers (BERT) has shown marvelous improvements across various NLP tasks, and consecutive variants have been proposed to further improve the performance of the pre-trained language models. In this paper, we target on revisiting Chinese pre-trained language models to examine their effectiveness in a non-English language and release the Chinese pre-trained language model series to the community. We also propose a simple but effective model called MacBERT, which improves upon RoBERTa in several ways, especially the masking strategy that adopts MLM as correction (Mac). We carried out extensive experiments on eight Chinese NLP tasks to revisit the existing pre-trained language models as well as the proposed MacBERT. Experimental results show that MacBERT could achieve state-of-the-art performances on many NLP tasks, and we also ablate details with several findings that may help future research. https://github.com/ymcui/MacBERT


Introduction
Bidirectional Encoder Representations from Transformers (BERT)  has become enormously popular and has proven to be effective in recent natural language processing studies, which utilizes large-scale unlabeled training data and generates enriched contextual representations. As we traverse several popular machine reading comprehension benchmarks, such as SQuAD (Rajpurkar et al., 2018), CoQA (Reddy et al., 2019), QuAC (Choi et al., 2018), NaturalQuestions (Kwiatkowski et al., 2019), RACE (Lai et al., 2017), we can see that most of the top-performing models are based on BERT and its variants Zhang et al., 2019;Ran et al., 2019), demonstrating that the pre-trained language models have 1 https://github.com/ymcui/MacBERT become new fundamental components in natural language processing field.
Starting from BERT, the community have made great and rapid progress on optimizing the pretrained language models, such as ERNIE (Sun et al., 2019a), XLNet , RoBERTa , SpanBERT , AL-BERT (Lan et al., 2019), ELECTRA (Clark et al., 2020), etc. However, training Transformer-based (Vaswani et al., 2017) pre-trained language models are not as easy as we used to train word embeddings or other traditional neural networks. Typically, training a powerful BERT-large model, which has 24-layer Transformer with 330 million parameters, to convergence needs high-memory computing devices, such as TPU, which is very expensive. On the other hand, though various pre-trained language models have been released, most of them are based on English, and there are few efforts on building powerful pre-trained language models on other languages.
In this paper, we aim to build Chinese pre-trained language model series and release them to the public for facilitating the research community, as Chinese and English are among the most spoken languages in the world. We revisit the existing popular pre-trained language models and adjust them to the Chinese language to see if these models generalize well in the language other than English. Besides, we also propose a new pre-trained language model called MacBERT, which replaces the original MLM task into MLM as correction (Mac) task and mitigates the discrepancy of the pretraining and fine-tuning stage. Extensive experiments are conducted on eight popular Chinese NLP datasets, ranging from sentence-level to documentlevel, such as machine reading comprehension, text classification, etc. The results show that the proposed MacBERT could give significant gains in most of the tasks against other pre-trained language  models, and detailed ablations are given to better examine the composition of the improvements. The contributions of this paper are listed as follows.
• Extensive empirical studies are carried out to revisit the performance of Chinese pre-trained language models on various tasks with careful analyses.
• We propose a new pre-trained language model called MacBERT that mitigates the gap between the pre-training and fine-tuning stage by masking the word with its similar word, which has proven to be effective on downstream tasks.
• To further accelerate future research on Chinese NLP, we create and release the Chinese pre-trained language model series to the community.

Related Work
In this section, we revisit the techniques of the representative pre-trained language models in the recent natural language processing field. The overall comparisons of these models, as well as the proposed MacBERT, are depicted in Table 1. We elaborate on their key components in the following subsections.

BERT
BERT (Bidirectional Encoder Representations from Transformers)  has proven to be successful in natural language processing studies. BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all Transformer layers. Primarily, BERT consists of two pre-training tasks: Masked Language Model (MLM) and Next Sentence Prediction (NSP).
• MLM: Randomly masks some of the tokens from the input, and the objective is to predict the original word based only on its context.
• NSP: To predict whether sentence B is the next sentence of A.
Later, they further proposed a technique called whole word masking (wwm) for optimizing the original masking in the MLM task. In this setting, instead of randomly selecting WordPiece (Wu et al., 2016) tokens to mask, we always mask all of the tokens corresponding to a whole word at once. This will explicitly force the model to recover the whole word in the MLM pre-training task instead of just recovering WordPiece tokens (Cui et al., 2019a), which is much more challenging. As the whole word masking only affects the masking strategy of the pre-training process, it would not bring additional burdens on down-stream tasks. Moreover, as training pre-trained language models are computationally expensive, they also release all the pretrained models as well as the source codes, which stimulates the community to have great interests in the research of pre-trained language models.

ERNIE
ERNIE (Enhanced Representation through kNowledge IntEgration) (Sun et al., 2019a) is designed to optimize the masking process of BERT, which includes entity-level masking and phrase-level masking. Different from selecting random words in the input, entity-level masking will mask the named entities, which are often formed by several words. Phrase-level masking is to mask consecutive words, which is similar to the N-gram masking strategy . 2 . Yang et al. (2019) argues that the existing pretrained language models that are based on autoencoding, such as BERT, suffer from the discrepancy of the pre-training and fine-tuning stage because the masking symbol [MASK] will never appear in the fine-tuning stage. To alleviate this problem, they proposed XLNet, which was based on Transformer-XL . XLNet mainly modifies in two ways. The first is to maximize the expected likelihood over all permutations of the factorization order of the input, where they called the Permutation Language Model (PLM). Another is to change the autoencoding language model into an autoregressive one, which is similar to the traditional statistical language models. 3

RoBERTa
RoBERTa (Robustly Optimized BERT Pretraining Approach)  aims to adopt original BERT architecture but make much more precise modifications to show the powerfulness of BERT, which was underestimated. They carried out careful comparisons of various components in BERT, including the masking strategies, training steps, etc. After thorough evaluations, they came up with several useful conclusions to make BERT more powerful, mainly including 1) training longer with bigger batches and longer sequences over more data; 2) removing the next sentence prediction and using dynamic masking.

ALBERT
ALBERT (A Lite BERT) (Lan et al., 2019) primarily tackles the problems of higher memory assumption and slow training speed of BERT. AL-BERT introduces two parameter reduction techniques. The first one is the factorized embedding parameterization that decomposes the embedding matrix into two small matrices. The second one is the cross-layer parameter sharing that the Transformer weights are shared across each layer of AL-BERT, which will significantly reduce the parameters. Besides, they also proposed the sentence-order prediction (SOP) task to replace the traditional NSP pre-training task.

ELECTRA
ELECTRA (Efficiently Learning an Encoder that Classifiers Token Replacements Accurately) (Clark et al., 2020) employs a new generator-discriminator framework that is similar to GAN (Goodfellow et al., 2014). The generator is typically a small MLM that learns to predict the original words of the masked tokens. The discriminator is trained to discriminate whether the input token is replaced by the generator. Note that, to achieve efficient training, the discriminator is only required to predict a binary label to indicate "replacement", unlike the way of MLM that should predict the exact masked word. In the fine-tuning stage, only the discriminator is used.

Chinese Pre-trained Language Models
While we believe most of the conclusions in the previous works are true in English condition, we wonder if these techniques still generalize well in other languages. In this section, we illustrate how the existing pre-trained language models are adapted for the Chinese language. Furthermore, we also propose a new model called MacBERT, which adopts the advantages of previous models as well as newly designed components. Note that, as these models are all originated from BERT without changing the nature of the input, no modification should be made to adapt to these models in the fine-tuning stage, which is very flexible for replacing one another.

BERT-wwm & RoBERTa-wwm
In the original BERT, a WordPiece tokenizer (Wu et al., 2016) was used to split the text into Word-Piece tokens, where some words will be split into several small fragments. The whole word masking (wwm) mitigate the drawback of masking only a part of the whole word, which is easier for the model to predict. In Chinese condition, WordPiece tokenizer no longer split the word into small fragments, as Chinese characters are not formed by alphabet-like symbols. We use the traditional Chinese Word Segmentation (CWS) tool to split the text into several words. In this way, we could adopt whole word masking in Chinese to mask the word instead of individual Chinese characters. For implementation, we strictly followed the original whole word masking codes and did not change other components, such as the percentage of word masking, etc. We use LTP (Che et al., 2010) for Chinese word segmentation to identify the word boundaries.

Chinese English
Original Sentence 使用语言模型来预测下一个词的概率。 we use a language model to predict the probability of the next word.
we use a language model to pre ##di ##ct the pro ##ba ##bility of the next word .

Original Masking
we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word . Figure 1: Examples of different masking strategies.
Note that the whole word masking only affects the selection of the masking tokens in the pre-training stage. The input of BERT still uses WordPiece tokenizer to split the text, which is identical to the original BERT. Similarly, whole word masking could also be applied on RoBERTa, where the NSP task is not adopted. An example of the whole word masking is depicted in Figure 1.

MacBERT
In this paper, we take advantage of previous models and propose a simple modification that leads to significant improvements on fine-tuning tasks, where we call this model as MacBERT (MLM as correction BERT). MacBERT shares the same pre-training tasks as BERT with several modifications. For the MLM task, we perform the following modifications.
• We use whole word masking as well as Ngram masking strategies for selecting candidate tokens for masking, with a percentage of 40%, 30%, 20%, 10% for word-level unigram to 4-gram.
• Instead of masking with [MASK] token, which never appears in the fine-tuning stage, we propose to use similar words for the masking purpose. A similar word is obtained by using Synonyms toolkit (Wang and Hu, 2017), which is based on word2vec (Mikolov et al., 2013) similarity calculations. If an N-gram is selected to mask, we will find similar words individually. In rare cases, when there is no similar word, we will degrade to use random word replacement.
• We use a percentage of 15% input words for masking, where 80% will replace with similar words, 10% replace with a random word, and keep with original words for the rest of 10%.
For the NSP-like task, we perform sentence-order prediction (SOP) task as introduced by ALBERT (Lan et al., 2019), where the negative samples are created by switching the original order of two consecutive sentences. We ablate these modifications in Section 6.1 to better demonstrate the contributions of each component.

Setups for Pre-Trained Language Models
We downloaded Wikipedia dump 4 (as of March 25, 2019), and pre-processed with WikiExtractor.py as suggested by , resulting in 1,307 extracted files. We use both Simplified and Traditional Chinese in this dump. After cleaning the raw text (such as removing html tagger) and separating the document, we obtain about 0.4B words. As Chinese Wikipedia data is relatively small, besides Chinese Wikipedia, we also use extended training data for training these pre-trained language models (mark with ext in the model name). The in-house collected extended data contains encyclopedia, news, and question answering web, which has 5.4B words and is over ten times bigger than the Chinese Wikipedia. Note that we always use extended data for MacBERT, and omit the ext mark. In order to identify the boundary of Chinese words, we use LTP (Che et al., 2010) for Chinese word segmentation. We use official create pretraining data.py to convert raw input text to the pre-training examples.
To better acquire the knowledge from the existing pre-trained language model, we did NOT train our base-level model from scratch but the official Chinese BERT-base, inheriting its vocabulary and weight. However, for the large-level model, we have to train from scratch but still using the same vocabulary provided by the base-level model.
For training BERT series, we adopt the scheme of training on a maximum length of 128 tokens then on 512, suggested by . However, we empirically found that this will result in   insufficient adaptation for the long-sequence tasks, such as reading comprehension. In this context, for RoBERTa and MacBERT, we directly use a maximum length of 512 throughout the pre-training process, which was adopted in . For the batch size less than 1024, we adopt the original ADAM (Kingma and Ba, 2014) with weight decay optimizer in BERT for optimization, and use LAMB optimizer (You et al., 2019) for better scalability in larger batch. The pre-training was either done on a single Google Cloud TPU 5 v3-8 (equals to a single TPU) or TPU Pod v3-32 (equals to 4 TPUs), depending on the magnitude of the model. Specifically, for MacBERT-large, we trained for 2M steps with a batch size of 512 and an initial learning rate of 1e-4. The training details are shown in Table 3. For clarity, we do not list 'ext' models, where the other parameters are the same with the one that is not trained on extended data.

Setups for Fine-tuning Tasks
To thoroughly test these pre-trained language models, we carried out extensive experiments on various natural language processing tasks, covering a wide spectrum of text length, i.e., from sentencelevel to document-level. Task details are shown in Table 2. Specifically, we choose the following eight popular Chinese datasets. • Single Sentence Classification (SSC): ChnSen-tiCorp (Tan and Zhang, 2008), THUCNews (Li and Sun, 2007).
In order to make a fair comparison, for each dataset, we keep the same hyper-parameters (such as maximum length, warm-up steps, etc.) and only tune the initial learning rate from 1e-5 to 5e-5 for each task. Note that the initial learning rates are tuned on original Chinese BERT, and it would be possible to achieve another gains by tuning the learning rate individually. We run the same experiment ten times to ensure the reliability of results. The best initial learning rate is determined by selecting the best average development set performance. We report the maximum and average scores to both evaluate the peak and average performance.
For all models except for ELECTRA, we use the same initial learning rate setting for each task, as depicted in Table 2. For ELECTRA models, we use a universal initial learning rate of 1e-4 for base-level models and 5e-5 for large-level models as suggested in Clark et al. (2020).
As the pre-training data are quite different

Machine Reading Comprehension
Machine Reading Comprehension (MRC) is a representative document-level modeling task which requires to answer the questions based on the given passages. We mainly test these models on three datasets: CMRC 2018, DRCD, and CJRC.
• CMRC 2018: A span-extraction machine reading comprehension dataset, which is similar to SQuAD (Rajpurkar et al., 2016) that extract a passage span for the given question.
• DRCD: This is also a span-extraction MRC dataset but in Traditional Chinese.
• CJRC: Similar to CoQA (Reddy et al., 2019), which has yes/no questions, no-answer questions, and span-extraction questions. The data is collected from Chinese law judgment documents. Note that we only use small-train-data.json for training.   The results are depicted in Table 4, 5, 6. Using additional pre-training data will result in further improvement, as shown in the comparison between BERT-wwm and BERT-wwm-ext. This is why we use extended data for RoBERTa, ELECTRA, and MacBERT. Moreover, the proposed MacBERT yields significant improvements on all reading comprehension datasets. It is worth mentioning that our MacBERT-large could achieve a state-of-the-art F1 of 60% on the challenge set of CMRC 2018, which requires deeper text understanding.
Also, it should be noted that though DRCD is a traditional Chinese dataset, training with additional large-scale simplified Chinese could also have a great positive effect. As simplified and traditional Chinese share many identical characters, using a   powerful pre-trained language model with only a few traditional Chinese data could also bring improvements without converting traditional Chinese characters into simplified ones. Regarding CJRC, where the text is written in professional ways regarding Chinese laws, BERTwwm shows moderate improvement over BERT but not that salient, indicating that further domain adaptation is needed for the fine-tuning tasks on nongeneral domains. However, by increasing general training data will result in improvement, suggesting that when there is no enough domain data, we could also use large-scale general data as a remedy.

Single Sentence Classification
For single sentence classification tasks, we select ChnSentiCorp and THUCNews datasets. We use the ChnSentiCorp dataset for evaluating sentiment classification, where the text should be classified into either a positive or negative label. THUCNews is a dataset that contains news in different genres, where the text is typically very long. In this paper, we use a version that contains 50K news in 10 domains (evenly distributed), including sports, finance, technology, etc. 7 The results show that our 7 https://github.com/gaussic/text-classification-cnn-rnn MacBERT could give moderate improvements over baselines, as these datasets have already reached very high accuracies.

Sentence Pair Classification
For sentence pair classification tasks, we use XNLI data (Chinese portion), Large-scale Chinese Question Matching Corpus (LCQMC), and BQ Corpus, which require to input two sequences and predict their relations. We can see that MacBERT outperforms other models, but the improvements were moderate, with a slight improvement on the average score, but the peak performance is not as good as RoBERTa-wwm-ext-large. We suspect that these tasks are less sensitive to the subtle difference of the input than the reading comprehension tasks. As sentence pair classification only needs to generate a unified representation of the whole input and thus results in a moderate improvement.

Discussion
While our models achieve significant improvements on various Chinese tasks, we wonder where the essential components of the improvements from. To this end, we carried out detailed ablations on MacBERT to demonstrate their effectiveness, and we also compare the claims of the existing pretrained language models in English to see if their modification still holds true in another language.

Effectiveness of MacBERT
We carried out ablations to examine the contributions of each component in MacBERT, which was thoroughly evaluated in all fine-tuning tasks. The results are shown in Table 9. The overall average scores are obtained by averaging the test scores of each task (EM and F1 metrics are averaged before the overall averaging). From a general view, removing any component in MacBERT will result in a decline in the average performance, suggesting that all modifications contribute to the overall improvements. Specifically, the most effective modifications are the N-gram masking and similar word replacement, which are the modifications on the masked language model task. When we compare N-gram masking and similar word replacement, we could see clear pros and cons, where N-gram masking seems to be more effective in text classification tasks, and the performance of reading comprehension tasks seems to benefit more from the similar word replacement task. By combining these two tasks will compensate each other and have a better performance on both genres.
The NSP task does not show as much importance as the MLM task, demonstrating that it is much more important to design a better MLM task to fully unleash the text modeling power. Also, we compared the next sentence prediction  and sentence order prediction (Lan et al., 2019) task to better judge which one is much powerful. The results show that the sentence order prediction task indeed shows better performance than the original NSP, though it is not that salient. The SOP task requires identifying the correct order of the two sentences rather than using a random sentence, which is much easy for the machine to identify. Removing the SOP task will result in noticeable declines in reading comprehension tasks compared to the text classification tasks, which suggests that it is necessary to design an NSP-like task to learn the relations between two segments (for example, passage and question in reading comprehension task).

Investigation on MLM Task
As said in the previous section, the dominant pretraining task is the masked language model and its variants. The masked language model task re-  lies on two sides: 1) the selection of the tokens to be masked, and 2) the replacement of the selected tokens. In the previous section, we have demonstrated the effectiveness of the selection of the masking tokens, such as the whole word masking or N-gram masking, etc. Now we are going to investigate how the replacement of the selected tokens will affect the performance of the pre-trained language models. In order to investigate this problem, we plot the CMRC 2018 and DRCD performance at different pre-training steps. Specifically, we follow the original masking percentage 15% of the input sequence, in which 10% masked tokens remain the same. In terms of the remaining 90% masked tokens, we classify into four categories.
• Partial Mask: original BERT implementation, with 80% tokens replaced into [MASK] tokens, and 10% replaced into random words.  We only plot the steps from 1M to 2M to show stabler results than the first 1M steps. The results are depicted in Figure 2. The pre-training models that rely on mostly using [MASK] for masking purpose (i.e., partial mask and all mask) results in worse performances, indicating that the discrepancy of the pre-training and fine-tuning is an actual problem that affects the overall performance. Among which, we also noticed that if we do not leave 10% as original tokens (i.e., identity projection), there is also a consistent decline, indicating that masking with [MASK] token is less robust and vulnerable to the absence of identity projection for negative sample training.
To our surprise, a quick fix, that is to abandon the [MASK] token completely and replace all 90% masked tokens into random words, yields consistent improvements over [MASK]-dependent masking strategies. This also strengthens the claims that the original masking method that relies on the [MASK] token, which never appears in the finetuning task, will result in a discrepancy and worse performance. To make this more delicate, in this paper, we propose to use similar words for masking purpose, instead of randomly pick a word from the vocabulary, as random word will not fit in the context and may break the naturalness of the language model learning, as traditional N-gram language model is based on natural sentence rather than a manipulated influent sentence. However, if we use similar words for masking purposes, the fluency of the sentence is much better than using random words, and the whole task transforms into a grammar correction task, which is much more natural and without the discrepancy of the pre-training and fine-tuning stage. From the chart, we can see that the MacBERT yields the best performance among the four variants, which verifies our assumptions.

Conclusion
In this paper, we revisit pre-trained language models in Chinese to see if the techniques in these state-of-the-art models generalize well in a different language other than English only. We created Chinese pre-trained language model series and proposed a new model called MacBERT, which modifies the masked language model (MLM) task as a language correction manner and mitigates the discrepancy of the pre-training and fine-tuning stage. Extensive experiments are conducted on various Chinese NLP datasets, and the results show that the proposed MacBERT could give significant gains in most of the tasks, and detailed ablations show that more focus should be made on the MLM task rather than the NSP task and its variants, as we found that NSP-like task does not show a landslide advantage over one another. With the release of the Chinese pre-trained language model series, we hope it will further accelerate the natural language processing in the Chinese research community.
In the future, we would like to investigate an effective way to determine the masking ratios instead of heuristic ones to further improve the performance of the pre-trained language models.

CMRC 2018
Dev   We also carried out experiments on text classification task, such as XNLI, but the XLNet-mid could only gives near 74% on the test set, while the BERT-base could reach an accuracy of 77.8%. We haven't figured out the exact issues and also did not find other successful Chinese XLNet in the community. We will investigate the issue and will update these results once we figure it out through our open-source implementation repository.