ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations

The pre-training of text encoders normally processes text as a sequence of tokens corresponding to small text units, such as word pieces in English and characters in Chinese. It omits information carried by larger text granularity, and thus the encoders cannot easily adapt to certain combinations of characters. This leads to a loss of important semantic information, which is especially problematic for Chinese because the language does not have explicit word boundaries. In this paper, we propose ZEN, a BERT-based Chinese text encoder enhanced by n-gram representations, where different combinations of characters are considered during training, thus potential word or phrase boundaries are explicitly pre-trained and fine-tuned with the character encoder (BERT). Therefore ZEN incorporates the comprehensive information of both the character sequence and words or phrases it contains. Experimental results illustrated the effectiveness of ZEN on a series of Chinese NLP tasks, where state-of-the-art results is achieved on most tasks with requiring less resource than other published encoders. It is also shown that reasonable performance is obtained when ZEN is trained on a small corpus, which is important for applying pre-training techniques to scenarios with limited data. The code and pre-trained models of ZEN are available at https://github.com/sinovation/ZEN.


Introduction
Pre-trained text encoders (Peters et al., 2018b;Devlin et al., 2018;Radford et al., 2018Radford et al., , 2019Yang et al., 2019) have drawn much attention in natural language processing (NLP), because state-of-theart performance can be obtained for many NLP tasks using such encoders. In general, these encoders are implemented by training a deep neural * Work done during the internship at Sinovation Ventures. † Corresponding author. 1 The code and pre-trained models of ZEN are available at https://github.com/sinovation/ZEN. model on large unlabeled corpora. Although the use of big data brings success to these pre-trained encoders, it is still unclear whether existing encoders have effectively leveraged all useful information in the corpus. Normally, the pre-training procedures are designed to learn on tokens corresponding to small units of texts (e.g., word pieces for English, characters for Chinese) for efficiency and simplicity. However, some important information carried by larger text units may be lost for certain languages when we use a standard encoder, such as BERT. For example, in Chinese, text semantics are greatly affected by recognizing valid n-grams 2 . This means a pre-trained encoder can potentially be improved by incorporating such boundary information of important n-grams.
Recently, there are studies adapting BERT for Chinese with word information, yet they are limited in maintaining the original BERT structure, augmented with learning from weakly supervised word information or requiring external knowledge. As an example, a representative study in Cui et al. (2019) proposed to use the whole-word masking strategy to mitigate the limitation of word information. They used an existing segmenter to produce possible words in the input sentences, and then trained a standard BERT on the segmented texts by masking whole words. Sun et al. (2019a) proposed to perform both entity-level and phraselevel masking to learn knowledge and information from the pre-training corpus. However, their approaches are limited in the following senses. First, both methods rely on the word masking strategy so that the encoder can only be trained with existing word and phrase information. Second, similar to the original BERT, the masking strategy results in the mis-match of pretraining and fine-tuning, i.e., no word/phrase information is retained when the encoders are applied to downstream prediction tasks. Third, incorrect word segmentation or entity recognition results cause errors propagated to the pre-training process and thus may negatively affected the generalization capability of the encoder.
In this paper, we propose ZEN, a Chinese (Z) text encoder Enhanced by representing N-grams, which provides an alternative way to improve character based encoders (e.g., BERT) by using larger text granularity. To train our model, one uses an n-gram lexicon from any possible sources such as pre-defined dictionaries and n-gram lists extracted via unsupervised approaches. Such lexicon is then mapped to training texts, and is used to highlight possible combinations of characters that indicate likely salient contents during the training process. Our model then integrate the representations of these n-gram contexts with the character encoder. Similarly, the fine-tune process on any task-specific dataset further enhances ZEN with such n-gram representations. An important feature of our method is that while the model explicitly takes advantage of n-gram information, the model only outputs character-level encodings that is consistent with BERT. Therefore downstream tasks are not affected. ZEN extends the original BERT structure and explicitly incorporates representations of large granular texts into it, which is different from (and complementary to) previous methods that relied on weak supervision such as masking. 3 To mitigate the error propagation, in the n-gram encoder, we use the attention mechanism to dynamically weigh n-grams so that those truly useful n-grams are emphasized while noisy ones are less learned with low weights. In addition, the n-gram vocabulary is collected from an extra large corpus, and it can be easily adapted to any source from different domains to ensure incorporating the most important n-grams before training ZEN.
Our experiments follow the standard procedure, i.e., training ZEN on the Chinese Wikipedia dump and fine-tune it on several Chinese downstream NLP tasks. Experiment results demonstrate its validity and effectiveness where state-of-the-art performance is achieved on many tasks using the ngrams, which are automatically learned from the training data other than external or prior knowledge. In particular, our method outperforms some existing encoders trained on much larger corpora. 3 Although the character encoder may still use masking as a learning objective, the encoded n-grams are explicitly

ZEN
The overall architecture of ZEN is shown in Figure  1, where the backbone model (character encoder) is BERT 4 (Devlin et al., 2018), enhanced by n-gram information represented by a multi-layer encoder. Since the basis of BERT is well explained in previous studies (Devlin et al., 2018;Yu and Jiang, 2019), in this paper, we focus on the details of ZEN, by explaining how n-grams are processed and incorporated into the character encoder.

N-gram Extraction
A high quality of text representation plays an important role to obtain good performance for many NLP tasks (Song et al., 2017;Zhu et al., 2019;Liu and Lapata, 2019), where a powerful encoder is required to model more contextual information. Inspired by the studies (Song et al., 2009;Song and Xia, 2012;Ouyang et al., 2017;Kim et al., 2018;Peng et al., 2018;Higashiyama et al., 2019;Tian et al., 2020c;Li et al., 2020) that leverage the large granularity contextual information carried by n-grams to enhance text representation for Chinese, we propose ZEN to enhance character based text encoders (e.g., BERT) by leveraging ngrams. In doing so, we extract n-grams prior to pre-training ZEN through two different steps. The first step is to prepare an n-gram lexicon (denoted as L), from which one can use any unsupervised method to extract n-grams from large corpora for later processing. The second step of n-gram extraction is performed during pre-training, where some n-grams in L are selected according to each training instance c = (c 1 , c 2 , ..., c i , ..., c kc ) with k c characters. Once these n-grams are extracted, we use an n-gram matching matrix (denoted as M), to record the positions of the extracted n-grams in each training instance. M is thus an k c ×k n matrix, where each element is represented by where k n is the number of extracted n-grams from c, and n j the j-th n-gram. A sample M for an input text is shown in the bottom part of Figure 1.

Encoding N-grams
As shown in the right part of Figure 1   refer to two BERT objectives: next sentence prediction and masked language model, respectively. [MSK] is the masked token. The incorporation of n-grams into the character encoder is illustrated by the addition operation presented in blue color. The bottom part presents n-gram extraction and preparation for the given input instance.
to represent all n-grams, whose information are thus encoded in different levels matching the correspondent layers in BERT. We adopt Transformer (Vaswani et al., 2017) as the encoder, which is a multi-layer encoder that can model the interactions among all n-grams through their representations in each layer. This modeling power is of high importance for ZEN because for certain context, salient n-grams are more useful than random others, and such salient n-grams are expected to be emphasized in pre-training. This effect can be achieved by multi-head self-attention (MhA) mechanism in Transformer (Clark et al., 2019). In detail, the transformer for n-grams is the same as its original version for sequence modeling, except that it does not encode n-gram positions because all n-grams are treated equally without a sequential order. Each n-gram extracted for the input is represented by an embedding from the n-gram embedding matrix. Therefore, for all extract n-grams, we obtain the j-th n-gram embedding e j as the input and denote it in layer l of the n-gram encoder by µ (l) j , and formulate the encoding process across layers by MhA j is used as the query (Q) vector to calculate the attentions over all other input n-grams from the same layer, and U (l) refers to the matrix that stacks all n-gram representations in the layer l that servers as the key (K) and value (V ) in MhA. This encoding process is repeated layer-by-layer  along with the character encoder. i,k represent embeddings for the i-th character and the k-th n-gram associated to this character at layer l, the enhanced representation for this character is computed by

Representing N-grams in
where υ (l) * i is the resulting embedding sent to the next layer. Herein + and refer to the elementwise addition operation. Therefore, υ when no n-gram covers this character. For the entire layer l, this enhancement can be formulated by where V (l) is the embedding matrix for all characters, and its combination with U (l) can be directly done through M. This process is repeated for each layer in the backbone BERT excecept for the last one. The final output of all character embeddings from the last layer is sent to optimize BERT objectives, i.e., mask recovery and next sentence prediction. Note that, since there is masking in BERT training, when a character is masked, n-grams that cover this character are not considered. For fine-tuning, we choose seven NLP tasks and their corresponding benchmark datasets in our experiments, many of them have been used in previous studies (Cui et al., 2019;Sun et al., 2019a,b). These tasks and datasets are described as follows.   dump, and prepared by sorting them (except for unigrams) according to their frequencies. We try the cut-off threshold between 5 and 40 where all those n-grams with frequency lower than the threshold are filtered out. The resulted sizes of L using different thresholds range from 179K to 64K n-grams. 9 All n-gram embeddings are randomly initialized. For the backbone BERT in ZEN, we use the same structure as that in previous work (Devlin et al., 2018;Sun et al., 2019a;Cui et al., 2019), i.e., 12 layers with 12 self-attention heads, 768 dimensions for hidden states and 512 for max input length, etc. The pre-training tasks also employ the same masking strategy and next sentence prediction as in Devlin et al. (2018), so that ZEN can be compared with BERT on a fair basis. We use the same parameter setting for the n-gram encoder as in BERT, except that we only use 6 layers and set 128 as the max length of n-grams 10 . The resulting ZEN requires only 20% additional inference time (averaged by testing on the seven tasks) over the original BERT base model. We adopt mixed precision training (Micikevicius et al., 2017) by the Apex library 11 to speed up the training process. Each ZEN model is trained simultaneously on 4 NVIDIA Tesla V100 GPUs with 16GB memory.
Our task-specific fine-tuning uses similar hyperparameters reported in Cui et al. (2019), with slightly different settings on max input sequence 9 Our main experiments are conducted on cut-off=15, resulting in 104K n-grams in the lexicon. 10 That is, we extract up to 128 n-grams per instance. Most of the previous studies show their performance on the development set of the aforementioned tasks and we follow them to do so in order to provide a reference and comparison. 13 There are other studies that demonstrate the effectiveness of ZEN on CWS (Tian et al., 2020c), POS tagging (Tian et al., 2020a), constituency parsing , and NER (Nie et al., 2020a,b), in which their models equipped with ZEN encoder consistently outperform the ones with BERT.  the performance gap between ZEN (P) and BERT (P) is larger than that in their R setting, which illustrates that learning an encoder with reliable initialization is more important and integrating n-gram information contributes a better enhancement on well-learned encoders. For two types of tasks, it is noticed that token-level tasks, i.e., CWS, POS and NER, demonstrate a bigger improvement of ZEN over BERT than that of sentence-level tasks. where the potential boundary information presented by n-grams are essential to provide a better guidance to label each character. Particularly for CWS and NER, these boundary information are directly related to the outputs. Similarly, sequence-level tasks show a roughly same trend on the improvement of ZEN over BERT, which also shows the capability of combining both character and n-gram information in a text encoder. The reason behind this improvement is that in token-level tasks, highfrequent n-grams 14 in many cases are valid chunks in a sentence that carry key semantic information. We also compare ZEN (P) with existing pretrained encoders on the same NLP tasks, with their results listed in the middle part of Table 2 (Wei et al., 2019) where B and L denote the base and large BERT backbone model, respectively. Note that although there are other pre-trained encoders with exploiting entity knowledge or multi-model signals, they are not compared in this paper because external information are required in their work (e.g. KnowBERT (Peters et al., 2019)). In fact, even though without using such external information, ZEN still achieves the state-of-the-art performance on many of the tasks experimented.
In general, the results clearly indicate the effectiveness of ZEN. In detail, for the comparison between ZEN and BERT-wwm, it shows that, when starting from pre-trained BERT, ZEN outperforms BERT-wwm on all tasks that BERT-wwm has re-14 Such as fixed expressions and common phrases, which may have less varied meanings than other ordinary combinations of characters and random character sequences. 15 We only report the results from their conducted tasks.
sults reported. This observation suggests that explicitly representing n-grams and integrating them into BERT has its advantage over using masking strategy, and using n-grams rather than word may have better tolerance on error propagation since word segmentation is unreliable in many cases. The comparison between ZEN and ERNIE encoders also illustrates the superiority of enhancing BERT with n-grams. For example, ZEN shows a consistent improvement over ERNIE 1.0 even though significantly larger non-public datasets were utilized in their pre-training. Compared to ERNIE 2.0, which used many more pre-training tasks and significantly more non-public training data, ZEN is still competitive on SA, SPM and NLI tasks. Particularly, ZEN outperforms ERNIE 2.0 (B) on SA (TEST) and SPM (TEST), which indicates that ngram enhanced character-based encoders of ZEN can achieve performance comparable to approaches using significantly more resources. Since the two approaches are complementary to each other, one might be able to combine them to achieve higher performance. Moreover, ZEN and ERNIE 2.0 (L) have comparable performance on some certain tasks (e.g., SA and SPM), which further confirms the power of ZEN even though the model of ERNIE 2.0 (L) is significantly larger. Similar results are also observed for ZEN and NEZHA, where ZEN illustrates its effectiveness again when compared to a model that learning with larger model and more data, as well as more tricks applied in pretraining. However, for NLI, ZEN's performance is not as good as ERNIE 2.0 and NEZHA (B & L), which further indicates that their model are good at inference task owing to their larger models and large-scale corpora with more prior knowledge. 16 More importantly, to show the improvement of ZEN is statistically significant over BERT, we conduct the Student's t-test of ZEN (P) and BERT (P). 16 To examine whether ZEN is able to be scaled up by increasing model parameters, we conduct experiments with ZEN-large (corresponding to BERT-large) as well. Our initial results confirm that it improves the performance of all seven downstream tasks when the model parameters are increased on both character and n-gram encoder. Detailed results are not presented in this paper owing to space limitation.  The p-values are computed by running the same model ten times on each task, and the results are shown in Table 2 with asterisks. Note that we measure 95% confidence interval for the difference between two models on all seven tasks. For all experiments except for the dev set of DC, the pvalue is smaller than 0.05, which shows that ZEN consistently outperform BERT, not by chances, so that confirms the validity and effectiveness of our model design with n-gram encoding.

Pre-training with Small Corpus
Pre-trained models usually require a large corpus to perform its training. However, in many applications in specialized domains, a large corpus may not be available. For such applications with limited training data, ZEN, with n-gram enhancement, is expected to encode text much more effectively. Therefore, to further illustrate the advantage of ZEN, we conduct an experiment that uses a small corpus to pre-train BERT and ZEN. In detail, we prepare a corpus with 1/10 size of the entire Chinese Wikipedia by randomly selecting sentences from it. Then all encoders are pre-trained on it with random initialization and tested on the same NLP tasks in the previous experiment. The results are re-   Table 3. In general, same trend is shown in this experiment when compared with that in the previous one, where ZEN constantly outperform BERT in all task. This observation confirms that representing n-grams provides stable enhancement when our model is trained on corpora with different sizes. In detail, these results also reveals that n-gram information helps more on some tasks, e.g., CWS, NER, NLI, over the others. The reason is not surprising since that boundary information carried by n-grams can play a pivotal role in these tasks. Overall, this experiment simulates the situation of pre-training a text encoder with limited data, which could be a decisive barrier of doing so in the coldstart scenario, and thus demonstrates that ZEN has its potential to perform well in this situation.

Analysis
We analyze ZEN with several factors affecting its performance. Details are illustrated in this section.

Effects of Pre-training Epochs
The number of pretraining epochs is another factor affecting the performance of pre-trained encoders. In this analysis, we use CWS and SA as two probing tasks to test the performance of differ- ent encoders (BERT and ZEN) against the number of pretraining epochs. The pretrained models at certain epochs are fine-tuned on these tasks, and the results are illustrated in Figure 2 and 3. We have the following observations. First, for both P and R models, ZEN shows better curves than those of BERT in both tasks, which indicates that ZEN achieves higher performance at comparable pretraining stages. Second, for R settings, ZEN shows a noticeable faster convergence than BERT, especially during the first few epochs of pretraining. This demonstrates that n-gram information improves the encoder's performance when pretraining starts from random initialization.

Effects of N-gram Extraction Threshold
To explore how n-gram extraction cutoff threshold affects the performance of ZEN, we test it with different thresholds for n-gram lexicon extraction. Similar to the previous experiment, we also use CWS and SA as the probe tasks in this analysis. The first analysis on threshold-performance relations is demonstrated in Figure 4, where we set the threshold ranging from 0 to 40 and use the max number of 128 n-grams in pre-training. In doing so, we observe that the best performed ZEN on both tasks is obtained when the threshold is set to 15, where increasing it under 15 causes improved performance of ZEN and vice versa when it gets over 15. This observation confirms that either too many (lower threshold) or too few (higher threshold) ngrams in the lexicon are less helpful in enhancing ZEN's performance, since there exists a balance between introducing enough knowledge and noise.
In the second analysis, when an optimal threshold is given (i.e., 15), we investigate the performance of ZEN with different maximum number of n-grams in pre-training for each input sequence. We test such number ranging from 0 (no n-grams encoded in ZEN) to 128, with the results shown in Figure 5 (X-axis is in log view with base 2). It shows that the number 32 (2 5 ) gives a good tradeoff between performance and computation, although there is a small gain by using more n-grams. This analysis illustrates that ZEN only requires a small numbers of n-grams to achieve good performance.

Visualization of N-gram Representations
Case studies are conducted on some certain instances to further illustrate the effectiveness of ngram representations in pre-training ZEN. Figure  6 and 7 visualize the weights of extracted n-grams from two input instances when they are encoded by ZEN across different layers. In general, 'valid' n-grams are more favored than others, e.g., 提高 (improve), 波士顿 (Boston) have higher weights than 会提高 (will improve) and 士顿 (Ston), especially those ones that have cross ambiguities in the context, e.g., 高速 (high speed) should not be considered in the first instance so that 速度 (speed) has a higher weight than it. This observation illustrates that ZEN is able to not only distinguish those phrasal n-grams to others but also select appropriate ones according to the context. Interestingly, for different layers, long (and valid) n-grams, e.g., 提高速度 (speed up) and 波士顿咨询 (Boston consulting group), tend to receive more intensive weights at higher layers, which implicitly indicates that such n-grams contain more semantic rather than morphological information. We note that information encoded in BERT follows a similar layerwise order as what is suggested in Jawahar et al. (2019). The observations from this case study not only illustrates the details of how n-grams enhance the pre-train model, but also suggest that ZEN provides a potential solution to some text analyzing tasks, e.g., chunking and keyphrase extraction.

Related Work
Representation learning of text attracts much attention in recent years, with the rise of deep learning in NLP (Collobert et al., 2011;Mikolov et al., 2013;Pennington et al., 2014). There are considerable interests in representing text with contextualized information (Ling et al., 2015;Melamud et al., 2016;Bojanowski et al., 2017;Song et al., 2018;Peters et al., 2018a). Following this paradigm, pre-trained models have been proposed and are proven useful in many NLP tasks (Devlin et al., 2018;Radford et al., 2018Radford et al., , 2019Yang et al., 2019;Liu et al., 2019c). In detail, such models can be categorized into two types: autoregressive and autoencoding encoders. The former models behave like normal language models that predict the probability distributions of text units following observed texts. These models, such as GPT (Radford et al., 2018) and GPT2 (Radford et al., 2019), are trained to encode a uni-directional context. Differently, the autoencoding models, such as BERT (Devlin et al., 2018) and XLNet , leverage bidirectional context, and encode texts by reconstructing the masked tokens in each text instance according to their context from both sides. Particularly for Chinese, many enhanced pretrain models are proposed that can utilize wordlevel information in one way or another because words carry important linguistic information. For example, ERNIE 1.0 (Sun et al., 2019a) adopted a multi-level masking strategy performed on different level of texts; its improved version, ERNIE 2.0 (Sun et al., 2019b) used continual pre-training strategy which is benefited from multi-task learning with more parameters in the model. Recently, BERT-wwm (Cui et al., 2019) enhanced Chinese BERT with a simple masking of whole-words. In addition, there are other recent studies that enhanced BERT for Chinese language processing, such as optimizing training via special optimization techniques (Wei et al., 2019) or from prior knowledge (Liu et al., 2019b). All the studies revealed that processing on larger granularity of text is helpful in Chinese, which is consistent with previous findings in many Chinese NLP tasks (Wu et al., 2015;Higashiyama et al., 2019). Compared to the aforementioned studies, ZEN thus provides an alternative solution that explicitly encodes n-grams into character-based encoding, rather than through weak supervision, i.e., masking, to incorporate word/phrase information.

Conclusion and Future Work
In this paper, we proposed ZEN, a pre-trained Chinese text encoder enhanced by n-gram representations, where different combinations of characters are extracted, encoded and integrated in training a backbone model, i.e., BERT. In ZEN, given a sequence of Chinese characters, n-grams are extracted and their information are effectively incorporated into the character encoder. Different from previous work, ZEN provides an alternative way of learning larger granular text for pre-trained models, where the structure of BERT is extended by another Transformer-style encoder to represent the extracted n-grams for each input text instance.
Experiments on several NLP tasks demonstrated the validity and effectiveness of ZEN, where stateof-the-art results were obtained from them while ZEN is built upon the BERT base model and requires less training data and no knowledge from external sources compared to other existing Chinese text encoders. Experiments of ZEN on small corpus also showed its efficiency and capability of being able to learn with limited data. Further analyses revealed the factors affecting ZEN's performance, where the quality of the n-gram lexicon and the number of n-grams used for each input are more important than the training epoch.
Note that ZEN employs a different method to incorporate word (n-gram) information, which could be complementary to some other previous approaches. Therefore, it is potentially beneficial to combine it with other character-encoding approaches. For future work, we plan to enlarge ZEN as well as apply it to other languages that lack white-space tokenization and compare different ngram extraction methods, e.g., obtaining n-grams by Byte Pair Encoding (Sennrich et al., 2015) or WordPiece tokenization (Wu et al., 2016).