Pre-training with Meta Learning for Chinese Word Segmentation

Recent researches show that pre-trained models (PTMs) are beneficial to Chinese Word Segmentation (CWS). However, PTMs used in previous works usually adopt language modeling as pre-training tasks, lacking task-specific prior segmentation knowledge and ignoring the discrepancy between pre-training tasks and downstream CWS tasks. In this paper, we propose a CWS-specific pre-trained model MetaSeg, which employs a unified architecture and incorporates meta learning algorithm into a multi-criteria pre-training task. Empirical results show that MetaSeg could utilize common prior segmentation knowledge from different existing criteria and alleviate the discrepancy between pre-trained models and downstream CWS tasks. Besides, MetaSeg can achieve new state-of-the-art performance on twelve widely-used CWS datasets and significantly improve model performance in low-resource settings.


Introduction
Chinese Word Segmentation (CWS) is a fundamental task for Chinese natural language processing (NLP), which aims at identifying word boundaries in a sentence composed of continuous Chinese characters. It provides a basic component for other NLP tasks like named entity recognition (Li et al., 2020), dependency parsing , and semantic role labeling (Xia et al., 2019), etc.
Generally, most previous studies model the CWS task as a character-based sequence labeling task (Xue, 2003;Zheng et al., 2013;Chen et al., 2015;Ma et al., 2018;. Recently, pretrained models (PTMs) such as BERT (Devlin et al., 2019) have been introduced into CWS tasks, which could provide prior semantic knowledge and boost the performance of CWS systems. Yang (2019) directly fine-tunes BERT on several CWS benchmark datasets.  fine-tunes BERT in a Criteria Li Na entered the semi-final CTB6 李娜 multi-criteria learning framework, where each criterion shares a common BERT-based feature extraction layer and has separate projection layer. Meng et al. (2019) combines Chinese character glyph features with pre-trained BERT representations. Tian et al. (2020) proposes a neural CWS framework WMSEG, which utilizes memory networks to incorporate wordhood information into the pre-trained model ZEN (Diao et al., 2019). PTMs have been proved quite effective by finetuning on downstream CWS tasks. However, PTMs used in previous works usually adopt language modeling as pre-training tasks. Thus, they usually lack task-specific prior knowledge for CWS and ignore the discrepancy between pre-training tasks and downstream CWS tasks.
To deal with aforementioned problems of PTMs, we consider introducing a CWS-specific pretrained model based on existing CWS corpora, to leverage the prior segmentation knowledge. However, there are multiple inconsistent segmentation criteria for CWS, where each criterion represents a unique style of segmenting Chinese sentence into words, as shown in Table 1. Meanwhile, we can easily observe that different segmentation criteria could share a large proportion of word boundaries between them, such as the boundaries between word units "李娜(Li Na)", "进入(entered)" and "半决赛(the semi-final)", which are the same for all segmentation criteria. It shows that the common prior segmentation knowledge is shared by different criteria.
In this paper, we propose a CWS-specific pretrained model METASEG. To leverage shared segmentation knowledge of different criteria, METASEG utilizes a unified architecture and introduces a multi-criteria pre-training task. Moreover, to alleviate the discrepancy between pretrained models and downstream unseen criteria, meta learning algorithm (Finn et al., 2017) is incorporated into the multi-criteria pre-training task of METASEG.
Experiments show that METASEG could outperform previous works significantly, and achieve new state-of-the-art results on twelve CWS datasets. Further experiments show that METASEG has better generalization performance on downstream unseen CWS tasks in low-resource settings, and improve recalls for Out-Of-Vocabulary (OOV) words. To the best of our knowledge, METASEG is the first task-specific pre-trained model especially designed for CWS.

Related Work
Recently, PTMs have been used for CWS and achieve good performance (Devlin et al., 2019). These PTMs usually exploit fine-tuning as the main way of transferring prior knowledge to downstream CWS tasks. Specifically, some methods directly fine-tune PTMs on CWS tasks (Yang, 2019), while others fine-tune them in a multi-task framework . Besides, other features are also incorporated into PTMs and fine-tuned jointly, including Chinese glyph features (Meng et al., 2019), wordhood features (Tian et al., 2020), and so on. Although PTMs improve CWS systems significantly, their pre-training tasks like language modeling still have a wide discrepancy with downstream CWS tasks and lack CWS-specific prior knowledge.
Task-specific pre-trained models are lately studied to introduce task-specific prior knowledge into multiple NLP tasks. Specifically designed pretraining tasks are introduced to obtain the taskspecific pre-trained models, and then these models are fine-tuned on corresponding downstream NLP tasks, such as named entity recognition (Xue et al., 2020), sentiment analysis (Ke et al., 2020) and text summarization . In this paper, we propose a CWS-specific pre-trained model METASEG.

Approach
As other task-specific pre-trained models (Ke et al., 2020), the pipeline of METASEG is divided into two phases: pre-training phase and fine-tuning phase. In pre-training phase, we design a unified architecture and incorporate meta learning algorithm into a multi-criteria pre-training task, to obtain the CWS-specific pre-trained model which has less discrepancy with downstream CWS tasks. In finetuning phase, we fine-tune the pre-trained model on downstream CWS tasks, to leverage the prior knowledge learned in pre-training phase.
In this section, we will describe METASEG in three parts. First, we introduce the Transformerbased unified architecture. Second, we elaborate on the multi-criteria pre-training task with meta learning algorithm. Finally, we give a brief description of the downstream fine-tuning phase.

The Unified Architecture
In traditional CWS systems (Chen et al., 2015;Ma et al., 2018), CWS model usually adopts a separate architecture for each segmentation criterion. An instance of the CWS model is created for each criterion and trained on the corresponding dataset independently. Thus, a model instance can only serve one criterion, without sharing any segmentation knowledge with other different criteria.
To better leverage the common segmentation knowledge shared by multiple criteria, METASEG employs a unified architecture based on the widelyused Transformer network (Vaswani et al., 2017) with shared encoder and decoder for all different criteria, as illustrated in Figure 1.
The input for the unified architecture is an augmented sentence, which is composed of a specific criterion token plus the original sentence to represent both criterion and text information. In embedding layer, the augmented sentence is transformed into input representations by summing the token, segment and position embeddings. The Transformer network is used as the shared encoder layer, encoding the input representations into hidden representations through blocks of multi-head attention and position-wise feed-forward modules (Vaswani et al., 2017). Then a shared linear decoder with softmax is followed to map hidden representations to the probability distribution of segmentation labels. The segmentation labels consist of four CWS labels {B, M, E, S}, denoting the word beginning, middle, ending and single word respectively.
Formally, the unified architecture can be concluded as a probabilistic model P θ (Y |X), which represents the probability of the segmentation label  The input is composed of criterion and sentence, where the criterion can vary with the same sentence. The output is a corresponding sequence of segmentation labels of given criterion.
sequence Y given the augmented input sentence X. The model parameters θ are invariant of any criterion c, and would capture the common segmentation knowledge shared by different criteria.

Multi-Criteria Pre-training with Meta Learning
In this part, we describe multi-criteria pre-training with meta learning for METASEG. We construct a multi-criteria pre-training task, to fully mine the shared prior segmentation knowledge of different criteria. Meanwhile, to alleviate the discrepancy between pre-trained models and downstream CWS tasks, meta learning algorithm (Finn et al., 2017) is used for pre-training optimization of METASEG.
Multi-Criteria Pre-training Task As mentioned in Section 1, there are already a variety of existing CWS corpora (Emerson, 2005;Jin and Chen, 2008). These CWS corpora usually have inconsistent segmentation criteria, where human-annotated data is insufficient for each criterion. Each criterion is usually used to fine-tune a CWS model separately on a relatively small dataset and ignores the shared knowledge of different criteria. But in our multi-criteria pre-training task, multiple criteria are jointly used for pre-training to capture the common segmentation knowledge shared by different existing criteria. First, nine public CWS corpora (see Section 4.1) of diverse segmentation criteria are merged as a joint multi-criteria pre-training corpus D T . Every sentence under each criterion is augmented with the corresponding criterion, and then incorporated into the joint multi-criteria pre-training corpus. To represent criterion information, we add a specific criterion token in front of the input sentence, such as [pku] for PKU criterion (Emerson, 2005). We also add [CLS] and [SEP] token to sentence beginning and ending respectively like Devlin et al. (2019). This augmented input sentence represents both criterion and text information, as shown in Figure 1.
Then, we randomly pick 10% sentences from the joint multi-criteria pre-training corpus D T and replace their criterion tokens with a special token [unc], which means undefined criterion. With this design, the undefined criterion token [unc] would learn criterion-independent segmentation knowledge and help to transfer such knowledge to downstream CWS tasks.
Finally, given a pair of augmented sentence X and segmentation labels Y from the joint multicriteria pre-training corpus D T , our unified architecture (Section 3.1) predicts the the probability of segmentation labels P θ (Y |X). We use the normal negative log-likelihood (NLL) loss as objective function for this multi-criteria pre-training task: Meta Learning Algorithm The objective of most PTMs is to maximize its performance on pretraining tasks (Devlin et al., 2019), which would lead to the discrepancy between pre-trained models and downstream tasks. Besides, pre-trained CWS model from multi-criteria pre-training task could still have discrepancy with downstream unseen criteria, because downstream criteria may not exist in pre-training. To alleviate the above discrepancy, we utilize meta learning algorithm (Lv et al., 2020) for pre-training optimization of METASEG. The main objective of meta learning is to maximize generalization performance on potential downstream tasks, which prevents pre-trained models from overfitting on pre-training tasks. As shown in Figure 2, by introducing meta learning algorithm, pretrained models would have less discrepancy with downstream tasks instead of inclining towards pretraining tasks. PT represents the multi-criteria pre-training task, while solid line represents the pre-training phase. DT represents the downstream CWS task, while dashed line represents the fine-tuning phase. θ represents pre-trained model parameters.
The meta learning algorithm treats pre-training task T as one of the downstream tasks. It tries to optimize meta parameters θ 0 , from which we can get the task-specific model parameters θ k by k gradient descent steps over the training data D train T on task T , where α is learning rate, D train T,i is the i-th batch of training data. Formally, task-specific parameters θ k can be denoted as a function of meta parameters θ 0 as follows: To maximize the generalization performance on task T , we should optimize meta parameters θ 0 on the batch of test data D test T , The above meta optimization could be achieved by gradient descent, so the update rule for meta parameters θ 0 is as follows: where β is the meta learning rate. The gradient in Equation 4 can be rewritten as: where the last step in Equation 5 adopts firstorder approximation for computational simplification (Finn et al., 2017). Specifically, the meta learning algorithm for pretraining optimization is described in Algorithm 1. It can be divided into two stages: i) meta train stage, which updates task-specific parameters by k gradient descent steps over training data; ii) meta test stage, which updates meta parameters by one gradient descent step over test data. Hyper-parameter k is the number of gradient descent steps in meta train stage. The meta learning algorithm degrades to normal gradient descent algorithm when k = 0. The returned meta parameters θ 0 are used as the pre-trained model parameters for METASEG.

Downstream Fine-tuning
After pre-training phase mentioned in Section 3.2, we obtain the pre-trained model parameters θ 0 , which capture prior segmentation knowledge and have less discrepancy with downstream CWS tasks. We fine-tune these pre-trained parameters θ 0 on downstream CWS corpus, to transfer the prior segmentation knowledge.
For format consistency, we process the sentence from the given downstream corpus in the same way as Section 3.2, by adding the criterion token for j = 1, 2, ..., k do 4:  WTB, UD, ZX datasets are kept for downstream fine-tuning phase, while the other nine datasets are combined into the joint multi-criteria pre-training corpus (Section 3.2), which amounts to nearly 18M words.
For CTB6, WTB, UD, ZX and CNC datasets, we use the official data split of training, development, and test sets. For the rest, we use the official test set and randomly pick 10% samples from the training data as the development set. We pre-process all these datasets following four procedures: 1. Convert traditional Chinese datasets into simplified, such as CITYU, AS and CKIP; 2. Convert full-width tokens into half-width; 1 http://corpus.zhonghuayuwen.org/

Replace continuous English letters and digits with unique tokens;
4. Split sentences into shorter clauses by punctuation. Table 2 presents the statistics of processed datasets.

Hyper-Parameters
We employ METASEG with the same architecture as BERT-Base (Devlin et al., 2019), which has 12 transformer layers, 768 hidden sizes and 12 attention heads. In pre-training phase, METASEG is initialized with released parameters of Chinese BERT-Base model 2 and then pre-trained with the multi-criteria pre-training task. Maximum input length is 64, with batch size 64, and dropout rate 0.1. We adopt AdamW optimizer (Loshchilov and Hutter, 2019) with β 1 = 0.9, β 2 = 0.999 and weight decay rate of 0.01. The optimizer is implemented by meta learning algorithm, where both learning rate α and meta learning rate β are set to 2e-5 with a linear warm-up proportion of 0.1. The meta train steps are selected to k = 1 according to downstream performance. Pre-training process runs for nearly 127,000 meta test steps, amounting to (k + 1) * 127, 000 gradient descent steps, which takes about 21 hours on one NVIDIA Tesla V100 32GB GPU card.
In fine-tuning phase, we set maximum input length to 64 for all criteria but 128 for WTB, with batch size 64. We fine-tune METASEG with AdamW optimizer of the same settings as pretraining phase without meta learning. METASEG is fine-tuned for 5 epochs on each downstream dataset.
In low-resource settings, experiments are performed on WTB dataset, with maximum input length 128. We evaluate METASEG at sampling rates of 1%, 5%, 10%, 20%, 50%, 80%. Batch size is 1 for 1% sampling and 8 for the rest. We keep other hyper-parameters the same as those of fine-tuning phase.
The standard F1 score is used to evaluate the performance of all models. We report F1 score of each model on the test set according to its best checkpoint on the development set as .

Results on Pre-training Criteria
After pre-training, we fine-tune METASEG on each pre-training criterion. Table 3 shows F1 scores on test sets of nine pre-training criteria in two blocks. The first block displays the performance of previous works. The second block displays three models implemented by us: BERT-Base is the fine-tuned model initialized with official BERT-Base parameters. METASEG (w/o fine-tune) is our proposed pre-trained model directly used for inference without fine-tuning. METASEG is the fine-tuned model initialized with pre-trained METASEG parameters. From the second block, we observe that finetuned METASEG could outperform fine-tuned BERT-Base on each criterion, with 0.26% improvement on average. It shows that METASEG is more effective when fine-tuned for CWS. Even without fine-tuning, METASEG (w/o fine-tune) still behaves better than fine-tuned BERT-Base model, indicating that our proposed pre-training approach is the key factor for the effectiveness of METASEG. Finetuned METASEG performs better than that of no fine-tuning, showing that downstream fine-tuning is still necessary for the specific criterion. Furthermore, METASEG can achieve state-of-the-art results on eight of nine pre-training criteria, demonstrating the effectiveness of our proposed methods.

Results on Downstream Criteria
To evaluate the knowledge transfer ability of METASEG, we perform experiments on three unseen downstream criteria which are absent in pretraining phase. Table 4 shows F1 scores on test sets of three downstream criteria. The first block displays previous works on these downstream criteria, while the second block displays three models implemented by us (see Section 4.2.1 for details).
Results show that METASEG outperforms the previous best model by 0.56% on average, achieving new state-of-the-art performance on three downstream criteria. Moreover, METASEG (w/o fine-tune) actually preforms zero-shot inference on downstream criteria and still achieves 87.28% average F1 score. This shows that METASEG does learn some common prior segmentation knowledge in pre-training phase, even if it doesn't see these downstream criteria before.
Compared with BERT-Base, METASEG has the same architecture but different pre-training tasks. It can be easily observed that METASEG with finetuning outperforms BERT-Base by 0.46% on average. This indicates that METASEG could indeed alleviate the discrepancy between pre-trained models and downstream CWS tasks than BERT-Base.

Ablation Studies
We perform further ablation studies on the effects of meta learning (ML) and multi-criteria pretraining (MP), by removing them consecutively from the complete METASEG model. After removing both of them, METASEG degrades into the normal BERT-Base model. F1 scores for ablation studies on three downstream criteria are illustrated in Table 5.
We observe that the average F1 score drops by 0.12% when removing the meta learning algorithm (-ML), and continues to drop by 0.34% when removing the multi-criteria pre-training task (-ML-MP). It demonstrates that meta learning and multicriteria pre-training are both significant for the effectiveness of METASEG.

Low-Resource Settings
To better explore the downstream generalization ability of METASEG, we perform experiments on the downstream WTB criterion in low-resource settings. Specifically, we randomly sample a given rate of instances from the training set and finetune the pre-trained METASEG model on downsampling training sets. These settings imitate the realistic low-resource circumstance where humanannotated data is insufficient. The performance at different sampling rates is evaluated on the same WTB test set and reported in Table 6. Results show that METASEG outperforms BERT-Base at every sampling rate. The margin is larger when the sampling rate is lower, and reaches 6.20% at 1% sampling rate. This demonstrates that METASEG could generalize better on the downstream criterion in low-resource settings.
When the sampling rate drops from 100% to 1%, F1 score of BERT-Base decreases by 7.60% while that of METASEG only decreases by 2.37%. The performance of METASEG at 1% sampling rate still reaches 91.60% with only 8 instances, comparable with performance of BERT-Base at 20% sampling rate. This indicates that METASEG can make better use of prior segmentation knowledge and learn from less amount of data. It shows that METASEG would reduce the need of human annotation significantly.

Out-of-Vocabulary Words
Out-of-Vocabulary (OOV) words denote the words which exist in inference phase but don't exist in training phase. OOV words are a critical cause of errors on CWS tasks. We evaluate recalls for OOV words on test sets of all twelve criteria in Table 7.
Results show that METASEG outperforms BERT-Base on ten of twelve criteria and improves recalls for OOV words by 0.99% on average. This indicates that METASEG could benefit from our proposed pre-training methodology and recognize more OOV words in inference phase.

Non-Pretraining Setup
To investigate the contribution of multi-criteria pretraining towards performance of METASEG, we perform experiments on a non-pretraining baseline Transformer. Transformer has the same architecture and is directly trained from scratch on the same nine datasets (Section 4.2.1), but doesn't have any pre-training phase as METASEG.   F1 scores between Transformer and METASEG is displayed in Table 8. Results show that METASEG outperforms the non-pretraining Transformer on each criterion and achieves a 2.40% gain on average, even with the same datasets and architecture. It demonstrates that multi-criteria pre-training is vital for the effectiveness of METASEG and the performance gain is not merely from the large dataset size.
Moreover, METASEG has the generalization ability to transfer prior knowledge to downstream unseen criteria, which could not be achieved by the non-pretraining counterpart Transformer.

Visualization
To visualize the discrepancy between pre-trained models and downstream criteria, we plot similarities of three downstream criteria with METASEG and BERT. Specifically, we extract the criterion token embeddings of three downstream criteria WTB, UD and ZX. We also extract the undefined criterion token embeddings of METASEG and BERT as representations of these two pre-trained models. We compute cosine similarities between three criteria embeddings and two pre-trained model embeddings, and illustrate them in Figure 3.
We can observe that similarities of all three downstream criteria lie above the dashed line, indicating that all three downstream criteria are more similar to METASEG than BERT. The closer one criterion is to the upper left corner, the more similar it is to METASEG. Therefore, we can conclude that WTB is the most similar criterion to METASEG among all these criteria, which qualitatively corresponds to the phenomenon that WTB criterion has the largest performance gain in Table 4. The above visualization results show that our proposed approach could solidly alleviate the discrepancy between pre-trained models and downstream CWS tasks. Thus METASEG is more similar to downstream criteria.

Conclusion
In this paper, we propose a CWS-specific pretrained model METASEG, which employs a unified architecture and incorporates meta learning algorithm into a multi-criteria pre-training task. Experiments show that METASEG could make good use of common prior segmentation knowledge from different existing criteria, and alleviate the discrepancy between pre-trained models and downstream CWS tasks. METASEG also gives better generalization ability in low-resource settings, and achieves new state-of-the-art performance on twelve CWS datasets.
Since the discrepancy between pre-training tasks and downstream tasks also exists in other NLP tasks and other languages, in the future we will explore whether the approach of pre-training with  meta-learning in this paper could be applied to other tasks and languages apart from Chinese word segmentation.