Joint Chinese Word Segmentation and Part-of-speech Tagging via Multi-channel Attention of Character N-grams

Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks for Chinese language processing. Previous studies have demonstrated that jointly performing them can be an effective one-step solution to both tasks and this joint task can benefit from a good modeling of contextual features such as n-grams. However, their work on modeling such contextual features is limited to concatenating the features or their embeddings directly with the input embeddings without distinguishing whether the contextual features are important for the joint task in the specific context. Therefore, their models for the joint task could be misled by unimportant contextual information. In this paper, we propose a character-based neural model for the joint task enhanced by multi-channel attention of n-grams. In the attention module, n-gram features are categorized into different groups according to several criteria, and n-grams in each group are weighted and distinguished according to their importance for the joint task in the specific context. To categorize n-grams, we try two criteria in this study, i.e., n-gram frequency and length, so that n-grams having different capabilities of carrying contextual information are discriminatively learned by our proposed attention module. Experimental results on five benchmark datasets for CWS and POS tagging demonstrate that our approach outperforms strong baseline models and achieves state-of-the-art performance on all five datasets.


Introduction
Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks in Chinese natural language processing (NLP). Although they can be treated as two separate tasks in a sequential order, it has been demonstrated by previous studies that processing them jointly in a unified sequence labeling framework could be more effective, where CWS and POS tags are predicted in a single step (Ng and Low, 2004;Jiang et al., 2008;Wang et al., 2011;Sun, 2011;Zeng et al., 2013;Zheng et al., 2013;Zhang et al., 2014;Kurita et al., 2017;Shao et al., 2017;. In doing so, existing studies mainly focused on incorporating contextual information (e.g., n-grams) as features into their joint taggers, which had been widely used as an effective way to improve model performance especially before neural models were widely used. Although neural models are powerful in modeling long text sequences, external features from larger granular texts are still demonstrated to be useful in existing neural models (Zheng et al., 2013;Kurita et al., 2017;Shao et al., 2017;. In these models, contextual features are leveraged by directly concatenating their embeddings with the character embeddings in the embedding layer, where all contextual features are treated equally without distinguishing their importance to the joint tagging process in the specific context. However, this concatenation approach to incorporating contextual features into a joint tagger fails to consider that different contextual features could have different contributions to the joint task in a specific context, especially when there are ambiguities in the input sentence. For example, in an example sentence Figure 1: An illustration of the different roles of different n-gram features for the character "大" (highlighted in yellow) in an example sentence. In this case, "放大" (enlarge) and "大道" (avenue) are two n-gram features associated with "大", where the former one (in red color) suggests incorrect joint CWS and POS labels while the latter one (in green color) suggests the correct labels.
in Figure 1, there are two n-gram features associated with the character "大" (highlighted in yellow), i.e., "放大" (enlarge) and "大道" (avenue). In this case, both n-grams are common Chinese words; the former (in red color) might suggest incorrect CWS and POS tagging results, while the latter (in green color) suggests correct results. 2 If a model treats both n-gram features equally, it could be misled by the former (i.e., "放大" (enlarge) in red color). Therefore, it is important to distinguish the contributions of different n-grams to the joint task in a specific context; extra efforts are needed to effectively and smartly leverage such n-grams. In addition, considering that n-grams with different properties could also contribute differently for the joint task, it can be helpful to categorize the n-gram features into groups according to these properties and then model them separately.
In this paper, we propose a neural character-based joint CWS and POS tagger with multi-channel attention (MCATT) of character n-grams to improve the joint task. Specifically, to tag each character in an input sentence, the proposed MCATT first extracts the n-grams associated with the character from a pre-constructed lexicon and next categorizes such n-grams according to a specific metric (i.e., frequency or length). Then, we feed all n-grams within the same category into each channel, where those n-grams are compared and weighted according to their contribution to the joint label prediction in a specific context. Afterwards, the attentions from different channels are combined to help with the tagging process for each corresponding character. Compared to normal attention, where all associated n-grams are compared and weighted together without categorization, multi-channels provide an alternative approach to discriminatively leverage n-grams with different properties. Therefore, the weights for n-grams with similar properties are computed in their own channel rather than computed globally with all other n-grams. This multi-channel mechanism could be helpful to leverage the infrequent yet important n-grams, because the parameters for those n-grams are updated infrequently during training so that models with normal attentions may fail to distinguish these infrequent important n-grams in a specific context. We experiment our proposed model on five widely used benchmark datasets. Our model with multi-channel attentions outperforms strong baselines and achieves state-of-the-art results on all datasets.

Our Approach
The architecture of our approach is shown in Figure 2. The left side illustrates the backbone model following the sequence labeling paradigm; the right side elaborates the multi-channel attention module used to incorporate contextual n-gram information into the backbone model. Formally, given an input sentence X = x 1 x 2 · · · x i · · · x l , where l is the input sequence length, our approach predicts its corresponding joint CWS and POS label sequence Y = y 1 y 2 · · · y i · · · y l by where MA denotes the multi-channel attention module. Let N denote a lexicon consisting of a list of n-grams collected for the entire corpus; S ⊂ N is the set of n-grams in X that appears in N . The details of applying the multi-channel attentions of n-grams to such framework are provided below.
Figure 2: The overall architecture of our character-based model for the joint CWS and POS tagging with an example input and output. On the left is the backbone model following the sequence labeling paradigm; on the right is the multi-channel attention module with n-grams categorized by their length. Different attention channels for n-grams associated with "大" (big) are highlighted with distinct colors.

The Multi-channel Attentions
N-grams have been used as useful contextual features to enhance text representation for CWS and POS tagging in many studies (Song et al., 2009;Song and Xia, 2013;Shao et al., 2017;. However, for joint CWS and POS tagging, previous approaches to leveraging the n-gram features are limited to directly concatenating the n-gram embeddings with the input character embedding, where unimportant n-grams may mislead the model and result in incorrect predictions. Therefore, assigning appropriate weights to different n-grams regarding to their contexts is a potential effective solution (Higashiyama et al., 2019;Tian et al., 2020b) to the joint task and we propose to use multi-channel attention to tackle this mission. In detail, we first categorize n-grams by a specific metric, which in this study is either their frequencies or lengths and then model the grouped n-grams in separate channels of attentions. As a result, the contributions of the salient n-grams are highlighted and the attention weights are not dominated by frequent n-grams or the short ones that tend to appear in more sentences. Our model is thus able to leverage the highlighted n-grams accordingly and avoid being misled by the unimportant ones.
To train the attention module, for each instance X , we collect all n-grams that appear in N to form a set of n-grams S to be used in the attention module. The multi-channel attention works as follows: in the first step, all n-grams are categorized into n groups according to their frequencies in a corpus or their lengths. We denote all n-grams as S = {S 1 , S 2 , · · · S k , · · · S n } and the n-grams in each group as j to represent the vectored embedding of s (k) j . Afterwards, for character x i , the attention weight of each n-gram s j is the inner product of h i and e 3,3 = 0 because "大" is a component of s 1 and s 2 but is not a part of s (4) 3 3 . As a result, for each entire channel, its resulted weight is computed by Finally, the overall attention of different n-grams for x i is the concatenation of attentions from all channels: with a trainable positive parameter δ k to balance the contribution of each channel.

Joint Tagging with the Attentions
To leverage the n-grams through the proposed attention module, we first obtain the hidden vector h i of each x i in the input sequence from the encoder (e.g., BERT (Devlin et al., 2019)) of the backbone model. Next, we feed the resulting h i to the attention module and obtain its output a i , which contains the weighted contextual information carried by the n-gram features. Then, we incorporate such weighted information into the backbone model by concatenating a i with h i and align the resulting vector to the output dimension by a trainable matrix W d , which is represented by Afterwards, we pass u i to a conditional random field (CRF) decoder to estimate the joint label y i for x i .

Datasets
In our experiments, five Chinese benchmark datasets are used, including CTB5, CTB6, CTB7, and CTB9 from the Penn Chinese TreeBank (Xue et al., 2005) and the Chinese GSD Treebank of Universal Dependencies (UD) (Nivre et al., 2016). 4 All CTB datasets are in simplified Chinese while UD is in traditional Chinese. Following Shao et al. (2017), we translate the UD dataset into simplified Chinese before experiments are conducted. 5 To obtain the training, development, and test data for each datasets, we follow previous studies (Jiang et al., 2008;Wang et al., 2011;Zhang et al., 2014;Shao et al., 2017) to split CTB5, CTB6, CTB7, and CTB9, and use the official splits for UD. Since UD contains two types of POS tags, namely, universal and language-specific tags, we follow the notation in Shao et al. (2017) and mark the former one as UD1 and the latter one as UD2. For the POS tag set, CTB has 33 tags; UD1 and UD2 have 15 and 42 tags, respectively. The statistics of all five datasets in terms of the number of words and sentences in the training, development, and test sets, respectively, are reported in Table 1. We also report out-of-vocabulary (OOV) rates in the development and test sets.

Lexicon Construction
To enhance the joint CWS and POS tagging through the multi-channel attentions, we need to construct the lexicon N which is simply a list of n-grams. 6 In this study, we do not want our approach to rely on existing n-gram resources. Therefore, we use three unsupervised methods to obtain n-grams from each datasets, namely, accessor variety (AV) (Feng et al., 2004), description length gain (DLG) (Kit and Wilks, 1999), and pointwise mutual information (PMI) (Sun et al., 1998).
Accessor Variety Given a character n-gram s, let left access number L av (s) be the number of distinct characters that precede s in the corpus. The right access number R av (s) is defined similarly. The AV score of s is the minimal number of the left and right access numbers: In general, n-grams with higher AV scores are more likely to be words in Chinese. Since AV is sensitive to the size of dataset, in our experiments, we use different thresholds for the five datasets: 2 for CTB5, 3 for CTB6, 4 for CTB7, 5 for CTB9, and 1 for UD. For each dataset, we collect all n-grams whose AV scores are higher than the corresponding threshold to build the lexicon N .
Description Length Gain DLG measures the wordnesshood of an n-gram s according to the change of the description length of a dataset D with and without treating s as a segment; formally, DLG(s) is calculated by where D[r → s] ⊕ s is the revised dataset of the original D with all the occurrences of s in D replaced by a single symbol r, and with the original n-gram s appended to the end. Besides, the description length of a corpus D is calculated by where V is a character vocabulary containing all character types appearing in D and c(x) denotes the count of character x in D. In our experiments, the threshold for DLG is set to 0; that is, for each dataset D, its lexicon N contains all n-grams whose DLG scores are higher than that threshold.
Pointwise Mutual Information PMI measures the co-occurrence of two adjacent characters x , x by where p is the probability distribution of a given n-gram (i.e., x , x and x x ) in a dataset. For each dataset, we check all the character bi-grams in the corpus; a delimiter is inserted between the two characters if their PMI score is below a threshold. The n-grams in the resulted segmented corpus form the lexicon N .
In our experiments, we set the threshold for all datasets to 0.
To construct N , we perform the aforementioned unsupervised methods on the raw text of the training set and the development set combined for each dataset. 7 Next, we filter out n-grams whose frequency is no more than a threshold. 8 Finally, for all datasets we keep the n-grams whose lengths are within the range of [1, 10]. Table 2 shows the sizes of the lexicons for the five datasets.   (a) and (b), respectively. "Freq." and "Len." refer to the n-gram categorization strategies based on n-gram frequency and n-gram length; "AV", "DLG", and "PMI" stands for different ways to construct the lexicon N ; "N/A" is the abbreviation for not applicable.

Implementations
Since text representation plays an important role in model performance (Conneau et al., 2017;Song et al., 2017;Song et al., 2018), in our experiment, we try two well-known Chinese text encoders as the backbone model: Chinese version of pre-trained BERT 9 (Devlin et al., 2019) and ZEN 10 (Diao et al., 2019). For both BERT and ZEN, we follow their default settings in our experiments (i.e., for both BERT and ZEN, we use 12 layers of multi-head attentions on character encoding with the dimension of hidden vectors set to be 768; for ZEN, we use 6 layers of n-gram representations). For the models with the multi-channel attention module, we use two criteria to categorize the n-grams that are used in different channels. The first is by frequency, where n-grams whose counts in the dataset are within the same range [c k , c k+1 ) are categorized into one group and are compared and weighted within the same channel in the attention module. In our experiments, we set c k = 2 k , for k ∈ [1, 10] and c 11 = +∞. The second criterion is by n-gram length, where n-grams with the same n value are in the same group and fed into the same channel in the attention module.
For other settings, we randomly initialize the n-gram embeddings used in the attention module, with their dimension matching the hidden vector size of the BERT/ZEN encoder, i.e. 768; we set the dropout rate to 0.2, the training batch size to 16, and learning rate to 1e-5. We fine-tune all parameters in BERT and ZEN and use the negative log-likelihood loss function to optimize all models. For evaluation, we follow previous studies (Zheng et al., 2013;Kurita et al., 2017;Shao et al., 2017; to use the F scores of the segmentation and joint label, where the latter one is the main focus of this  Table 4: F scores of segmentation and joint tagging on the test set of five datasets from previous studies, and our models and baselines (using BERT/ZEN encoder) with and without the multi-channel attentions.
paper 11 . We train all models on the training set, preserve the one achieving the highest joint F -score on the development set, and finally evaluate it on the test set.

Overall Performance
In experiments, we test our model with the multi-channel attention module under different settings, where two different encoders (i.e., BERT and ZEN), two strategies to categorize n-grams (one is based on n-gram frequency (denoted by "Freq.") and the other is based on n-gram length (denoted by "Len.")), and three ways to construct the lexicon N are used. In addition, we also run baseline models without using the attention module as well as the ones using normal attentions (single-channel, denoted by "Norm Att.") to model all n-grams. The results (the F scores of the segmentation and joint labels) of our models and baseline models on the development sets are reported in Table 3. There are several observations. First, the multi-channel attention works well with different (i.e., BERT and ZEN) encoders; it helps both segmentation and the joint task consistently on all datasets when compared with the BERT/ZEN baseline without using it. Second, the proposed multi-channel attention module can be applied with different n-gram categorization strategies (i.e., by n-gram frequency and length). Especially, even though the performance of BERT/ZEN model with normal attentions is rather good on the joint task, the multi-channel attention is still able to further boost its performance. This observation shows that grouping and modeling n-grams in different channels could better leverage the n-gram features compared with modeling them together in normal attentions. Third, our model shows its robustness with respect to different ways of constructing N , where similar results are observed over construction methods of AV, DLG, and PMI on all datasets. For example, on CTB7, the absolute differences of the F score between our models under different settings are no more than 0.06%.
Moreover, we also compare our experimental results with representative studies in the past decade on the test set of five benchmark datasets. The F -scores of their studies and the ones from our models and baselines are reported in Table 4, where the lexicon N used for our models and baselines is constructed based on PMI. From the results, our model with BERT and multi-channel attentions outperforms all baselines and previous studies on all datasets with respect to the F score of the joint labels. In addition, when equipped with ZEN, our model can further outperform BERT-based models on the F -scores of the joint CWS and POS tagging task. Compared with previous studies, where extra knowledge or Figure 3: The effect of different n-gram groups categorized by frequency and length on CTB5, where the n-gram lexicon N is constructed by PMI. The segmentation and joint tagging F scores of the models (using BERT encoder) with normal and multi-channel attentions are illustrated in (a) and (c), where N is constructed by including n-grams whose frequency is in range [2 i , +∞) (1≤i≤10) and whose length is in range [1, n] (1≤n≤10), respectively. The weights (i.e., δ k in Eq. 4) assigned to n-gram groups categorized by frequency and length in the multi-channel attention module are shown in (b) and (d), respectively. resources, such as well-defined dictionaries (Wang et al., 2011;Zhang et al., 2014), syntactic features form manual crafted resources (Zhang and Clark, 2010), information of Chinese radicals (Shao et al., 2017), or large auto-processed data  are used, our approach only leverages the resource from the datasets, which reduces the cost to train a joint CWS and POS tagger. Overall, the above results demonstrate that weighting n-grams separately is an appropriate approach to improve joint CWS and POS tagging without requiring extra knowledge.

Effect of N-gram Categorization Methods
We analyze the effect of different categorization methods, i.e., n-gram frequency and n-gram length, to the joint task. For frequency-based methods, we tried different frequency thresholds from 2 1 to 2 10 , where n-grams whose frequency in the dataset is less than 2 i (1 ≤ i ≤ 10) are ignored in the attention module; for length-based methods, we try the number from 1 to 10, where n-grams with their length form 1 to n (≤ 10) are considered. We run experiments with our models and normal attentions using BERT encoder under these settings on CTB5 with the lexicon N constructed by PMI, where the curves (F -scores) of the models with the two categorization methods are reported in Figure 3(a) and 1(c), respectively.
For frequency-based categorization method, the performance of models with normal and multi-channel attentions drops when the frequency threshold increases (see Figure 3(a)). Yet, our model shows a smaller drop over the normal attention model, indicating that multi-channel could be a solution to enhance the attention learned on the same data. We also find that the curves tend to stabilize when the frequency reaches 128. This could be explained by that although frequent n-grams may provide useful information to the task, this information can also be learned by the backbone, leading to a relatively small improvement. In addition, we compare the weights assigned to each n-gram group in Figure 3(b), from which we find n-grams with the frequency of [4, 8) receives a relatively high weight. One possible reason could be that these n-grams provide important cues for the joint task, which is hard for the backbone model to learn because these n-grams do not appear frequently. Therefore, by categorizing n-grams based on their frequency, the model can highlight important n-grams that are infrequent, and thus the model is not dominated by the frequently appearing n-grams.

6
"外商投资企业" (foreign-invested enterprise) 6.8; "国内生产总值" (GDP) 3.0; ...  For length-based categorization method, the performance of both models with normal and multi-channel attentions tend to improve with higher n-gram length threshold. In this case, our model shows a bigger improvement over the normal attention model and the curves tend to stabilize when the X-axis reaches 6 (see Figure 3(c)). A possible reason could be that the number of new n-grams being leveraged with the raise of n-gram length threshold is decreasing so that their influence to the overall performance could be hard to observe. Similarly, we also compare the weights assigned to n-grams grouped by their length and illustrate them in Figure 3(d). In the histogram, we find that bi-grams receive the highest weight over all n-gram groups, which could be attributed to the fact that most words in Chinese contains two characters. On the contrary, n-grams with more characters tend to have fewer influence to the task because it is uncommon to see very long words in Chinese.

N-gram Analyses
To explore the way of our model leveraging n-grams in different properties, we apply our model (with BERT encoder) trained on CTB5 with PMI lexicon construction method to the whole data of CTB5 12 and for each n-gram we sum its assigned attention. The n-gram examples categorized by their frequency and length are shown in Table 5 and Table 6, respectively, where n-grams and their assigned attentions are presented in the decreasing order. For the n-gram examples in both tables, we find that almost all n-grams with high weights are normal Chinese words or phrases. Due to the lexicon is constructed by an unsupervised method, where many n-grams in it are not well-formed words, this observation indicates our model in both settings can distinguish important n-grams and assign them high weights. In addition, for each frequency group in Table 5, there are cases where long n-grams receive a higher weight than the short n-grams, e.g., for n-grams in group [2 5 , 2 6 ), the weight assigned to the seven-gram "比去年 同期增长" (increase from the same period last year) is 2.8 × 10 −3 while the weight assigned to the four-gram "中国政府" (Chinese government), which is more frequent than the seven-gram in CTB5, is 2.5 × 10 −3 . 13 Similarly, in the n-gram examples in Table 6, top ranked long n-grams tend to have higher weights compared with the top ranked short ones. This observation shows that our model with multi-channel attentions can appropriately model the important long n-grams, i.e., assigning high weights to the long n-grams that are common words or phrases, even though they may be infrequent in the dataset.

Related Work
There are basically two approaches to CWS and POS tagging: to perform the two tasks it in a pipeline framework; or to treat them as a joint task where the two tasks are conducted simultaneously, which is known as joint CWS and POS tagging. Ng and Low (2004) provided a comprehensive study to compare the two approaches and found that the joint approach outperform the pipeline one. Therefore, in the past two decades, the majority of studies on CWS and POS tagging applied the joint approach to these tasks (Ng and Low, 2004;Jiang et al., 2008;Jiang et al., 2009;Wang et al., 2011;Sun, 2011;Zeng et al., 2013), where n-grams are widely used as features carrying contextual information to improve model performance. Recently, neural methods, especially the recurrent neural networks (e.g., bi-LSTM) have demonstrated their effectiveness to encode contextual information, and thus significantly improve the model performance in joint CWS and POS tagging. Even though, improvements can still be obtained when n-grams are incorporated into the neural taggers (Zheng et al., 2013;Kurita et al., 2017;Shao et al., 2017;Tian et al., 2020a). For example, Kurita et al. (2017) used a stacked bi-LSTM model to incorporate n-grams and achieved state-of-the-art results on CTB5 and CTB7. In addition to n-grams, approaches leveraging external resources are also used to improve joint CWS and POS tagging: Shao et al. (2017) leveraged n-grams and radical information of Chinese characters to enhance model performance;  pre-trained their character embeddings on large data where both segmentation and POS lebels are auto-tagged by an existing model. Compared to these studies, our model provide a way to leverage n-grams through a multi-channel attention mechanism, where n-grams are categorized by their frequencies or lengths and the n-grams in the same category are compared and weighted. Therefore, the n-grams that contribute more to the joint task in a specific context are highlighted and the model will not be dominated by the frequent or short n-grams (short n-grams also tend to be frequent) because their parameters are intensively updated during training.

Conclusion
In this paper, we propose a neural character-based tagger for joint CWS and POS tagging, where a multi-channel attention mechanism is used to leverage context information carried by n-grams. In detail, for multi-channel attention, we categorize n-grams according to a specific metric, such as their frequencies or lengths, and model the n-grams separately in each attention channel. In doing so, n-grams with different properties (frequencies or lengths) can be weighted and distinguished separately under a specific context, so that they can be reasonably treated for the joint CWS and POS tagging task. Experimental results on five Chinese benchmark datasets shows that our approach with the multi-channel attentions can work well with n-grams extracted by different methods and provides consistent improvements over strong baseline taggers (i.e., BERT and ZEN) without using it. Particularly, our model with ZEN achieves the state-of-the-art performance for joint CWS and POS on all datasets.