Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-way Attentions of Auto-analyzed Knowledge

Chinese word segmentation (CWS) and part-of-speech (POS) tagging are important fundamental tasks for Chinese language processing, where joint learning of them is an effective one-step solution for both tasks. Previous studies for joint CWS and POS tagging mainly follow the character-based tagging paradigm with introducing contextual information such as n-gram features or sentential representations from recurrent neural models. However, for many cases, the joint tagging needs not only modeling from context features but also knowledge attached to them (e.g., syntactic relations among words); limited efforts have been made by existing research to meet such needs. In this paper, we propose a neural model named TwASP for joint CWS and POS tagging following the character-based sequence labeling paradigm, where a two-way attention mechanism is used to incorporate both context feature and their corresponding syntactic knowledge for each input character. Particularly, we use existing language processing toolkits to obtain the auto-analyzed syntactic knowledge for the context, and the proposed attention module can learn and benefit from them although their quality may not be perfect. Our experiments illustrate the effectiveness of the two-way attentions for joint CWS and POS tagging, where state-of-the-art performance is achieved on five benchmark datasets.


Introduction
Chinese word segmentation (CWS) and part-ofspeech (POS) tagging are two fundamental and crucial tasks in natural language processing (NLP) for Chinese. The former one aims to find word boundaries in a sentence and the latter, on the top of segmentation results, assigns a POS tag to each word to indicate its syntactical property in the sentence. To effectively perform CWS and POS tagging, combining them into a joint task is proved to have better performance than separately conducting the two tasks in a sequence (Ng and Low, 2004). Therefore, many studies were proposed in the past decade for joint CWS and POS tagging (Jiang et al., 2008(Jiang et al., , 2009Sun, 2011;Zeng et al., 2013;Zheng et al., 2013;Kurita et al., 2017;Shao et al., 2017;. These studies, regardless of whether they used conventional approaches (Jiang et al., 2008(Jiang et al., , 2009Sun, 2011;Zeng et al., 2013) or deep learning based approaches (Zheng et al., 2013;Kurita et al., 2017;Shao et al., 2017;, focused on incorporating contextual information into their joint tagger. In addition, it is well known that syntactic structure is also able to capture and provide the information of long-distance dependencies among words. For example, Figure 1 shows an example of local ambiguity, where the green highlighted part has two possible interpretations -"报告 VV/书 NN" (report a book) and "报告书 NN" (the report). The ambiguity can be resolved with syntactic analysis; for instance, the dependency structure, if available, would prefer the first interpretation. While the subject and the object of the sentence (highlighted in yellow) are far away from the ambiguous part in Figure 2: The architecture of TWASP for the joint CWS and POS tagging with the two-way attention mechanism, which is presented with example context features and their dependency knowledge (highlighted in yellow) from auto-analyzed results for a character (i.e., "分" (split) highlighted in green) in the given sentence. the surface word order, they are much closer in the dependency structure (the subject depends on "报告 VV" and "书 NN" depends on the the object). This example shows that syntactic structure provides useful cues for CWS and POS tagging.
Syntactic knowledge can be obtained from manually constructed resources such as treebanks and grammars, but such resources require considerate efforts to create and might not be available for a particular language or a particular domain. A more practical alternative is to use syntactic structures automatically generated by off-the-shelf toolkits. Some previous studies (Huang et al., 2007;Jiang et al., 2009;Wang et al., 2011; verified the idea for this task by learning from autoprocessed corpora. However, their studies treat auto-processed corpora as gold reference and thus are unable to distinguishingly use it according to its quality (the resulted knowledge is not accurate in most cases). Therefore, the way to effectively leverage such auto-generated knowledge for the joint CWS and POS tagging task is not fully explored.
In this paper, we propose a neural model named TWASP with a two-way attention mechanism to improve joint CWS and POS tagging by learning from auto-analyzed syntactic knowledge, which are generated by existing NLP toolkits and provide necessary (although not perfect) information for the task. In detail, for each input character, the proposed attention module extracts the context features associated with the character and their corresponding knowledge instances according to the auto-analyzed results, then computes the attentions separately for features and knowledge in each attention way, and finally concatenates the attentions from two ways to guide the tagging process. In doing so, our model can distinguish the important auto-analyzed knowledge based on their contributions to the task and thus avoid being influenced by some inferior knowledge instances. Compared to another prevailing model, i.e., key-value memory networks (Miller et al., 2016), which can learn from pair-wisely organized information, the twoway attentions not only are able to do so, but also fully leverage features and their knowledge rather than using one to weight the other. 2 We experiment with three types of knowledge, namely, POS labels, syntactic constituents, and dependency relations, in our experiments. The experimental results on five benchmark datasets illustrate the effectiveness of our model, where state-of-the-art performance for the joint task is achieved on all datasets. We also perform several analyses, which confirm the validity of using two-way attentions and demonstrate that our model can be further improved by synchronously using multiple types of knowledge.

The Model
The architecture of TWASP is illustrated in Figure  2. The left part shows the backbone of the model for the joint CWS and POS tagging following the character-based sequence labeling paradigm, where the input is a character sequence X = x 1 x 2 · · · x i · · · x l and the output is a sequence of joint labels Y = y 1 y 2 · · · y i · · · y l . To enhance the backbone paradigm, the proposed two-way attention module (as shown in the right part of Figure  2) takes the syntactic knowledge produced from the input sentence, analyzes it and then feeds it to the tagging process. In this section, we firstly introduce the auto-analyzed knowledge, then explain how the two-way attentions consume such knowledge, and finally describe how the joint CWS and POS tagging works with the resulted attentions.

Auto-analyzed Knowledge
Auto-analyzed knowledge is demonstrated to be an effective type of resources to help NLP systems understand the texts (Song et al., 2017;Seyler et al., 2018;Huang and Carley, 2019). One challenge for leveraging external knowledge for the joint task is that gold-standard annotations are extremely rare for text in most domains, especially the syntactic annotations. An alternative solution is to use off-the-shelf NLP systems to produce such knowledge, which is proved to be useful in previous studies (Huang et al., 2007;Jiang et al., 2009;Wang et al., 2011;. Rather than processing an entire corpus and then extracting features or training embeddings from the resulted corpus as in previous studies, our model does not treat knowledge as gold references: it generates auto-analyzed knowledge for each sentence and learns the weights of the corresponding features. Formally, for a character sequence X , let S and K denote the lists of context features and knowledge for X , respectively. For each character x i in X , let S i = [s i,1 , s i,2 , · · · s i,j , · · · s i,m i ] and K i = [k i,1 , k i,2 , · · · k i,j , · · · k i,m i ] be the sublists of S and K for x i . Here, s i,j and k i,j denote a context feature and a knowledge instance, respectively. In this paper, we use three types of syntactic knowledge for the joint task, namely POS labels, syntactic constituents, and dependency relations, where POS labels indicate the syntactic information of individual words, syntactic constituents provide the structural grouping information for a text span, and dependencies offer dependency relations between words. Figure 3 shows an example sentence and the corresponding S and K. For character "分" (highlighted in green), its S i and K i are highlighted in yellow. In order to distinguish same knowledge appearing with different context features, we use a feature-knowledge combination tag to represent each knowledge instance (e.g., "分子 NN", "分 子 NP", and "分子 dobj" in Figure 3). We explain each type of knowledge below. Figure 3 (a) shows that, for each x i (e.g., x 6 ="分"), we use a 2-word window for both sides to extract context features from S to form S i (i.e., S 6 = ["分子", "结合", "成", "时"]), and then get their corresponding knowledge instances of POS labels from K to form K i (i.e., K 6 = ["分 子 NN", "结合 VV", "成 VV", "时 LC"]). Figure 3 (b), the rule for extracting syntactic constituency knowledge is as follows. We start with the word containing the given character x i , go up the con-stituency tree to the first ancestor whose label is in a pre-defined syntactic label list, 3 then use all the words under this node to select context features from S, and finally combine the words with the syntactic label of the node to select knowledge instances from K. For example, for x 6 ="分", the lowest syntactic node governing "分子" is NP (highlighted in yellow); thus S 6 = ["分子"] and K 6 = ["分子 NP"]. Another example is x 5 ="成", the lowest acceptable node on its syntactic path is VP; therefore, S 5 = ["结合", "成", "分子"] and K 5 = ["结合 VP", "成 VP", "分子 VP"].

Syntactic Constituents As shown in
Dependency Relations Given a character x i , let w i be the word that contains x i . The context features S i include w i , w i 's governor, and w i 's dependents in the dependency structure; those words combined with their inbound dependency relation labels form K i . For example, for x 6 ="分", w 6 = "分子", which depends on "结合" with a dependency label dobj. Therefore, S 6 = ["分子", "结 合"], and K 6 = ["分子 obj", "结合 root"].

Two-Way Attentions
Attention has been shown to be an effective method for incorporating knowledge into NLP systems (Kumar et al., 2018;Margatina et al., 2019) but it cannot be used directly for feature and knowledge in pair-wise forms. Previous studies on the joint task normally directly concatenate the embeddings from context features and knowledge instances into the embeddings of characters , which could be problematic for incorporating auto-analyzed, error-prone syntactic knowledge obtained from off-the-shelf toolkits.
For both features and their knowledge instances for X , we use a two-way attention design to have separate attention for S and K. Particularly, the two ways, namely, the feature way and the knowledge way, are identical in architecture, where each way has a feed-forward attention module (Raffel and Ellis, 2015). For each x i , its S i and K i are firstly fed into the feature attention way and the knowledge attention way, respectively, then computed within each way, and their final attention vectors are combined to feedback to the backbone model.
Take the feature way as an example, the attention weight for each context feature s i,j is computed by where h i is the vector from a text encoder for x i and e s i,j the embedding of s i,j . Then we have the weighted embedding a s i for all s i,j in S i via where denotes a element-wise sum operation. For the knowledge way, the same process is applied to get a k i by distinguishing and weighting each knowledge instance k i,j . Finally, the output of the two attention ways are obtained through an concatenation of the two vectors: a i = a s i ⊕ a k i .

Joint Tagging with Two-way Attentions
To functionalize the joint tagging, the two-way attentions interact with the backbone model through the encoded vector h i and its output a i for each x i . For h i , one can apply many prevailing encoders, e.g., Bi-LSTM or BERT (Devlin et al., 2019) Once a i is obtained, we concatenate it with h i and send it through a fully connected layer to align the dimension of the output for final prediction: where W and b are trainable parameters. Afterwards, conditional random fields (CRF) is used to estimate the probability for y i over all possible joint CWS and POS tags under x i and y i−1 by Here, W c and b c are the weight matrix and the bias vector, respectively, and they are estimated using the (y i−1 , y i ) tag pairs in the gold standard.

Datasets
We employ five benchmark datasets in our experiments, where four of them, namely, CTB5, CTB6, CTB7, and CTB9, are from the Penn Chinese TreeBank 4 (Xue et al., 2005) and the fifth one is  the Chinese part of Universal Dependencies (UD) 5 (Nivre et al., 2016). The CTB datasets are in simplified Chinese characters while the UD dataset is in traditional Chinese. Following Shao et al. (2017), we convert the UD dataset into simplified Chinese 6 before conducting experiments on it. CTB uses 33 POS tags, and we split CTB5-CTB9 following previous studies (Wang et al., 2011;Jiang et al., 2008;Shao et al., 2017). In addition, because the data in CTB9 come from eight genres -broadcast conversation (BC), broadcast news (BN), conversational speech(CS), discussion forums (DF), magazine articles (MZ), newswire (NW), SMS/chat messages (SC), and weblog (WB) -we also use CTB9 in a crossdomain study (see Section 3.4). UD uses two POS tagsets, namely the universal tagset (15 tags) and language-specific tagset (42 tags for Chinese). We refer to the corpus with the two tagsets as UD1 and UD2, respectively, and use the official splits of train/dev/test in our experiments. The statistics for the aforementioned datasets are in Table 1. 5 We use its version 2.4 downloaded from https:// universaldependencies.org/.

Implementation
To obtain the aforementioned three types of knowledge, we use two off-the-shelf toolkits, Stanford CoreNLP Toolkit (SCT) 7 (Manning et al., 2014) and Berkeley Neural Parser (BNP) 8 (Kitaev and Klein, 2018): the former tokenizes and parses a Chinese sentence, producing POS tags, phrase structure and dependency structure of the sentence; the latter does POS tagging and syntactic parsing on a pre-tokenized sentence. Both toolkits were trained on CTB data and thus produced CTB POS tags. To extract knowledge, we firstly use SCT to automatically segment sentences and then run both SCT and BNP for POS tagging and parsing. Table  2 shows the size of S and K for all the datasets. We test the model with three encoders: two of them, namely Bi-LSTM and BERT 9 (Devlin et al., 2019), are widely used; the third encoder is ZEN 10 (Diao et al., 2019), which is a recently released Chinese encoder pre-trained with n-gram information and outperforms BERT in many downstream tasks. For the Bi-LSTM encoder, we set its hidden state size to 200 and use the character embeddings released by Shao et al. (2017) to initialize its input representations. For BERT and ZEN, we follow their default settings, e.g., 12 layers of selfattentions with the dimension of 768.
For the two-way attention module, we randomly initialize the embeddings for all context features and their corresponding knowledge instances, where one can also use pre-trained embeddings (Song et al., 2018;Grave et al., 2018;Yamada et al., 2020) for them. For all the 7 We use its version 3.9.2 downloaded from https:// stanfordnlp.github.io/CoreNLP/.
10 https://github.com/sinovation/ZEN  Table 3: Experimental results (the F-scores for segmentation and joint tagging) of TWASP using different encoders with and without auto-analyzed knowledge on the five benchmark datasets. "Syn." and "Dep." refer to syntactic constituents and dependency relations, respectively. The results of SCT and BNP are also reported as references, where * marks that the segmentation and POS tagging criteria from the toolkits and the UD dataset are different. models, we set the maximum character length of the input sequence to 300 and use negative loglikelihood loss function. Other hyper-parameters of the models are tuned on the dev set and the tuned models are evaluated on the test set for each dataset (each genre for CTB9). F-scores for word segmentation and the joint CWS-POS tags are used as main evaluation metrics 11 in all experiments.

Overall Performance
In our main experiment, we run our TWASP on the five benchmark datasets using the three encoders, i.e., Bi-LSTM, BERT, and ZEN. The results on the F-scores of word segmentation and joint CWS and POS tagging are in Table 3, which also includes the performance of the baselines without attention and the two toolkits (i.e., SCT and BNP). The results of SCT and BNP on the UD dataset are bad because they were trained on CTB, which used different segmentation and POS tagging criteria.
There are several observations. First, for all encoders, the two-way attentions provide consistent enhancement to the baselines with different types of knowledge. Particularly, although the baseline model is well-performed when BERT (or ZEN) serves as the encoder, the attention mod- 11 We use the evaluation script from https://github. com/chakki-works/seqeval. ule is still able to further improve its performance with the knowledge produced by the toolkits even though the toolkits have worse-than-baseline results for the joint task. Second, among different types of knowledge, POS labels are the most effective ones that help the joint task. For instance, among BERT-based models, the one enhanced by POS knowledge from SCT achieves the best performance on most datasets, which is not surprising because such knowledge matches the outcome of the task. In addition, for BERT-based models enhanced by knowledge from BNP (i.e., BERT + POS (BNP) and BERT + Syn. (BNP)), syntactic constituents provide more improvement than POS labels on all CTB datasets. This observation could be explained by that BNP is originally designed for constituency parsing with CTB criteria; the syntactic constituents are complicated while effective when they are accurate. Third, while SCT and BNP were trained on CTB, whose tagset is very different from the two tagsets for UD, TWASP still outperforms the baselines on UD with the knowledge provided by SCT and BNP, indicating that syntactic knowledge is useful even when it uses different word segmentation and POS tagging criteria. Table 4 shows the results of our best models (i.e. BERT and ZEN with POS (SCT)) and previous studies on the same datasets. Our approach  outperforms previous studies on the joint task and achieves new state-of-the-art performance on all datasets. While some of the previous studies use auto-analyzed knowledge (Wang et al., 2011;, they regard such knowledge as gold reference and consequently could suffer from errors in the auto-analyzed results. In contrast, our proposed model is able to selectively model the input information and to discriminate useful knowledge instances through the two-way attentions.

Cross-Domain Performance
Domain variance is an important factor affecting the performance of NLP systems (Guo et al., 2009;McClosky et al., 2010;Song and Xia, 2013). To further demonstrate the effectiveness of TWASP, we conduct cross-domain experiments on the eight genres of CTB9 using BERT and ZEN as the baseline and their enhanced version with POS knowledge from SCT. In doing so, we test on each genre with the models trained on the data from all other genres. The results on both segmentation and the joint task are reported in Table 5, where the SCT results are also included as a reference. The comparison between the baselines and TWASP with POS knowledge clearly shows the consistency of performance improvement with twoway attentions, where for both BERT and ZEN, TWASP outperforms the baselines for all genres on the joint labels. In addition, similar to the observations from the previous experiment, both accurate and inaccurate POS knowledge are able to help the joint task. For example, although the SCT results on several genres (e.g., CS, DF, SC) are much worse than of the BERT baseline, the POS labels produced by SCT can still enhance TWASP on word segmentation and joint tagging with the proposed two-way attention module.

The Effect of Two Attention Ways
In the first analysis, we compare our two-way attention with normal attention. For normal attention, we experiment three ways of incorporating context features and knowledge: (1) using context features and knowledge together in the attention, where all features or knowledge instances are equally treated in it; (2) using context features only; and (3) using knowledge only. We run these experiments with BERT encoder and POS knowledge from SCT on CTB5 and report the results in Table 6. Overall, the two-way attentions outperform all three settings for normal attention, which clearly indicates the validity of using two attention ways for features and knowledge, i.e., compared to (1), as well as the advantage of learning from both of them, i.e., compared to (2) and (3). Interestingly, in the three settings, (3) outperforms (1), which could be explained by that, with normal attention, mixed feature and knowledge instances in it may make it difficult to weight them for the joint task.
There are other methods for using both context features and knowledge in a neural framework, such as key-value memory networks (kvMN) (Miller et al., 2016), which is demonstrated to improve CWS by Tian et al. (2020). Thus we compare our approach with kvMN, in which context features are mapped to keys and knowledge to values. We follow the standard protocol of the kvMN, e.g., addressing keys by S i and reading values from K i through the corresponding knowledge for each key, computing weights from all key embeddings, and outputting the weighted embeddings from all values. The result from the kvMN is reported at the last row of Table 6, where its performance is not as good as the two-way attentions, and even   worse than using normal attention with setting (3). The reason could be straightforward: the output of kvMN is built upon value (knowledge) embeddings and therefore information from key (context feature) embeddings does not directly contribute to it other than providing weights for the value. As a result, kvMN acts in a similar yet inferior 12 way of setting (3) where only knowledge is used.

Knowledge Ensemble
Since every type of knowledge works well in our model, it is expected to investigate how the model performs when multiple types of knowledge are used together. To this end, we run experiments on CTB5 to test on our BERT-based TWASP with knowledge ensemble, where two ensemble strategies, i.e., averaging and concatenation, are applied with respect to how a i for each knowledge type is combined with others. The results are reported in Table 7. In this table, the first seven rows (ID: 1-7) indicate that different types of knowledge are 12 The "inferior" is explained by that, in kvMN, the value weights are inaccurate because they are computed with respect to the contribution of keys rather than knowledge instances.  Table 7: Comparison of different knowledge ensemble results, which are presented by the joint tagging Fscores from our BERT-based TWASP on CTB5. and refer to averaging and concatenation of attentions from different knowledge types, respectively. As a reference, the best result on CTB5 for BERTbased model without knowledge ensemble is 96.77% achieved by BERT + POS (SCT) (see Table 3). combined according to whether they come from the same toolkit (ID: 1-5) or belong to the same category (ID: 6 and 7); and the last row (ID: 8) is for the case that all types of knowledge are combined.
There are several observations. First, compared to only using one type of knowledge (refer to Table  3), knowledge ensemble improves model performance where more knowledge types contribute to better results. The best model is thus obtained when all knowledge (from each toolkit and from both toolkits) are used. Second, knowledge in the same type from different toolkits may complement to each other and thus enhance model performance accordingly, which is confirmed by the results from the models assembling POS (or Syn+Dep) information from both SCT and BNP. Third, for different ensemble strategies, concatenation tends to perform better than averaging, which is not surprising since concatenation actually turns the model into a multi-way structure for knowledge integration.

Case Study
When the toolkit provides accurate knowledge, it is not surprising that our two-way attention model would benefit from the auto-analyzed knowledge. Interestingly, even when the toolkit provides inaccurate output, our model might still be able to benefit from such output. Figure 4 shows such an example, where our system uses BERT+Dep using SCT and the baseline system is BERT without twoway attention. The sentence contains an ambiguity character bigram "马上", which has two possible interpretations, "马上 AD" (immediately) and "马 NN/上 LC" (on the horse). The second one is correct, yet the baseline tagger chooses the former because "马上" (immediately) is a very common adverb. Although SCT also chooses the wrong segmentation and thus has an incorrect dependency structure, our system is still able to produce correct segmentation and POS tags. One plausible explanation for this is that the inaccurate dependency structure includes an advmod link between "马上" (immediately) and "很好"(very good). Because such a dependency pair seldom appears in the corpus, the attention from such knowledge is weak and hence encourages our system to choose the correct word segmentation and POS tags.

Related Work
There are basically two approaches to CWS and POS tagging: to perform POS tagging right after word segmentation in a pipeline, or to conduct the two tasks simultaneously, known as joint CWS and POS tagging. In the past two decades, many studies have shown that joint tagging outperforms the pipeline approach (Ng and Low, 2004;Jiang et al., 2008Jiang et al., , 2009Wang et al., 2011;Sun, 2011;Zeng et al., 2013). In recent years, neural methods started to play a dominant role for this task (Zheng et al., 2013;Kurita et al., 2017;Shao et al., 2017;, where some of them tried to incorporate extra knowledge in their studies. For example, Kurita et al. (2017) exploited to model n-grams to improve the task; Shao et al. (2017) extended the idea by incorporating pre-trained n-gram embeddings, as well as radical embeddings, into character representations.  tried to leverage the knowledge from character embeddings, trained on an automatically tagged corpus by a baseline tagger. Compared to these previous studies, TWASP provides a simple, yet effective, neural model for joint tagging, without requiring a complicated mechanism of incorporating different features or pre-processing a corpus.

Conclusion
In this paper, we propose neural approach with a two-way attention mechanism to incorporate autoanalyzed knowledge for joint CWS and POS tagging, following a character-based sequence labeling paradigm. Our proposed attention module learns and weights context features and their corresponding knowledge instances in two separate ways, and use the combined attentions from the two ways to enhance the joint tagging. Experimental results on five benchmark datasets illustrate the validity and effectiveness of our model, where the two-way attentions can be integrated with different encoders and provide consistent improvements over baseline taggers. Our model achieves stateof-the-art performance on all the datasets. Overall, this work presents an elegant way to use autoanalyzed knowledge and enhance neural models with existing NLP tools. For future work, we plan to apply the same methodology to other NLP tasks.