Improving Chinese Word Segmentation with Wordhood Memory Networks

Contextual features always play an important role in Chinese word segmentation (CWS). Wordhood information, being one of the contextual features, is proved to be useful in many conventional character-based segmenters. However, this feature receives less attention in recent neural models and it is also challenging to design a framework that can properly integrate wordhood information from different wordhood measures to existing neural frameworks. In this paper, we therefore propose a neural framework, WMSeg, which uses memory networks to incorporate wordhood information with several popular encoder-decoder combinations for CWS. Experimental results on five benchmark datasets indicate the memory mechanism successfully models wordhood information for neural segmenters and helps WMSeg achieve state-of-the-art performance on all those datasets. Further experiments and analyses also demonstrate the robustness of our proposed framework with respect to different wordhood measures and the efficiency of wordhood information in cross-domain experiments.


Introduction
Unlike most written languages in the world, the Chinese writing system does not use explicit delimiters (e.g., white space) to separate words in written text. Therefore, Chinese word segmentation (CWS) conventionally serves as the first step in Chinese language processing, especially for many downstream tasks such as text classification (Zeng et al., 2018), question answering (Liu et al., 2018), machine translation (Yang et al., 2018), etc.
In the past two decades, the mainstream methodology of CWS treated CWS as a character-based sequence labeling task (Tseng et al., 2005;Song et al., 2006;Sun and Xu, 2011;Pei et al., 2014;Chen et al., 2015;Zhang et al., 2016;Ma et al., 2018;Higashiyama et al., 2019;, where various studies were proposed to effectively extract contextual features to help better predicting segmentation labels for each character (Zhang et al., 2013;Zhou et al., 2017;Higashiyama et al., 2019). Among all the contextual features, the ones measuring wordhood for n-grams illustrate their helpfulness in many nonneural CWS models (Sun et al., 1998;Xue and Shen, 2003;. Later, following the track of the sequence labeling methodology, recent approaches with neural networks are proved to be powerful in this task (Chen et al., 2015;Ma et al., 2018;Higashiyama et al., 2019). However, since neural networks (e.g., LSTM) is considered to be able to provide a good modeling of contextual dependencies, less attention is paid to the idea of explicitly leveraging wordhood information of n-grams in the context as what had previously been done in non-neural models. Although some studies sidestepped the idea by incorporating contextual n-grams (Pei et al., 2014;Zhou et al., 2017) or word attention (Higashiyama et al., 2019) into the sequence labeling process, they are limited in either concatenating word and character embeddings or requiring a well-defined word lexicon. Therefore, it has not been fully explored what would be the best way of representing contextual information such as wordhood features in neural CWS models. Moreover, consider there are various choices of wordhood measures, it is also a challenge to design a framework that can incorporate different wordhood features so that the entire CWS approach can be general while being effective in accommodating the input from any measures.
In this paper, we propose WMSEG, a neural framework with a memory mechanism, to improve Figure 1: The architecture of WMSEG. "N " denotes a lexicon constructed by wordhood measures. N-grams (keys) appearing in the input sentence "部分居民生活水平" (some residents' living standard) and the wordhood information (values) of those n-grams are extracted from the lexicon. Then, together with the output from the text encoder, n-grams (keys) and their wordhood information (values) are fed into the memory module, whose output passes through a decoder to get final predictions of segmentation labels for every character in the input sentence.
CWS by leveraging wordhood information. In detail, we utilize key-value memory networks (Miller et al., 2016) to incorporate character n-grams with their wordhood measurements in a general sequence labeling paradigm, where the memory module can be incorporated with different prevailing encoders (e.g., BiLSTM and BERT) and decoders (e.g., softmax and CRF). For the memory, we map n-grams and their wordhood information to keys and values in it, respectively, and one can use different wordhood measures to generate such information. Then for each input character, the memory module addresses all the n-grams in the key list that contain the character and uses their corresponding values to generate an output vector to enhance the decoder for assigning a segmentation label to the character. Experimental results from five widely used benchmark datasets confirm that WMSEG with wordhood information can improve CWS over powerful baseline segmenters and ourperform previous studies, where state-of-the-art performance is observed on all the datasets. Further experiments and analyses are also performed to investigate different factors affecting WMSEG's performance.

The Proposed Framework
Following previous studies, we regard CWS as a character-based sequence labeling task. The architecture of WMSEG is illustrated in Figure 1, where the general sequence labeling paradigm is the top part with a memory module inserted between the encoder and the decoder. The model predicts a tag (e.g., tag B for the 1st character in a word) for each character, and the predicted tag sequence is then converted to word boundary in the system output. The bottom part of the figure starts with a lexicon N , which is simply a list of n-grams and can be built by various methods (see Section 2.1). Given an input sentence X = x 1 x 2 ...x i ...x l , for each character x i in X , our approach uses the lexicon N to generate (keys, values) for x i and send it to the memory module. In all, the process of WMSEG to perform CWS can be formalized as where T denotes the set of all types of segmentation labels, and l stands for the length of the input sentence X . The output Y is the corresponding label sequence for X with Y representing the best label sequence according to the model. M is the memory module proposed in this paper that consumes X and N and provides corresponding wordhood information for X to maximize p.
In the rest of this section, we describe the construction of the n-gram lexicon, the proposed wordhood memory networks, and how it is integrated with different encoders and decoders, respectively.

Lexicon Construction
To build the wordhood memory networks, the first step is to construct the lexicon N because the keys in the memory module are built upon N , where each n-gram in N is stored as a key in it. 2 In this study, N is simply a list of n-grams, and technically, it can be constructed through many existing resources or automatic methods. Compared to using an off-the-shelf lexicon or the word dictionary from the training data, it is hypothesized that, for the purpose of incorporating wordhood information into the general sequence labeling framework, unsupervised wordhood measures, such as accessor variety (AV) , pointwise mutual information (PMI) (Sun et al., 1998), and description length gain (DLG) (Kit and Wilks, 1999), would perform better. For example, AV measures the wordhood of an n-gram k by where L av (k) and R av (k) denote the number of different character types that can precede (left access number) or follow (right access number) the n-gram k. Normally, the higher the AV score is, the more likely the n-gram forms a word.

Wordhood Memory Networks
To encode both n-grams and the wordhood information they carry, one requires an appropriate framework to do so for CWS. Compared with other network structures that can exploit n-grams such as the attention mechanism, key-value memory networks are more appropriate to model such pairwise knowledge via transforms between keys and values.
In the memory, we map n-grams and their wordhood information to keys and values, respectively. Following Miller et al. (2016), we illustrate how our memory module generates and operates the (keys, values) pair for each x i in this subsection.

N-gram Addressing
For each x i in a training/test instance, normally there are many ngrams in N that contain x i . Therefore, the ngram addressing step is to generate all n-grams from x i 's context (including x i ) and keep only the ones that appear in N , For example, in the input sentence shown in Figure  1, the n-grams that contain the character x 4 ="民" (people) form the list K 4 = ["民" (people), "居民" 2 Therefore n-gram and key are equivalent in the memory.
Rule v i,j x i is the beginning of the key k i,j V B x i is inside the key k i,j V I x i is the ending of the key k i,j V E x i is the single-character key k i,j V S (resident), "民生" (livelihood), "居民生活" (residents' life)], which are highlighted in the dashed boxes illustrated at the bottom part of the figure.
Then, the memory module activates the corresponding keys in it, addresses their embeddings (which are denoted as e k i,j for each k i,j ), and computes the probability distribution for them with for each key, where h i is the vector for x i which can be generated from any text encoder.
Wordhood Reading Values in the memory represent the wordhood information for a given x i and k i,j pair, which is not a straightforward mapping because x i may have different roles in each k i,j . For example, k i,j delivers different wordhood information when x i appears at the beginning or the ending of k i,j . Therefore, we set rules in Table 1 to read a value for a key according to different situations of Figure 1) so that all n-grams should map to one of the values based on x i 's position in k i,j . To illustrate that, in the aforementioned example, ngrams in K 4 for Figure 1). As a result, each K i for x i has a list of values denoted Then the total wordhood memory for x i is computed from the weighted sum of all keys and values by where e v i,j is the embedding for v i,j . Afterwards, o i is summed element-wise with h i and the result is passed through a fully connected layer by TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST TRAIN DEV TEST   CHAR #  4,050K 184K 1,826K 173K 8,368K 198K 2,403K  68K 1,056K 100K 134K  WORD #  2,368K 107K 1,110K 104K 5,500K 123K 1,456K  41K  641K 60K  82K  CHAR TYPE #  5K  3K  5K  3K  6K  4K  5K  3K  4K  3K  3K  WORD TYPE #  88K  13K  55K  13K  141K  19K  69K  9K  42K 10K  12K OOV  where W o is a trainable parameter and the output a i ∈ R |T | is a weight vector with its each dimension corresponding to a segmentation label.

Text Encoders and Decoders
To ensure wordhood memory networks functionalize, one requires to generate h i for each x i by where the Encoder can be different models, e.g., Bi-LSTM and BERT (Devlin et al., 2019), to represent a sequence of Chinese characters into vectors. Once all a i are generated from the memory for each x i , a decoder takes them to predict a sequence of segmentation labels Y = y 1 y 2 · · · y l for X by where A = a 1 a 2 · · · a i · · · a l is the sequence of output from Eq. 5. The Decoder can be implemented by different algorithms, such as softmax: where a t i is the value at dimension t in a i . Or one can use CRF for the Decoder: where W c ∈ R |T |×|T | and b c ∈ R |T | are trainable parameters to model the transition for y i−1 to y i .

Datasets
We employ five benchmark datasets in our experiments: four of them, namely, MSR, PKU, AS, and CITYU, are from SIGHAN 2005Bakeoff (Emerson, 2005 and the fifth one is CTB6 (Xue et al., 2005). AS and CITYU are in traditional Chinese characters whereas the other three use simplified OOV RATE 3.4 6.0 8.9 5.9 7.1  Table 2 show the statistics of all datasets in terms of the number of characters and words and the percentage of out-of-vocabulary (OOV) words in the dev/test sets with respect to the training set.
In addition, we also use CTB7 (LDC2010T07) to perform our cross-domain experiments. There are five genres in CTB7, including broadcast conversation (BC), broadcast news (BN), magazine (MZ), newswire (NW), and weblog (WEB). The statistics of all the genres are reported in Table 3, where the OOV rate for each genre is computed according to the union of all other genres. For example, the OOV rate for BC is computed with respect to the union of BN, MZ, NW, and WEB.

Wordhood Measures
We experiment with three wordhood measures to construct N . The main experiment adopts the aforementioned AV as the measure to rank all ngrams, because AV was shown to be the most effective wordhood measure in previous CWS studies (Zhao and Kit, 2008). Since AV is sensitive to corpus size, in our experiments we use different AV thresholds when building the lexicon for each dataset: the threshold is 2 for PKU, CITYU, CTB6 and CTB7, and 5 for MSR and AS.
To test the the robustness of WMSEG, we also try two other wordhood measures, i.e., PMI (Sun et al., 1998) and DLG (Kit and Wilks, 1999). PMI measures pointwise mutual information between two Chinese characters, x and x , via where p computes the probability of an n-gram (i.e., x , x and x x ) in a dataset. A high PMI score indicates that the two characters co-occur a lot in the dataset and are likely to form a word. Hence, we use a threshold to determine whether a word boundary delimiter should be inserted between two adjacent characters in the dataset. In our experiments, we set the threshold to 0, PMI score lower than it will result in a segmentation. In other words, for each dataset, we use PMI to perform unsupervised segmentation and collect the segmented words from it to build the n-gram lexicon N . The other measure, DLG, computes wordhood of an n-gram s according to the change of the description length of a dataset D with and without treating that n-gram as a segment: where D denotes the original dataset and D[r → s]⊕s represents a new dataset by treating s as a new segment, replacing all the occurrences of s with a new symbol r (which can be seen as an index for newly identified segment s), and then appending s at the end. DL(D) is the Shannon-Fano code length of a dataset D, calculated by where V refers to the vocabulary of D and c(x) the count of segment x. We set the threshold for DLG to 0 and use the n-grams whose DLG is higher than it to build lexicon N for each dataset.  All aforementioned measures are conducted on the union of the training and test sets, so that ngrams and their wordhood information are shared in both the learning and prediction phase. We remove all white spaces from the data and use the resulted raw texts to perform these measures. Table  4 shows the sizes of the lexicons created with these wordhood measures on the five datasets.

Model Implementation
Following previous studies (Sun and Xu, 2011;Chen et al., 2015Ma et al., 2018;, we use four segmentation labels in our experiments, i.e., T = {B, I, E, S}. Among them, B, I, and E indicate a character is the beginning, inside, and the ending of a word and S denotes that the character is a single-character word. Since text representation plays an important role to facilitate many tasks (Conneau et al., 2017;Song et al., 2017Sileo et al., 2019), we try two effective and well-known encoders, i.e., Bi-LSTM and BERT 4 . In addition, we test WMSEG on a pretrained encoder for Chinese language, i.e., ZEN 5 (Diao et al., 2019), which learns n-gram information in its pre-training from large raw corpora and outperforms BERT on many Chinese NLP tasks. Table 5 shows the hyperparameter settings for all the encoders: for the Bi-LSTM encoder, we follow the setting of Chen et al. (2015) and adopt their character embeddings for e x i , and for BERT and ZEN encoders, we follow the default settings in their papers (Devlin et al., 2019;Diao et al., 2019).
For the decoders, we use softmax and CRF, and set their loss functions as cross-entropy and negative log-likelihood, respectively. The memory module can be initialized by random or pre-trained word embeddings for keys and values. In our experiments, we use random initialization for them. 6  Table 6: Experimental results of WMSEG on SIGHAN2005 and CTB6 datasets with different configurations. "EN-DN" stands for the text encoders ("BL" for Bi-LSTM and "BT" for BERT) and decoders ("SM" for softmax and "CRF" for CRF). The "WM" column indicates whether the wordhood memories are used ( √ ) or not (×).

Results and Analyses
In this section, we firstly report the results of WM-SEG with different configurations on five benchmark datasets and its comparison with existing models. Then we explore the effect of using different lexicon N and different wordhood measures in WMSEG. We also use a cross-domain experiment to illustrate the effectiveness of WMSEG when more OOVs are in the test set. Lastly, a case study is performed to visualize how the wordhood information used in WMSEG helps CWS.

Results on Benchmark Datasets
In the main experiment, we illustrate the validity of the proposed memory module by comparing WM-SEG in different configurations, i.e., with and without the memory in integrating with three encoders, i.e., Bi-LSTM, BERT, and ZEN, and two decoders, i.e., softmax and CRF. The experimental results on the aforementioned five benchmark datasets are shown in Table 6, where the overall F-score and the recall of OOV are reported. With five datasets and six encoder-decoder configurations, the table includes results from 30 pairs of experiments, each pair with or without using the memories. There are several observations drawn from the results. First, the overall comparison clearly indicates that, WMSEG (i.e., the model with wordhood memories) outperforms the baseline (i.e., the model without wordhood memories) for all 30 pairs in terms of F-scores and for 26 pairs in terms of R OOV . Second, the proposed memory module works smoothly with different encoders and decoders, where some improvement is pretty signifi-cant; for instance, when using Bi-LSTM as the encoder and CRF as the decoder, WMSEG improves the F-score on the AS dataset from 94.39 to 95.07 and R OOV from 61.59 to 68.17. With BERT or ZEN as the encoder, even when the baseline system performs very well, the improvement of WMSEG on F-scores is still decent. Third, among the models with ZEN, the ones with the memory module further improve their baselines, although the context information carried by n-grams is already learned in pre-training ZEN. This indicates that wordhood information provides additional cues (besides the contextual features) that can benefit CWS, and our proposed memory module is able to provide further task-specific guidance to an n-gram integrated encoder. Fourth, the wordhood memory shows its robustness with different lexicon size when we consider WMSEG's performance with the lexicon statistics reported in Table 4 together. To summarize, the results in this experiment not only confirm that wordhood information is a simple yet effective source of knowledge to help CWS without requiring external support such as a well-defined dictionary or manually crafted heuristics, but also fully illustrate that the design of our model can effectively integrate this type of knowledge.
To further illustrate the validity and the effectiveness of WMSEG, we compare our best-performing model with the ones in previous studies on the same benchmark datasets. The comparison is presented in Table 7, where WMSEG (both the one with BERT and ZEN) outperforms all existing models with respect to the F-scores and achieves new state-of-the-art performance on all datasets.  Table 7: Performance (F-score) comparison between WMSEG (BT-CRF and ZEN-CRF with woodhood memory networks) and previous state-of-the-art models on the test set of five benchmark datasets.

Cross-Domain Performance
As domain variance is always an important factor affecting the performance of NLP systems especially word semgenters Song and Xia, 2013), in addition to the experiments on benchmark datasets, we also run WMSEG on CTB7 across domains (genres in this case) with and without the memory module. To test on each genre, we use the union of the data from the other four genres to train our segmenter and use AV to extract n-grams from the entire raw text from CTB7 in this experiment. Table 8 reports the results in Fscore and OOV recall, which show a similar trend as that in Table 6, where WMSEG outperforms baselines for all five genres. Particularly, for genres with large domain variance (e.g., the ones with high OOV rates such as MZ and WEB), CWS is difficult, and its relatively low F-scores in Table 8 from baseline models confirm that. Yet WMSEG offers a decent way to improve cross-domain CWS performance without any help from external knowledge or complicated model design, which further illustrates the effectiveness of the memory module. The reason could be that many n-grams are shared in both training and test data; these n-grams with their wordhood information present a strong indication to the model on what combinations of characters can be treated as words, even though some of them never appear in the training data.

Effect of Using Different N
To analyze the robustness of WMSEG with respect to the lexicon, we compare four ways (ID: 2-5 in Table 9) of constructing the lexicon (N ): the first one simply uses the vocabulary from the training data (marked as GOLD LABEL in Table 9; ID: 2); the other three ways use AV to extract n-grams from the unsegmented training data only (ID: 3), the test data only (ID: 4), and training + test set (ID: 5), respectively. 7 Table 9 shows the results of running BERT-CRF on the WEB genre of CTB7 without the wordhood memories (ID: 1) and with the memories (ID: 2-5), following the cross-domain setting in §4.2. While the four methods with memories achieve similar results on the F score, indicating the robustness of our proposed framework, the one that builds N using the raw texts from both training and test sets through unsupervised method (ID: 5) achieves the biggest improvement on R OOV , demonstrating the advantage of including the unlabeled test set by incorporating the results from unsupervised wordhood measures into the models.

Effect of Different Wordhood Measures
WMSEG provides a general way of integrating wordhood information for CWS, we expect other wordhood measures to play the same role in it. Therefore, we test PMI and DLG in our model and compare them with the previous results from AV (see Table 6). Specifically, we use our best performing BERT-based model, i.e., BERT-CRF, with the n-gram lexicons constructed by the aforementioned three measures and run it on all benchmark datasets. We draw the histograms of the F-scores obtained from WMSEG with each measure (red, green, and blue bars for AV, PMI, and DLG, re-  Table 8: Experimental results on five genres of CTB7. Abbreviations follow the same notation in Table 6.  Table 9: Comparisons of performance gain on the WEB genre of CTB7 with respect to the baseline BERT-CRF model when the n-gram lexicon N for WMSEG is built upon different sources. √ and × refer to if a corresponding data source is used or not, respectively. spectively) in Figure 2, where the F-scores of the baseline model are also presented in orange bars.

ID TRAIN TEST GOLD LABEL F ROOV
As shown in the figure, the performances of using the three measures are very similar, which indicates that WMSEG is able to robustly incorporate the wordhood information from various measures, despite that those measures focus on different aspects of n-grams when determining whether the ngrams should be treated as words. Particularly, consider that the lexicons produced by the three measures are rather different in their sizes (as shown in Table 4), the results in Figure 2 strongly demonstrate the effectiveness of our proposed approach in learning with a limited number of n-grams. This observation also reveals the possibility that many n-grams may be redundant for our model, and WM-SEG is thus able to identify the most useful ones from them, which is analyzed in the case study.

Case Study
To investigate how the memory learns from the wordhood information carried by n-grams, we conduct a case study with an example input sentence "他/从 小/学/电 脑/技 术" (He learned computer techniques since childhood). In this sentence, the n-gram "从小学" is ambiguous with two possible interpretations: "从小/学" (learn since childhood) and "从/小学" (from primary school). Native Chinese speakers can easily choose the first one with the given context but a word segmenter might incorrectly choose the second segmentation.
We feed this case into our BERT-CRF model with the memory module. In Figure 3, we visualize the resulted weights that learned from keys (a) and values (b) of the memory, as well as from the final tagger (c). The heatmaps of all keys and values in the memory with respect to each corresponding input character clearly illustrate that the appropriate n-grams, e.g., "他" (he), "学" (learn), "从小" (from childhood), etc., receive higher weights than others and the corresponding values for them are also emphasized, which further affects final CWS tagging so that the weight distributions from (b) and (c) look alike to each other. Therefore, this visualization explains, to some extent, that the proposed memory mechanism can identify and distinguish important n-grams within a certain context and thus improves CWS performance accordingly. As one of the most fundamental NLP tasks for Chinese language processing, CWS has been studied for decades, with two steams of methods, i.e., word-based and character-based ones (Xue and Shen, 2003;Peng et al., 2004;Levow, 2006;Zhao et al., 2006;Zhao and Kit, 2008;Li and Sun, 2009;Song et al., 2009a;Li, 2011;Sun and Xu, 2011;Zhang et al., 2013;Pei et al., 2014;Chen et al., 2015;Ma and Hinrichs, 2015;Liu et al., 2016;Zhang et al., 2016;Wang and Xu, 2017;Zhou et al., 2017;Ma et al., 2018;Higashiyama et al., 2019;Gong et al., 2019;. Among these studies, most of them follow the character-based paradigm to predict segmentation labels for each character in an input sentence; n-grams are used in some of these studies to enhance model performance, which is also observed in many other NLP tasks (Song et al., 2009b;Xiong et al., 2011;Shrestha, 2014;Shi et al., 2016;Diao et al., 2019). Recently, CWS benefits from neural networks and further progress are made with embeddings (Pei et al., 2014;Ma and Hinrichs, 2015;Liu et al., 2016;Zhang et al., 2016;Wang and Xu, 2017;Zhou et al., 2017), recurrent neural models (Chen et al., 2015;Ma et al., 2018;Higashiyama et al., 2019;Gong et al., 2019) and even adversarial learning . To enhance CWS with neural models, there were studies leverage external information, such as vocabularies from auto-segmented external corpus (Wang and Xu, 2017;Higashiyama et al., 2019), where Higashiyama et al. (2019) introduced a word attention mechanism to learn from large granular texts during the CWS process. In addition, the studies from  and  try to improve CWS by learning from data annotated through different segmentation criteria. Moreover, there is a study leveraging auto-analyzed syntactic knowledge obtained from off-the-shelf toolkits to help CWS and part-of-speech tagging (Tian et al., 2020). Compare to these studies, WMSEG offers an alternative solution to robustly enhancing neural CWS models without requiring external resources.

Conclusion
In this paper, we propose WMSEG, a neural framework for CWS using wordhood memory networks, which maps n-grams and their wordhood information to keys and values in it and appropriately models the values according to the importance of keys in a specific context. The framework follows the sequence labeling paradigm, and the encoders and decoders in it can be implemented by various prevailing models. To the best of our knowledge, this is the first work using key-value memory networks and utilizing wordhood information for neural models in CWS. Experimental results on various widely used benchmark datasets illustrate the effectiveness of WMSEG, where state-of-the-art performance is achieved on all datasets. Further experiments and analyses also demonstrate the robustness of WM-SEG in the cross-domain scenario as well as when using different lexicons and wordhood measures.