Coupling Distant Annotation and Adversarial Training for Cross-Domain Chinese Word Segmentation

Fully supervised neural approaches have achieved significant progress in the task of Chinese word segmentation (CWS). Nevertheless, the performance of supervised models always drops gravely if the domain shifts due to the distribution gap across domains and the out of vocabulary (OOV) problem. In order to simultaneously alleviate the issues, this paper intuitively couples distant annotation and adversarial training for cross-domain CWS. 1) We rethink the essence of “Chinese words” and design an automatic distant annotation mechanism, which does not need any supervision or pre-defined dictionaries on the target domain. The method could effectively explore domain-specific words and distantly annotate the raw texts for the target domain. 2) We further develop a sentence-level adversarial training procedure to perform noise reduction and maximum utilization of the source domain information. Experiments on multiple real-world datasets across various domains show the superiority and robustness of our model, significantly outperforming previous state-of-the-arts cross-domain CWS methods.


Introduction
Chinese is an ideographic language and lacks word delimiters between words in written sentences. Therefore, Chinese word segmentation (CWS) is often regarded as a prerequisite to downstream tasks in Chinese natural language processing. This task is conventionally formalized as a characterbased sequence tagging problem (Peng et al., 2004), where each character is assigned a specific label to denote the position of the character in a word. With the development of deep learning techniques, recent years have also seen increasing interest in applying neural network models onto CWS (Cai * Corresponding author and Zhao, 2016;Liu et al., 2016;Cai et al., 2017;Ma et al., 2018). These approaches have achieved significant progress on in-domain CWS tasks, but they still suffer from the cross-domain issue when they come to processing of out-of-domain data.
Cross-domain CWS is exposed to two major challenges: 1) Gap of domain distributions. This is a common issue existing in all domain adaptation tasks. Source domain data and target domain data generally have different distributions. As a result, models built on source domain data tend to degrade performance when they are applied to target domain data. Generally, we need some labeled target domain data to adapt source domain models, but it is expensive and time consuming to manually craft such data. 2) Out of vocabulary (OOV) problem, which means there exist some words in the testing data that never occur in the training data. Source domain models have difficulties in recognizing OOV words since source domain data contains no information on the OOVs. Figure 1 presents examples to illustrate the difference between the word distributions of the newswire domain and the medical domain. Segmenters built on the newswire domain have very limited information to segment domain-specific words like "溶菌酶 (Lysozyme)".
Previous approaches to cross-domain CWS mainly fall into two groups. The first group aims to attack the OOV issue by utilizing predefined dictionaries from the target domain to facilitate cross-domain CWS (Liu et al., 2014;Zhao et al., 2018;, which are apt to suffer from scalability since not all domains possess predefined dictionaries. In other words, these methods are directly restricted by external resources that are available in a target domain. Studies in the second group (Ye et al., 2019) attend to learn target domain distributions like word embeddings from unlabeled target domain data. In this approach, source domain data is not fully utilized since the information from source domain data is transferred solely through the segmenter built on the data.
In this paper, we propose to attack the aforementioned challenges simultaneously by coupling the techniques of distant annotation and adversarial training. The goal of distant annotation is to automatically construct labeled target domain data with no requirement for human-curated domain-specific dictionaries. To this end, we rethink the definition and essence of "Chinese words" and develop a word miner to obtain domain-specific words from unlabeled target domain data. Moreover, a segmenter is trained on the source domain data to recognize the common words in unlabeled target data. This way, sentences from the target domain are assigned automatic annotations that can be used as target domain training data.
Although distant annotation could provide satisfactory labeled target domain data, there still exist annotation errors that affect the final performance. To reduce the effect of noisy data in automatic annotations in target domain data and make better use of source domain data, we propose to apply adversarial training jointly on the source domain dataset and the distantly constructed target domain dataset. And the adversarial training module can capture deeper domain-specific and domain-agnostic features.
To show the effectiveness and robustness of our approach, we conduct extensive experiments on five real-world datasets across various domains. Experimental results show that our approach achieves state-of-the-art results on all datasets, significantly outperforming representative previous works. Further, we design sufficient subsidiary experiments to prove the alleviation of the aforementioned problems in cross-domain CWS.

Related Work
Chinese Word Segmentation Chinese word segmentation is typically formalized as a sequence tag-ging problem. Thus, traditional machine learning models such as Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs) are widely employed for CWS in the early stage (Wong and Chan, 1996;Gao et al., 2005;Zhao et al., 2010). With the development of deep learning methods, research focus has been shifting towards deep neural networks that require little feature engineering. Chen et al. (2015) are the first that use LSTM (Hochreiter and Schmidhuber, 1997) to resolve long dependencies in word segmentation problems. Since then, the majority of efforts is building end-to-end sequence tagging architectures, which significantly outperform the traditional approaches on CWS task (Wang and Xu, 2017;Zhou et al., 2017;Yang et al., 2017;Cai et al., 2017;Huang et al., 2019b;Gan and Zhang, 2019;Yang et al., 2019). Cross-domain CWS As a more challenging task, cross-domain CWS has attracted increasing attention. Liu and Zhang (2012) propose an unsupervised model, in which they use a character clustering method and the self-training algorithm to jointly model CWS and POS-tagging. Liu et al. (2014) apply partial CRF for cross-domain CWS via obtaining a partial annotation dataset from freely available data. Similarly, Zhao et al. (2018) build partially labeled data by combining unlabeled data and lexicons.  propose to incorporate the predefined domain dictionary into the training process via predefined handcrafted rules. Ye et al. (2019) propose a semi-supervised approach that leverages word embeddings trained on the segmented text in the target domain. Adversarial Learning Adversarial learning is derived from the Generative Adversarial Nets (GAN) (Goodfellow et al., 2014), which has achieved huge success in the computer vision field. Recently, many works have tried to apply adversarial learning to NLP tasks. (Jia and Liang, 2017;Li et al., 2018;Farag et al., 2018) focus on learning or creating adversarial rules or examples for improving the robustness of the NLP systems. For cross-domain or cross-lingual sequence tagging, the adversarial discriminator is widely used to extract domain or language invariant features (Kim et al., 2017;Huang et al., 2019a;Zhou et al., 2019).
3 Our Approach Figure 2 shows the framework of our approach to cross-domain CWS, which is mainly composed of two components: 1) Distant Annotation (DA), and 2) Adversarial Training (AT). In the following, we will describe details of the framework (DAAT) from the left to right in Figure 2.
In this paper, bold-face letters (e.g. W ) are used to denote vectors, matrices and tensors. We use numerical subscripts to indicate the indices of a sequence or vector. We use the subscript of src to indicate the source domain and tgt to denote the target domain.

Distant Annotation
As illustrated in Figure 2, given a labeled source domain dataset and an unlabeled target domain dataset, distant annotation (DA) aims to automatically generate word segmentation results for sentences in the target domain. DA has two main modules, including a base segmenter and a Domainspecific Words Miner. Specifically, the base segmenter is a GCNN-CRF (Wang and Xu, 2017) model trained solely on the labeled source domain data and is used to recognize words that are common among the source and target domains. Domain-specific Words Miner is designed to explore the target domain-specific words. Base Segmenter In the CWS task, given a sentence s = {c 1 , c 2 , ..., c n } , following the BMES tagging scheme, each character c i is assigned one of the labels in {B, M, E, S}, indicating whether the character is in the beginning, middle, end of a word, or the character is merely a single-character word.
For a sentence s, we first use an embedding layer to obtain the embedding representation e i for each character c i . Then, the sentence s can be represented as e = {e 1 , e 2 , ..., e n } ∈ R n×d , where d denotes the embedding dimension. e will be fed into the GCNN model Gehring et al., 2017), which computes the output as: d and l are the input and output dimensions respectively, and k is the window size of the convolution operator. σ is the sigmoid function and represents element-wise product. We adopt a stacking convolution architecture to capture long distance information, the output of the previous layers will be treated as input of the next layer. The final representation of sentence s is H s = {h 1 , h 2 , ..., h n }.
Correlations among labels are crucial factors in sequence tagging. Particularly, for an input sequence s src = {c 1 , c 2 , ..., c n } (take source domain data as example), the corresponding label sequence is L = {y 1 , y 2 , ..., y n }. The goal of CRF is to compute the conditional probability distribution: where T denotes the transition function to calculate the transition scores from y i−1 to y i . C contains all the possible label sequences on sequence s and L is a random label sequence in C. And S represents the score function to compute the emission score from the hidden feature vector h i to the corresponding label y i , which is defined as: W y i and b y i are learned parameters specific to the label y i . To decode the highest scored label sequence, a classic Viterbi (Viterbi, 1967) algorithm is utilized as the decoder. The loss function of the sequence tagger is defined as the sentence-level negative loglikelihood: The loss of the target tagger L tgt could be computed similarly. Domain-specific Words Miner As mentioned in section 1, previous works usually use existing domain dictionaries to solve the domain-specific noun entities segmentation problem in cross-domain CWS. But this strategy does not consider that it is properly difficult to acquire a dictionary with high quality for a brand new domain. In contrast, we develop a simple and efficient strategy to perform domain-specific words mining without any predefined dictionaries. Given large raw text on target domain and a base segmenter, we can obtain a set of segmented texts Γ = {T 1 , T 2 , ..., T N }, where stop-words are removed. Then let γ = {t 1 , t 2 , ..., t m } denote all the n-gram sequences extracted from Γ. For each sequence t i , we need to calculate the possibility that it is a valid word. In this procedure, four factors are mainly considered. 1) Mutual Information (MI). MI (Kraskov et al., 2004) is widely used to estimate the correlation of two random variables. Here, we use mutual information between different sub-strings to measure the internal tightness for a text segment, as shown in Figure 3(a). Further, in order to exclude extreme cases, it is necessary to enumerate all the sub-string candidates. The final MI score for one sequence t i consists of n characters t i = {c 1 ...c n } is defined as: where p(·) denotes the probability given the whole corpus Γ.
2) Entropy Score (ES). Entropy is a crucial concept aiming at measuring the uncertainty of random variables in information theory (Jaynes, 1957).
(a) Mutual Information to measure the internal tightness.
(b) Entropy Score to measure the external flexibility. . Thus, we can use ES to measure the uncertainty of candidate text fragment, since higher uncertainty means a richer neighboring context. Let N l (t i ) = {l 1 , ..., l k } and N r (t i ) = {r 1 , ..., r k } be the set of left and right adjacent characters for t i . The left entropy score ES l and right entropy ES r of t i can be formulated as We choose min(ES l (t i ), ES r (t i )) as the final score for t i . Hence, ES(t i ) could explicitly represent the external flexibility for a text segment (as shown in Figure 3(b)), and further serve as an important indicator to judge whether the segment is an independent word.
3) tf-idf. tf-idf is a widely used numerical statistic that can reflect how important a word is to a document in a collection or corpus. As illustrated in Figure 1, most of the domain-specific words are noun entities, which share a large weighting factor in general.
In this work, we define a word probability score p val (t i ) to indicate how likely t i can be defined as a valid word.
where σ denotes the sigmoid function and N denotes normalization operation with the max-min method. 4) Word frequency. If t i is a valid word, it should appear repeatedly in Γ.
Finally, by setting an appropriate threshold for p val (t i ) and word frequence, the Domain-Specific Words Miner could effectively explore domainspecific words, then construct the domain-specific word collection C for the target domain. In this work, we only consider words t i with p val (t i ) ≥ 0.95 and frequency larger than 10.
The left part of Figure 2 illustrates the data construction process of DA. First, we utilize the Domain-specific Words Miner to build the collection C for the target domain. Take sentence "溶酶 菌的科学研究 (Scientific research on lysozyme)" as an example, we use the forward maximizing match algorithm based on C, which shows that "溶 酶菌 (lysozyme)" is a valid word. Hence, the labels of characters "溶", "酶", "菌" are "B", "M ", "E". For the left part of the sentence, we adopt the baseline segmenter to perform the labelling process. "的 科 学 研 究" will be assigned with {"S", "B"."E", "B", "E"}. To this end, we are able to automatically build annotated dataset on the target domain.

Adversarial Training
The structure of the Adversarial Training module is illustrated as the right part of Figure 2. As mentioned in 3.1, we construct an annotated dataset for the target domain. Accordingly, the inputs of the network are two labeled datasets from source domain S and target domain T . There are three encoders to extract features with different emphases, and all the encoders are based on GCNN as introduced in section 3.1. For domain-specific features, we adopt two independent encoders E src and E tgt for source domain and target domain. For domainagnostic features, we adopt a sharing encoder E shr and a discriminator G d , which will be both trained as adversarial players.
For the two domain-specific encoders, the input sentence is s src ={c s 1 , c s 2 , ..., c s n } from source domain, or sentence s tgt ={c t 1 , c t 2 , ..., c t m } from the target domain. The sequence representation of s src and s tgt can be obtained by E src and E tgt . Thus, the domain independent representations of s src and s tgt are H s ∈ R n×l and H t ∈ R m×l , where n and m denote the sequence lengths of s src and s tgt respectively, l is the output dimension of GCNN encoder.
For the sharing encoder, we hope that E shr is able to generate representations that could fool the sentence level discriminator to correctly predict the domain of each sentence, such that E shr finally extracts domain-agnostic features. Formally, given sentences s src and s tgt from source domain and target domain, E shr will produce sequence features H * s and H * t for s src and s tgt respectively. The discriminator G d of the network aims to distinguish the domain of each sentence. Specifically, we will feed the final representation H * of every sentence s to a binary classifier G y where we adopt the text CNN network (Kim, 2014). G y will produce a probability that the input sentence s is from the source domain or target domain. Thus, the loss function of the discriminator is: Features generated by the sharing encoder E shr should be able to fool the discriminator to correctly predict the domain of s. Thus, the loss function for the sharing encoder L c is a flipped version of L d : The final representation will be fed into the CRF tagger.
So far, our model can be jointly trained in an endto-end manner with the standard back-propagation algorithm. More details about the adversarial training process are described in Algorithm 1. When there is no annotated dataset on the target domain, we could remove L tgt during the adversarial training process and use the segmenter on source domain for evaluation.
Algorithm 1 Adversarial training algorithm. Input: Manually annotated dataset D s for source domain S, and distantly annotated dataset D t for target domain T for i ← 1 to epochs do for j ← 1 to num of steps per epoch do Sample mini-batches X s ∼ D s , X t ∼ D t if j%2 = 1 then loss = L src + L tgt + L d Update θ w.r.t loss else loss= L src + L tgt + L c Update θ w.r.t loss end end end

Experiments
In this section, we conduct extensive cross-domain CWS experiments on multiple real-world datasets with different domains, then comprehensively evaluate our method and other approaches.

Datasets and Experimental Settings
Datasets Six datasets across various domains are used in our work. The statistics of all datasets are shown in Table 1. In this paper, we use PKU dataset (Emerson, 2005) as the source domain data, which is a benchmark CWS dataset on the newswire domain. In addition, the other five datasets in other domains will be utilized as the target domain datasets. Among the five target domain datasets there are three Chinese fantasy novel datasets, including DL (DoLuoDaLu), FR (FanRenXiuXianZhuan) and ZX (ZhuXian) (Qiu and Zhang, 2015). An obvious advantage for fantasy novel datasets is that there are a large number of proper words originated by the author for each fiction, which could explicitly reflect the alleviation of the OOV problem for an approach. Besides the fiction datasets, we also use DM (dermatology) and PT (patent) datasets (Ye et al., 2019), which are from dermatology domain and patent domain respectively. All the domains of the target datasets are very different from the source dataset (newswire). To perform a fair and comprehensive evaluation, the full/test settings of the datasets follow Ye et al. (2019).
Hyper-Parameters Table 2 shows the hyperparameters used in our method. All the models are implemented with Tensorflow (Abadi et al., 2016) and trained using mini-batched back-propagation. Adam optimizer (Kingma and Ba, 2015) is used for optimization. The models are trained on NVIDIA Tesla V100 GPUs with CUDA 1 . Evaluation Metrics We use standard microaveraged precision (P), recall (R) and F-measure as our evaluation metrics. We also compute OOV rates to reflect the degree of the OOV issue.

Compared Methods
We make comprehensive experiments with selective previous proposed methods, which are: Partial CRF (Liu et al., 2014) builds partially annotated data using raw text and lexicons via handcrafted rules, then trains the CWS model based on both labeled dataset (PKU) and partially annotated data using CRF. CWS-DICT  trains the CWS model with a BiLSTM-CRF architecture, which incorporates lexicon into a neural network by designing handcrafted feature templates. For fair comparison, we use the same domain dictionaries produced by the Domain-specific Words Miner for Partial CRF and CWS-DICT methods. WEB-CWS (Ye et al., 2019) is a semi-supervised wordbased approach using word embeddings trained with segmented text on target domain to improve cross-domain CWS. Besides, we implement strong baselines to perform a comprehensive evaluation, which are: GCNN (PKU) uses the PKU dataset only, and we adopt the GCNN-CRF sequence tagging architecture (Wang and Xu, 2017). GCNN (Target) uses the distantly annotated dataset built on the target domain only. GCNN (Mix) uses the mixture dataset with both the PKU dataset and the distantly annotated target domain dataset. DA is a combination of GCNN (PKU) and domain-specific words. Details are introduced in 3.1. AT denotes the setting that we adopt adversarial training when no distantly annotated dataset on the target domain is provided, but the raw text is available.

Overall Results
The final results are reported in Table 3, from which we can observe that: (1) Our DAAT model significantly outperforms previously proposed methods on all datasets, yielding the state-of-the-art results. Particularly, DAAT improves the F1-score on the five datasets from 93.5 to 94.1, 90.2 to 93.1, 89.6 to 90.9, 82.8 to 85.0 and 85.9 to 89.6 respectively. The results demon-  strate that the unified framework is empirically effective, for the alleviation of the OOV problem and the full utilization of source domain information.
(2) As mentioned in section 3, the AT model uses the same adversarial training network as the DAAT, yet without annotation on the target domain dataset. Results on the AT setting could explicitly reflect the necessity to construct the annotated target domain dataset. Specifically, without the constructed dataset, the AT method only yields 90.7, 86.8, 85.0, 81.0 and 85.1 F1-scores on five datasets respectively. But when use the annotated target domain dataset, we can get the DAAT with the best performance.
(3) WEB-CWS was the state-of-the-art approach that utilizes word embeddings trained on the segmented target text. Yet it is worth noticing that our model that only combines the base segmenter trained on PKU and domain-specific words (DA) could outperform WEB-CWS, which indicates that the distant annotation method could exploit more and deeper semantic features from the raw text. For the CWS-DICT method, which requires an external dictionary, we use the word collection (built by the Domain-specific Words Miner) to guarantee the fairness of the experiments. We can observe that our framework could yield significantly better results than CWS-DICT. Moreover, CWS-DICT needs existing dictionaries as external information, which is difficult for the model to transfer to brand new domains without specific dictionaries. In contrast, our framework utilizes the Domain-specific Words Miner to construct the word collection with high flexibility across domains.

Effect of Distant Annotation
In this section, we focus on exploring the ability to tackle the OOV problem for the DA method, which could distantly construct an annotated dataset from the raw text on the target domain. As illustrated in Table 4, the cross-domain CWS task suffers from a surprisingly serious OOV problem. All OOV rates (source) are above 10%, which will definitely degrade model performance. Nevertheless, after constructing an annotated dataset on the target domain, the OOV rate (target) drops significantly. Specifically, the DA method yields 9.92%, 13.1%, 14.09% 20.51% and 14.94% absolute OOV rate drop on the five out-domain datasets. The statistical result reveals that the Domain-specific Words Miner could accurately explore specific domain words for any domains from raw texts. Therefore, the DA of our framework could efficaciously tackle the OOV problem. Moreover, the module does not need any specific domain dictionaries, which means it can be transferred to new domains without limitations.

Impact of the Threshold p val
Obviously, the setting of the hyper-parameter p val will directly affect the scale and quality of the domain-specific word collection. To analyze how p val affects the model performance, we conduct experiments with different setting p val in {0.7, 0.8, 0.9, 0.95, 0.99}, and the size of word collection and model performance on DL and DM datasets are shown in Figure 4. Constant with intuition, the collection size will decrease as the increase of p val because the filter criterion for words will get more strict, which is also a process of noise reduction. However, the F1-score curves are not incremental or descending. When p val <= 0.95, the F1-scores on two datasets will increase because the eliminated words of this stage are mostly wrong. While the F1-scores will maintain or decrease when p val > 0.95, because in this case, some correct words will be eliminated. We set p val = 0.95 to guarantee the quality and quantity of the word collection simultaneously, so as to guarantee the model performance. And in this setting, the collection sizes are 0.7k words for DL, 1.7k for FR, 3.3k for ZX, 1.5k for DM and 2.2k for PT respectively.

Effect of Adversarial Learning
We develop an adversarial training procedure to reduce the noise in the annotated dataset produced by DA. In Table 3, we find that GCNN (Target) method trained on the annotated target dataset constructed by DA achieves impressive performance on all the five datasets, outperforming the WEB-CWS method. In addition, with the adversarial training module, the model further yields the remark-   able improvements of the F1-scores. The results demonstrate that the adversarial network could capture deeper semantic features than simply using the GCNN-CRF model, via better making use of the information from both source and target domains.

Analysis of Feature Distribution
As introduced in 3.2, in the process of adversarial learning, domain-independent encoders could learn domain-specific features H s and H t , and the sharing encoder could learn domain-agnostic features H * s and H * t . We use t-SNE (Maaten and Hinton, 2008) algorithm to project these feature representations into planar points for visualization to further analyze the feature learning condition. As illustrated in Figure 5  (green) and H t (black) have little overlap, indicating the distribution gap between different domains. However, the domain-agnostic feature distributions H * s (red) and H * t (blue) are very similar, implying that the learned feature representation can be well shared by both domains.

Impact of Amount from Source and Target data
In this subsection, we analyze the impact of the data usage for both source and target domain, the experiment is conducted on the PKU (source) and DL (target) datasets. In Figure 6, we respectively select 20%, 40%, 60%, 80% and 100% of the source do-main data and 1%, 5%, 20%, 50%, 100% of the target domain data to perform the training procedure. The result demonstrates that increasing source and target data will both lead to an increase F1-score. Generally, the amount of the target data gives more impact on the whole performance, which conforms to the intuition. The " 1% Target Training Data" line indicates that the performance of the model will be strictly limited if the target data is severely missing. But when the amount of the target data increase to 5%, the performance will be improved significantly, which shows the ability to explore domain-specific information for our method.

Conclusion
In this paper, we intuitively propose a unified framework via coupling distant annotation and adversarial training for the cross-domain CWS task. In our method, we investigate an automatic distant annotator to build the labeled target domain dataset, effectively address the OOV issue. Further, an adversarial training procedure is designed to capture information from both the source and target domains. Empirical results show that our framework significantly outperforms other proposed methods, achieving the state-of-the-art result on all five datasets across different domains.