Attention Is All You Need for Chinese Word Segmentation

This paper presents a fast and accurate Chinese word segmentation (CWS) model with only unigram feature and greedy decoding algorithm. Our model uses only attention mechanism for network block building. In detail, we adopt a Transformer-based encoder empowered by self-attention mechanism as backbone to take input representation. Then we extend the Transformer encoder with our proposed Gaussian-masked directional multi-head attention, which is a variant of scaled dot-product attention. At last, a bi-affinal attention scorer is to make segmentation decision in a linear time. Our model is evaluated on SIGHAN Bakeoff benchmark dataset. The experimental results show that with the highest segmentation speed, the proposed attention-only model achieves new state-of-the-art or comparable performance against strong baselines in terms of closed test setting.


Introduction
Chinese word segmentation (CWS) is a task for Chinese natural language process to delimit word boundary.CWS is a basic and essential task for Chinese which is written without explicit word delimiters and different from alphabetical languages like English.(Xue, 2003) treats Chinese word segmentation (CWS) as a sequence labeling task with character position tags, which is followed by (Lafferty et al., 2001;Peng et al., 2004;Zhao et al., 2006).Traditional CWS models depend on the design of features heavily which effects the performance of model.To minimize the effort in feature engineering, some CWS models (Zheng et al., 2013;Pei et al., 2014;Chen et al., 2015a,b;Xu and Sun, 2016;Cai and Zhao, 2016;Liu et al., 2016;Cai et al., 2017) are developed following neural network architecture for sequence labeling tasks (Collobert et al., 2011).Neural CWS models perform strong ability of feature representation, employing unigram and bigram character embedding as input and approach good performance.
The CWS task is often modeled as one graph model based on a scoring model that means it is composed of two parts, one part is an encoder which is used to generate the representation of characters from the input sequence, the other part is a decoder which performs segmentation according to the encoder scoring.Table 1 summarizes typical CWS models according to their decoding ways for both traditional and neural models.Markov models such as (Ng and Low, 2004) and (Zheng et al., 2013) depend on the maximum entropy model or maximum entropy Markov model both with a Viterbi decoder.Besides, conditional random field (CRF) or Semi-CRF for sequence labeling has been used for both traditional and neural models though with different representations (Peng et al., 2004;Andrew, 2006;Liu et al., 2016;Wang and Xu, 2017;Ma et al., 2018).Generally speaking, the major difference between traditional and neural network models is about the way to represent input sentences.
Recent works about neural CWS which focus on benchmark dataset, namely SIGHAN Bakeoff (Emerson, 2005), may be put into the following three categories roughly.
Encoder.Practice in various natural language processing tasks has been shown that effective representation is essential to the performance improvement.Thus for better CWS, it is crucial to encode the input character, word or sentence into effective representation.(Ng and Low, 2004), (Low et al., 2005) MMTNN: (Pei et al., 2014) (Zheng et al., 2013), LSTM: (Chen et al., 2015b) Viterbi
term memory network (LSTM).Graph model.As CWS is a kind of structure learning task, the graph model determines which type of decoder should be adopted for segmentation, also it may limit the capability of defining feature, as shown in Table 2, not all graph models can support the word features.Thus recent work focused on finding more general or flexible graph model to make model learn the representation of segmentation more effective as (Cai and Zhao, 2016;Cai et al., 2017).
External data and pre-trained embedding.Whereas both encoder and graph model are about exploring a way to get better performance only by improving the model strength itself.Using external resource such as pre-trained embeddings or language representation is an alternative for the same purpose (Yang et al., 2017;Zhao et al., 2018).SIGHAN Bakeoff defines two types of evaluation settings, closed test limits all the data for learning should not be beyond the given training set, while open test does not take this limitation (Emerson, 2005).In this work, we will focus on the closed test setting by finding a better model design for further CWS performance improvement.
Shown in Table 1, different decoders have particular decoding algorithms to match the respec-tive CWS models.Markov models and CRF-based models often use Viterbi decoders with polynomial time complexity.In general graph model, search space may be too large for model to search.Thus it forces graph models to use an approximate beam search strategy.Beam search algorithm has a kind low-order polynomial time complexity.Especially, when beam width b=1, the beam search algorithm will reduce to greedy algorithm with a better time complexity O(M n) against the general beam search time complexity O(M nb 2 ), where n is the number of units in one sentences, M is a constant representing the model complexity.Greedy decoding algorithm can bring the fastest speed of decoding while it is not easy to guarantee the precision of decoding when the encoder is not strong enough.
In this paper, we focus on more effective encoder design which is capable of offering fast and accurate Chinese word segmentation with only unigram feature and greedy decoding.Our proposed encoder will only consist of attention mechanisms as building blocks but nothing else.Motivated by the Transformer (Vaswani et al., 2017) and its strength of capturing long-range dependencies of input sentences, we use a self-attention network to generate the representation of input which makes the model encode sentences at once without feed-ing input iteratively.Considering the weakness of the Transformer to model relative and absolute position information directly (Shaw et al., 2018) and the importance of localness information, position information and directional information for CWS, we further improve the architecture of standard multi-head self-attention of the Transformer with a directional Gaussian mask and get a variant called Gaussian-masked directional multi-head attention.Based on the newly improved attention mechanism, we expand the encoder of the Transformer to capture different directional information.With our powerful encoder, our model uses only simple unigram features to generate representation of sentences.
For decoder which directly performs the segmentation, we use the bi-affinal attention scorer, which has been used in dependency parsing (Dozat and Manning, 2017) and semantic role labeling (Cai et al., 2018), to implement greedy decoding on finding the boundaries of words.In our proposed model, greedy decoding ensures a fast segmentation while powerful encoder design ensures a good enough segmentation performance even working with greedy decoder together.Our model will be strictly evaluated on benchmark datasets from SIGHAN Bakeoff shared task on CWS in terms of closed test setting, and the experimental results show that our proposed model achieves new state-of-the-art.
The technical contributions of this paper can be summarized as follows.
• We propose a CWS model with only attention structure.The encoder and decoder are both based on attention structure.
• With a powerful enough encoder, we for the first time show that unigram (character) featues can help yield strong performance instead of diverse n-gram (character and word) features in most of previous work.
• To capture the representation of localness information and directional information, we propose a variant of directional multi-head self-attention to further enhance the state-ofthe-art Transformer encoder.

Models
The CWS task is often modelled as one graph model based on an encoder-based scoring model.
as the representation of sentences.With v b and v f , the bi-affinal scorer calculates the probability of each segmentation gaps and predicts the word boundaries of input.Similar as the Transformer, the encoder is an attention network with stacked self-attention and point-wise, fully connected layers while our encoder includes three independent directional encoders.

Encoder Stacks
In the Transformer, the encoder is composed of a stack of N identical layers and each layer has one multi-head self-attention layer and one positionwise fully connected feed-forward layer.One residual connection is around two sub-layers and followed by layer normalization (Vaswani et al., 2017).This architecture provides the Transformer a good ability to generate representation of sentence.
With the variant of multi-head self-attention, we design a Gaussian-masked directional encoder to capture representation of different directions to improve the ability of capturing the localness information and position information for the importance of adjacent characters.One unidirectional encoder can capture information of one particular direction.
For CWS tasks, one gap of characters, which is from a word boundary, can divide one sequence into two parts, one part in front of the gap and one part in the rear of it.The forward encoder and backward encoder are used to capture information of two directions which correspond to two parts divided by the gap.
One central encoder is paralleled with forward and backward encoders to capture the information of entire sentences.The central encoder is a special directional encoder for forward and backward information of sentences.The central encoder can fuse the information and enable the encoder to capture the global information.
The encoder outputs one forward information and one backward information of each positions.The representation of sentence generated by center encoder will be added to these information directly: where is the output of center encoder and r f = (r f 1 , ..., r f n ) is the output of forward encoder.

Gaussian-Masked Directional
Multi-Head Attention Similar as scaled dot-product attention (Vaswani et al., 2017), Gaussian-masked directional attention can be described as a function to map queries and key-value pairs to the representation of input.
Here queries, keys and values are all vectors.Standard scaled dot-product attention is calculated by dotting query Q with all keys K, dividing each values by √ d k , where √ d k is the dimension of keys, and apply a softmax function to generate the weights in the attention: Different from scaled dot-product attention, Gaussian-masked directional attention expects to pay attention to the adjacent characters of each positions and cast the localness relationship between characters as a fix Gaussian weight for attention.We assume that the Gaussian weight only relys on the distance between characters.
Firstly we introduce the Gaussian weight matrix G which presents the localness relationship between each two characters: (3) ) where g ij is the Gaussian weight between character i and j, dis ij is the distance between character i and j, Φ(x) is the cumulative distribution function of Gaussian, σ is the standard deviation of Gaussian function and it is a hyperparameter in our method.Equation (4) can ensure the Gaussian weight equals 1 when dis ij is 0. The larger distance between charactersis, the smaller the weight is, which makes one character can affect its adjacent characters more compared with other characters.
To combine the Gaussian weight to the selfattention, we produce the Hadamard product of Gaussian weight matrix G and the score matrix produced by where AG is the Gaussian-masked attention.It ensures that the relationship between two characters with long distances is weaker than adjacent characters.
The scaled dot-product attention models the relationship between two characters without regard to their distances in one sequence.For CWS task, the weight between adjacent characters should be more important while it is hard for self-attention to achieve the effect explicitly because the selfattention cannot get the order of sentences directly.The Gaussian-masked attention adjusts the weight between characters and their adjacent character to a larger value which stands for the effect of adjacent characters.For forward and backward encoder, the selfattention sublayer needs to use a triangular matrix mask to let the self-attention focus on different weights: where pos i is the position of character c i .The triangular matrix for forward and backward encode are: Similar as (Vaswani et al., 2017), we use multihead attention to capture information from different dimension positions as Figure 3(a) and get Gaussian-masked directional multi-head attention.With multi-head attention architecture, the representation of input can be captured by is the parameter matrices to generate heads, d k is the dimension of model and d h is the dimension of one head.

Bi-affinal Attention Scorer
Regarding word boundaries as gaps between any adjacent words converts the character labeling task to the gap labeling task.Different from character labeling task, gap labeling task requires information of two adjacent characters.The relationship between adjacent characters can be represented as the type of gap.The characteristic of word boundaries makes bi-affine attention an appropriate scorer for CWS task.
Bi-affinal attention scorer is the component that we use to label the gap.Bi-affinal attention is developed from bilinear attention which has been used in dependency parsing (Dozat and Manning, 2017) and SRL (Cai et al., 2018).The distribution of labels in a labeling task is often uneven which makes the output layer often include a fixed bias term for the prior probability of different labels (Cai et al., 2018).Bi-affine attention uses bias terms to alleviate the burden of the fixed bias term and get the prior probability which makes it different from bilinear attention.The distribution of the gap is uneven that is similar as other labeling task which fits bi-affine.
Bi-affinal attention scorer labels the target depending on information of independent unit and the joint information of two units.In bi-affinal attention, the score s ij of characters c i and c j (i < j) is calculated by: where v f i is the forward information of c i and v b i is the backward information of c j .In Equation ( 8), W , U and b are all parameters that can be updated in training.W is a matrix with shape (d i ×N ×d j ) and U is a (N × (d i + d j )) matrix where d i is the dimension of vector v f i and N is the number of labels.In our model, the biaffine scorer uses the forward information of character in front of the gap and the backward information of the character behind the gap to distinguish the position of characters.Figure 4 is an example of labeling gap.The method of using biaffine scorer ensures that the boundaries of words can be determined by adjacent characters with different directional information.The score vector of the gap is formed by the probability of being a boundary of word.Further, the model generates all boundaries using activation function in a greedy decoding way.(Emerson, 2005) which has four datasets, PKU, MSR, AS and CITYU.Table 3 shows the statistics of train data.We use F-score to evaluate CWS models.
To train model with pre-trained embeddings in AS and CITYU, we use OpenCC1 to transfer data from traditional Chinese to simplified Chinese.
Pre-trained Embedding We only use unigram feature so we only trained character embeddings.Our pre-trained embedding are pre-trained on Chinese Wikipedia corpus by word2vec (Mikolov et al., 2013)  Hyperparameters For different datasets, we use two kinds of hyperparameters which are presented in Table 4.We use hyperparameters in Table 4 for small corpora (PKU and CITYU) and normal corpora (MSR and AS).We set the standard deviation of Gaussian function in Equation (4) to 2. Each training batch contains sentences with at most 4096 tokens.
Optimizer To train our model, we use the Adam (Kingma and Ba, 2015) optimizer with β 1 = 0.9, β 2 = 0.98 and = 10 −9 .The learning rate schedule is the same as (Vaswani et al., 2017): ) where d is the dimension of embeddings, step is the step number of training and warmup s tep is the step number of warmup.When the number of steps is smaller than the step of warmup, the learning rate increases linearly and then decreases.

Hardware and Implements
We trained our models on a single CPU (Intel i7-5960X) with an nVidia 1080 Ti GPU.We implement our model in Python with Pytorch 1.0.

Results
Tables 5 and 6 reports the performance of recent models and ours in terms of closed test setting.Without the assistance of unsupervised segmentation features userd in (Wang et al., 2019), our model outperforms all the other models in MSR and AS except (Ma et al., 2018) and get comparable performance in PKU and CITYU.Note that all the other models for this comparison adopt various n-gram features while only our model takes unigram ones.
With unsupervised segmentation features introduced by (Wang et al., 2019), our model gets a higher result.Specially, the results in MSR and AS achieve new state-of-the-art and approaching previous state-of-the-art in CITYU and PKU.The unsupervised segmentation features are derived from the given training dataset, thus using them does not violate the rule of closed test of SIGHAN Bakeoff.
Table 7 compares our model and recent neural models in terms of open test setting in which any external resources, especially pre-trained embeddings or language models can be used.In MSR and AS, our model gets a comparable result while our results in CITYU and PKU are not remarkable.
However, it is well known that it is always hard to compare models when using open test setting, especially with pre-trained embedding.Not all models may use the same method and data to pretrain.Though pre-trained embedding or language model can improve the performance, the performance improvement itself may be from multiple sources.It often that there is a success of pretrained embedding to improve the performance, while it cannot prove that the model is better.
PKU MSR AS CITYU (Chen et al., 2015a)  Compared with other LSTM models, our model performs better in AS and MSR than in CITYU and PKU.Considering the scale of different corpora, we believe that the size of corpus affects our model and the larger size is, the better model performs.For small corpus, the model tends to be overfitting.
Tables 5 and 6 also show the decoding time in different datasets.Our model finishes the segmentation with the least decoding time in all four datasets, thanks to the architecture of model which only takes attention mechanism as basic block.

Chinese Word Segmentation
CWS is a task for Chinese natural language process to delimit word boundary.(Xue, 2003) for the first time formulize CWS as a sequence labeling task.(Zhao et al., 2006) show that different character tag sets can make essential impact for CWS.(Peng et al., 2004) use CRFs as a model for CWS, achieving new state-of-the-art.Works of statistical CWS has built the basis for neural CWS.
Neural word segmentation has been widely used to minimize the efforts in feature engineering which was important in statistical CWS.(Zheng et al., 2013) introduce the neural model with sliding-window based sequence labeling.(Chen et al., 2015a) propose a gated recursive neural network (GRNN) for CWS to incorporate complicated combination of contextual character and ngram features.(Chen et al., 2015b) use LSTM to learn long distance information.(Cai and Zhao, 2016) propose a neural framework that eliminates context windows and utilize complete segmentation history.(Lyu et al., 2016) explore a joint model that performs segmentation, POS-Tagging and chunking simultaneously.(Chen et al., 2017a) propose a feature-enriched neural model for joint CWS and part-of-speech tagging.(Zhang et al., 2017) present a joint model to enhance the segmentation of Chinese microtext by performing CWS and informal word detection simultaneously.(Wang and Xu, 2017) propose a character-based convolutional neural model to capture n-gram features automatically and an effective approach to incorporate word embeddings.(Cai et al., 2017) improve the model in (Cai and Zhao, 2016) and propose a greedy neural word segmenter with balanced word and character embedding inputs.(Zhao et al., 2018) propose a novel neural network model to incorporate unlabeled and partially-labeled data.(Zhang et al., 2018) propose two methods that extend the Bi-LSTM to perform incorporating dictionaries into neural networks for CWS.(Gong et al., 2019) propose Switch-LSTMs to segment words and provided a more flexible solution for multi-criteria CWS which is easy to transfer the learned knowledge to new criteria.

Transformer
Transformer (Vaswani et al., 2017) is an attentionbased neural machine translation model.The Transformer is one kind of self-attention networks (SANs) which is proposed in (Lin et al., 2017).Encoder of the Transformer consists of one selfattention layer and a position-wise feed-forward layer.Decoder of the Transformer contains one self-attention layer, one encoder-decoder attention layer and one position-wise feed-forward layer.The Transformer uses residual connections around the sublayers and then followed by a layer normalization layer.
Scaled dot-product attention is the key component in the Transformer.The input of attention contains queries, keys, and values of input sequences.The attention is generated using queries and keys like Equation (2).Structure of scaled dotproduct attention allows the self-attention layer generate the representation of sentences at once and contain the information of the sentence which is different from RNN that process characters of sentences one by one.Standard self-attention is similar as Gaussian-masked direction attention while it does not have directional mask and gaussian mask.(Vaswani et al., 2017) also propose multi-head attention which is better to generate representation of sentence by dividing queries, keys and values to different heads and get information from different subspaces.

Conclusion
In this paper, we propose an attention mechanism only based Chinese word segmentation model.Our model uses self-attention from the Transformer encoder to take sequence input and biaffine attention scorer to predict the label of gaps.To improve the ability of capturing the localness and directional information of self-attention based encoder, we propose a variant self-attention called Gaussian-masked directional multi-head attention to replace the standard self-attention.We also extend the Transformer encoder to capture directional features.Our model uses only unigram features instead of multiple n-gram features in previous work.Our model is evaluated on standard benchmark dataset, SIGHAN Bakeoff 2005, which shows not only our model performs segmentation faster than any previous models but also gives new higher or comparable segmentation performance against previous state-of-the-art models.

Figure 1 :
Figure 1: The architecture of our model.
Figure 2: The structure of Gaussian-Masked directional encoder.

Figure 3 :
Figure 3: The illustration of Gaussian-masked directional multi-head attention and Gaussian-masked directional attention.

Figure 4 :
Figure 4: An example of bi-affinal scorer labeling the gap.The bi-affinal attention scorer only uses the forward information of front character and the backward information of character to label the gap.
Table 2 summarizes regular feature sets for typical CWS models including ours as well.The building blocks that encoders use include recurrent neural network (RNN) and convolutional neural network (CNN), and long-

Table 5 :
(Wang et al., 2019)used for pretrained embedding is all transferred to simplified Chinese and not segmented.On closed test, we use embeddings initialized randomly.Results on PKU and MSR compared with previous models in closed test.The asterisks indicate the result of model with unsupervised label from(Wang et al., 2019).

Table 6 :
(Wang et al., 2019)ITYU compared with previous models in closed test.The asterisks indicate the result of model with unsupervised label from(Wang et al., 2019).

Table 7 :
F1 scores of our results on four datasets in open test compared with previous models.