Does Chinese BERT Encode Word Structure?

Contextualized representations give significantly improved results for a wide range of NLP tasks. Much work has been dedicated to analyzing the features captured by representative models such as BERT. Existing work finds that syntactic, semantic and word sense knowledge are encoded in BERT. However, little work has investigated word features for character languages such as Chinese. We investigate Chinese BERT using both attention weight distribution statistics and probing tasks, finding that (1) word information is captured by BERT; (2) word-level features are mostly in the middle representation layers; (3) downstream tasks make different use of word features in BERT, with POS tagging and chunking relying the most on word features, and natural language inference relying the least on such features.


Introduction
Large scale pre-trained models such as BERT (Devlin et al., 2019) have been widely used in NLP, giving improved results in a wide range of downstream tasks, including dependency parsing (Zhou and Zhao, 2019), summarization (Liu and Lapata, 2019) and reading comprehension . To better understand the reason behind their effectiveness, a natural question is what knowledge can be learned during the self-supervised pre-training stage. Some previous works prove that BERT effectively captures syntactic information (Goldberg, 2019), semantic information (Jawahar et al., 2019), commonsense knowledge (Cui et al., 2020), and factual knowledge (Petroni et al., 2019) without fine-tuning on task-specific datasets, which explains its effectiveness on related tasks. BERT has also gained success in Chinese NLP (Cui et al., 2019;. However, relatively little work investigate knowledge learned by Chinese BERT. Different from English, Chinese sentences are written as sequences of characters without explicit word boundary. Yet Chinese BERT is character-based, where the contextualized representations are built on each character. For traditional Chinese NLP, word segmentation is considered an essential pre-processing step (Cai and Zhao, 2016;Xu and Sun, 2016). In neural modeling,  argue that character-based model consistently outperforms word-based model for Chinese, because the former can alleviate the over-fitting and out-of-vocabulary issue. In this paper, we investigate whether Chinese BERT encodes word structure features.
We aim to answer the following three research questions. First, how much word information is captured by character-based Chinese BERT? Second, out of 12 representation layers of a BERT encoder, which layers encode the most word-level features? Third, the connection between word-level features embedded in BERT representations and the performance of downstream tasks such as named entity recognition (NER) and natural language inference (NLI). Intuitively, some tasks rely on word features more than other tasks. By analyzing word knowledge inside BERT after fine-tuning on each task, we can gain empirical evidence on how the task is solved by BERT.
We take two main approaches for analyzing BERT. First, BERT is based on Transformer (Vaswani et al., 2017), a multi-layer multi-head self-attention-network architecture, where the embedding of each This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. character is calculated by neural attention (Bahdanau et al., 2015) over all the characters in a sequence. Inspired by Clark et al. (2019), we look into the attention weight distribution of each head in each layer, in order to understand whether some attention head exhibit salient word-level patterns in the attention targets. Second, we take Chinese word segmentation as a probing task (Conneau et al., 2018;, training a linear classifier based on the BERT representation at each Transformer layer, so that word-level information contained in the layer can be quantified through segmentation performance.
Results on two Chinese datasets with varying segmentation standards consistently show that word information is captured by BERT representation. There exist attention heads that focus on the start and end characters in each word, word unigrams and bigrams, as well as word boundary patterns. In addition, word information is captured mostly in the middle layers of Transformer, allowing light-weight probing layers to achieve segmentation F1 score around 90% on both segmentation datasets. Finally, we find that different Chinese tasks require different levels of word information, with fine-tuning on tasks such as POS tagging and chunking significantly improving the probing task, while fine-tuning on tasks such as NLI significantly decreasing probing accuracies.
To our knowledge, we are the first to investigate word structure knowledge in Chinese BERT. Our code has been released at https://github.com/ylwangy/BERT_zh_Analysis.

BERT
BERT (Devlin et al., 2019) consists of multi-layer Transformer (Vaswani et al., 2017) blocks. Formally, given a sentence s = c 1 , c 2 , ..., c n , each c i is first transformed into input vector e i by summing up token embeddings E c i , segment embeddings SE c i , which distinguish the sentence location, and position embeddings PE c i , which indicates character position: The vectors e 1 , ..., e n ∈ R n×d are taken as input to the first layer in a Transformer encoder, which consists of L layers. Now for each layer, denote the input as E. E is then transformed into vectors for queries Q m , keys K m , and values V m via linear mappings, {Q m , K m , V m } ∈ R n×d k : where {W m Q , W m K , W m V } ∈ R d×d k are trainable parameters, m ∈ [1, 2, ..., M ] represent the m-th attention head. M parallel attention functions are applied to produce M hidden states {H 1 , ..., H M }: α m is the attention distribution for the m-th head and √ d k is a scaling factor. Finally, multi-head hidden states are concatenated to obtain a hidden representationĥ i of each character c i : h i are then fed to a multi-layer perceptron for computing the final outputs h i for the layer. Feedforward connections and layer normalization are also applied, the detail of which can be found in Vaswani et al. (2017). We denote the output of the l-th layer as h l i (l ∈ [1, 2, ..., L]). Given a corpus {s t = c 1 , c 2 , ..., c nt }| T t=1 , the masked language model objective is to minimize the loss of predicting the randomly chosen masked character c mask j (j ∈ [1, 2, ..., n t ]) by its representation h mask j in the last layer L: where E is the token embedding table in Eq. 1.  During pre-training, special tokens [CLS] and [SEP] are added to indicate the beginning of a sentences and the separation of two sentences, respectively. We conduct our experiments on BERT-base-Chinese 1 , which has 12 layers, 12 heads and 768 hidden size.

Attention Distribution Analysis
We analyze the distribution of attention weight α in Eq. 3 across different layers and heads. Specifically, we compute the attention weight from characters to several specific characters as well as the attention distribution from characters to surrounding words. The structures are illustrated by Figure 1. 2 For each character, we first investigate its attention weights to several specific characters, including the character itself, the previous character, the next character, and two special tokens [CLS] and [SEP], as shown in Figure 1(a). Formally, the attention from c i to c j in a certain head is defined as in Eq. 3: where c j can be the previous, current, or next character c i−1 , c i , c i+1 , and the token [CLS] or [SEP], respectively. We further investigate attention to characters at word boundary locations. Formally, for each c i that is the first character of some word w k , we first consider the attention weights to the last characters of the current word and the first character of the next word: Similarly, for each c j that is the last character of some word w k , we consider the attention weights to  Table 1: Character-to-character attention distribution. The numbers j in the parentheses denotes the j-th head. F, L, F 1 , and L -1 denote first character of current word, last character of current word, first character of next word, and last character of previous word, respectively.
the first character of the current word and the last character of the previous word: Figure 1(b) shows one example, where w 1 and w 2 are two adjacent words, c 1 , c 2 are the first and last character of the word w 1 , and c 3 , c 5 are the first and last character of the word w 2 , respectively. For this example, we consider the attentions α c 1 →c 2 , α c 3 →c 5 , α c 1 →c 3 (Eq. 7), and α c 2 →c 1 , α c 5 →c 3 , α c 5 →c 2 (Eq. 8).
In addition to character-to-character attention weights, we consider character-to-word attention by taking the average of attention weights to characters that belong to specific words. Formally, we define the attentions from a character c j in word w k to it previous, current and next words as follows: where |w| denotes the length of word w. Figure 1(c) shows one example, where the α curr word for c 1 is the average of α c 1 →c 1 and α c 1 →c 2 , because c 1 belongs to the words w 1 composed of characters c 1 and c 2 . α next word for c 1 is the average of α c 1 →c 3 , α c 1 →c 4 and α c 1 →c 5 .

Datasets
There exist different word segmentation criteria for the same sentence, which mainly differ by the segmentation granularity. We select two corpora with golden word segmentation labels for computing attention distribution: Chinese Treebank (Xue et al., 2005) 9.0 (CTB9) and PKU (Emerson, 2005), averaging the attention weights across all the characters for each attention heads.

Results
Character-to-Character Attention. Table 1 show the largest values among all the 12 attention heads in each layer of BERT. We use head i-j to denote the j-th attention head in the i-th layer. The average sentence lengths of the CTB9 and PKU datasets are 26.3 and 35.8, respectively. As a result, random baseline for the two datasets are 3.8% (1/26.3) and 2.7% (1/35.8), respectively.
First, the Specific Characters columns of the two tables show weights calculated by Eq. 6. There are specific heads with strong attention weights to both the current character and its neighboring characters. For both datasets, attention to the previous character can reach 92%, while to the next character is 60%. This shows that the previous character plays a very important role in certain BERT representation layers. Attention weights to [SEP] and [CLS] are also significantly stronger than the random baseline, with those to the [SEP] node being above 90%. This shows that BERT is highly sensitive to sentence boundaries. Finally, results are quite consistent between CTB9 and PKU.
The First&Last Characters of Words columns of the two tables show attention weights calculated by Eqs. 7 and 8. First, attention between the first and last characters of the same word can reach 50% to 60%, which shows that word information is captured by BERT representation. In addition, attention between consecutive words can reach 20% to 30%, which shows that word n-gram information is also learned. Word internal attention is higher than cross-word attention, which can be because word n-grams are more sparse and play less role in BERT.
Finally, the strongest weights mostly occur in the middle layers (3∼8), which suggests that word information from BERT concentrates in those layers. Our probing task in Section 4 confirms this.
Our findings are in line with those by Clark et al. (2019), who observe that different heads in English BERT models can put strong attention weights on different dependency relations. In addition, Jawahar et al. (2019) shows that syntax in BERT are in relatively lower layers while semantics are in higher layers. Word information can be regarded as a lexical level syntax feature, and thus our finding is similar. Character-to-Word Attention. Figure 2(a) shows the results of the best-performing heads calculated using Eq 9. In addition to the current, previous and next words, we also show the results for other words within a window size of 5. We can find that attention to the previous word is stronger compared with that to the current word and the next word. Attention weights decrease when the target is increasingly far from the current characters, which shows that neighboring context is more important in BERT representation concerning words. This is consistent with character-to-character trends (Figure 2(b)). Values for CTB9 are relatively larger than those for PKU mainly because the sentences are shorter, but the main observations are consistent.

Case Study
We visualize the best-performing heads in Table 1  These heads consistently attend to one of the specific characters. For example, the head 1-5 attends to the next characters, with the last character "电(news)" attending to the [SEP] token. The head 7-4 attends to the previous character, with the first character "新(Novel)" attending to the [SEP] token as well.
Figures 3(f) to 3(i) show more examples that the heads attend to word boundary characters. For head 3-11, most of the last characters tend to pay attention to the first character of the same words. As for head 8-3, the last characters attend to the last characters of the previous words. There are exceptions to the general trends above. Take head 3-10 for example, in the sentence "宁波港(Ningbo Port)多渠道(multichannel)利用(use)外资(overseas investment)", the character "利(benefit)" puts the most attention on the character "外(outside)", which is out of the corresponding word "利用(use)".
Figure 3(j) shows the attention distribution in the sentence "钱其琛(Qichen Qian)谈(talk)香港(Hong Kong)前景(prospect)和(and)台湾(Taiwan)问题(issue)" for head 6-5. Each character puts the most attention on characters in the previous word. For example, the character "谈(talk)" focuses mainly on characters in the word "钱其琛(Qichen Qian)", and the character "题(question)" puts most attention to characters in the word "台湾(Taiwan)". This indicates that the head 6-5 takes most of the information from the previous words to generate contextualized character representation.

Probing Task
We probe the contextualized representation of each character for Chinese Word Segmentation (CWS) (Xue, 2003). In particular, CWS can be treated as a character-level sequence labeling task, where the label set includes B, M, E, and S (which stand for the beginning, middle, ending of word and single    character word, respectively). We directly use the fixed hidden representations in each layer as the input, on which a trainable linear classifier is built, as Figure 4 shows. We use local classifier rather than the conditional random fields (Lafferty et al., 2001), in order to focus on information extracted from hidden representations directly. The intuition is that if a simple linear classifier can predict the segmentation labels, we can reasonably conclude that the model has word structure features. Formally, in the representations of layer l, the label probability distribution of the character c j is calculated as follows: P (y |h l j ) = softmax (Wh l j ) where y ∈ {B, M, E, S}, W is the linear transformation parameters and h l j is the j-th hidden representation in l-th layer.

CWS Settings and Results
We experiment with the CTB 9.0 dataset 3 , and widely-used SIGHAN 2005 benchmarks 4 including four subsets (i.e, PKU, MSR, CITYU and AS). We split the CTB 9.0 dataset into train/dev/test set following Shao et al. (2017). Statistics of the datasets are shown in Table 2. The standard word segmentation F1 score is used for evaluation. Our model is implemented using NCRF++ (Yang et al., 2018). In particular, we use Adam (Kingma and Ba, 2015) as our optimization method, with a learning rate of 2e-5, a dropout rate of 0.1, and the number of training epochs being 3. The parameters are selected using the development set. Table 3 shows the main results. In general, layers 3∼8 give relatively higher performances, with F1 score close to or over 90% for all the datasets. The best-performing layers are layers 4 and 7. Existing state-of-the-art neural segmentors trained on the datasets give F1 score of 96.5, 96.1, 97.4, 97.2, and 96.2 on CTB9, PKU, MSR, CITYU and AS datasets, respectively (Shao et al., 2017;. From the above results we can see that word information is strongly captured by BERT in the middle layers. This is consistent with findings of Section 3.

Tuning on Downstream Tasks
To further analyze the influence of downstream tasks to word segmentation by fine-tuning the model using different downstream tasks including NER, chunking, POS tagging and five tasks in the Chinese   Table 4: Performance of the word segmentation probing task after model finetuning.
Language Understanding Evaluation (CLUE) benchmark (Xu et al., 2020), then we execute the probing task again. Intuitively, word information is encoded by downstream tasks if the fine-tuned model performs better on word segmentation probing task, or vice versa. The tasks and datasets we use are as follows: • NER. We use the OntoNotes 4.0 (Weischedel et al., 2011) as the named entity recognition dataset.
• Chunking. CTB 4.0 is used as the chunking dataset. We split the data into train/dev/test set following Lyu et al. (2016).
• POS Tagging. CTB 5.0 is used as the part-of-speech tagging dataset. The dataset split is the same as in Shao et al. (2017).
• CMNLI (Chinese Multi-Genre of Natural Language Inference) is a CLUE task to predict the relationship (neutral, entailment or contradiction) between two sentences.
• TNEWS is a short sentence classification dataset 5 in CLUE, where the task is to assign a title to each news. The category of labels includes finance, technology, sports, etc.
• AFQMC (Ant Financial Question Matching Corpus) comes from Ant Technology Exploration Conference (ATEC) Developer competition 6 . The CLUE task is to predict whether two sentences are semantically similar.
• CSL (Chinese Scientific Literature) dataset 7 contains Chinese paper abstracts and their keywords. The CLUE task is to recognize whether given keywords are correct to the corresponding paper, where some noise keywords are generated by using TF-IDF value.
The results are shown in Table 4. We observe 5 trends for different datasets. First, for POS tagging and chunking, CWS probing is strongly improved after fine-tuning. These two datasets are the mostly connected with words, and joint segmentation models have been investigated (Zhang and Clark, 2010;Zheng et al., 2013;Lyu et al., 2016). Second, for NER, the performance slightly improved. The task sees debates on whether word information is useful, where segmentation error can outweight word features (He and Wang, 2008;Liu et al., 2010). Third, for WSC and CSL the performance did not change. Fourth, for the text classification tasks TNEWS and AFQMC the performance slightly decreases. This may because word features do not help much for these tasks. Last, the performance decreases sharply for CMNLI, which shows that the semantics heavy task does not make use of word information and also  Figure 5: Layerwise word segmentation performance for BERT and fine-tuned models.
worsen the performance of CWS probing model. We show the layer-wise word segmentation results on CTB9 for different models in Figure 5. A silent finding is that for those tasks that benefit from word information, the best-performing layers move up from the middle layers. In contrast, for the CMNLI task, the best-performing layers move down from the middle layers. This shows that useful information can be brought closer to the final prediction layer during fine-tuning, or that training signals influence the top layers more strongly. For the other tasks, the best-performing layers are still 3∼8.

Related Work
Knowledge from Pre-trained Models. Peters et al. (2018) find that ELMo encodes syntactic and semantic features at different layers. To analyze the syntactic features inside BERT, Goldberg (2019) use BERT to predict the masked verb, and compare assigned scores between the correct verb and the incorrect verb. Petroni et al. (2019) demonstrate that BERT is able to recall factual knowledge using pre-defined cloze templates. While all the above work has been conducted on English, little work analyzes the linguistic or structural knowledge from Chinese pre-trained model. We thus fill a gap in the literature. Attention Analysis. Clark et al. (2019) visualize the attention patterns from BERT, finding different behaviors from different attention heads. Htut et al. (2019) evaluate syntactic knowledge by computing the maximum spanning tree on BERT's attention to recover dependency trees. Kovaleva et al. (2019) analyze the attention distribution for BERT fine-tuned across different tasks, pointing out that redundancy exists in different heads. Our work is similar in finding the patterns by making use of attention and fine-tuning. However, except token-to-token attention patterns, we also investigate attentions to specific characters for finding the potential word structure of Chinese. Probing Method. Conneau et al. (2018) introduce 10 probing tasks, uncovering linguistic properties that a sentence encoder captures.  design probing tasks on the contextualized representations from pre-trained models and investigate the linguistic knowledge it encodes. Ian et al. (2019) design edge probing tasks to investigate the sub-sentential structure of contextualized word embeddings. Hewitt and Manning (2019) propose structural probing methods and find that syntax trees are embedded implicitly in ELMo and BERT. Our method is similar in analyzing the contextualized representations directly. However, we select the Chinese word segmentation as our probing task for each character-level contextualized representation.

Conclusion
We investigated the capability of Chinese BERT for capturing the word structure using two different methods. First, analyzing the attention distribution for different patterns, we find that some of the attention heads can capture the word structure implicitly. Second, using a word segmentation probing task for the contextualized representation inside the model, we find that a simple linear classifier performs well in the middle layers. By using our probing method, we find evidence that different Chinese tasks rely on different degrees of word information, with NLI relying the least on word features.