Enhancing Pre-trained Chinese Character Representation with Word-aligned Attention

Most Chinese pre-trained models take character as the basic unit and learn representation according to character’s external contexts, ignoring the semantics expressed in the word, which is the smallest meaningful utterance in Chinese. Hence, we propose a novel word-aligned attention to exploit explicit word information, which is complementary to various character-based Chinese pre-trained language models. Specifically, we devise a pooling mechanism to align the character-level attention to the word level and propose to alleviate the potential issue of segmentation error propagation by multi-source information fusion. As a result, word and character information are explicitly integrated at the fine-tuning procedure. Experimental results on five Chinese NLP benchmark tasks demonstrate that our method achieves significant improvements against BERT, ERNIE and BERT-wwm.


Introduction
Pre-trained language Models (PLM) such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), ERNIE (Sun et al., 2019) and XLNet (Yang et al., 2019) have been proven to capture rich language information from text and then benefit many NLP applications by simple fine-tuning, including sentiment classification, natural language inference, named entity recognition and so on.
Generally, most of PLMs focus on using attention mechanism (Vaswani et al., 2017) to represent the natural language, such as word-level attention for English and character-level attention for Chinese.Unlike English, in Chinese, words are not separated by explicit delimiters, which means that character is the smallest linguistic unit.However, in most cases, the semantic of single Chinese character is ambiguous.For example, in Table 1, using the attention over word 西山 is more intuitive than over the two individual characters 西 and 山.Moreover, previous work has shown that considering the word segmentation information can lead to better language understanding and accordingly benefits various Chines NLP tasks (Wang et al., 2017;Zhang and Yang, 2018;Gui et al., 2019).
All these factors motivate us to expand the character-level attention mechanism in Chinese PLM to represent attention over words1 .To this end, there are two main challenges.(1) How to seamlessly integrate word segmentation information into character-level attention module of PLM is an important problem.(2) Gold-standard segmentation is rarely available in the downstream tasks, and how to effectively reduce the cascading noise caused by automatic segmentation tools (Li et al., 2019) is another challenge.
In this paper, we propose a new architecture, named Multi-source Word Alignd Attention (MWA), to solve the above issues.(1) Psycholinguistic experiments (Bai et al., 2008;Meng et al., 2014) have shown that readers are likely to pay approximate attention to each character in one Chinese word.Drawing inspiration from such finding, we introduce a novel word-aligned attention, which could aggregate attention weight of characters in one word into a unified value with the mixed pooling strategy (Yu et al., 2014).( 2) For reducing segmentation error, we further extend our word-aligned attention with multi-source segmentation produced by various segmenters, and deploy a fusion function to pull together their disparate output.In this way, we can implicitly reduce the error caused by automatic annotation.

Methodology
2.1 Character-level Pre-trained Encoder The primary goal of this work is to inject the word segmentation knowledge into character-level Chinese PLM and enhance original models.Given the strong performance of recent deep transformers trained on language modeling, we adopt BERT and its updated variants (ERNIE, BERT-wwm) as the basic encoder for our work, and the outputs H from the last layer of encoder are treated as the enriched contextual representations.

Word-aligned Attention
Although the character-level PLM can well capture language knowledge from text, it neglects the semantic information expressed in the word level.Therefore we apply a word-aligned layer on top of the encoder to integrate the word boundary information into representation of character with the attention aggregation mechanism.
For an input sequence with with n characters S = [c 1 , c 2 , ..., c n ], where c j denotes the j-th character, Chinese words segmentation tool π is used to partition S into non-overlapping word blocks: where w i = {c s , c s+1 , ..., c s+l−1 } is the i-th segmented word of length l and s is the index of w i 's first character in S. We apply the self-attention with the representations of all input characters to get the character-level attention score matrix A c ∈ R n×n .
It can be formulated as: where Q and K are both equal to the collective representation H at the last layer of the Chinese PLM, W k ∈ R d×d and W q ∈ R d×d are trainable parameters for projection.While A c models the relationship between two arbitrarily characters without regard to the word boundary, we argue that incorporating word as atoms in the attention can better represent the semantics, as the literal meaning of each individual characters can be quite different from the implied meaning of the whole word, and the simple weighted sum in character-level cannot capture the semantic interaction between words.
To this end, we propose to align A c in the word level and integrate the inner-word attention.For the sake of simplicity, we rewrite A c as [a 1 c , a 2 c , ..., a n c ], where a i c ∈ R n denotes the ith row vector of A c and the attention score vector of the i-th character.Then we deploy π to segment A c according to π(S).For example, if In this way, an attention vector sequence is segmented into several subsequences and each subsequence represents the attention of one word.Then, motivated by the psycholinguistic finding that readers are likely to pay approximate attention to each character in one Chinese word, we devise an appropriate aggregation module to fuse the inner-word character attention.Concretely, we first transform {a s c , ..., a s+l−1 c } into one attention vector a i w for w i with the mixed pooling strategy (Yu et al., 2014).Then we execute the piecewise up-mpling opera-tion over each a i w to keep input and output dimensions unchanged for the sake of plug and play.The detailed process can be summarized as follows: where λ ∈ R 1 is a weighting trainable variable to balance the mean and max pooling, e l = [1, ..., 1] T represents a l-dimensional all-ones vector, l is the length of word w i , e l ⊗ a i w = [a i w , ..., a i w ] denotes the kronecker product operation between e l and a i w , Âc ∈ R n×n is the aligned attention matrix.The Eq.
(4-5) can help incorporate word segmentation information into character-level attention calculation process, and determine the attention vector of one character from the perspective of the whole word, which is beneficial for eliminating the attention bias caused by character ambiguity.Finally, we get the enhanced character representation produced by word-aligned attention: where V = H, W v ∈ R d×d is a trainable projection matrix.Besides, we also use multi-head attention (Vaswani et al., 2017) to capture information from different representation subspaces jointly, thus we have K different aligned attention matrices Âk c (1 ≤ k ≤ K) and corresponding output Ĥk .
With multi-head attention architecture, the output can be expressed as follows:

Multi-source Word-aligned Attention
As mentioned in Section 1, our proposed wordaligned attention relies on the segmentation results of CWS tool π.Unfortunately, a segmenter is usually unreliable due to the risk of ambiguous and non-formal input, especially on out-of-domain data, which may lead to error propagation and an unsatisfactory model performance.In practice, The ambiguous distinction between morphemes and compound words leads to the cognitive divergence of words concepts, thus different π may provide diverse π(S) with various granularities.To reduce the impact of segmentation error and effectively mine the common knowledge of different segmenters, it's natural to enhance the word-aligned attention layer with multi-source segmentation input.Formally, assume that there are M popular CWS tools employed, we can obtain M different representations H 1 , ..., H M by Eq. 7. Then we propose to fuse these semantically different representations as follows: where W g is the parameter matrix and H is the final output of the MWA attention layer.

Experiments Setup
To test the applicability of the proposed MWA attention, we choose three publicly available Chinese pre-trained models as the basic encoder: BERT, ERNIE, and BERT-wwm.In order to make a fair comparison, we keep the same hyper-parameters (such maximum length, warm-up steps, initial learning rate, etc) as suggested in BERT-wwm (Cui et al., 2019) for both baselines and our method on each dataset.We run the same experiment for five times and report the average score to ensure the reliability of results.For detailed hyperparameter settings, please see Appendix.Besides, three popular CWS tools thulac (Sun et al., 2016), ictclas (Zhang et al., 2003) and hanlp (He, 2014) are employed to segment the Chinese sentences into words.We carried out experiments on four Chinese NLP tasks, including Emotion Classification (EC), Named Entity Recognition (NER), Sentence Pair Matching (SPM) and Natural Language Inference (NLI).The detail of those tasks and the corresponding datasets are introduced in Appendix.

Experiment Results
Table 2 shows the experiment measuring improvements from the MWA attention on test sets of four datasets.Generally, our method consistently outperforms all baselines on all of four tasks, which clearly indicates the advantage of introducing word segmentation information into the encoding of character sequences.Moreover, the Wilcoxon's test shows that significant difference (p < 0.01) exits between our model with baseline models.
In detail, On the EC task, we observe 1.46% absolute improvement in F1 score over ERINE.SPM and NLI tasks can also gain benefits from our   enhanced representation, achieving an absolute F1 increase of 0.68% and 0.55% over original models averagely.For the NER task, our method improves the performance of BERT by 1.54%, and obtains 1.23% improvement averagely over all baselines.We attribute such significant gain in NER to the particularity of this task.Intuitively, Chinese NER is correlated with word segmentation, and named entity boundaries are also word boundaries.Thus the potential boundary information presented by the additional segmentation input can provide a better guidance to label each character, which is consistent with the conclusion in (Zhang and Yang, 2018;Gui et al., 2019).

Ablation Study
To demonstrate the effectiveness of our multisource fusion method in reducing the segmentation error introduced by CWS tools, We further carry out experiments on the EC task with different segmentation inputs.Table 3 presents the comprehensive results on the three segmentation inputs produced by three CWS tools aforementioned.Experimental results show that our model gives quite stable improvement no matter the segmentation input quality.This again suggests the effectiveness of incorporating word segmentation information into character-level PLMs.And by employing multiple segmenters and fusing them together could introduce richer segmentation information and reduce the impact of general existent segmentation error.

Conclusion
In this paper, we propose an effective architecture Word-aligned Attention to incorporate word segmentation information into character-based pretrained language models, which is adopted to a variety of downstream NLP tasks as an extend layer in fine-tuned process.And we also employ more segmenters into via proposed Multi-source Word-aligned Attention for reducing segmentation error.The experimental results show the effectiveness of our method.Comparing to BERT, ERNIE and BERT-wwm, our model obtains substantial improvements on various NLP benchmarks.Although we mainly focused on Chinese PLM in this paper, our model would take advantage the capabilities of Word-aligned Attention for word-piece in English NLP task.We are also considering applying this model into pre-training language model for various Language Model task in different grain to capture multi-level language features.

Table 2 :
(Cui et al., 2019)igned attention models on multi NLP task.All of results are f1-score evaluated on test set and each experiment are enacted five times, the average is taken as result.Part of results are similar to results from BERT-wwm technical report(Cui et al., 2019).

Table 3 :
Results of word-aligned attention produced by difference segmenters, and results of aggregated model over multi tokenizers on weibo sentiment-100k dataset.