Porous Lattice Transformer Encoder for Chinese NER

Incorporating lexicons into character-level Chinese NER by lattices is proven effective to exploitrich word boundary information. Previous work has extended RNNs to consume lattice inputsand achieved great success. However, due to the DAG structure and the inherently unidirectionalsequential nature, this method precludes batched computation and sufficient semantic interaction.In this paper, we propose PLTE, an extension of transformer encoder that is tailored for ChineseNER, which models all the characters and matched lexical words in parallel with batch process-ing. PLTE augments self-attention with positional relation representations to incorporate latticestructure. It also introduces a porous mechanism to augment localness modeling and maintainthe strength of capturing the rich long-term dependencies. Experimental results show that PLTEperforms up to 11.4 times faster than state-of-the-art methods while realizing better performance.We also demonstrate that using BERT representations further substantially boosts the performanceand brings out the best in PLTE.


Introduction
Named Entity Recognition (NER) is a fundamental task in natural language processing (NLP), which aims to automatically discover named entities and identify their corresponding categories from plain text. NLP tasks such as information retrieval (Berger and Lafferty, 2017), relation extraction (Yu et al., 2019) and entity linking  require the NER as one of their preprocessing components. Recent studies show that English NER models have achieved improved performance by integrating character information into word representations based on sequence labeling. Different from English NER, East Asian languages including Chinese are written without explicit word boundary. One intuitive way to solve this problem is to segment the input sentences into words first, and then to apply word sequence labeling (Yang et al., 2016;He and Sun, 2017a). However, such methods suffer from error propagation between these two subtasks.
To overcome this limitation, efforts have been devoted to incorporating word information by leveraging lexicon features and gazetteers (Peng and Dredze, 2015;Cao et al., 2018;Wu et al., 2019;Lin et al., 2019). As recent state-of-the-art (SOTA) lattice-based method,  integrated matched lexical words information into character sequence with a directed acyclic graph (DAG) structure using lattice LSTM. While obtaining promising results, this model faces two challenges. First, as an extension to the non-parallelizable sequential LSTM to a DAG structured model, lattice LSTM is restricted to preprocess one character at a time, which can make it infeasibly to deploy. Second, due to the inherently unidirectional sequential nature, lattice LSTM fails to incorporate the word-level semantics into the representation of the characters except for the last character in each word, despite that such information can be crucial for character-level sequence tagging. Taking the sentence in Figure 1 as an example, lattice LSTM decodes the information of the lexical word "南京市(NanJing City)" to "市(City)" but skips Figure 1: Example of word character lattice. Restricted by the unidirectional sequential nature, lattice LSTM cannot model the semantic interaction between word "南京市(Nanjing City)" and its constituent characters "南(South)" and "京(Capital)", resulting in the loss of crucial information for tagging. Besides, lattice LSTM cannot perform batched computation due to the directed acyclic graph input structure. the other two inside characters "南(South)" and "京(Capital)", although the semantics and boundary information of "南京市(NanJing City)" can be useful knowledge for predicting the tag of "南(South)" as "B-LOC".
In this paper, we address these issues by considering a novel Porous Lattice Transformer Encoder (PLTE). Inspired by previous research on machine translation Sperber et al., 2019), which integrated lattice-structured inputs into self-attention models, we propose a lattice transformer encoder for Chinese NER by introducing lattice-aware self-attention, which borrows the idea from the relative positional embedding (Shaw et al., 2018) to make self-attention aware of the relative position information in lattice structure. Considering that self-attention network calculates attention weights between each pair of tokens in a sequence regardless of their distance, we simply concatenate all the characters and lexical words as input to consume lattices without resorting to the DAG structure. In this way, characters coupled with lexical words can be processed in batches. A lexical word representation is allowed to build a direct relation with the included characters by lattice-aware self-attention, thus addressing the second issue.
Some work Yang et al., 2019) demonstrates that self-attention benefits from locality modeling, especially for the NER task. As we can see from the example in Figure 1, the word "位 于(Locates In)" is the immediate and most obvious feature to guide the neighboring character "桥(Bridge)" to be identified as "E-LOC" instead of "E-PER", while "中国(China)" has no contribution to this decision. Given this observation, we further introduce a novel porous mechanism to enhance the local dependencies among neighboring tokens. The key insight is to modify the self-attention architecture by replacing the fully-connected topology with a pivot-shared structure. In this particular, every two non-neighboring tokens are connected by a shared pivot node to strengthen the dependency for two neighboring tokens. Experimental results on four datasets demonstrate that our model performs up to 11.4 times faster than baselines and achieves better performance. Furthermore, we show that our model can be easily integrated into the pre-trained language model such as BERT (Devlin et al., 2019), and combining them further improves the state of the art.
In summary, this paper makes the following contributions: (1) We investigate lattice transformer encoder for Chinese NER, which is capable of handling lattices in batch mode and capturing dependencies between characters and matched lexical words. (2) We revise lattice-aware attention distribution via a porous mechanism, which enhances the ability of capturing useful local context. (3) Experimental results show that the proposed model is effective and efficient. The source code of this paper can be obtained from https://github.com/strawberryx/PLTE.

Related Work
Our work is in line with NER models based on neural networks and lattice transformer models. Huang et al. (2015) proposed a BiLSTM-CRF model for NER and achieved strong performance. Santos and Guimaraes (2015) used word-and character-level representations based on the CharWNN deep neural network. Lample et al. (2016) designed a character LSTM and word LSTM for NER. Compared to our work, these word-based methods suffer from segmentation errors.
To avoid segmentation errors, most recent NER models are built upon character sequence labeling. Peng and Dredze (2015) proposesd a joint training objective for three types of neural embeddings to better recognize entity boundary. Lu et al. (2016) presented a position-sensitive skip-gram model to learn multi-prototype Chinese character embeddings. He and Sun (2017a) took the positional character embeddings into account. Although these methods achieve promising performance, they ignore word information lying in character sequence.
Some work exploits rich word boundary and semantic information in character sequence. Cao et al. (2018) applied an adversarial transfer learning framework to integrate the task-shared word boundary information into Chinese NER.  explored four different strategies for Word-Character LSTM. Gui et al. (2019a) proposed a CNN-based NER model that incorporates lexicons using a rethinking mechanism. Recent state-of-the-art methods exploit lattice-structured models to integrate latent word information into character sequence, which has been proven effective on various NLP tasks (Su et al., 2017;Tan et al., 2018) . Specifically,  utilized the lattice LSTM to leverage explicit word information over character sequence labeling. Based on this method, Gui et al. (2019b) and Sui et al. (2019) formulated the lattice structure as a graph and leveraged Graph Neural Networks (GNNs) to integrate lexical knowledge. However, for the NER task, coupling pre-trained language models such as BERT (Devlin et al., 2019) with GNNs and fine-tuning them can be non-trivial.
Lattice transformer has been exploited in NMT , as well as speech translation (Sperber et al., 2019;. Compared with existing work, our proposed porous lattice transformer encoder is different in both motivation and structure. We revise the fully-connected attention distribution with a pivot-shared structure via the porous mechanism to enhance the local dependencies among neighboring tokens. 1 To our knowledge, we are the first to design a lattice transformer for Chinese NER.

Background
In this section, we first briefly review the self-attention mechanism, then move on to current lattice Transformer that our PLTE model is built upon.

Self-Attention
Self-attention mechanism has attracted increasing attention due to their flexibility in parallel computation and dependency modeling. Given an input sequence representation X = {x 1 , · · · , x n } ∈ R n×d , we can first transform it into queries Q = XW Q ∈ R n×d k , keys K = XW K ∈ R n×d k , and values V = XW V ∈ R n×dv , where {W Q , W K , W V } are trainable parameters. The output sequence representation is calculated as: where √ d k is the scaling factor.

Lattice Transformer
Transformer has been used for many NLP tasks, notably machine translation and language modeling Devlin et al., 2019). By invoking multi-layer self-attention for global context modeling, Transformer enables paralleled computation and addresses the inherent sequential computation shortcoming of RNNs. Lattice Transformer is a generalization of the standard transformer architecture to accept lattice-structured inputs, it linearizes the lattice structure and introduces a position relation score matrix to make self-attention aware of the topological structure of lattice: where R ∈ R n×n encodes the lattice-dependent relations between each pair of elements from the lattices, and its computational method relies on the specific relation definition according to the task objective.  . Characters and lexical words are shown in yellow and green, respectively. We concatenate character and word embeddings as lattice input. When decoding, we mask words and just make sequence labeling for characters; and (b) Illustration of the relative position relation matrix. Notice that we present several relations among partial tokens as instances.
Different colors indicate different relations defined in Figure 3. For instance, the relation between t 4 and t 10 is r 6 , since that "长(Long)" is included in "市长(Mayor)". The circle filled with lines denotes that we don't compute attention between non-neighboring tokens due to our porous mechanism.

Models
The overall structure of our model is shown in Figure 2(a), which consists of 3 main components, lattice input layer, Porous lattice transformer encoder and BiGRU-CRF decoding.

Lattice Input Layer
The input layer aims to embed both semantic information and position information of tokens into their token embeddings.
Word-Character Embedding Formally, let S = {c 1 , ..., c M } denotes a sentence, where c i is the i-th character. The lexical words in the lexicon that match a character subsequence can be formulated as e i:j , where the index of the first and last letters are i and j, respectively. Similarly, we can also represent c i as e i:i . As shown in the top half of Figure 2(b), e 3:4 indicates the lexical word named "市长(Mayor)" which contains c 3 named "市(City)" and c 4 named "长(Long)". Each character c i can be turned into the vector x c i which includes it's character embedding and bigram embedding. By looking up the vector from a pre-trained word embedding matrix, each matched lexical word e i:j is represented as a vector x w i:j .
Lattice-Aware Position Encoding Since self-attention architecture contains no recurrence, to make the model aware of the sequence order, we add position embedding to the semantic embedding of each token. Specifically, the position of a character is defined as its absolute position in the input sequence S. And the position of a matched word is the position of its first character. For example, in Figure 2(b), the position of word "南京(Nanjing)" is 1 because this sentence begins with "南(South)". Finally, since position information is incorporated into token embeddings, we can simply put the matched words to the end of the character sequence S and form a new token sequence T = {t i } N i=1 to consume lattice structure, where N is the sum of the number of characters and words. See the top half of Figure 2(b) for the detailed correspondence.  Figure 3: Relation between e p:q and e k:l . We use the block filled with dots and lines to present e p:q and e k:l , respectively. Notice that if p = q = k = l, we denote the relation between e p:q and e k:l as r 5 . And relation r 7 consists of two cases.

Porous Lattice Transformer Encoder
As mentioned in the Introduction, our primary goal is to adapt the standard transformer to the task of Chinese NER with lattice inputs. To this end, we first propose lattice-aware self-attention to consume input tokens and the relative position information of lattice structure. Then, we design a porous mechanism which learns sparse attention coefficients by replacing the fully-connected topology with a pivot-shared structure to enhance the association between neighboring elements. We also use multi-head attention (Vaswani et al., 2017) to capture information from different representation subspaces jointly.
Lattice-Aware Self-Attention (LASA) The position embedding method described above only indicates the sequential order and cannot capture the relative position information of the lattice-structured input. For example, in Figure 2(b), the sequential distance from "市(City)" or "市长(Mayor)" to "长(Long)' is 1 under previous position definition. Actually, "长(Long)" is included in "市长(Mayor)" and right adjacent to "市(City)", but absolute position fails to make a distinction. To address this issue, we propose a relative position relation matrix L ∈ N N ×N to present such position information. Similar to , we enumerate all possible relations between each pair of elements e p:q and e k:l in Figure 3. We give a detailed and vivid example in Figure 2(b). For two tokens t i and t j refering to e p:q and e k:l respectively, the matrix entry L i,j is the pre-defined relation between them, such as L 1,2 = r 1 . More concretely, in order to make L learnable, we first represent L as the relation position embedding, a 3D tensor R ∈ R N ×N ×dr by looking up a trainable embedding matrix A ∈ R 8×dr , where d r is the relational embedding dimensionality. Note that here we define eight types of embedding instead of seven relations in Figure 3. The additional embedding is introduced to represent the interaction relation with a shared pivot node (described in the next section) and facilitate parallel computation. Then, to incorporate such position relations into attention layer, we adapt Equation 2 as follows: where R K ∈ R N ×N ×d k and R V ∈ R N ×N ×dv are two relation embedding tensors which are added to the keys and values respectively to indicate relation between input tokens. In our case, Q is a 2D array of shape [N × d k ] while R K is a 3D array and we need to result in a new array of shape [N × N ], with the element in i-th row and j-th column is d k k=1 Q ik R K ijk . To implement this operation, we apply einsum 2 to sum out the dimension of the hidden size, which is an operation computing multilinear expressions (i.e., sums of products) using the Einstein summation convention.
Porous Multi-Head Attention (PMHA) Considering that standard self-attention mechanism encodes sequences by relating sequence items to another one through computation of pairwise similarity, it disperses the distribution of attention and overlooks the local knowledge provided by neighboring elements, which is crucial for NER. To maintain the strength of capturing long distance dependencies and enhance the ability of capturing short-range dependencies, we sparsify the transformer architecture by replacing the fully-connected topology with a pivot-shared structure referenced by (Guo et al., 2019). Specifically, given element set E and its embedding matrix X, where e i:j ∈ E and x i:j ∈ X (if e i:j is a character then x i:j = x c i else x i:j = x w i:j ), we define e r k i:j as the element set whose relation with e i:j is r k , x r k i:j as the concatenation of the embeddings where each embedding represents the corresponding element in e r k i:j . we also define the neighboring set of e i:j as ε = {e r 1 i:j ; e r 2 i:j ; e r 3 i:j ; e r 4 i:j ; e r 5 i:j ; e r 6 i:j } , then we update the hidden state h i:j of e i:j with multi-head attention as follows: where W Q h , W K h , W V h are trainable projection matrices corresponding to the h-th head, z h is the h-th output, H is the number of heads and Att() is defined in Equation 4. As we can see, in our porous multihead attention, one element e i:j just makes direct attention computation with its neighboring elements and models the non-local compositions via the pivot node s. As illustrated in Figure 2(b), e i:j doesn't compute attention directly with the element set e r 7 i:j , thus we mask them. Under this lightweight porous structure, our transformer encoder has an approximate ability to strengthen local dependencies among neighboring tokens and keep the ability to capture long distance dependencies.

BiGRU-CRF Decoding
After extracting the semantic information by the porous lattice transformer encoder layer, we feed the character sequence representations into a BiGRU-CRF decoding layer to make sequence tagging. Specifically, taking [x c 1 ; h 1:1 ], ..., [x c n ; h n:n ] as input, a bidirectional GRU is implemented to produce forward state − → h t and backward state ← − h t for each time step, and then we concatenate these two separate hidden states as the encoding output of the t-th character, donated as Finally, a standard CRF layer is used on top of h 1 , h 2 , ..., h n to make sequence tagging. For a label sequence y = {y 1 , y 2 , ..., y n }, we define its probability to be: where y denotes all possible tag sequences, W y i CRF is a model parameter specific to y i , and b (y i−1 ,y i ) CRF is the transition score between y i−1 and y i . For decoding, we use the first-order Viterbi algorithm to find the label sequence that obtains highest score.

Training
Given a set of manually labeled training data {(S i , y i )}| N i=1 , sentence-level log-likelihood loss with L 2 regularization is used to train the model: where λ is the L 2 regularization weight and Θ represents the parameter set.

Experiments
We conduct experiments to investigate the effectiveness of our proposed PLTE method across different domains. Standard precision (P), recall (R) and F1-score (F1) are used as evaluation metrics.

Data
We evaluate our model on four datasets, including OntoNotes (Ralph et al., 2011), MSRA (Levow, 2006), Weibo NER (Peng and Dredze, 2015;He and Sun, 2017b) and a Chinese Resume dataset . We use the same training, valid and test split as

Baseline Methods
We compare our proposed model to several recent lexicon-enhanced character-based models. Lattice LSTM. Lattice LSTM  exploits lexical information in character sequence through gated recurrent cells, which can avoid segmentation errors.
LR-CNN. LR-CNN (Gui et al., 2019a) is the latest SOTA method of Chinese NER, which incorporates lexicons using a rethinking mechanism.
Furthermore, to explore the effectiveness of pre-trained language model, we implement several baselines based on BERT representations.
BERT-Tagger. BERT-Tagger (Devlin et al., 2019) uses the outputs from the last layer of model BERT base as the character-level enriched contextual representations to make sequence labelling.

Hyper-parameter settings
In our experiments, we use the same character embeddings, character bigram embeddings and word embeddings as , which are pre-trained on Chinese Giga-word 3 using Word2vec (Mikolov et al., 2013) and fine-tuned during training. The model is trained using stochastic gradient descent with the initial learning rate of 0.045 and the weight decay of 0.05. Dropout is applied to the embeddings and GRU layer with a rate of 0.5 and the transformer encoder with 0.3. For the biggest dataset MSRA and the smallest dataset Weibo, we set the dimensionality of GRU hidden states as 200 and 80 respectively. For the other datasets, this dimension is set to 100. What's more, the hidden size and the number of heads are set to 128 and 6, respectively. For models based on BERT, we fine-tune BERT representation layer during training. We use BertAdam to optimize all trainable parameters, select the best learning rate from 1e-5 to 1e-4 on the development set.

Results
OntoNotes. Table 1 illustrates our experimental results on OntoNotes. The "Input" column indicates whether the input sentences are segmented or not, where methods in Gold seg process word sequences with gold segmentation and No seg indicates that the input sentence is a character sequence.  With gold-standard segmentation, all of the word-level models Yang et al., 2016) achieve strong performance by using segmentation and external labeled data. But such information is not available in most datasets, such that we only use pre-trained character and word embeddings as our resource.
Under No-segmentation settings, we first compare 3 widely-used non-BERT models. Our PLTE model achieves the best F1 score and gains a 0.72% improvement over lattice LSTM in F1 score since our model integrates lexical words information into self-attention computation in a more effective way. 4 With pre-trained BERT base , BERT-Tagger leads to a significant boost in performance to 79.16%. On this basis, our proposed PLTE[BERT] model outperforms the BERT-Tagger by 1.44% in F1 score on OntoNotes.
MSRA/Weibo/Resume Tables 1 and 2 present comparisons among various methods on the MSRA, Weibo, and Resume datasets. Existing statistical methods explore the rich statistical features (Zhou et al., 2013) and character embedding features (Lu et al., 2016). For neural models, some existing models use multi-task learning (Peng and Dredze, 2016;Cao et al., 2018) or semi-supervised learning (He and Sun, 2017a). CAN-NER  investigate a character-based convolutional attention network coupled with GRU for Chinese NER.
Consistent with observations on OntoNotes, all the lexicon-enhanced methods achieve higher F1 scores than character-based methods, which demonstrates the usefulness of lexical word information. With pre-trained contextual representations, BERT-based models outperform non-BERT models by a large margin. Even though the original BERT model already provides strong prediction power, PLTE consistently improves over BERT-Tagger, lattice LSTM[BERT] and LR-CNN [BERT], which indicates that our proposed PLTE model can make better use of these semantic representations. Another interesting observation is that PLTE gains more significant improvement when combined with BERT compared with other lexicon-enhanced methods. We suspect that it is because PLTE is more capable of fully leveraging the language information embedded in the input representations. While the embeddings pre-trained by Word2vec are not as informative to PLTE to fulfill its potential, BERT representation can well capture rich semantic patterns and help PLTE improve the performance.

Efficiency Advantage
PLTE also outperforms current lexicon-enhanced methods in efficiency. Table 3 lists the test times of different models with different input representations on all four benchmarks. As we can see, PLTE runs up to 11.4 and 5.11 times faster than lattice LSTM and LR-CNN respectively with Word2vec embeddings on OneNotes. Similar efficiency improvement can also be observed on other datasets under both two kinds of input representations. Aligning word-character lattice structure for batch running can be usually non-trivial (Sui et al., 2019) and both lattice LSTM and LR-CNN have no ability in batch-running due to the DAG structure or variable-sized lexical words set. In contrast, PLTE overcomes this limitation since we can simply concatenate all the elements as input thanks to the lattice-aware self-attention mechanism, which calculates attention weights between each pair of tokens by matrix multiplication, thus can be computed parallelly in batches.
To investigate the influence of the different sentence lengths, we conduct experiments on OntoNotes by splitting this dataset into five parts according to sentence length. The results in Figure 4 Figure 4: (a) Test speed against the sentence length. Sen/s denotes the number of sentences processed per second; and (b) An ablation study of our proposed model. For model without lattice-aware self-attention (-LASA), we take character sequence as input, and one character just computes multi-head self-attention weights with its adjacent characters and the shared pivot node. For model without porous mechanism (-PM), we directly utilize multi-head LASA to aggregate the weighted information of each word with fully-connected attention connections. For PLTE-LASA-PMHA, we apply multi-head self-attention to each pair of elements from the input character sequence.
that PLTE runs faster than lattice LSTM and LR-CNN with different sentence lengths, especially for short sentences. In particular, when the sentence length is less than 20, PLTE(batch size = 4) runs 9.64 times faster than lattice LSTM and 8.81 times faster than LR-CNN. When the sentence length increases, the efficiency gains from batching computation decline gradually due to the limited computing resources of a single GPU. Besides, even if we set the batch size as 1, PLTE still has remarkable advantage in speed, since lattice LSTM demands multiple recurrent computation steps, and the rethinking mechanism in LR-CNN is also computationally expensive.

Model Ablation study
We conduct an ablation study on four datasets to understand the effectiveness of each component, the results are shown in Figure 4(b). We can observe that: (1) Removing the LASA module hurts the results by 3.63%, 0.99%, 2.81% and 0.58% F1 score on four datasets respectively, which indicates that lexicons play an important role in character-level Chinese NER. (2) By introducing the porous mechanism (PM), we can enhance the ability of capturing useful local context, which is beneficial to NER, while maintaining the strength of capturing long-term dependencies. (3) PLTE-PM performs worse than PLTE-LASA-PM, which confirms that the standard LASA is not suitable for NER because it takes into account all the signals and disperses the distribution of attention, while NER may be benefited more from local modeling. (4) PLTE-LASA outperforms PLTE-LASA-PM on most datasets, which shows that the porous mechanism can also benefit self-attention when only taking characters as input.

Conclusion
We presented PLTE, a porous lattice transformer encoder which incorporates lexicons into character-level Chinese NER. PLTE enables the interaction between the matched lexical words and their constituent characters, and proceeds in batches with the lattice-aware self-attention. It also learns a porous attention distribution to enhance the ability of localness modeling. We evaluate the proposed model on four Chinese NER datasets. Using Word2vec embeddings, our PLTE outperforms various baselines and performs up to 11.4 times faster than previous lattice-based method. Switching to BERT representations, PLTE achieves more significant performance gain than existing methods. There are multiple venues for future work, where one promising direction is to apply our model to the pre-training procedure of Chinese Transformer language models.