FLAT: Chinese NER Using Flat-Lattice Transformer

Recently, the character-word lattice structure has been proved to be effective for Chinese named entity recognition (NER) by incorporating the word information. However, since the lattice structure is complex and dynamic, the lattice-based models are hard to fully utilize the parallel computation of GPUs and usually have a low inference speed. In this paper, we propose FLAT: Flat-LAttice Transformer for Chinese NER, which converts the lattice structure into a flat structure consisting of spans. Each span corresponds to a character or latent word and its position in the original lattice. With the power of Transformer and well-designed position encoding, FLAT can fully leverage the lattice information and has an excellent parallel ability. Experiments on four datasets show FLAT outperforms other lexicon-based models in performance and efficiency.

Recently, the lattice structure has been proved to have a great benefit to utilize the word information and avoid the error propagation of word segmentation (Zhang and Yang, 2018). We can match a sentence with a lexicon to obtain the latent words in it, and then we get a lattice like in Figure  1(a). The lattice is a directed acyclic graph, where each node is a character or a latent word. The lattice includes a sequence of characters and potential * Corresponding author.  words in the sentence. They are not ordered sequentially, and the word's first character and last character determine its position. Some words in lattice may be important for NER. For example, in Figure 1(a), "人和药店(Renhe Pharmacy)" can be used to distinguish between the geographic entity "重庆(Chongqing)" and the organization entity "重 庆人(Chongqing People)".
There are two lines of methods to leverage the lattice. (1) One line is to design a model to be compatible with lattice input, such as lattice LSTM (Zhang and Yang, 2018) and LR-CNN (Gui et al., 2019a). In lattice LSTM, an extra word cell is employed to encode the potential words, and attention mechanism is used to fuse variable-number nodes at each position, as in Figure 1(b). LR-CNN uses CNN to encode potential words at different window sizes. However, RNN and CNN are hard to model long-distance dependencies (Vaswani et al., 2017), which may be useful in NER, such as coreference (Stanislawek et al., 2019). Due to the dynamic lattice structure, these methods cannot fully utilize the parallel computation of GPU.
(2) Another line is to convert lattice into graph and use a graph neural network (GNN) to encode it, such as Lexicon-based Graph Network (LGN) (Gui et al., 2019b) and Collaborative Graph Network (CGN) (Sui et al., 2019). While sequential structure is still important for NER and graph is general counterpart, their gap is not negligible. These methods need to use LSTM as the bottom encoder to carry the sequential inductive bias, which makes the model complicated.
In this paper, we propose FLAT: Flat LAttice Transformer for Chinese NER. Transformer (Vaswani et al., 2017) adopts fully-connected selfattention to model the long-distance dependencies in a sequence. To keep the position information, Transformer introduces the position representation for each token in the sequence. Inspired by the idea of position representation, we design an ingenious position encoding for the lattice-structure, as shown in Figure 1(c). In detail, we assign two positional indices for a token (character or word): head position and tail position, by which we can reconstruct a lattice from a set of tokens. Thus, we can directly use Transformer to fully model the lattice input. The self-attention mechanism of Transformer enables characters to directly interact with any potential word, including self-matched words. To a character, its self-matched words denote words which include it. For example, in Figure 1(a), self-matched words of "药 (Drug)" are "人和药 店(Renhe Pharmacy)" and "药店 (Pharmacy)" (Sui et al., 2019). Experimental results show our model outperforms other lexicon-based methods on the performance and inference-speed. Our code will be released at https://github.com/LeeSureman/Flat-Lattice-Transformer.

Background
In this section, we briefly introduce the Transformer architecture. Focusing on the NER task, we only discuss the Transformer encoder. It is composed of self-attention and feedforward network (FFN) layers. Each sublayer is followed by residual connection and layer normalization. FFN is  a position-wise multi-layer Perceptron with nonlinear transformation. Transformer performs selfattention over the sequence by H heads of attention individually and then concatenates the result of H heads. For simplicity, we ignore the head index in the following formula. The result of per head is calculated as: where E is the token embedding lookup table or the output of last Transformer layer.
The vanilla Transformer also uses absolute position encoding to capture the sequential information. Inspired by Yan et al. (2019), we think commutativity of the vector inner dot will cause the loss of directionality in self-attention. Therefore, we consider the relative position of lattice also significant for NER.

Converting Lattice into Flat Structure
After getting a lattice from characters with a lexicon, we can flatten it into flat counterpart. The flat-lattice can be defined as a set of spans, and a span corresponds to a token, a head and a tail, like in Figure 1(c). The token is a character or word. The head and tail denote the position index of the token's first and last characters in the original sequence, and they indicate the position of the token in the lattice. For the character, its head and tail are the same. There is a simple algorithm to recover flat-lattice into its original structure. We can first take the token which has the same head and tail, to construct the character sequence. Then we use other tokens (words) with their heads and tails to build skip-paths. Since our transformation is recoverable, we assume flat-lattice can maintain the original structure of lattice.

Relative Position Encoding of Spans
The flat-lattice structure consists of spans with different lengths. To encode the interactions among spans, we propose the relative position encoding of spans. For two spans x i and x j in the lattice, there are three kinds of relations between them: intersection, inclusion and separation, determined by their heads and tails. Instead of directly encoding these three kinds of relations, we use a dense vector to model their relations. It is calculated by continuous transformation of the head and tail information. Thus, we think it can not only represent the relation between two tokens, but also indicate more detailed information, such as the distance between a character and a word. Let head[i] and tail[i] denote the head and tail position of span x i . Four kinds of relative distances can be used to indicate the relation between x i and x j . They can be calculated as: where d (hh) ij denotes the distance between head of x i and tail of x j , and other d ij have similar meanings. The final relative position encoding of spans is a simple non-linear transformation of the four distances: where W r is a learnable parameter, ⊕ denotes the concatenation operator, and p d is calculated as in Vaswani et al. (2017), where d is d ij or d (tt) ij and k denotes the index of dimension of position encoding. Then we use a variant of self-attention (Dai et al., 2019) to leverage the relative span position encoding as follows:    Zhang and Yang (2018). PLT denotes the porous lattice Transformer (Mengge et al., 2019). 'YJ' denotes the lexicon released by Zhang and Yang (2018), and 'LS' denotes the lexicon released by Li et al. (2018). The result of other models are from their original paper. Except that the superscript * means the result is not provided in the original paper, and we get the result by running the public source code. Subscripts 'msm' and 'mld' denote FLAT with the mask of self-matched words and long distance (>10), respectively.
where W q , W k,R , W k,E ∈ R d model ×d head and u, v ∈ R d head are learnable parameters. Then we replace A with A * in Eq.(1). The following calculation is the same with vanilla Transformer. After FLAT, we only take the character representation into output layer, followed by a Condiftional Random Field (CRF) (Lafferty et al., 2001 He and Sun, 2016). We show statistics of these datasets in Table 1. We use the same train, dev, test split as Gui et al. (2019b). We take BiLSTM-CRF and TENER (Yan et al., 2019) as baseline models. TENER is a Transformer using relative position encoding for NER, without external information. We also compare FLAT with other lexicon-based methods. The embeddings and lexicons are the same as Zhang and Yang (2018). When comparing with CGN (Li et al., 2018), we use the same lexicon as CGN. The way to select hyper-parameters can be found in the supplementary material. In particular, we use only one layer Transformer encoder for our model.

Overall Performance
As shown in  (Li et al., 2018), our model also outperforms CGN by 0.73 in average F1 score. Maybe due to the characteristic of Transformer, the improvement of FLAT over other lexicon-based models on small datasets is not so significant like that on large datasets.

Advantage of Fully-Connected Structure
We think self-attention mechanism brings two advantages over lattice LSTM: 1) All characters can directly interact with its self-matched words. 2) Long-distance dependencies can be fully modeled. Due to our model has only one layer, we can strip them by masking corresponding attention. In detail, we mask attention from the character to its self-matched word and attention between tokens whose distance exceeds 10. As shown in Table  2, the first mask brings a significant deterioration to FLAT while the second degrades performance slightly. As a result, we think leveraging information of self-matched words is important For Chinese NER.

Efficiency of FLAT
To verify the computation efficiency of our model, we compare the inference-speed of different lexicon-based models on Ontonotes. The result is shown in Figure 3. GNN-based models outperform lattice LSTM and LR-CNN. But the RNN encoder of GNN-based models also degrades their speed. Because our model has no recurrent module and can fully leverage parallel computation of GPU, it outperforms other methods in running efficiency. In terms of leveraging batch-parallelism, the speedup ratio brought by batch-parallelism is 4.97 for FLAT, 2.1 for lattice LSTM, when batch size = 16. Due to the simplicity of our model, it can benefit from batch-parallelism more significantly.

How FLAT Brings Improvement
Compared with TENER, FLAT leverages lexicon resources and uses a new position encoding. To probe how these two factors bring improvement. We set two new metrics, 1) Span F: while the common F score used in NER considers correctness of both the span and the entity type, Span F only considers the former. 2) Type Acc: proportion of full-correct predictions to span-correct predictions. Table 3 shows two metrics of three models on the devlopment set of Ontonotes and MSRA. We can find: 1) FLAT outperforms TENER in two metrics significantly.
2) The improvement on Span F brought by FLAT is more significant than that on Type Acc. 3) Compared to FLAT, FLAT head 's deterioration on Span F is more significant than that on Type Acc. These show: 1) The new position encoding helps FLAT locate entities more accurately.
2) The pre-trained word-level embedding  Table 4: Comparision between BERT and BERT+FLAT. 'BERT' refers to the BERT+MLP+CRF architecture. 'FLAT+BERT' refers to FLAT using BERT embedding. We finetune BERT in both models during training. The BERT in the experiment is 'BERT-wwm' released by Cui et al. (2019). We use it by the BERTEmbedding in fastNLP 1 . makes FLAT more powerful in entity classification (Agarwal et al., 2020).

Compatibility with BERT
We also compare FLAT equipped with BERT with common BERT+CRF tagger on four datasets, and Results are shown in Table 4. We find that, for large datasets like Ontonotes and MSRA, FLAT+BERT can have a significant improvement over BERT. But for small datasets like Resume and Weibo, the improvement of FLAT+BERT over BERT is marginal.

Lexicon-based NER
Zhang and Yang (2018) introduced a lattice LSTM to encode all characters and potential words recognized by a lexicon in a sentence, avoiding the error propagation of segmentation while leveraging the word information. Gui et al. (2019a) exploited a combination of CNN and rethinking mechanism to encode character sequence and potential words at different window sizes. Both models above suffer from the low inference efficiency and are hard to model long-distance dependencies. Gui et al. (2019b) and Sui et al. (2019) leveraged a lexicon and character sequence to construct graph, converting NER into a node classification task. However, due to NER's strong alignment of label and input, their model needs an RNN module for encoding. The main difference between our model and models above is that they modify the model structure according to the lattice, while we use a well-designed position encoding to indicate the lattice structure.

Lattice-based Transformer
For lattice-based Transformer, it has been used in speech translation and Chinese-source translation. The main difference between them is the way to 1 https://github.com/fastnlp/fastNLP indicate lattice structure. In Chinese-source translation, Xiao et al. (2019) take the absolute position of nodes' first characters and the relation between each pair of nodes as the structure information. In speech translation, Sperber et al. (2019) used the longest distance to the start node to indicate lattice structure, and Zhang et al. (2019) used the shortest distance between two nodes. Our span position encoding is more natural, and can be mapped to all the three ways, but not vise versa. Because NER is more sensitive to position information than translation, our model is more suitable for NER. Recently, Porous Lattice Transformer (Mengge et al., 2019) is proposed for Chinese NER. The main difference between FLAT and Porus Lattice Transformer is the way of representing position information. We use 'head' and 'tail' to represent the token's position in the lattice. They use 'head', tokens' relative relation (not distance) and an extra GRU. They also use 'porous' technique to limit the attention distribution. In their model, the position information is not recoverable because 'head' and relative relation can cause position information loss. Briefly, relative distance carries more information than relative relation.

Conclusion and Future Work
In this paper, we introduce a flat-lattice Transformer to incorporate lexicon information for Chinese NER. The core of our model is converting lattice structure into a set of spans and introducing the specific position encoding. Experimental results show our model outperforms other lexiconbased models in the performance and efficiency. We leave adjusting our model to different kinds of lattice or graph as our future work.

A.1 Hyperparameters Selection
For MSRA and Ontonotes these two large datasets, we select the hyper-parameters based on the development experiment of Ontonotes. For two small datasets, Resume and Weibo, we find their optimal hyper-parameters by random-search. The Table 5 lists the hyper-parameters obtained from the development experiment of Ontonotes.
The Table 6 lists the range of hyper-parameters random-search for Weibo, Resume datasets. For the hyper-parameters which do not appear in it, they are the same as in Table 5