Chinese NER Using Lattice LSTM

We investigate a lattice-structured LSTM model for Chinese NER, which encodes a sequence of input characters as well as all potential words that match a lexicon. Compared with character-based methods, our model explicitly leverages word and word sequence information. Compared with word-based methods, lattice LSTM does not suffer from segmentation errors. Gated recurrent cells allow our model to choose the most relevant characters and words from a sentence for better NER results. Experiments on various datasets show that lattice LSTM outperforms both word-based and character-based LSTM baselines, achieving the best results.


Introduction
As a fundamental task in information extraction, named entity recognition (NER) has received constant research attention over the recent years. The task has traditionally been solved as a sequence labeling problem, where entity boundary and category labels are jointly predicted. The current stateof-the-art for English NER has been achieved by using LSTM-CRF models (Lample et al., 2016;Ma and Hovy, 2016;Chiu and Nichols, 2016;Liu et al., 2018) with character information being integrated into word representations.
Chinese NER is correlated with word segmentation. In particular, named entity boundaries are also word boundaries. One intuitive way of performing Chinese NER is to perform word segmentation first, before applying word sequence labeling. The segmentation → NER pipeline, however, can suffer the potential issue of error propagation, since NEs are an important source of OOV * Equal contribution. in segmentation, and incorrectly segmented entity boundaries lead to NER errors. This problem can be severe in the open domain since crossdomain word segmentation remains an unsolved problem (Liu and Zhang, 2012;Jiang et al., 2013;Qiu and Zhang, 2015;Chen et al., 2017;Huang et al., 2017). It has been shown that character-based methods outperform word-based methods for Chinese NER (He and Wang, 2008;Liu et al., 2010;Li et al., 2014).
One drawback of character-based NER, however, is that explicit word and word sequence information is not fully exploited, which can be potentially useful. To address this issue, we integrate latent word information into characterbased LSTM-CRF by representing lexicon words from the sentence using a lattice structure LSTM. As shown in Figure 1, we construct a wordcharacter lattice by matching a sentence with a large automatically-obtained lexicon. As a result, word sequences such as "长江大桥 (Yangtze River Bridge)", "长江 (Yangtze River)" and "大 桥 (Bridge)" can be used to disambiguate potential relevant named entities in a context, such as the person name "江大桥 (Daqiao Jiang)".
Since there are an exponential number of wordcharacter paths in a lattice, we leverage a lattice LSTM structure for automatically controlling information flow from the beginning of the sentence to the end. As shown in Figure  different paths to each character. Trained over NER data, the lattice LSTM can learn to find more useful words from context automatically for better NER performance. Compared with characterbased and word-based NER methods, our model has the advantage of leveraging explicit word information over character sequence labeling without suffering from segmentation error.
Results show that our model significantly outperforms both character sequence labeling models and word sequence labeling models using LSTM-CRF, giving the best results over a variety of Chinese NER datasets across different domains. Our code and data are released at https:// github.com/jiesutd/LatticeLSTM.

Related Work
Our work is in line with existing methods using neural network for NER. Hammerton (2003) attempted to solve the problem using a unidirectional LSTM, which was among the first neural models for NER. Collobert et al. (2011) used a CNN-CRF structure, obtaining competitive results to the best statistical models. dos Santos et al. (2015) used character CNN to augment a CNN-CRF model. Most recent work leverages an LSTM-CRF architecture.  uses hand-crafted spelling features; Ma and Hovy (2016) and Chiu and Nichols (2016) use a character CNN to represent spelling characteristics; Lample et al. (2016) use a character LSTM instead. Our baseline word-based system takes a similar structure to this line of work.
Character sequence labeling has been the dominant approach for Chinese NER (Chen et al., 2006b;Lu et al., 2016;. There have been explicit discussions comparing statistical word-based and character-based methods for the task, showing that the latter is empirically a superior choice (He and Wang, 2008;Liu et al., 2010;Li et al., 2014). We find that with proper representation settings, the same conclusion holds for neural NER. On the other hand, lattice LSTM is a better choice compared with both word LSTM and character LSTM.
How to better leverage word information for Chinese NER has received continued research attention (Gao et al., 2005), where segmentation information has been used as soft features for NER (Zhao and Kit, 2008;Peng and Dredze, 2015;He and Sun, 2017a), and joint segmentation and NER has been investigated using dual decomposition (Xu et al., 2014), multi-task learning (Peng and Dredze, 2016), etc. Our work is in line, focusing on neural representation learning. While the above methods can be affected by segmented training data and segmentation errors, our method does not require a word segmentor. The model is conceptually simpler by not considering multi-task settings.
External sources of information has been leveraged for NER. In particular, lexicon features have been widely used (Collobert et al., 2011;Passos et al., 2014;Luo et al., 2015). Rei (2017) uses a word-level language modeling objective to augment NER training, performing multi-task learning over large raw text. Peters et al. (2017) pretrain a character language model to enhance word representations. Yang et al. (2017b) exploit cross-domain and cross-lingual knowledge via multi-task learning. We leverage external data by pretraining word embedding lexicon over large automatically-segmented texts, while semisupervised techniques such as language modeling are orthogonal to and can also be used for our lattice LSTM model.
Lattice structured RNNs can be viewed as a natural extension of tree-structured RNNs (Tai et al., 2015) to DAGs. They have been used to model motion dynamics , dependencydiscourse DAGs , as well as speech tokenization lattice (Sperber et al., 2017) and multi-granularity segmentation outputs (Su et al., 2017) for NMT encoders. Compared with existing work, our lattice LSTM is different in both motivation and structure. For example, being designed for character-centric lattice-LSTM-CRF sequence labeling, it has recurrent cells but not hidden vectors for words. To our knowledge, we are the first to design a novel lattice LSTM representation for mixed characters and lexicon words, and the first to use a word-character lattice for segmentation-free Chinese NER.
We follow the best English NER model Ma and Hovy, 2016;Lample et al., 2016), using LSTM-CRF as the main network structure. Formally, denote an input sentence as s = c 1 , c 2 , . . . , c m , where c j denotes the jth character. s can further be seen as a word sequence s = w 1 , w 2 , . . . , w n , where w i denotes the ith word in the sentence, obtained using a Chinese segmentor. We use t(i, k) to denote the index j for the kth character in the ith word in the sentence. Take the sentence in Figure 1 for example. If the segmentation is "南京市 长江大桥", and indices are from 1, then t(2, 1) = 4 (长) and t(1, 3) = 3 (市). We use the BIOES tagging scheme (Ratinov and Roth, 2009) for both wordbased and character-based NER tagging.

Character-Based Model
The character-based model is shown in Figure  3(a). It uses an LSTM-CRF model on the character sequence c 1 , c 2 , . . . , c m . Each character c j is represented using e c denotes a character embedding lookup table.
A bidirectional LSTM (same structurally as Eq. 11) is applied to x 1 , x 2 , . . . , ← − h c m in the left-to-right and right-to-left directions, respectively, with two distinct sets of parameters. The hidden vector representation of each character is: A standard CRF model (Eq. 17) is used on h c 1 , h c 2 , . . . , h c m for sequence labelling. • Char + bichar. Character bigrams have been shown useful for representing characters in word segmentation (Chen et al., 2015;Yang et al., 2017a). We augment the character-based model with bigram information by concatenating bigram embeddings with character embeddings: where e b denotes a charater bigram lookup table.
• Char + softword. It has been shown that using segmentation as soft features for character-based NER models can lead to improved performance (Zhao and Kit, 2008;Peng and Dredze, 2016). We augment the character representation with segmentation information by concatenating segmentation label embeddings to character embeddings: where e s represents a segmentation label embedding lookup table. seg(c j ) denotes the segmentation label on the character c j given by a word segmentor. We use the BMES scheme for repre-senting segmentation (Xue, 2003).
Similar to the character-based case, a standard CRF model (Eq. 17) is used on h w 1 , h w 2 , . . . , h w m for sequence labelling.

Word-Based Model
The word-based model is shown in Figure 3(b). It takes the word embedding e w (w i ) for representation each word w i : where e w denotes a word embedding lookup table. A bi-directioanl LSTM (Eq. 11) is used to obtain a left-to-right sequence of hidden states Integrating character representations Both character CNN (Ma and Hovy, 2016) and LSTM (Lample et al., 2016) have been used for representing the character sequence within a word. We experiment with both for Chinese NER. Denoting the representation of characters within w i as x c i , a new word representation is obtained by concatenation of e w (w i ) and x c i : • Word + char LSTM. Denoting the embedding of each input character as e c (c j ), we use a bi-directional LSTM (Eq.

11) to learn hidden states
for the characters c t(i,1) , . . . , c t(i,len(i)) of w i , where len(i) denotes the number of characters in w i . The final character representation for w i is: • Word + char LSTM . We investigate a variation of word + char LSTM model that uses a single LSTM to obtain − → h c j and ← − h c j for each c j . It is similar with the structure of Liu et al. (2018) but not uses the highway layer. The same LSTM structure as defined in Eq. 11 is used, and the same method as Eq. 8 is used to integrate character hidden states into word representations.
• Word + char CNN. A standard CNN (LeCun et al., 1989) structure is used on the character sequence of each word to obtain its character representation x c i . Denoting the embedding of character c j as e c (c j ), the vector x c i is given by: where W CNN and b CNN are parameters, ke = 3 is the kernal size and max denotes max pooling.

Lattice Model
The overall structure of the word-character lattice model is shown in Figure 2, which can be viewed as an extension of the character-based model, integrating word-based cells and additional gates for controlling information flow. Shown in Figure 3(c), the input to the model is a character sequence c 1 , c 2 , . . . , c m , together with all character subsequences that match words in a lexicon D. As indicated in Section 2, we use automatically segmented large raw text for buinding D. Using w d b,e to denote such a subsequence that begins with character index b and ends with character index e, the segment w d 1,2 in Figure 1 is "南 京 (Nanjing)" and w d 7,8 is "大桥 (Bridge)". Four types of vectors are involved in the model, namely input vectors, output hidden vectors, cell vectors and gate vectors. As basic components, a character input vector is used to represent each chacracter c j as in the character-based model: The basic recurrent structure of the model is constructed using a character cell vector c c j and a hidden vector h c j on each c j , where c c j serves to record recurrent information flow from the beginning of the sentence to c j and h c j is used for CRF sequence labelling using Eq. 17.
The basic recurrent LSTM functions are: where i c j , f c j and o c j denote a set of input, forget and output gates, respectively. W c and b c are model parameters. σ() represents the sigmoid function.
Different from the character-based model, however, the computation of c c j now considers lexicon subsequences w d b,e in the sentence. In particular, each subsequence w d b,e is represented using where e w denotes the same word embedding lookup table as in Section 3.2. In addition, a word cell c w b,e is used to represent the recurrent state of x w b,e from the beginning of the sentence. The value of c w b,e is calculated by: where i w b,e and f w b,e are a set of input and forget gates. There is no output gate for word cells since labeling is performed only at the character level. With The calculation of cell values c c j thus becomes In Eq. 15, the gate values i c b,j and i c j are normalised to α c b,j and α c j by setting the sum to 1.
The final hidden vectors h c j are still computed as described by Eq. 11. During NER training, loss values back-propagate to the parameters 2 We experimented with alternative configurations on indexing word and character path links, finding that this configuration gives the best results in preliminary experiments. Single-character words are excluded; the final performance drops slightly after integrating single-character words. W c , b c , W w , b w , W l and b l allowing the model to dynamically focus on more relevant words during NER labelling.

Decoding and Training
A standard CRF layer is used on top of h 1 , h 2 , . . . , h τ , where τ is n for character-based and lattice-based models and m for word-based models. The probability of a label sequence y = l 1 , l 2 , . . . , l τ is Here y represents an arbitary label sequence, and W l i CRF is a model parameter specific to l i , and b (l i−1 ,l i ) CRF is a bias specific to l i−1 and l i . We use the first-order Viterbi algorithm to find the highest scored label sequence over a word-based or character-based input sequence. Given a set of manually labeled training data {(s i , y i )}| N i=1 , sentence-level log-likelihood loss with L 2 regularization is used to train the model: where λ is the L 2 regularization parameter and Θ represents the parameter set.

Experiments
We carry out an extensive set of experiments to investigate the effectiveness of word-character lattice LSTMs across different domains. In addition, we aim to empirically compare word-based and character-based neural Chinese NER under different settings. Standard precision (P), recall (R) and F1-score (F1) are used as evaluation metrics.

Experimental Settings
Data. Four datasets are used in this paper, which include OntoNotes 4 (Weischedel et al., 2011), MSRA (Levow, 2006) Weibo NER (Peng and  For more variety in test domains, we collected a resume dataset from Sina Finance 4 , which consists of resumes of senior executives from listed companies in the Chinese stock market. We randomly selected 1027 resume summaries and manually annotated 8 types of named entities with YEDDA system (Yang et al., 2018). Statistics of the dataset is shown in Table 2. The inter-annotator agreement is 97.1%. We release this dataset as a resource for further research.
Segmentation. For the OntoNotes and MSRA datasets, gold-standard segmentation is available in the training sections. For OntoNotes, gold segmentation is also available for the development and test sections. On the other hand, no segmentation is available for the MSRA test sections, nor the Weibo / resume datasets. As a result, OntoNotes is leveraged for studying oracle situations where gold segmentation is given. We use the neural word segmentor of Yang et al. (2017a) to automatically segment the development and test sets for word-based NER. In particular, for the OntoNotes and MSRA datasets, we train the segmentor using gold segmentation on their respective training sets. For Weibo and resume, we take the best model of Yang et al. (2017a)   which is trained using CTB 6.0 (Xue et al., 2005). Word Embeddings. We pretrain word embeddings using word2vec (Mikolov et al., 2013) over automatically segmented Chinese Giga-Word 6 , obtaining 704.4k words in a final lexicon. In particular, the number of single-character, twocharacter and three-character words are 5.7k, 291.5k, 278.1k, respectively. The embedding lexicon is released alongside our code and models as a resource for further research. Word embeddings are fine-tuned during NER training. Character and character bigram embeddings are pretrained on Chinese Giga-Word using word2vec and finetuned at model training.
Hyper-parameter settings. Table 3 shows the values of hyper-parameters for our models, which as fixed according to previous work in the literature without grid-search adjustments for each individual dataset. In particular, the embedding sizes are set to 50 and the hidden size of LSTM models to 200. Dropout (Srivastava et al., 2014) is applied to both word and character embeddings with a rate of 0.5. Stochastic gradient descent (SGD) is used for optimization, with an initial learning rate of 0.015 and a decay rate of 0.05.

Development Experiments
We compare various model configurations on the OntoNotes development set, in order to select the best settings for word-based and character-based NER models, and to learn the influence of lattice word information on character-based models.
Character-based NER. As shown in Table 4, without using word segmentation, a characterbased LSTM-CRF model gives a development F1score of 62.47%. Adding character-bigram and softword representations as described in Section 3.1 increases the F1-score to 67.63% and 65.71%, respectively, demonstrating the usefulness of both sources of information. In addition, a combination of both gives a 69.64% F1-score, which is the best 6 https://catalog.ldc.upenn.edu/ LDC2011T13  Word-based NER. Table 4 shows a variety of different settings for word-based Chinese NER. With automatic segmentation, a word-based LSTM CRF baseline gives a 64.12% F1-score, which is higher compared to the character-based baseline. This demonstrates that both word information and character information are useful for Chinese NER. The two methods of using character LSTM to enrich word representations in Section 3.2, namely word+char LSTM and word+char LSTM , lead to similar improvements.
A CNN representation of character sequences gives a slightly higher F1-score compared to LSTM character representations. On the other hand, further using character bigram information leads to increased F1-score over word+char LSTM, but decreased F1-score over word+char CNN. A possible reason is that CNN inherently captures character n-gram information. As a result, we use word+char+bichar LSTM for wordbased NER in the remaining experiments, which gives the best development results, and is structurally consistent with the state-of-the-art English NER models in the literature.
Lattice-based NER. Figure 4 shows the F1score of character-based and lattice-based models against the number of training iterations. We include models that use concatenated character and character bigram embeddings, where bigrams can play a role in disambiguating characters. As can be seen from the figure, lattice word information is useful for improving character-based NER, improving the best development result from 62.5% to 71.6%. On the other hand, the bigram-enhanced lattice model does not lead to further improvements compared with the original lattice model.   This is likely because words are better sources of information for character disambiguation compared with bigrams, which are also ambiguous. As shown in Table 4, the lattice LSTM-CRF model gives a development F1-score of 71.62%, which is significantly 7 higher compared with both the word-based and character-based methods, despite that it does not use character bigrams or word segmentation information. The fact that it significantly outperforms char+softword shows the advantage of lattice word information as compared with segmentor word information.

Final Results
OntoNotes. The OntoNotes test results are shown in Table 5 8 . With gold-standard segmentation, our word-based methods give competitive results to the state-of-the-art on the dataset , which leverage bilingual data. This demonstrates that LSTM-CRF is a competitive choice for word-based Chinese NER, as it is for other languages. In addition, the results show    Table 6. For this benchmark, no goldstandard segmentation is available on the test set. Our chosen segmentor gives 95.93% accuracy on 5-fold cross-validated training set. The best statistical models on the dataset leverage rich handcrafted features (Chen et al., 2006a;Zhang et al., 2006;Zhou et al., 2013) and character embedding features (Lu et al., 2016).  exploit neural LSTM-CRF with radical features.
Compared with the existing methods, our wordbased and character-based LSTM-CRF models give competitive accuracies. The lattice model significantly outperforms both the best characterbased and word-based models (p < 0.01), achieving the best result on this standard benchmark.
Weibo/resume. Results on the Weibo NER dataset are shown in Table 7, where NE, NM and   Overall denote F1-scores for named entities, nominal entities (excluding named entities) and both, respectively. Gold-standard segmentation is not available for this dataset. Existing state-of-theart systems include Peng and Dredze (2016) and He and Sun (2017b), who explore rich embedding features, cross-domain and semi-supervised data, some of which are orthogonal to our model 9 .
Results on the resume NER test data are shown in Table 8. Consistent with observations on OntoNotes and MSRA, the lattice model significantly outperforms both the word-based mode and the character-based model for Weibo and resume (p < 0.01), giving state-of-the-art results.

Discussion
F1 against sentence length. Figure 5 shows the F1-scores of the baseline models and lattice LSTM-CRF on the OntoNotes dataset. The character-based baseline gives relatively stable F1-scores over different sentence lengths, although the performances are relatively low. The word-based baseline gives substantially higher F1-scores over short sentences, but lower F1scores over long sentences, which can be because of lower segmentation accuracies over longer sentences. Both word+char+bichar and char+bichar+softword give better performances compared to their respective baselines, showing Table 9: Example. Red and green represent incorrect and correct entities, respectively. that word and character representations are complementary for NER. The accuracy of lattice also decreases as the sentence length increases, which can result from exponentially increasing number of word combinations in lattice. Compared with word+char+bichar and char+bichar+softword, the lattice model shows more robustness to increased sentence lengths, demonstrating the more effective use of word information.
Note that both word+char+bichar and lattice use the same source of word information, namely the same pretrained word embedding lexicon. However, word+char+bichar first uses the lexicon in the segmentor, which imposes hard constrains (i.e. fixed words) to its subsequence use in NER. In contrast, lattice LSTM has the freedom of considering all lexicon words. Entities in lexicon. Table 10 shows the total number of entities and their respective match ratios in the lexicon. The error reductions (ER) of the final  lattice model over the best character-based method (i.e. "+bichar+softword") are also shown. It can be seen that error reductions have a correlation between matched entities in the lexicon. In this respect, our automatic lexicon also played to some extent the role of a gazetteer (Ratinov and Roth, 2009;Chiu and Nichols, 2016), but not fully since there is no explicit knowledge in the lexicon which tokens are entities. The ultimate disambiguation power still lies in the lattice encoder and supervised learning.
The quality of the lexicon may affect the accuracy of our NER model since noise words can potentially confuse NER. On the other hand, our lattice model can potentially learn to select more correct words during NER training. We leave the investigation of such influence to future work.

Conclusion
We empirically investigated a lattice LSTM-CRF representations for Chinese NER, finding that it gives consistently superior performance compared to word-based and character-based LSTM-CRF across different domains. The lattice method is fully independent of word segmentation, yet more effective in using word information thanks to the freedom of choosing lexicon words in a context for NER disambiguation.