Dependency-Guided LSTM-CRF for Named Entity Recognition

Dependency tree structures capture long-distance and syntactic relationships between words in a sentence. The syntactic relations (e.g., nominal subject, object) can potentially infer the existence of certain named entities. In addition, the performance of a named entity recognizer could benefit from the long-distance dependencies between the words in dependency trees. In this work, we propose a simple yet effective dependency-guided LSTM-CRF model to encode the complete dependency trees and capture the above properties for the task of named entity recognition (NER). The data statistics show strong correlations between the entity types and dependency relations. We conduct extensive experiments on several standard datasets and demonstrate the effectiveness of the proposed model in improving NER and achieving state-of-the-art performance. Our analysis reveals that the significant improvements mainly result from the dependency relations and long-distance interactions provided by dependency trees.


Introduction
Named entity recognition (NER) is one of the most important and fundamental tasks in natural language processing (NLP). Named entities capture useful semantic information which was shown helpful for downstream NLP tasks such as coreference resolution (Lee et al., 2017), relation extraction (Miwa and Bansal, 2016) and semantic parsing (Dong and Lapata, 2018). On the other hand, dependency trees also capture useful semantic information within natural language sentences. Currently, research efforts have derived useful discrete features from dependency structures (Sasano and Kurohashi, 2008;Cucchiarelli and Velardi, 2001;Ling and Weld, 2012) or structural constraints (Jie et al., 2017) to help the NER task. However, how to make good use of the rich relational information as well as complex long-distance interactions among words as conveyed by the complete dependency structures for improved NER remains a research question to be answered. The first example in Figure 1 illustrates the relationship between a dependency structure and a named entity. Specifically, the word "premises", which is a named entity of type LOC (location), is characterized by the incoming arc with label "pobj" (prepositional object). This arc reveals a certain level of the semantic role that the word "premises" plays in the sentence. Similarly, the two words "Hong Kong" in the second example that form an entity of type GPE are also characterized by a similar dependency arc towards them.
The long-distance dependencies capturing nonlocal structural information can also be very helpful for the NER task (Finkel et al., 2005). In the second example of Figure 1, the long-distance dependency from "held" to "seminar" indicates a direct relation "nsubjpass" (passive subject) between them, which can be used to characterize the existence of an entity. However, existing NER models based on linear-chain structures would have difficulties in capturing such long-distance relations (i.e., non-local structures).
One interesting property, as highlighted in the work of Jie et al. (2017), is that most of the en-tities form subtrees under their corresponding dependency trees. In the example of the EVENT entity in Figure 1, the entity itself forms a subtree and the words inside have rich complex dependencies among themselves. Exploiting such dependency edges within the subtrees allows a model to capture non-trivial semantic-level interactions between words within long entities. For example, "practice" is the prepositional object (pobj) of "on" which is a preposition (prep) of "seminar" in the EVENT entity. Modeling these grandchild dependencies (GD) (Koo and Collins, 2010) requires the model to capture some higher-order long-distance interactions among different words in a sentence.
Inspired by the above characteristics of dependency structures, in this work, we propose a simple yet effective dependency-guided model for NER. Our neural network based model is able to capture both contextual information and rich long-distance interactions between words for the NER task. Through extensive experiments on several datasets on different languages, we demonstrate the effectiveness of our model, which achieves the state-of-the-art performance. To the best of our knowledge, this is the first work that leverages the complete dependency graphs for NER. We make our code publicly available at http://www.statnlp.org/research/ information-extraction.

Related Work
NER has been a long-standing task in the field of NLP. While many recent works (Peters et al., 2018a;Akbik et al., 2018;Devlin et al., 2019) focus on finding good contextualized word representations for improving NER, our work is mostly related to the literature that focuses on employing dependency trees for improving NER. Sasano and Kurohashi (2008) exploited the syntactic dependency features for Japanese NER and achieved improved performance with a support vector machine (SVM) (Cortes and Vapnik, 1995) classifier. Similarly, Ling and Weld (2012) included the head word in a dependency edge as features for fine-grained entity recognition. Their approach is a pipeline where they extract the entity mentions with linear-chain conditional random fields (CRF) (Lafferty et al., 2001) and used a classifier to predict the entity type. Liu et al. (2010) proposed to link the words that are associ-ated with selected typed dependencies (e.g., "nn", "prep") using a skip-chain CRF (Sutton and Mc-Callum, 2004) model. They showed that some specific relations between the words can be exploited for improved NER. Cucchiarelli and Velardi (2001) applied a dependency parser to obtain the syntactic relations for the purpose of unsupervised NER. The resulting relation information serves as the features for potential existence of named entities. Jie et al. (2017) proposed an efficient dependency-guided model based on the semi-Markov CRF (Sarawagi and Cohen, 2004) for NER. The purpose is to reduce time complexity while maintaining the non-Markovian features. They observed certain relationships between the dependency edges and the named entities. Such relationships are able to define a reduced search space for their model. While these previous approaches do not make full use of the dependency tree structures, we focus on exploring neural architectures to exploit the complete structural information conveyed by the dependency trees.

Model
Our dependency-guided model is based on the state-of-the-art BiLSTM-CRF model proposed by Lample et al. (2016). We first briefly present their model as background and next present our dependency-guided model.

Background: BiLSTM-CRF
In the task of named entity recognition, we aim to predict the label sequence y = {y 1 , y 2 , · · · , y n } given the input sentence x = {x 1 , x 2 , · · · , x n } where n is the number of words. The labels in y are defined by a label set with the standard IOBES 1 labeling scheme (Ramshaw and Marcus, 1999;Ratinov and Roth, 2009). The CRF (Lafferty et al., 2001) layer defines the probability of the label sequence y given x: Following Lample et al. (2016), the score is defined as the sum of transitions and emissions from the bidirectional LSTM (BiLSTM): where A is a transition matrix in which A y i ,y i+1 is the transition parameter from the label y i to the label y i+1 2 . F x is an emission matrix where F x,y i represents the scores of the label y i at the i-th position. Such scores are provided by the parameterized LSTM (Hochreiter and Schmidhuber, 1997) networks. During training, we minimize the negative log-likelihood to obtain the model parameters including both LSTM and transition parameters.

Dependency-Guided LSTM-CRF
Input Representations The word representation w in the BiLSTM-CRF (Lample et al., 2016;Ma and Hovy, 2016;Reimers and Gurevych, 2017) model consists of the concatenation of the word embedding as well as the corresponding character-based representation. Inspired by the fact that each word (except the root) in a sentence has exactly one head (i.e., parent) word in the dependency structure, we can enhance the word representations with such dependency information. Similar to the work by Miwa and Bansal (2016), we concatenate the word representation together with the corresponding head word representation and dependency relation embedding as the input representation. Specifically, given a dependency edge (x h , x i , r) with x h as parent, x i as child and r as dependency relation, the representation at position i can be denoted as: where w i and w h are the word representations of the word x i and its parent x h , respectively. We take the final hidden state of character-level BiL-STM as the character-based representation (Lample et al., 2016). v r is the embedding for the dependency relation r. These relation embeddings are randomly initialized and fine-tuned during training. The above representation allows us to capture the direct long-distance interactions at the input layer. For the word that is a root of the dependency tree, we treat its parent as itself 3 and create a root relation embedding. Additionally, contextualized word representations (e.g., ELMo) can also be concatenated into u.
Neural Architecture Given the dependencyencoded input representation u, we apply the LSTM to capture the contextual information and Abramov had an accident in Moscow Figure 2: Dependency-guided LSTM-CRF with 2 LSTM Layers. Dashed connections mimic the dependency edges. "g(·)" represents the interaction function.
model the interactions between the words and their corresponding parents in the dependency trees. Figure 2 shows the proposed dependency-guided LSTM-CRF (DGLSTM-CRF) with 2 LSTM layers for the example sentence "Abramov had an accident in Moscow" and its dependency structure. The corresponding label sequence is Followed by the first BiLSTM, the hidden states at each position will propagate to the next BiLSTM layer and its child along the dependency trees. For example, the hidden state of the word "had", h 2 , will propagate to its child "Abramov" at the first position. For the word that is root, the hidden state at that specific position will propagate to itself. We use an interaction function g(h i , h p i ) to capture the interaction between the child and its parent in a dependency. Such an interaction function can be concatenation, addition or a multilayer perceptron (MLP). We further apply another BiLSTM layer on top of the interaction functions to produce the context representation for the final CRF layer.
The architecture shown in Figure 2 with a 2layer BiLSTM can effectively encode the grandchild dependencies because the input representations encode the parent information and the interaction function further propagate the grandparent information. Such propagations allow the model to capture the indirect long-distance interactions from the grandchild dependencies between the words in the sentence as mentioned in Section 1. In general, we can stack more interaction functions and BiLSTMs to enable deeper reasoning over the dependency trees. Specifically, the hid- den states of the (l + 1)-th layer H (l+1) can be calculated from the hidden state of the previous layer H (l) : where p i indicates the parent index of the word p i ) represents the interaction functions between the hidden state at the i-th and p ith positions under the dependency edges (x p i , x i ). The number of layers L can be chosen according to the performance on the development set.

Interaction Function
The interaction function between the parent and child representations can be defined in various ways. Table 1 shows the list of interaction function considered in our experiments. The first one returns the hidden state itself, which is equivalent to stacking the LSTM layers. The concatenation and addition involve no parameter, which are straightforward ways to model the interactions. The last one applies an MLP to model the interaction between parent and child representations. With the rectified linear unit (ReLU) as activation function, the g(h i , h p i ) function is analogous to a graph convolutional network (GCN) (Kipf and Welling, 2017) formulation. In such a graph, each node has a self connection (i.e., h i ) and a dependency connection with parent (i.e., h p i ). Similar to the work by Marcheggiani and Titov (2017), we adopt different parameters W 1 and W 2 for self and dependency connections.

Experiments
Datasets The main experiments are conducted on the large-scale OntoNotes 5.0 (Weischedel et al., 2013) English and Chinese datasets. We chose these datasets because they contain both constituency tree and named entity annotations. There are 18 types of entities defined in the OntoNotes dataset. We convert the constituency   (Recasens et al., 2010) 7 . The SemEval-2010 task was originally designed for the task of coreference resolution in multiple languages. Again, we chose these corpora primarily because they contain both dependency and named entity annotations. Following Finkel and Manning (2009) and Jie et al. (2017), we select the most dominant three entity types and merge the rest into one general a entity type "misc". Table  2 shows the statistics of the datasets used in main experiments. To further evaluate the effectiveness of the dependency structures, we also conduct additional experiments under a low-resource setting for NER (Cotterell and Duh, 2017).
The last two columns of Table 2 show the relationships between the dependency trees and named entities with length larger than 2 for the complete dataset. Specifically, the penultimate column shows the percentage of entities that can form a complete subtree (ST) under their dependency tree structures. Apparently, most of the entities form subtrees, especially for the Catalan and Spanish datasets where 100% entities form subtrees. This observation is consistent with the findings reported in Jie et al. (2017). The last column in Table 2 shows the percentage of the grandchild dependencies (Koo and Collins, 2010) (GD) that exist in these subtrees (i.e., entities). Such grandchild dependencies could be useful for detecting certain named entities, especially for long entities.
As we will see later in Section 5, the performance of long entities can be significantly improved with our dependency-guide model. The heatmap table in Figure 3 shows the correlation between the entity types and the dependency relations in the OntoNotes English dataset. Specifically, each entry denotes the percentage of the entities that have a parent dependency with a specific dependency relation. For example, at the row with GPE entity, 37% of the entity words 8 have a dependency edge whose label is "pobj". When looking at column of "pobj" and "nn", we can see that most of the entities relate to the prepositional object (pobj) and noun compound modifier (nn) dependencies. Especially for the NORP (i.e., nationalities or religious or political groups) and ORDINAL (e.g., "first", "second") entities, more than 60% of the entity words have the dependency with adjectival modifier (amod) relation. Furthermore, every entity type (i.e., row) has a most related dependency relation (with more than 17% occurrences). Such observations present useful information that can be used to categorize named entities with different types. cates the model only relies on the input representation. Following , the complete dependency trees are considered bidirectional and encoded with a contextualized GCN (BiLSTM-GCN). We further add the relation-specific parameters (Marcheggiani and Titov, 2017) and a CRF layer for the NER task. The resulting baseline is BiLSTM-GCN-CRF 9 . We use the bootstrapping paired t-test (Berg-Kirkpatrick et al., 2012) for significance test when comparing the results of different models.
Experimental Setup We choose MLP as the interaction function in our DGLSTM-CRF according to performance on the development set. The hidden size of all models (i.e., LSTM, GCN) is set to 200. We use the Glove (Pennington et al., 2014) 100-d word embeddings, which was shown to be effective in English NER task (Ma and Hovy, 2016;Peters et al., 2018a). We use the publicly available FastText (Grave et al., 2018) word embeddings for Chinese, Catalan and Spanish. The ELMo (Peters et al., 2018a), deep contextualized word representations 10 are used for all languages in our experiments since Che et al. (2018) provides ELMo for many other languages 11 , including Chinese, Catalan and Spanish. We use the average weights over all layers of the ELMo representations and concatenate them with the input representation u. Our models are optimized by mini-batch stochastic gradient descent (SGD) with learning rate 0.01 and batch size 10. The L 2 regularization parameter is 1e-8. The hyperparameters are selected according to the performance on the OntoNotes English development set. Table 3 shows the performance comparison between our work and previous work on the OntoNotes English dataset. Without the LSTM layers (i.e., L = 0), the proposed model with dependency information significantly improves the NER performance with more than 2 points in F 1 compared to the baseline BiLSTM-CRF (L = 0), which demonstrate the effective-9 Detailed description of this baseline can also be found in the supplementary material. 10 We also tried BERT (Devlin et al., 2019) in preliminary experiments and obtained similar performance as ELMo. The NER performance using BERT without fine-tuning reported in Peters et al. (2019) is consistent with the one reported by ELMo (Peters et al., 2018a).  ness of dependencies for the NER task. Our best performing BiLSTM-CRF baseline (with Glove) achieves a F 1 score of 87.78 which is better than or on par with previous works (Chiu and Nichols, 2016;Li et al., 2017;Ghaddar and Langlais, 2018) with extra features. This baseline also outperforms the CNN-based models (Strubell et al., 2017;Li et al., 2017). The BiLSTM-GCN-CRF model outperforms the BiLSTM-CRF model but achieves inferior performance compared to the proposed DGLSTM-CRF model. We believe it is challenging to preserve the surrounding context information with stacking GCN layers while contextual information is important for NER (Peters et al., 2018b). Overall, the 2-layer DGLSTM-CRF model significantly (with p < 0.01) outperforms the best BiLSTM-CRF baseline and the BiLSTM-GCN-CRF model. As we can see from the table, increasing the number of layers (e.g., L = 3) does not give us further improvements for both BiLSTM-CRF and DGLSTM-CRF because such third-order information (e.g., the relationship among a words parent, its grandparent, and greatgrandparent) does not play an important role in indicating the presence of named entities.  We further compare the performance of all models with ELMo (Peters et al., 2018a) representations to investigate whether the effect of dependency would be diminished by the contextualized word representations. With L = 0, the ELMo representations largely improve the performance of BiLSTM-CRF compared to the BiLSTM-CRF model with word embeddings only but is still 1 point below our DGLSTM-CRF model. The 2layer DGLSTM-CRF model outperforms the best BilSTM-CRF baseline with 0.9 points in F 1 (p < 0.001). Empirically, we found that among the entities that are correctly predicted by DGLSTM-CRF but wrongly predicted by BiLSTM-CRF, 47% of them are with length more than 2. Our finding shows the 2-layer DGLSTM-CRF model is able to accurately recognize long entities, which can lead to a higher precision. In addition, 51.9% of the entities that are correctly retrieved by DGLSTM-CRF have the dependency relations "pobj", "nn" and "nsubj", which have strong correlations with certain named entity types (Figure 3). Such a result demonstrates the effectiveness of dependency relations in improving the recall of NER. Table 4 shows the performance comparison on the Chinese datasets. We compare our models against the state-of-the-art  NER model on this dataset, Lattice LSTM (Zhang and Yang, 2018) 12 . Our implementation of the strong BiLSTM-CRF baseline achieves comparable performance against the Lattice LSTM. Similar to the English dataset, our model with L = 0 significantly improves the performance compared to the BiLSTM-CRF (L = 0) model. Our DGLSTM-CRF model achieves the best performance with L = 2 and is consistently better (p < 0.02) than the strong BiLSTM-CRF baselines. As we can see from the table, the improvements of the DGLSTM-CRF model mainly come from recall (p < 0.001) compared to the BiLSTM model, especially in the scenario with word embeddings only. Empirically, we also found that those correctly retrieved entities of the DGLSTM-CRF (compared against the baseline) mostly correlate with the following dependency relations: "nn", "nsubj", "nummod". However, DGLSTM-CRF achieves lower precisions against BiLSTM-CRF, which indicates that the DGLSTM-CRF model makes more false-positive predictions. The reason could be the relatively lower ratio of ST(%) 13 as shown in Table 2, which means some of the entities do not form subtrees under the complete dependency trees. In such a scenario, the model may not correctly identify the boundary of the entities, which results in lower precision. 12 We run their code on the OntoNotes 5.0 Chinese dataset. 13 Percentage of entities that can form a subtree.

OntoNotes Chinese
SemEval-2010 Table 5 shows the results of our models on the SemEval-2010 Task 1 datasets. Overall, we observe substantial improvements of the DGLSTM-CRF on the Catalan and Spanish datasets (with p < 0.001 marked in bold against the best performing BiLSTM-CRF baseline), especially for DGLSTM-CRF with ELMo and L larger than 1. With word embeddings, the best DGLSTM-CRF model outperforms the best performing BiLSTM-CRF baseline with more than 10 and 9 points in F 1 on the Catalan and Spanish datasets, respectively. The BiLSTM-GCN-CRF model also performs much better than the BiLSTM-CRF baselines but is worse than the DGLSTM-CRF model with L ≥ 2. Both precision and recall significantly improve with a large margin compared to the best performing BiLSTM-CRF, especially for the recall (with more than 10 points improvement) on these two datasets. With ELMo, the best performing DGLSTM-CRF model outperforms the BiLSTM-CRF baseline with about 6 and 7 points in F 1 on these two datasets, respectively. The substantial improvements show that the structural dependency information is extremely helpful for these two datasets.
With ELMo representations, we observe about 2 and 3 points improvements in F 1 compared with the 1-layer DGLSTM-CRF model on these two datasets, respectively. Empirically, more than 50% of the entities that are correctly predicted by the   2-layer model but not the 1-layer model are with length larger than 2. Also, most of these entities contain the grandchild dependencies "(sn, sn)" and "(spec, sn)" where sn represents noun phrase and spec represents specifier (e.g., determiner, quantifier) in both datasets. Such a finding shows that the 2-layer model is able to capture the interactions given by the grandchild dependencies.

Additional Experiments
CoNLL-2003 English Table 6 shows the performance on the CoNLL-2003 English dataset. The dependencies are predicted from Spacy (Honnibal and Montani, 2017). With the contextualized word representations, DGLSTM-CRF outperforms BiLSTM-CRF with 0.2 points in F 1 (p < 0.09). The improvement is not significant due to the relatively lower equality of the dependency trees. To further study the effect of the dependencies, we modified the predicted dependencies to ensure each entity form a subtree in the complete dataset. Such modification improves the F 1 to 92.7, which is significantly better (p < 0.05) than the BiLSTM-CRF.
Low-Resource NER Following Cotterell and Duh (2017), we emulate truly low-resource condition with 100 sentences for training. We assume that the contextualized word representations are not available and dependencies are predicted. Table 7 shows the NER performance on the SemEval-2010 Task 1 datasets under the lowresource setting. With limited amount of training data, BiLSTM-CRF suffers from low recall and the DGLSTM-CRF largely improves it on these two datasets. Using gold dependencies further significantly improves the precision and recall.   Effect of Dependency Quality To evaluate how the quality of dependency trees affect the performance, we train a state-of-the-art dependency parser (Dozat and Manning, 2017) using our training set and make prediction on the development/test set. We implemented the dependency parser using the AllenNLP package (Gardner et al., 2017). Table 8 shows the performance (LAS) of the dependency parser on four languages (i.e., OntoNotes English, OntoNotes Chinese, Catalan and Spanish) and the performance of DGLSTM-CRF against the best performing BiLSTM-CRF with ELMo. DGLSTM-CRF even with predicted dependencies is able to consistently outperform the BiLSTM-CRF on four languages. However, the performance is still worse than the DGLSTM-CRF with gold dependencies, especially on the Catalan and Spanish. Such results suggest that it is essential to have high-quality dependency annotations available for the proposed model. Table 9 shows the ablation study of the 2-layer DGLSTM-CRF model on the OntoNotes English dataset. With self connection as interaction function, the F 1 drops 0.3 points.

Ablation Study
The model achieves comparable performance with concatenation as interaction function but F 1 drops about 0.4 points with the addition interaction function. We believe that the addition potentially leads to certain information loss. Without the depen-  dency relation embedding v r in the input representation, the F 1 drops about 0.4 points.

Effectiveness of Dependency Relations
To demonstrate whether the model benefits from the dependency relations, we first select the entities that are correctly predicted by the 2-layer DGLSTM-CRF model but not by the best performing baseline 2-layer BiLSTM-CRF on the OntoNotes English dataset. We draw the heatmap in Figure 4 based on these entities. Comparing Figure 3 and 4, we can see that they are similar in terms of the density. Both of them show consistent relationships between the entity types and the dependency relations. The comparison shows that the improvements partially result from the effect of dependency relations. We also found from our model's predictions that some entity types have strong correlations with the relation pairs on grandchild dependencies 14 . Table 10 shows the performance comparison with different entity lengths on all datasets. As mentioned earlier, the dependencies as well as the grandchild relations allow our models to capture the long-distance interactions between the words. As shown in the table, the performance of entities with lengths more than 1 consistently improves with DGLSTM-CRF for all languages except Chinese. As we pointed out in the dataset statistics (Table 2), the number of entities that form subtrees in OntoNotes Chinese is relatively smaller compared to other datasets. The performance gain is more significant for entities with longer length on 14 The corresponding heatmap visualization is provided in supplementary material. the other three languages. We found that, among the improvements of entities with length larger than 2 in English, 85% of them have long-distance dependencies and 30% of them have grandchild dependencies within the entity boundary. The analysis shows that our model that exploits the dependency tree structures is helpful for recognizing long entities.

Conclusions and Future Work
Motivated by the relationships between the dependency trees and named entities, we propose a dependency-guided LSTM-CRF model to encode the complete dependency tree and capture such relationships for the NER task. Through extensive experiments on several datasets, we demonstrate the effectiveness of the proposed model in improving the NER performance. Our analysis shows that NER benefits from the dependency relations and long-distance dependencies, which are able to capture the non-local interactions between the words. As statistics shows that most of the entities form subtrees under the dependency trees, future work includes building a model for joint NER and dependency parsing which regards each entity as a single unit in a dependency tree.