Named Entity Recognition as Dependency Parsing

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing, concerned with identifying spans of text expressing references to entities. NER research is often focused on flat entities only (flat NER), ignoring the fact that entity references can be nested, as in [Bank of [China]] (Finkel and Manning, 2009). In this paper, we use ideas from graph-based dependency parsing to provide our model a global view on the input via a biaffine model (Dozat and Manning, 2017). The biaffine model scores pairs of start and end tokens in a sentence which we use to explore all spans, so that the model is able to predict named entities accurately. We show that the model works well for both nested and flat NER through evaluation on 8 corpora and achieving SoTA performance on all of them, with accuracy gains of up to 2.2 percentage points.


Introduction
'Nested Entities' are named entities containing references to other named entities as in [Bank of [China]], in which both [China] and [Bank of China] are named entities. Such nested entities are frequent in data sets like ACE 2004, ACE 2005 and GENIA (e.g., 17% of NEs in GENIA are nested (Finkel and Manning, 2009), altough the more widely used set such as CONLL 2002CONLL , 2003 and ONTONOTES only contain so called flat named entities and nested entities are ignored.
The current SoTA models all adopt a neural network architecture without hand-crafted features, which makes them more adaptable to different tasks, languages and domains (Lample et al., 2016;Chiu and Nichols, 2016;Peters et al., 2018;Devlin et al., 2019;Ju et al., 2018;Sohrab and Miwa, 2018;Straková et al., 2019). In this paper, we introduce a method to handle both types of NEs in one system by adopting ideas from the biaffine dependency parsing model of Dozat and Manning (2017). For dependency parsing, the system predicts a head for each token and assigns a relation to the head-child pairs. In this work, we reformulate NER as the task of identifying start and end indices, as well as assigning a category to the span defined by these pairs. Our system uses a biaffine model on top of a multi-layer BiLSTM to assign scores to all possible spans in a sentence. After that, instead of building dependency trees, we rank the candidate spans by their scores and return the top-ranked spans that comply with constraints for flat or nested NER. We evaluated our system on three nested NER benchmarks (ACE 2004, ACE 2005 and five flat NER corpora (CONLL 2002(Dutch, Spanish) CONLL 2003, and ONTONOTES). The results show that our system achieved SoTA results on all three nested NER corpora, and on all five flat NER corpora with substantial gains of up to 2.2% absolute percentage points compared to the previous SoTA. We provide the code as open source 1 .

Related Work
Flat Named Entity Recognition. The majority of flat NER models are based on a sequence labelling approach. Collobert et al. (2011) introduced a neural NER model that uses CNNs to encode tokens combined with a CRF layer for the classification. Many other neural systems followed this approach but used instead LSTMs to encode the input and a CRF for the prediction (Lample et al., 2016;Ma and Hovy, 2016;Chiu and Nichols, 2016). These latter models were later extended to use contextdependent embeddings such as ELMo (Peters et al., 2018).    (2019) invented BERT, a bidirectional transformer architecture for the training of language models. BERT and its siblings provided better language models that turned again into higher scores for NER. Lample et al. (2016) cast NER as transitionbased dependency parsing using a Stack-LSTM. They compare with a LSTM-CRF model which turns out to be a very strong baseline. Their transition-based system uses two transitions (shift and reduce) to mark the named entities and handles flat NER while our system has been designed to handle both nested and flat entities.
Nested Named Entity Recognition. Early work on nested NER, motivated particularly by the GENIA corpus, includes (Shen et al., 2003;Beatrice Alex and Grover, 2007;Finkel and Manning, 2009). Finkel and Manning (2009) also proposed a constituency parsing-based approach. In the last years, we saw an increasing number of neural models targeting nested NER as well. Ju et al. (2018) suggested a LSTM-CRF model to predict nested named entities. Their algorithm iteratively continues until no further entities are predicted. Lin et al. (2019) tackle the problem in two steps: they first detect the entity head, and then they infer the entity boundaries as well as the category of the named entity. Straková et al. (2019) tag the nested named entity by a sequence-to-sequence model exploring combinations of context-based embeddings such as ELMo, BERT, and Flair. Zheng et al. (2019) use a boundary aware network to solve the nested NER. Similar to our work, Sohrab and Miwa (2018) enumerate exhaustively all possible spans up to a defined length by concatenating the LSTMs outputs for the start and end position and then using this to calculate a score for each span. Apart from the different network and word embedding configurations, the main difference between their model and ours is there for the use of biaffine model. Due to the biaffine model, we get a global view of the sentence while Sohrab and Miwa (2018) concatenates the output of the LSTMs of possible start and end positions up to a distinct length. Dozat and Manning (2017) demonstrated that the biaffine mapping performs significantly better than just the concatenation of pairs of LSTM outputs.

Methods
Our model is inspired by the dependency parsing model of Dozat and Manning (2017). We use both word embeddings and character embeddings as input, and feed the output into a BiLSTM and finally to a biaffine classifier. Figure 1 shows an overview of the architecture. To encode words, we use both BERT Large and fast-Text embeddings (Bojanowski et al., 2016). For BERT we follow the recipe of (Kantor and Globerson, 2019) to obtain the context dependent embeddings for a target token with 64 surrounding tokens each side. For the character-based word embeddings, we use a CNN to encode the characters of the tokens. The concatenation of the word and character-based word embeddings is feed into a BiLSTM to obtain the word representations (x).
After obtaining the word representations from the BiLSTM, we apply two separate FFNNs to create different representations (h s /h e ) for the start/end of the spans. Using different representations for the start/end of the spans allow the system to learn to identify the start/end of the spans separately. This improves accuracy compared to the model which directly uses the outputs of the LSTM since the context of the start and end of the entity are different. Finally, we employ a biaffine model over the sentence to create a l × l × c scoring tensor (r m ), where l is the length of the sentence and c is the number of NER categories + 1(for non-entity). We compute the score for a span i by: where s i and e i are the start and end indices of the span i, U m is a d × c × d tensor, W m is a 2d × c matrix and b m is the bias.
The tensor r m provides scores for all possible spans that could constitute a named entity under the constrain that s i ≤ e i (the start of entity is before its end). We assign each span a NER category y : We then rank all the spans that have a category other than "non-entity" by their category scores (r m (i y )) in descending order and apply following post-processing constraints: For nested NER, a entity is selected as long as it does not clash the boundaries of higher ranked entities. We denote a entity i to clash boundaries with another entity j if s i < s j ≤ e i < e j or s j < s i ≤ e j < e i , e.g. in the Bank of China, the entity the Bank of clashes boundary with the entity Bank of China, hence only the span with the higher category score will be selected. For flat NER, we apply one more constraint, in which any entity containing or is inside an entity ranked before it will not be selected. The learning objective of our named entity recognizer is to assign a correct category (including the non-entity) to each valid span. Hence it is a multi-class classification problem and we optimise our models with softmax cross-entropy: 4 Experiments Data Set. We evaluate our system on both nested and flat NER, for the nested NER task, we use the ACE 2004 2 , ACE 2005 3 , and GENIA  corpora; for flat NER, we test our system on the CONLL 2002 (Tjong Kim Sang, 2002), CONLL 2003 (Tjong Kim Sang and De Meulder, 2003) and ONTONOTES 4 corpora. For ACE 2004, ACE 2005 we follow the same settings of Lu and Roth (2015) and Muis and Lu (2017)  fair comparson we also used the same documents as in Lu and Roth (2015) for each split. For GENIA, we use the GENIA v3.0.2 corpus. We preprocess the dataset following the same settings of Finkel and Manning (2009)  Evaluation Metric. We report recall, precision and F1 scores for all evaluations. The named entity is considered correct when both boundary and category are predicted correctly.
Hyperparameters We use a unified setting for all of the experiments, Table 1 shows hyperparameters for our system. 5 In Sohrab and Miwa (2018), the last 10% of the training set is used as a development set, we include their result mainly because their system is similar to ours. 6 The revised version is provided by the shared task organiser in 2006 with more consistent annotations. We confirmed with the author of Akbik et al. (2018) that they used the revised version.  For the GENIA corpus our system achieved an F1 score of 80.5% and improved the SoTA by 2.2% absolute. Our hypothesise is that for GENIA the high accuracy gain is due to our structural prediction approach and that sequence-tosequence models rely more on the language model    Table 3), e.g. the system of Chiu and Nichols (2016); Strubell et al. (2017). In contrast, our system is less sensitive to the domain and the granularity of the categories. As shown in Table 3, our system achieved an F1 score of 91.3% on the ONTONOTES corpus and is very close to our system performance on the CONLL 2003 corpus (93.5%).
On the multi-lingual data, our system achieved F1 scores of 86.4% for German, 90.3% for Spanish and 93.5% for Dutch. Our system outperforms the previous SoTA results by large margin of 2.1%, 1.5%, 1.3% and 1% on ONTONOTES, Spanish, German and Dutch corpora respectively and is slightly better than the SoTA on English data set. In addition, we also tested our system on the revised version of German data to compare with the model by Akbik et al. (2018), our system again achieved a substantial gain of 2% when compared with their system.

Ablation Study
To evaluate the contribution of individual components of our system, we further remove selected components and use ONTONOTES for evaluation (see Table 4). We choose ONTONOTES for our ablation study as it is the largest corpus. Biaffine Classifier We replace the biaffine mapping with a CRF layer and convert our system into a sequence labelling model. The CRF layer is frequently used in models for flat NER, e.g. (Lample et al., 2016). When we replace the biaffine model of our system with a CRF layer, the performance drops by 0.8 percentage points (Table 4). The large performance difference shows the benefit of adding a biaffine model and confirms our hypothesis that the dependency parsing framework is an important factor for the high accuracy of our system.
Contextual Embeddings We ablate BERT embeddings and as expected, after removing BERT embeddings, the system performance drops by a large number of 2.4 percentage points (see Table  4). This shows that BERT embeddings are one of the most important factors for the accuracy.
Context Independent Embeddings We remove the context-independent fastText embedding from our system. The context-independent embedding contributes 0.4% towards the score of our full system (Table 4). Which suggests that even with the BERT embeddings enabled, the contextindependent embeddings can still make quite noticeable improvement to a system.
Character Embeddings Finally, we remove the character embeddings. As we can see from Table 4, the impact of character embeddings is quite small. One explanation would be that English is not a morphologically rich language hence does not benefit largely from character-level information and the BERT embeddings itself are based on word pieces that already capture some character-level information.
Overall, the biaffine mapping and the BERT embedding together contributed most to the high accuracy of our system.

Conclusion
In this paper, we reformulate NER as a structured prediction task and adopted a SoTA dependency parsing approach for nested and flat NER. Our system uses contextual embeddings as input to a multilayer BiLSTM. We employ a biaffine model to assign scores for all spans in a sentence. Further constraints are used to predict nested or flat named entities. We evaluated our system on eight named entity corpora. The results show that our system achieves SoTA on all of the eight corpora. We demonstrate that advanced structured prediction techniques lead to substantial improvements for both nested and flat NER.