A Unified MRC Framework for Named Entity Recognition

The task of named entity recognition (NER) is normally divided into nested NER and flat NER depending on whether named entities are nested or not.Models are usually separately developed for the two tasks, since sequence labeling models, the most widely used backbone for flat NER, are only able to assign a single label to a particular token, which is unsuitable for nested NER where a token may be assigned several labels. In this paper, we propose a unified framework that is capable of handling both flat and nested NER tasks. Instead of treating the task of NER as a sequence labeling problem, we propose to formulate it as a machine reading comprehension (MRC) task. For example, extracting entities with the per label is formalized as extracting answer spans to the question “which person is mentioned in the text".This formulation naturally tackles the entity overlapping issue in nested NER: the extraction of two overlapping entities with different categories requires answering two independent questions. Additionally, since the query encodes informative prior knowledge, this strategy facilitates the process of entity extraction, leading to better performances for not only nested NER, but flat NER. We conduct experiments on both nested and flat NER datasets.Experiment results demonstrate the effectiveness of the proposed formulation. We are able to achieve a vast amount of performance boost over current SOTA models on nested NER datasets, i.e., +1.28, +2.55, +5.44, +6.37,respectively on ACE04, ACE05, GENIA and KBP17, along with SOTA results on flat NER datasets, i.e., +0.24, +1.95, +0.21, +1.49 respectively on English CoNLL 2003, English OntoNotes 5.0, Chinese MSRA and Chinese OntoNotes 4.0.


Introduction
Named Entity Recognition (NER) refers to the task of detecting the span and the semantic category of entities from a chunk of text.The task can be further divided into two sub-categories, nested NER and flat NER, depending on whether entities are nested or not.Nested NER refers to a phenomenon that the spans of entities (mentions) are nested, as shown in Figure 1.Entity overlapping is a fairly common phenomenon in natural languages.
The task of flat NER is commonly formalized as a sequence labeling task: a sequence labeling model (Chiu and Nichols, 2016;Ma and Hovy, 2016;Devlin et al., 2018) is trained to assign a single tagging class to each unit within a sequence of tokens.This formulation is unfortunately incapable of handling overlapping entities in nested NER (Huang et al., 2015;Chiu and Nichols, 2015), where multiple categories need to be assigned to a single token if the token participates in multiple entities.Many attempts have been made to reconcile sequence labeling models with nested NER (Alex et al., 2007;Byrne, 2007;Finkel and Manning, 2009;Lu and Roth, 2015;Katiyar and Cardie, 2018), mostly based on the pipelined systems.However, pipelined systems suffer from the disadvantages of error propagation, long running time and the intensiveness in developing hand-crafted features, etc.
Inspired by the current trend of formalizing NLP problems as question answering tasks (Levy et al., 2017;McCann et al., 2018;Li et al., 2019), we propose a new framework that is capable of handling both flat and nested NER.Instead of treating the task of NER as a sequence labeling problem, we propose to formulate it as a SQuADstyle (Rajpurkar et al., 2016(Rajpurkar et al., , 2018) ) machine reading comprehension (MRC) task.Each entity type is characterized by a natural language query, and entities are extracted by answering these queries given the contexts.For example, the task of assigning the PER(PERSON) label to " [Washington] was born into slavery on the farm of James Burroughs" is formalized as answering the question "which person is mentioned in the text?".This strategy naturally tackles the entity overlapping issue in nested NER: the extraction of two entities with different categories that overlap requires answering two independent questions.
The MRC formulation also comes with another key advantage over the sequence labeling formulation.For the latter, golden NER categories are merely class indexes and lack for semantic prior information for entity categories.For example, the ORG(ORGANIZATION) class is treated as a onehot vector in sequence labeling training.This lack of clarity on what to extract leads to inferior performances.On the contrary, for the MRC formulation, the query encodes significant prior information about the entity category to extract.For example, the query "find an organization such as company, agency and institution in the context" encourages the model to link the word "organization" in the query to location entities in the context.Additionally, by encoding comprehensive descriptions (e.g., "company, agency and institution") of tagging categories (e.g., ORG), the model has the potential to disambiguate similar tagging classes.
We conduct experiments on both nested and flat NER datasets to show the generality of our approach.Experimental results demonstrate its effectiveness.We are able to achieve a vast amount of performance boost over current SOTA models on nested NER datasets, i.e., +1.28, +2.55, +5.44, +6.37, respectively on ACE04, ACE05, GENIA and KBP17, as well as flat NER datasets, i.e., +0.24, +1.95, +0.21, +1.49respectively on English CoNLL 2003, English OntoNotes 5.0, Chinese MSRA, Chinese OntoNotes 4.0.We wish that our work would inspire the introduction of new paradigms for the entity recognition task.

Named Entity Recognition (NER)
Traditional sequence labeling models use CRFs (Lafferty et al., 2001;Sutton et al., 2007) as a backbone for NER.The first work using neural models for NER goes back to 2003, when Hammerton (2003) attempted to solve the problem using unidirectional LSTMs.Collobert et al. (2011) presented a CNN-CRF structure, augmented with character embeddings by Santos and Guimaraes (2015).Lample et al. (2016) explored neural structures for NER, in which the bidirectional LSTMs are combined with CRFs with features based on character-based word representations and unsupervised word representations.Ma and Hovy (2016) and Chiu and Nichols (2016) used a character CNN to extract features from characters.Recent large-scale language model pretraining methods such as BERT (Devlin et al., 2018) and ELMo (Peters et al., 2018a) further enhanced the performance of NER, yielding state-of-the-art performances.

Nested Named Entity Recognition
The overlapping between entities (mentions) was first noticed by Kim et al. (2003), who developed handcrafted rules to identify overlapping mentions.Alex et al. (2007) proposed two multi-layer CRF models for nested NER.The first model is the inside-out model, in which the first CRF identifies the innermost entities, and the successive layer CRF is built over words and the innermost entities extracted from the previous CRF to identify second-level entities, etc.The other is the outsidein model, in which the first CRF identifies outermost entities, and then successive CRFs would identify increasingly nested entities.Finkel and Manning (2009) built a model to extract nested entity mentions based on parse trees.They made the assumption that one mention is fully contained by the other when they overlap.Lu and Roth (2015) proposed to use mention hyper-graphs for recognizing overlapping mentions.Xu et al. (2017) utilized a local classifier that runs on every possible span to detect overlapping mentions and Katiyar and Cardie (2018) used neural models to learn the hyper-graph representations for nested entities.Ju et al. (2018) dynamically stacked flat NER layers in a hierarchical manner.Lin et al. (2019a) proposed the Anchor-Region Networks (ARNs) architecture by modeling and leveraging the head-driven phrase structures of nested entity mentions.Luan et al. (2019) built a span enumeration approach by selecting the most confident entity spans and linking these nodes with confidenceweighted relation types and coreferences.Other works (Muis and Lu, 2017;Sohrab and Miwa, 2018;Zheng et al., 2019) also proposed various methods to tackle the nested NER problem.
Recently, nested NER models are enriched with pre-trained contextual embeddings such as BERT (Devlin et al., 2018) and ELMo (Peters et al., 2018b).Fisher and Vlachos (2019) introduced a BERT-based model that first merges tokens and/or entities into entities, and then assigned labeled to these entities.Shibuya and Hovy (2019) provided inference model that extracts entities iteratively from outermost ones to inner ones.Straková et al. (2019) viewed nested NER as a sequence-tosequence generation problem, in which the input sequence is a list of tokens and the target sequence is a list of labels.

Machine Reading Comprehension (MRC)
MRC models (Seo et al., 2016;Wang et al., 2016;Wang and Jiang, 2016;Xiong et al., 2016Xiong et al., , 2017;;Wang et al., 2016;Shen et al., 2017;Chen et al., 2017) extract answer spans from a passage through a given question.The task can be formalized as two multi-class classification tasks, i.e., predicting the starting and ending positions of the answer spans.Over the past one or two years, there has been a trend of transforming NLP tasks to MRC question answering.For example, Levy et al. (2017) transformed the task of relation extraction to a QA task: each relation type R(x, y) can be parameterized as a question q(x) whose answer is y.For example, the relation EDUCATED-AT can be mapped to "Where did x study?".Given a question q(x), if a non-null answer y can be extracted from a sentence, it means the relation label for the current sentence is R. McCann et al. (2018) transformed NLP tasks such as summarization or sentiment analysis into question answering.For example, the task of summarization can be formalized as answering the question "What is the summary?".Our work is significantly inspired by Li et al. (2019), which formalized the task of entity-relation extraction as a multi-turn question answering task.Different from this work, Li et al. (2019) focused on relation extraction rather than NER.Additionally, Li et al. (2019) utilized a template-based procedure for constructing queries to extract semantic relations between entities and their queries lack diversity.In this paper, more factual knowledge such as synonyms and examples are incorporated into queries, and we present an in-depth analysis of the impact of strategies of building queries.

Task Formalization
Given an input sequence X = {x 1 , x 2 , ..., x n }, where n denotes the length of the sequence, we need to find every entity in X, and then assign a label y ∈ Y to it, where Y is a predefined list of all possible tag types (e.g., PER, LOC, etc).
Dataset Construction Firstly we need to transform the tagging-style annotated NER dataset to a set of (QUESTION, ANSWER, CONTEXT) triples.For each tag type y ∈ Y , it is associated with a natural language question q y = {q 1 , q 2 , ..., q m }, where m denotes the length of the generated query.An annotated entity x start,end = {x start , x start+1 , • • • , x end-1 , x end } is a substring of X satisfying start ≤ end.Each entity is associated with a golden label y ∈ Y .By generating a natural language question q y based on the label y, we can obtain the triple (q y , x start,end , X), which is exactly the (QUESTION, ANSWER, CONTEXT) triple that we need.Note that we use the subscript " start,end " to denote the continuous tokens from index 'start' to 'end' in a sequence.

Query Generation
The question generation procedure is important since queries encode prior knowledge about labels and have a significant influence on the final results.Different ways have been proposed for question generation, e.g., Li et al. (2019) utilized a template-based procedure for constructing queries to extract semantic relations between entities.In this paper, we take annotation guideline notes as references to construct queries.Annotation guideline notes are the guidelines provided to the annotators of the dataset by the dataset builder.They are descriptions of tag categories, which are described as generic and precise as possible so that  1.

Model Backbone
Given the question q y , we need to extract the text span x start,end which is with type y from X under the MRC framework.We use BERT (Devlin et al., 2018) as the backbone.To be in line with BERT, the question q y and the passage X are concatenated, forming the combined string where [CLS] and [SEP] are special tokens.Then BERT receives the combined string and outputs a context representation matrix E ∈ R n×d , where d is the vector dimension of the last layer of BERT and we simply drop the query representations.

Span Selection
There are two strategies for span selection in MRC: the first strategy (Seo et al., 2016;Wang et al., 2016) is to have two n-class classifiers separately predict the start index and the end index, where n denotes the length of the context.Since the softmax function is put over all tokens in the context, this strategy has the disadvantage of only being able to output a single span given a query; the other strategy is to have two binary classifiers, one to predict whether each token is the start index or not, the other to predict whether each token is the end index or not.This strategy allows for outputting multiple start indexes and multiple end indexes for a given context and a specific query, and thus has the potentials to extract all related entities according to q y .We adopt the second strategy and describe the details below.
Start Index Prediction Given the representation matrix E output from BERT, the model first predicts the probability of each token being a start index as follows: T start ∈ R d×2 is the weights to learn.Each row of P start presents the probability distribution of each index being the start position of an entity given the query.
End Index Prediction The end index prediction procedure is exactly the same, except that we have another matrix T end to obtain probability matrix Start-End Matching In the context X, there could be multiple entities of the same category.This means that multiple start indexes could be predicted from the start-index prediction model and multiple end indexes predicted from the endindex prediction model.The heuristic of matching the start index with its nearest end index does not work here since entities could overlap.We thus further need a method to match a predicted start index with its corresponding end index.Specifically, by applying argmax to each row of P start and P end , we will get the predicted indexes that might be the starting or ending positions, i.e., Îstart and Îend : (2) where the superscript (i) denotes the i-th row of a matrix.Given any start index i start ∈ Îstart and end index i end ∈ Îend , a binary classification model is trained to predict the probability that they should be matched, given as follows: where m ∈ R 1×2d is the weights to learn.

Train and Test
At training time, X is paired with two label sequences Y start and Y end of length n representing the ground-truth label of each token x i being the start index or end index of any entity.We therefore have the following two losses for start and end index predictions: Let Y start, end denote the golden labels for whether each start index should be matched with each end index.The start-end index matching loss is given as follows: The overall training objective to be minimized is as follows: GENIA (Ohta et al., 2002) For the GENIA dataset, we use GENIAcorpus3.02p.We follow the protocols in Katiyar and Cardie (2018).

KBP2017
We follow Katiyar and Cardie (2018) and evaluate our model on the 2017 English evaluation dataset (LDC2017D55).Training set consists of RichERE annotated datasets, which include LDC2015E29, LDC2015E68, LDC2016E31 and LDC2017E02.We follow the dataset split strategy in Lin et al. (2019b).

Baselines
We use the following models as baselines: • Hyper-Graph: general framework that share span representations using dynamically constructed span graphs.
CoNLL2003 (Sang and Meulder, 2003) is an English dataset with four types of named entities: Location, Organization, Person and Miscellaneous.We followed data processing protocols in Ma and Hovy (2016).
OntoNotes 5.0 (Pradhan et al., 2013) is an English dataset and consists of text from a wide variety of sources.The dataset includes 18 types of named entity, consisting of 11 types (Person, Organization, etc) and 7 values (Date, Percent, etc).

Baselines
For English datasets, we use the following models as baselines.
• BiLSTM-CRF from Ma and Hovy (2016).(18 vs 4), and some entity categories face the severe data sparsity problem.Since the query encodes significant prior knowledge for the entity type to extract, the MRC formulation is more immune to the tag sparsity issue, leading to more improvements on OntoNotes.The proposed method also achieves new state-of-the-art results on Chinese datasets.For Chinese MSRA, the proposed method outperforms the fine-tuned BERT tagging model by +0.95% in terms of F1.We also improve the F1 from 79.16% to 82.11% on Chinese OntoNotes4.0.

Improvement from MRC or from BERT
For flat NER, it is not immediately clear which proportion is responsible for the improvement, the MRC formulation or BERT (Devlin et al., 2018).On one hand, the MRC formulation facilitates the entity extraction process by encoding prior knowledge in the query; on the other hand, the good performance might also come from the large-scale pre-training in BERT.
To separate the influence from large-scale BERT pretraining, we compare the LSTM-CRF tagging model (Strubell et al., 2017) with other MRC based models such as QAnet (Yu et al., 2018) and BiDAF (Seo et al., 2017), which do not rely on large-scale pretraining.Results on English Ontonotes are shown in Table 5.As can be seen, though underperforming BERT-Tagger, the MRC based approaches QAnet and BiDAF still significantly outperform tagging models based on LSTM+CRF.This validates the importance of MRC formulation.The MRC formulation's benefits are also verified when comparing BERT-tagger  with BERT-MRC: the latter outperforms the former by +1.95%.
We plot the attention matrices output from the BiDAF model between the query and the context sentence in Figure 2. As can be seen, the semantic similarity between tagging classes and the contexts are able to be captured in the attention matrix.
In the examples, Flevland matches geographical, cities and state.

How to Construct Queries
How to construct query has a significant influence on the final results.In this subsection, we explore different ways to construct queries and their influence, including: • Position index of labels: a query is constructed using the index of a tag to , i.e., "one", "two", "three".• Keyword: a query is the keyword describing the tag, e.g., the question query for tag ORG is "organization".• Rule-based template filling: generates questions using templates.The query for tag ORG is "which organization is mentioned in the text".• Wikipedia: a query is constructed using its wikipedia definition.The query for tag ORG is "an organization is an entity comprising multiple people, such as an institution or an association." • Synonyms: are words or phrases that mean exactly or nearly the same as the original keyword extracted using the Oxford Dictionary.The query for tag ORG is "association".• Keyword+Synonyms: the concatenation of a keyword and its synonym.• Annotation guideline notes: is the method we use in this paper.The query for tag ORG is "find organizations including companies, agencies and institutions".Table 6: Zero-shot evaluation on OntoNotes5.0.BERT-MRC can achieve better zero-shot performances.
Table 5 shows the experimental results on English OntoNotes 5.0.The BERT-MRC outperforms BERT-Tagger in all settings except Position Index of Labels.The model trained with the Annotation Guideline Notes achieves the highest F1 score.Explanations are as follows: for Position Index Dataset, queries are constructed using tag indexes and thus do not contain any meaningful information, leading to inferior performances; Wikipedia underperforms Annotation Guideline Notes because definitions from Wikipedia are relatively general and may not precisely describe the categories in a way tailored to data annotations.

Zero-shot Evaluation on Unseen Labels
It would be interesting to test how well a model trained on one dataset is transferable to another, which is referred to as the zero-shot learning ability.We trained models on CoNLL 2003 and test them on OntoNotes5.0.OntoNotes5.0 contains 18 entity types, 3 shared with CoNLL03, and 15 unseen in CoNLL03.Table 6 presents the results.As can been seen, BERT-tagger does not have zero-shot learning ability, only obtaining an accuracy of 31.87%.This is in line with our expectation since it cannot predict labels unseen from the training set.The question-answering formalization in MRC framework, which predicts the answer to the given query, comes with more generalization capability and achieves acceptable results.

Size of Training Data
Since the natural language query encodes significant prior knowledge, we expect that the proposed framework works better with less training data.Figure 3 verifies this point: on the Chinese OntoNotes 4.0 training set, the query-based BERT-MRC approach achieves comparable performance to BERT-tagger even with half amount of training data.comes with two key advantages: (1) being capable of addressing overlapping or nested entities; (2) the query encodes significant prior knowledge about the entity category to extract.The proposed method obtains SOTA results on both nested and flat NER datasets, which indicates its effectiveness.In the future, we would like to explore variants of the model architecture.

Figure 2 :
Figure 2: An example of attention matrices between the query and the input sentence.

Figure 3 :
Figure 3: Effect of varying percentage of training samples on Chinese OntoNotes 4.0.BERT-MRC can achieve the same F1-score comparing to BERT-Tagger with fewer training samples.

Table 2 :
Results for nested NER tasks.

Table 3 :
Results for flat NER tasks.

Table 5 :
Results of different types of queries.