NNE: A Dataset for Nested Named Entity Recognition in English Newswire

Named entity recognition (NER) is widely used in natural language processing applications and downstream tasks. However, most NER tools target flat annotation from popular datasets, eschewing the semantic information available in nested entity mentions. We describe NNE—a fine-grained, nested named entity dataset over the full Wall Street Journal portion of the Penn Treebank (PTB). Our annotation comprises 279,795 mentions of 114 entity types with up to 6 layers of nesting. We hope the public release of this large dataset for English newswire will encourage development of new techniques for nested NER.


Introduction
Named entity recognition-the task of identifying and classifying entity mentions in text-plays a crucial role in understanding natural language.It is used for many downstream language processing tasks, e.g., coreference resolution, question answering, summarization, entity linking, relation extraction and knowledge base population.However, most NER tools are designed to capture flat mention structure over coarse entity type schemas, reflecting the available annotated datasets.
Focusing on flat mention structures ignores important information that can be useful for downstream tasks.Figure 1 includes examples of nested named entities illustrating several phenomena: • Entity-entity relationships can be embedded in nested mentions.For instance, the location of the 'Ontario Supreme Court' is indicated by the embedded STATE mention 'Ontario'; • Entity attribute values can be embedded in nested mentions.For instance, the title is the embedded ROLE 'Former U.N. Ambassador', which also encodes the employment relation between the PERSON 'Jane Kirkpatrick' and ORG 'U.N.'; • Part-whole relationships can be encoded in nested mention structure.For instance, the REGION 'Southern California' is part of the STATE 'California'.
To facilitate ongoing research on nested NER, we introduce NNE-a large, manuallyannotated, nested named entity dataset over English newswire.This new annotation layer over the Wall Street Journal portion of the PTB includes 279,795 mentions.All mentions are annotated, including nested structures with depth as high as six layers.A fine-grained entity type schema is used, extending the flat BBN (Weischedel and Brunstein, 2005) annotation from 64 to 114 entity types.
We are publicly releasing the standoff annotations along with detailed annotation guidelines and scripts for knitting annotations onto the underlying PTB corpus.1 Benchmark results using recent state-of-the-art approaches demonstrate that good accuracy is possible, but complexity and run time are open challenges.As a new layer over the already rich collection of PTB annotations, NNE provides an opportunity to explore joint modelling of nested NER and other tasks at an unprecedented scale and detail.

The NNE dataset
Annotation Scheme: BBN (Weischedel and Brunstein, 2005) is a pronoun coreference and entity type corpus, annotated with 64 types of entities, numerical and time expressions.We use its flat entity schema as a starting point to design our schema.We analyzed existing BBN annotations to develop and automatically apply structured preannotation for predictable entity types.Additional fine-grained categories and further structural elements of entities, inspired by Sekine et al. (2002) and Nothman et al. (2013), are used to augment the BBN schema.We adhere to the following general principles when annotating nested named entities in the corpus: • Annotate all named entities, all time and date (TIMEX) and numerical (NUMEX) entities, including all non-sentence initial words in title case, and instances of proper noun mentions that are not capitalized.
• Annotate all structural elements of entities.These elements could be other entities, such as 'Ontario' (STATE) in 'Ontario Supreme Court' (GOVERNMENT), or structural components such as '40' (CARDINAL) and 'miles' (UNIT) in '40 miles' (QUANTITY:1D), as well as the internal structure induced by syntactic elements, such as coordination.
• Add consistent substructure to avoid spurious ambiguity.For example, the token 'Toronto', which is a CITY, would be labeled as part of an ORG:EDU organization span 'University of Toronto'.We add layers of annotations to allow each token to be annotated as consistently as possible, e.g., [University of [Toronto] CITY ] ORG:EDU .
• Add additional categories to avoid category confusion.Some entities are easy to identify, but difficult to categorize consistently.For instance, a hotel (or any business at a fixed location) has both organizational and locative qualities, or is at least treated metonymously as a location.Rather than requiring annotators to make an ambiguous decision, we elect to add category HOTEL to simplify the individual annotation decision.We also apply this principle when adding MEDIA, FUND, and BUILDING categories.
• Pragmatic annotation.Many annotation decisions are ambiguous and difficult, thus may require substantial research.For instance, knowing that 'The Boeing Company' was named after founder 'William E. Boeing' would allow us to annotate 'Boeing' with an embedded PERSON entity.However, this does not apply for other companies, such as 'Sony Corporation'.To let annotation decisions be made without reference to external knowledge, we label all tokens that seem to be the names of people as NAME, regardless of whether they are actually a person's name.
Entity types and mention frequencies can be found in Appendix A. See Ringland (2016) for annotation guidelines and extended discussion of annotation decisions.
Annotation Process: Although some existing annotation tools allow nested structures (e.g., Brat (Stenetorp et al., 2012)), we built a custom tool that allowed us to create a simple and fast way to add layers of entities, and suggest reusing existing structured annotations for the same span.
Using the annotations from BBN as underlying annotations, the annotator is shown a screen with the target sentence, as well as the previous and next sentences, if any.A view of the whole article is also possible to help the annotator with contextual cues.When annotators select a span, they are prompted with suggestions based on their own previous annotations, and common entities.Some entities are repeated frequently in an article, or over many articles in the corpus.The annotation tool allows a user to add a specified annotation to all strings matching those tokens in the same article, or in all articles.Four annotators, each with a background in linguistics and/or computational linguistics were selected and briefed on the annotation task and purpose.The WSJ portion of the PTB consists of 25 sections (00-24).Each annotator started with a subset of section 00 as annotation training, and was given feedback before moving on to other sections.Weekly meetings were held with all annotators to discuss ambiguities in the guidelines, gaps in the annotation categories, edge cases and ambiguous entities and to resolve discrepancies.
Total annotation time for the corpus was 270 hours, split between the four annotators.Sections 00 and 23 were doubly annotated, and section 02 was annotated by all four annotators.An additional 17 hours was used for adjudicating these sections annotated by multiple annotators.
Dataset Analysis: The resulting NNE dataset includes a large number of entity mentions of substantial depth, with more than half of mentions occurring inside another mentions.Of the 118,525 top-level entity mentions, 47,020 (39.6%) do not have any nested structure embedded.The remaining 71,505 mentions contain 161,270 mentions, averaging 2.25 structural mentions per each of these top-layer entity mentions.Note that one span can be assigned multiple entity types.For example, the span '1993' can be annotated as both DATE and YEAR.In NNE, 19,144 out of 260,386 total spans are assigned multiple types.Table 1 lists the number of spans occurring at each depth.To measure how clearly the annotation guidelines delineate each category, and how reliable our annotations are, inter-annotator agreement was calculated using annotations on Section 02, which was annotated by all four annotators.An adjudicated version was created by deciding a correct existing candidate label from within the four pos-sibilities, or by adjusting one of them on a token level.For the purposes of inter-annotator agreement, a tag stack is calculated for each word, essentially flattening each token's nested annotation structure into one label.For example, the tag of token 'California' in the third sentence of Figure 1 is STATE REGION, while 'beach' is O O. Agreement using Fleiss' kappa over all tokens is 0.907.Considering only tokens that are part of at least one mention according to at least one annotator, Fleiss' kappa is 0.832.Both results are above the 0.8 threshold for good reliability (Carletta, 1996).Average precision, recall and F 1 score across four annotators with respect to the adjudicated gold standard are 94.3, 91.8 and 93.0.

Benchmark results
We evaluate three existing NER models on our dataset: (1) the standard BiLSTM-CRF model which can handle only flat entities (Lample et al., 2016); (2) hypergraph-based (Wang and Lu, 2018); and, (3) transition-based (Wang et al., 2018) models.The latter two models were proposed to recognize nested mentions.We follow CoNLL evaluation schema in requiring an exact match of mention start, end and entity type (Sang and Meulder, 2003).We use sections 02 as development set, sections 23 and 24 as test set, and the remaining sections as training set.The model that performs best on the development set is evaluated on the test set for the final result.Since the standard BiLSTM-CRF model cannot handle nested entities, we use either the outermost (BiLSTM-CRF-TOP in Table 2) or the innermost mentions (BiLSTM-CRF-BOTTOM) for training.We also combine the outputs from these two flat NER models, and denote the result as BiLSTM-CRF-BOTH.
From Table 2, we can see that single flat NER models can achieve high precision but suffer from low recall.For example, the model pretrained on outermost (top) mentions has 38.0 recall, as

Related Work
Other corpora with nested entities: We briefly compare existing annotated English corpora involving nested entities.A comparison of statistics between our dataset and two widely used benchmark datasets is shown in Table 3.The ACE corpora (Mitchell et al., 2004;Walker et al., 2005) consist of data of various types annotated for entities, relations and events.disease names, often nest.For example, the RNA 'CIITA mRNA' contains a DNA mention 'CIITA'.
In addition to these two commonly used nested entity corpora, Byrne (2007) and Alex et al. (2007) introduced datasets with nested entities in historical archive and biomedical domains, respectively.However, their datasets are not publicly available.Four percent of entity mentions annotated in the English entity discovery and linking task in TAC-KBP track include nesting (Ji et al., 2014).
Resources built on the PTB: A lots of effort has been made on adding syntactic and semantic information to the PTB (Marcus et al., 1993).PropBank (Kingsbury et al., 2002) extended the PTB with the predicate argument relationships between verbs and their arguments.NomBank (Meyers et al., 2004) extended the argument structure for instances of common nouns.Vadas and Curran (2007), and Ficler and Goldberg (2016) extended the PTB with noun phrase and coordination annotations, respectively.
Our dataset is built on top of the PTB and enriches the full ecosystem of resources and systems that stem from it.

Summary
We present NNE, a large-scale, nested, finegrained named entity dataset.We are optimistic that NNE will encourage the development of new NER models that recognize structural information within entities, and therefore understand finegrained semantic information captured.Additionally, our annotations are built on top of the PTB, so that the NNE dataset will allow joint learning models to take advantage of semantic and syntactic annotations, and ultimately to understand and exploit the true structure of named entities.
... the Ontario Supreme Court said it will postpone ...
... this wealthy Southern California beach community ... state region Figure 1: Example nested mentions in NNE.

Table 1 :
Number of spans at each layer of nesting with their most frequent categories.