Evaluating and Combining Name Entity Recognition Systems

Name entity recognition (NER) is an important subtask in natural language processing. Various NER systems have been developed in the last decade. They may target for different domains, employ different methodologies, work on different languages, detect different types of entities, and support different inputs and output formats. These conditions make it difficult for a user to select the right NER tools for a specific task. Motivated by the need of NER tools in our research work, we select several publicly available and well-established NER tools to validate their outputs against both Wikipedia gold standard corpus and a small set of manually annotated documents. All the evaluations show consistent results on the selected tools. Finally, we constructed a hybrid NER tool by combining the best performing tools for the domains of our interest.


Introduction
Name entity recognition is an important subtask in natural language processing (NLP). The results of recognition and classification of proper nouns in a text document are widely used in information retrieval, information extraction, machine translation, question answering and automatic summarization (Nadeau and Sekine. 2007; Kaur and Gupta. 2010). Depending on the requirements of specific tasks, the types to be recognized can be person, location, organization and date, which are mostly used in newswire (Tjong et al., 2003), or other commonly used measures (percent, weight, money), email address, etc. It can also be domain specific entity types such as medical drug names, disease symptoms and treatment, etc. (Asma Ben Abacha and Pierre Zweigenbaum, 2001).
Name entity recognition is a challenging task which needs massive prior knowledge sources for better performance (Lev Ratinov, Dan Roth, 2009;Nadeau and Sekine. 2007). Many researches works have been conducted in different domains with various approaches. Early studies focus on heuristic and handcrafted rules. By defining the formation patterns and context over lexical-syntactic features and term constituents, entities are recognized by matching the patterns against the input documents (Rau, Lisa F. 1991;Collins, Michael, Singer, Y. 1999). Rule-based system may achieve high degree of precision. However, the development process is time-consuming and porting these developed rules from one domain to another is a major challenge. Recent research in NER tends to use machine learning approaches (Andrew Borthwick. 1999;McCallum, Andrew and Li, W. 2003;Takeuchi K. and Collier N. 2002). The learning methods include various supervised, semi-supervised and unsupervised learning. The supervised learning tends to be the dominant technique for named entity recognition and classification (David Nadeau and Satoshi Sekine. 2007). However, supervised machine learning methods require large amount of annotated documents for model training and its performance typically depends on the availability of sufficient high quality training data in the domain of interest. There are some systems which use hybrid methods to combine different rule-based and/or machine learning systems for improved performance over individual approaches (Srihari R. et al., 2000;Tim R. et al., 2012). Hybrid systems make the best use of the good features of different systems or methods to achieve the best overall performance. In this paper, we first select several publicly available and well-established NER tools in section 2. Then all the tools are validated in section 3 with CONLL 2003 metrics and a customized partial matching measurement. Then we constructed a hybrid NER system based on the best performed NER tools in section 4.

Tool Selection
Our goal is to evaluate freely available NER tools that have good performance for our research projects. The criteria for our selection are as follows: a) The NER tool is freely available and allows unlimited use. b) The tool can be downloaded and installed locally and works well with default configuration. c) The tool is not trained for a specific domain. d) The tool must be able to recognize the basic three entity types: PERSON, LOCATION, ORGANIZATION Based on the above criteria, the following NER tools have been selected: a) Stanford NER (Jenny Rose Finkel et al., 2005). b) spaCy 1 . c) Alias-i LingPipe (Alias-i. 2008). d) Natural Language Toolkit (NLTK) (Bird Steven et al. 2009).

Normalization
The selected tools come with different features, programming languages as well as different tag set and output format. To have an automated and efficient evaluation system, we have to integrate all these tools in one system and normalize all their outputs into a standard format. Stanford NER is a Java package (version 3.6.0). It is based on linear chain Conditional Random Field (Jenny Rose Finkel et al., 2005). The models were trained on a mixture of CoNLL, MUC-6, MUC-7 and ACE named entity corpora. The basic required output tags are "PERSON", "LOCATION" and "ORGANIZATION". 1 https://spacy.io/ spaCy is implementation in Python. There is no detailed information provided in its documentation with regard to its implemented models at the time of writing. The related output tags include "PERSON", "LOC", "ORG", "GPE" etc. For spaCy outputs, we map "LOC" to "LOCATION", "ORG" and "GPE" to "ORGANIZATION" and ignore all other types.
Alias-i LingPipe NER is implemented in Java and supports both rule-based NER and supervised training of a statistical model or more direct method like dictionary matching (Alias-i. 2008). We use version 4.1.0 and adopt the "Firstbest Named Entity Chunking". The trained model is News English on the MUC 6 corpus which is relatively slow compared with its other models but with higher accuracy. The output entity types match the normalized types and no mapping is needed.
NLTK is a python NLP toolkit and is wellestablished in the research community. NLTK's named entity chunker is based on a supervised machine learning algorithm -Maximum Entropy Classifier. Its model is trained on ACE corpus with the exact entity types which we are interested in: "PERSON", "LOCATION" and "ORGANIZATION". It also outputs "GPE" type which we will map to "LOCATION" for evaluation.

Integration
In order to automate the evaluation process, we developed a system to integrate all the toolkits into one system using a python script.

Fig. 1. System diagram for automated evaluation
The overall system structure of the integrated evaluation system is shown in Fig. 1.
In the process of evaluation, an annotated (gold-standard) input document must be provided. Currently, the supported format is IOB (short for Inside, Outside, Beginning) (Ramshaw and Marcus, 1995). In this scheme, every line in the file represents one token with two fields: the word itself and its named entity type. Empty lines denote sentence boundaries. Following is an example of the representation: The prefix "I-" in the tag means that the tag is inside a chunk. While the prefix "B-" indicates that the tag is the beginning of a chunk and is only used when a tag is followed by a tag of the same type without "O" tag between them. The "O" tag just means it is out of the chunk. This IOB chunk representation is much easier for manual annotation than inside XML annotation scheme.
An Analyzer module is used to extract the source document as well as all chunks and their types from the annotated file. Every chunk is represented in the format of a three-element tuple: (chunk, type, start_position), where the start_position is the sequence position (character index) of the chunk in the source document. This tuple representation contains all the necessary information for the validation of a chunk, including its boundary.
The Dispatcher module will pass the source document to all NER tools. All the tools will first tokenize the sentences, analyze these sentences and then create their respective list of tuples dynamically. Every output list from the NER tools will be compared against the standard list generated from the annotated file. The comparison results will be used to calculate true positives (TP), false positives (FP) and false negatives (FN). Then precision, recall and Fmeasure can be further calculated for evaluation. All the calculation results can be directly exported to excel file for easy comparison.

Evaluation
With the methodology defined in section 2, it is ready to evaluate all the selected tools with any data file annotated in the IOB format.

Evaluation Corpus
Since all the selected NER tools are able to classify the three entity types: PERSON, LOCATION and ORGANIZATION, the evaluation corpus must contain at least the above three entity types. The format is better to be in the supported IOB chunk representation. We found that WikiGold 2 meets the above requirements. WikiGold (Balasuriya et al. 2009) is an annotated corpus over a small sample of Wikipedia articles in CoNLL format (IOB). It contains 145 documents (separated by "-DOCSTART-"), 1696 sentences and 39152 tokens.

Evaluation Metrics
There are different evaluation metrics for the evaluation of NER systems (Nadeau and Sekine. 2007). The evaluation is basically to check the tool's ability on finding the boundaries of names and their correct types. Most evaluation systems require exact match on both boundary and entity type. The share task for CONLL 2003 (Sang andMeulder, 2003) is one of the examples for the exact matching. However, in some cases, the exact boundary detection is not so important as long as the major part of the name has been identified. For instance, "The United Nations" and "United Nations", "in November 2015" and "November 2015", they are almost the same except the minor differences in the definite article and preposition. The metrics used for evaluation in the Message Understanding Conference (MUC) (Grishman and Sundheim, 1996) adopted more loose matching conditions which allow for partial credit when partial span or wrong type detection happened. The credit was given to any correct entity type detected regardless of its boundary as long as there is an overlap, as well as the correct boundary identified regardless of the type. Here we score NER systems based on the following two metrics: a) Exact matching for both boundary and type (similar to CONLL) which measures a system's capability for accurate named entity detection. b) Partial matching for boundary is also counted, only when the detected type is correct. This measurement will mitigate the failures of exact matching when the boundary differences are caused by some unimportant words in the names such as the articles and prepositions. Based on the above two scoring protocols, the measuring system counts TP, FP and FN for every NER toolkit. Then typical precision: p = TP / (TP + FP) and recall: R = TP / (TP + FN) are further calculated to check the NER system's type I (false alarm) and type II (miss) errors respectively.  Table 2. Evaluation results on the WikiGold annotated data for the selected NER tools Table 2 shows the results of the four selected NER systems on the WikiGold data set. In the table, Precision (P), Recall (R) and F1 measure (F) are calculated against every entity type and a final overall score is also given for all the measurements. Similarly, the Precision (PP), Recall (PR) and F1 measure (PF) for partial boundary matching as described in section 3.2 are also calculated. From the results depicted in Table 2 we can derive the following conclusions:

PER
a) Loose boundary matching shows better results than the exact matching for every entity type across all the NER tools. That means there exist quite a number of cases where NER systems detected the right entity types but the boundaries are not exactly matched. b) ORGANIZATION appears to be the entity type which is more difficult for detecting for all the NER tools. This is proved by its lower scores compared with the PERSON and LOCATION types. c) Stanford NER and spaCy generally show better performance in this data set for both exact matching and partial matching.

Hybrid NER System
We need to have a NER system which is able to recognize PERSON, LOCATION, ORGANIZATION as well as DATE for our research projects. Among the evaluated NER tools, we selected the Stanford NER and spaCy for the configuration of the proposed hybrid NER system. Both tools showed good scores in our previous evaluation and are able to identify DATE entity without any extra setting (Stanford NER 7-class model includes the DATE type).
Our first target domain of application is Wikipedia pages about Singapore. To construct the hybrid NER system, we simply combined the outputs of the Stanford NER system and spaCy NER by using union method. In addition, a dictionary with limited entries on PERSON, LOCATION and ORGANIZATION about Singapore was also created with the expectation of improving system precision (Tsuruoka and Tsujii 2003;Cohen and Sarawagi, 2004). We set the dictionary to have the highest priority when there is any conflict with the outputs from other tools. Then followed by Stanford NER tool, it has the second highest priority on the determination of final named entities.

Data for Evaluation
In order to evaluate the performance of the hybrid system, we manually annotated twenty two web pages. All the web pages are from Singapore National Library Board eResources 3 . Half of the web pages are about Singapore history, another half are from Infopedia pages. We first use Stanford tool to tokenize all the documents and save them into different files. Every token is in a new line with a space line to separate the sentence. Then every token is manually annotated in IOB format. Table 3 shows the statistics of the two manually annotated datasets.   a) All the conclusions we drew from evaluation results over WikiGold dataset are still valid for the two manually annotated datasets: History and Infopedia. b) Stanford NER generally shows good performance on all tested datasets. However, its scores on DATE entity type are not as good as spaCy. After further analysis on the false alarm and missing errors, we noticed that Stanford NER has difficulty to identify the full date information from the text. For instance, from text "on 1 February 1858", it can only identify "February 1858", the date is always missing. This problem is probably caused by the fact that Stanford NER is not trained for the date format "date month year". An alternative solution is to use its rule-based Temporal Tagger (SUTime). However, this is not included in the current evaluation. c) The hybrid system usually has lower precision and higher recall than Stanford NER for entity types: PERSON, LOCATION, and ORGANIZATION. Its F1-measure is slightly better than Stanford NER for History data for these three entity types, but slight worse for Infopedia data. d) In general, the hybrid system has better overall performance over both Stanford NER and spaCy. This is especially true for History testing data. However, most of the advantages are contributed by its better DATE entity recognition. e) Overall, all the NER tools, including the hybrid system, showed better performance on History data than Infopedia data. This is mostly caused by some noise present in the Infopedia documents, for instance, html codes: &rsquo;, un-delimited words "COMPASS.FamilyWife" in the document due to the data extraction from the html pages. Their experiments showed that Stanford NER gave overall the best performance across two datasets, and was most effective on PER and LOC types. Alchemy API achieved the best results for the ORG type. In this paper, our work is different from the above mentioned validation tasks in the following ways: a) We developed a validation framework which can work with various NER tools regardless of their programming languages. All the tools can work dynamically for immediate validation against gold standard corpus. The comparing results can be presented in text document or directly exported to excel file in predefined table format. b) The selected tools are evaluated with both publicly available gold standard corpus and our manually annotated datasets. c) After evaluating the selected NER tools, a further step was taken by combining the best performing NER tools in an effort to construct a new hybrid NER tool for our application domain.

Conclusion
In this paper, we conducted a comparative evaluation of four publically available and wellestablished NER tools which include Stanford NER, spaCy, Alias-i LingPipe and NLTK. For validation purposes, a framework has been developed in python, which can seamlessly work with different NER systems implemented in different programming languages. The output can be produced dynamically in both text documents or excel tables. The selected NER tools were evaluated by using publicly available gold standard corpus and our manually annotated datasets. Results showed that Stanford NER, followed by spaCy, performed the best across all the testing datasets. We further constructed a hybrid NER tool for our application domain by combining the best two performing NER tools.
In the future, we plan to continue improving the overall performance of the hybrid NER system by combining different features of more advanced systems as well as rule-based components.