BENGAL: An Automatic Benchmark Generator for Entity Recognition and Linking

The manual creation of gold standards for named entity recognition and entity linking is time- and resource-intensive. Moreover, recent works show that such gold standards contain a large proportion of mistakes in addition to being difficult to maintain. We hence present Bengal, a novel automatic generation of such gold standards as a complement to manually created benchmarks. The main advantage of our benchmarks is that they can be readily generated at any time. They are also cost-effective while being guaranteed to be free of annotation errors. We compare the performance of 11 tools on benchmarks in English generated by Bengal and on 16 benchmarks created manually. We show that our approach can be ported easily across languages by presenting results achieved by 4 tools on both Brazilian Portuguese and Spanish. Overall, our results suggest that our automatic benchmark generation approach can create varied benchmarks that have characteristics similar to those of existing benchmarks. Our approach is open-source. Our experimental results are available at http://faturl.com/bengalexpinlg and the code at https://github.com/dice-group/BENGAL.


Introduction
The creating of gold standard is of central importance for the objective assessment and development of approaches all around computer science. For example, evaluation campaigns such as BioASQ (Tsatsaronis et al., 2012) have led to an improvement of the F-measure achieved by biomedical question answering systems by more than 5%. While the manual creation of Named Entity Recognition (NER) and Entity Linking (EL) gold standards (also called benchmarks) has the advantage of yielding resources which reflect human processing, it also exhibits significant disadvantages: a) Annotation mistakes: Human annotators have to read through every sentence in the corpus and often (a) miss annotations or (b) assign wrong resources to entities for reasons as various as fatigue or lack of background knowledge (and this even when supported with annotation tools). For example, Jha et al. (2017) was able to determine that up to 38,453 of the annotations in commonly used benchmarks (see GER-BIL (Usbeck et al., 2015) for a list of these benchmarks) were erroneous. A manual evaluation of 25 documents from the ACE2004 benchmark revealed that 195 annotations were missing and 14 of 306 annotations were incorrect. Similar findings were reported for AIDA/CONLL (Tjong Kim Sang and De Meulder, 2003) and OKE2015 (Nuzzolese et al., 2015). b) Volume: Manually created benchmarks are usually small (commonly < 2, 500 documents, see Table 2). Hence, they are of little help when aiming to benchmark the scalability of existing solutions (especially when these solutions use caching). c) Lack of updates: Manual benchmark generation approaches lead to static corpora which tend not to reflect the newest reference knowledge graphs (also called Knowledge Base (KB)s). For example, several of the benchmarks presented in GERBIL (Usbeck et al., 2015) link to outdated versions of Wikipedia or DBpedia. d) Popularity bias: van Erp et al. (2016) show that manual benchmarks are often biased towards popular resources. e) Lack of availability: The lack of benchmarks for resource-poor languages inhibits the development of corresponding NER and EL solutions.
Automatic methods are a viable and supplementary approach for the generation of gold standards for NER and EL, especially as they address some of the weaknesses of the manual benchmark creation process. The main contribution of our paper is a novel approach for the automatic generation of benchmarks for NER and EL dubbed BENGAL. Our approach relies on the abundance of structured data in Resource Description Framework (RDF) on the Web and is based on Natural Language Generation (NLG) techniques which verbalize such data to generate automatically annotated natural language statements. Our automatic benchmark creation method addresses the drawbacks of manual benchmark generation aforementioned as follows: a) It alleviates the human annotation error problem by relying on data in RDF which explicitly contain the entities to find. b) BENGAL is able to generate arbitrarily large benchmarks. Hence, it can enhance the measurement of both the accuracy and the scalability of approaches. c) BENGAL can be updated easily to reflect the newest terminology and reference KBs. Hence, it can generate corpora that reflect the newest KBs. d) BENGAL is not biased towards popular resources as it can choose entities to include in the benchmark generated following a uniform distribution. e) BENGAL can be ported to any token-based language. This is exemplified by porting BENGAL to Portuguese and Spanish.
2 Related Work 2.1 Gold Standards for NER and EL According to GERBIL (Usbeck et al., 2015), the 2003 CoNLL shared task (Tjong Kim Sang and De Meulder, 2003) is the most used benchmark dataset for recognition and linking. The ACE2004 and MSNBC (Cucerzan, 2007) news datasets were used by Ratinov et al. (Ratinov et al., 2011) to evaluate their seminal work on linking to Wikipedia. Another often-used corpus is AQUAINT, e.g., used by Milne and Witten (Milne and Witten, 2008). Detailed dataset statistics on some of these benchmarks can be found in Table 2.
A recent uptake of publicly available corpora (Röder et al., 2014;Steinmetz et al., 2013) based on RDF has led to the creation of many new datasets. For example, the Spotlight corpus and the KORE 50 dataset were proposed to showcase the usability of RDF-based annota-tions (Mendes et al., 2011). The multilingual N3 collection (Röder et al., 2014) was introduced to widen the scope and diversity of NIF-based corpora. Another recent observation is the shift towards gold standards for micropost documents like tweets. For example, the Microposts2014 corpus (Cano Basave et al., 2014) was created to evaluate NER on smaller pieces of text.
Semi-automatic approaches to benchmark creation are commonly crowd-based. They use one or more recognizers to create a first set of annotations and then hand over the tasks of refinement and/or linking to crowd workers to improve the quality. Examples of such approaches include Voyer et al. (2010) and CALBC (Rebholz-Schuhmann et al., 2010). Oramas et al. (2016) introduced a votingbased algorithm which analyses the hyperlinks presented in the input texts retrieved from different disambiguation systems such as Babelfy (Moro et al., 2014). Each entity mention in the input text is linked based on the degree of agreement across three EL systems.
BENGAL is the first automatic approach that makes use of structured data and can be replicated on any RDF KB for EL benchmarks.

NLG for the Web of Data
A plethora of works have investigated the generation of Natural Language (NL) texts from Semantic Web Technologies (SWT) such as Staykova (2014); Bouayad-Agha et al. (2014). However, the generation of NL from RDF has only recently gained momentum. This attention comes from the great number of published works such as (Cimiano et al., 2013;Duma and Klein, 2013;Ell and Harth, 2014;Biran and McKeown, 2015) which used RDF as an input data and achieved promising results. Moreover, the works published in the WebNLG (Colin et al., 2016) challenge, which used deep learning techniques such as (Sleimi and Gardent, 2016;, also contributed to this interest. RDF has also been showing promising benefits to the generation of benchmarks for evaluating NLG systems, e.g., (Gardent et al., 2017;Mohammed et al., 2016;Schwitter et al., 2004;Hewlett et al., 2005;Sun and Mellish, 2006). However, RDF has never been used for creating NER and NEL benchmarks. BENGAL addresses this research gap.

The BENGAL approach
BENGAL is based on the observation that more than 150 billion facts pertaining to more than 3 billion entities are available in machine-readable form on the Web (i.e., as RDF triples). 1 The basic intuition behind our approach is hence as follows: Given that NER and EL are often used in pipelines for the extraction of machine-readable facts from text, we can invert the pipeline and go from facts to text, thereby using the information in the facts to produce a gold standard that is guaranteed to contain no errors. In the following, we begin by giving a brief formal overview of RDF. Thereafter, we present how we use RDF to generate NER and EL benchmarks automatically and at scale.

RDF
The notation presented herein is based on the RDF 1.1 specification. An RDF graph G is a set of facts. Each fact is a triple t = (s, p, o) ∈ (R ∪ B) × P × (R∪B∪L) where R is the set of all resources (i.e., things of the real world), P is the set of all predicates (binary relations), B is the set of all blank nodes (which basically express existential quantification) and L is the set of all literals (i.e., of datatype values). We call the set R ∪ P ∪ L ∪ B our universe and call its elements entities. A fragment of DBpedia 2 is shown below. We will use this fragment in our examples. For the sake of space, our examples are in English. However, note that we ported BENGAL to Portuguese and Spanish so as to exemplify that it is not biased towards a particular language. Also, the morphological richness of both led us to choose them as languages.

Benchmarks
We define a benchmark as a set C of annotated documents D i . Each document D i is a sequence of characters s i1 . . . s in . Each subsequence s ij . . . s ik (with j < k) of the document D i which stands for a resource r ∈ R is assumed to be marked as such. We model the marking of resources by the function m : C × N × N → R and write m(D i , j, k) = r to signify that the substring s ij . . . s ik stands for the resource r. In case the substring s ij . . . s ik does not stand for a resource, we write m(D i , j, k) = . Let D 0 be the example shown in Listing 2. We would write m(D 0 , 0, 14) = :AlbertEinstein.
Albert Einstein was born in Ulm.

Verbalization
The notation and formal framework for verbalization in BENGAL are based on SPARQL2NL (Ngonga Ngomo et al., 2013). Let W be the set of all words in the dictionary of our target language (e.g., English). We define the realization function ρ : R ∪ P ∪ L → W * as the function which maps each entity to a word or sequence of words from the dictionary. Formally, the goal of our NLG approach is to devise an extension of ρ to conjunctions of RDF triples. This extension maps all triples t to their realization ρ(t) and defines how these atomic realizations are to be combined. We denote the extension of ρ by the same label ρ for the sake of simplicity. We adopt a rule-based approach to devise the extension of ρ, where the rules extending ρ to RDF triples are expressed in a conjunctive manner. This means that for premises P 1 , . . . , P n and consequences K 1 , . . . , K m we write P 1 ∧ . . . ∧ P n ⇒ K 1 ∧ . . . ∧ K m . The premises and consequences are explicated by using an extension of the Stanford dependencies. 3 We rely especially on the constructs explained in Table 1. For example, a possessive dependency between two phrase elements e 1 and e 2 is represented as poss(e 1 , e 2 ). For the sake of simplicity, we sometimes reduce the construct subj(y,x) ∧ dobj(y,z) to the triple (x,y,z) ∈ W 3 .

Approach
BENGAL assumes that it is given (1) an RDF graph  Dependency Explanation cc Stands for the relation between a conjunct and a given conjunction (in most cases and or or). For example in the sentence John eats an apple and a pear, cc(PEAR,AND) holds. We mainly use this construct to specify reduction and replacement rules. conj * Used to build the conjunction of two phrase elements, e.g. conj(subj(EAT,JOHN), subj(DRINK,MARY)) stands for John eats and Mary drinks. conj is not to be confused with the logical conjunction ∧, which we use to state that two dependencies hold in the same sentence. For example subj(EAT,JOHN) ∧ dobj(EAT,FISH) is to be read as John eats fish. dobj Dependency between a verb and its direct object, for example dobj(EAT,APPLE) expresses to eat an/the apple. nn The noun compound modifier is used to modify a head noun by the means of another noun. For instance nn(FARMER,JOHN) stands for farmer John. poss Expresses a possessive dependency between two lexical items, for example poss(JOHN,DOG) expresses John's dog. subj Relation between subject and verb, for example subj(BE,JOHN) expresses John is.
documents to generate, (3) a minimal resp. maximal document size (i.e., number of triples to use during the generation process) d min resp. d max , (4) a set of restrictions pertaining to the resources to generate and (5) a strategy for generating single documents. Given the graph G, BENGAL begins by selecting a set of seed resources from G based on the restrictions set using parameter (4). Thereafter, it uses the strategy defined via parameter (5) to select a subgraph of G. This subgraph contains a randomly selected number d of triples with The subgraph is then verbalized. The verbalization is annotated automatically and finally returned as a single document. Each single document then may be paraphrased if this option is chosen in the initial phase. This process is repeated as many times as necessary to reach the predefined number of documents. In the following, we present the details of each step underlying our benchmark generation process displayed in Figure 1.

Seed Selection
Given that we rely on RDF, we model the seed selection by means of a SPARQL SELECT query with one projection variable. Note that we can use the wealth of SPARQL to devise seed selection strategies of arbitrary complexity. However, given that NER and EL frameworks commonly focus on particular classes of resources, we are confronted with the condition that the seeds must be instances of a set of classes, e.g., :Person, :Organization or :Place. The SPARQL query for our example dataset would be as follows:

Subgraph Generation
Our approach to generating subgraphs is reminiscent of SPARQL query topologies as available in SPARQL query benchmarks. As these queries (e.g., FEASIBLE 4 queries) describe real informa-tion needs, their topology must stand for the type of information that is necessitated by applications and humans. We thus distinguish between three main types of subgraphs to be generated from RDF data: (1) star graphs provide information about a particular entity (e.g, the short biography of a person); (2) path graphs describe the relations between two entities (e.g., the relation between a gene and a side-effect); (3) hybrid graphs are a mix of both and commonly describe a specialized subject matter involving several actors (e.g., a description of the cast of a movie). Star Graphs. For each s i ∈ S, we gather all triples of the form t = (s i , p, o) ∈ R × P × (R ∪ L). 5 The triples are then added to a list L(s i ) sorted in descending order according to a hash function h. After randomly selecting a document size d between d min and d max , we select d random triples from L(s i ). For the dataset shown in Listing 1 and d = 2, we would for example get Listing 4.
Listing 4: Example dataset generated by the star strategy.

Symmetric Star Graphs. As above with
Path Graphs. For each s i ∈ S, we begin by computing list L(s i ) as in the symmetric star graph generation. Then, we pick a random triple (s i , p, o) or (o, p, s i ) from L(s i ) that is such that o is a resource. We then use o as seed and repeat the operation until we have generated d triples, where d is randomly generated as above. For the example dataset shown in Listing 1 and d = 2, we would for example get Listing 5.
Listing 5: Example dataset generated by the path strategy.
Hybrid Graphs. This is a 50/50-mix of the star and path graph generation approaches. In each iteration, we choose and apply one of the two strategies above randomly. For example, the hybrid graph generation can generate: :AlbertEinstein :birthPlace :Ulm . :AlbertEinstein :deathPlace :Princeton . 5 Note that we do not consider blank nodes as they cannot be verbalized due to the existential quantification they stand for.
Listing 6: Example dataset generated by the hybrid strategy.
Summary Graph Generation. This last strategy is a specialization of the star graph generation where the set of triples to a resource is not chosen randomly. Instead, for each class (e.g., :Person) of the input KB, we begin by filtering the set of properties and only consider properties that (1) have the said class as domain and (2) achieve a coverage above a user-set threshold (60% in our experiments) (e.g., :birthPlace, :deathPlace, :spouse). We then build a property co-occurence graph for the said class in which the nodes are the properties selected in the preceding step and the co-occurence of two properties p 1 and p 2 is the instance r of the input class where ∃o 1 , o 2 : (r, p 1 , o 1 ) ∈ K ∧ (r, p 2 , o 2 ) ∈ K. The resulting graph is then clustered (e.g., by using the approach presented by Ngonga Ngomo and Schumacher (2009)). We finally select the clusters which contain the properties with the highest frequencies in K that allow the selection of at least d triples from K. For example, if :birthPlace (frequency = 10), :deathPlace (frequency = 10) were in the same cluster while :spouse (frequency = 8) were in its own cluster, we would choose the pair (:birthPlace, :deathPlace) and return the corresponding triples for our input resource. Hence, we would return Listing 4 for our running example.

Verbalization module
The verbalization (micro-planning) strategy for the first four strategies consists of verbalizing each triple as a single sentence and is derived from SPARQL2NL (Ngonga Ngomo et al., 2013). To verbalize the subject of the triple t = (s, p, o), we use one of its labels according to Ell et al. (2011) (e.g., the rdfs:label). If the object o is a resource, we follow the same approach as for the subject. Importantly, the verbalization of a triple t = (s, p, o) depends mostly on the verbalization of the predicate p (see Table 1 for semantics). If p can be realized as a noun phrase, then a possessive clause can be used to express the semantics of (s, p, o). For example, if p can be verbalized as a nominal compound like birth place, then the verbalization ρ(s, p, o) of the triple is as follows: poss(ρ(p),ρ(s)) ∧ subj(BE,ρ(p)) ∧ dobj (BE,ρ(o)). In case p's realization is a verb, then the triple can be verbalized as subj(ρ(p),ρ(s)) ∧ dobj(ρ(p),ρ(o)). In our example, verbalizing (:AlbertEinstein, dbo:birthPlace, :Ulm) would thus lead to Albert Einstein's birth place is Ulm., as birth place is a noun.
In the case of summary graphs, we go beyond the verbalization of single sentences and merge sentences that were derived from the same cluster.

Paraphrasing
With this step, BENGAL avoids the generation of a large number of sentences that share the same terms and the same structure. Additionally, this step makes the use of reverse engineering strategies for the generation more difficult as it increases the diversity of the text in the benchmarks. Our paraphrasing is largely based on Androutsopoulos and Malakasiotis (2010) and runs as follows: 1. Change the structure of the sentence: We use the location of verbs in each sentence to randomly change passive into active structures and vice-versa. Sentences which describe type information (e.g., Einstein is a person) are not altered.
2. Replace synonyms: We use POS tags to select alternative labels from the knowledge base and a reference dictionary to replace entity labels by a synonym.
An example of a paraphrase generated by BEN-GAL is shown in Listing 7.

Experiments and Results
We generated 13 datasets in English (B1-B13), 4 datasets in Brazilian Portuguese and 4 datasets in Spanish to evaluate our approach. 6 B1 to B10 were generated by running our five sub-graph generation methods with and without paraphrasing. The number of documents was set to 100 while (d min , d max ) was set to (1, 5). B11 shows how BENGAL can be used to evaluate the scalability of approaches. 7 Here, we used the hybrid generation strategy to generate 10,000 documents. B12 and B13 comprise 10 longer documents each with d min set to 90. For B12, we focused on generating a high number of entities in the documents while B13 contains less entities but the same number of documents.
We compared B1-B13 with the 16 manually created gold standards for English found in GER-BIL. The comparison was carried out in two ways. First, we assessed the features of the datasets. Then, we compared the micro F-measure of 11 NER and EL frameworks on the manually and automatically generated datasets. We chose to use these 11 frameworks because they are included in GERBIL. This inclusion ensures that their interfaces are compatible and their results comparable. In addition, we assessed the performance of multilingual NER and EL systems on the datasets P1-P4 to show that BENGAL can be easily ported to languages other than English.

English Dataset features
The first aim of our evaluation was to quantify the variability of the datasets B1-B13 generated by BENGAL. To this end, we compared the distribution of the part of speech (POS) tags of the BENGAL datasets with those of the 16 benchmark datasets. An analysis of the Pearson correlation of these distributions revealed that the manually created datasets (D1-D16) have a high correlation (0.88 on average) with a minimum of 0.61 (D10-D16). The correlation of the POS tag distributions between BENGAL datasets and a manually created dataset vary between 0.34 (D7-B11) and 0.89 (D14-B9) with an average of 0.67. This shows that BENGAL datasets can be generated to be similar to manually created datasets (D14-B9) as well as to be very different to them (D7-B11). Hence, BENGAL can be used for testing sentence structures that are not common in the current manually generated benchmarks. 8 We also studied the distribution of entities and tokens across the datasets in our evaluation. Table 2 gives an overview of these distributions, where E is the set of entities in the corpus C. The distribution of values for the different features is very diverse across the different manually created datasets. This is mainly due to (1) different ways to annotate entities and (2) the domains of the datasets (news, description of entities, microposts). As shown in Table 2, BENGAL can be easily configured to generate a wide variety of datasets with similar quality and number of documents to those of real datasets. This is mainly due to our approach being able to generate benchmarks ranging from (1) benchmarks with sentences containing a large number of entities without any filler terms (high entity density) to (2) benchmarks which contain more information pertaining to entity types and literals (low entity density).

Annotator performance
We used GERBIL to evaluate the performance of 11 annotators on the manually created as well as the BENGAL datasets. We evaluated the annotators within an A2KB (annotation to knowledge base) experiment setting: Each document of the corpora was sent to each annotator. The annotator had to find and link all entities to a reference KB (here DBpedia). We measured both the performance of the NER and the EL steps. Table 3 shows the micro F1-score of the different annotators on chosen datasets. The manually created datasets showed diverse results. We analyzed the results further by using the F1-scores of the annotators as features of the datasets. Based on these feature vectors, we calculated the Pearson correlations between the datasets to identify datasets with similar characteristics. 9 The Pearson correlations of the F-measures achieved by the different annotators on the AIDA/CoNLL datasets (D2-D5) are very high (0.95-1.00) while the correlation between the results on the Spotlight corpus (D7) and N3-Reuters-128 (D13) is around -0.62. The results on D1 and D12-D15 have a correlation to the AIDA/CoNLL results (D2-D5) that is higher than 0.5. In contrast, the correlations of D7 and D8 to the AIDA/CoNLL datasets range from -0.54 to -0.36. These correlations highlight the diversity of the manually created datasets and suggest that creating an approach which emulates all datasets is non-trivial.
Like the correlations between the manually created datasets, the correlations between the results achieved on BENGAL datasets and hand-crafted datasets vary. The results on BENGAL correlate most with the results on the OKE 2015 data. The highest correlations were achieved with the OKE 2015 Task 1 dataset and range between 0.89 and 0.92. This suggests that our benchmark can emulate entity-centric benchmarks. The correlation of BENGAL with OKE is however reduced to 0.82 in D13, suggesting that BENGAL can be parametrized so as to diverge from such benchmarks. A similar observation can be made for the correlation D12 and ACE2004, where the correlation increased with the size of the documents in the benchmark. The correlation between the results across BENGAL datasets varies between 0.54 and 1, which further supports that BENGAL can generate a wide range of diverse datasets.

Annotator Performance on Spanish and Brazilian Portuguese
We implemented BENGAL for Brazilian Portuguese by using the RDF verbalizer presented in  and ran four multilingual NER and EL (MAG (Moussallem et al., 2017), DBpedia Spotlight, Babelfy, and PBOH (Ganea et al., 2016)) frameworks thereon. We also evaluated the performance of these annotators on subsets of the HAREM datasets (Freitas et al., 2010) 10 . We then extended this verbalizer to Spanish using the adaption of SimpleNLG to Spanish (Soto et al., 2017). We generated Spanish BENGAL datasets and evaluated the aforemen-   tioned NER and EL systems on them. 11 We also included VoxEL (Rosales-Méndez et al., 2018), a recent gold standard for Spanish. While the extension of BENGAL to Portuguese is an important result in itself, our results also provide additional insights in the NER and EL performance of existing solutions. Our results suggest that existing solutions are mostly biased towards a high precision but often achieve a lower recall on this language. For example, both Spotlight's and Babelfy's recall remain below 0.6 in most cases while their precision goes up to 0.9. This clearly results from the lack of training data for these resource-poor languages. In contrast, the Spanish annotators presented low but consistent results, which confirms 11 All Spanish results at http://faturl.com/ bengales. the lack of training data of these approaches on Spanish.

Discussion and Conclusion
We presented and evaluated BENGAL, an approach for the automatic generation of NER and EL benchmarks. Our results suggest that our approach can generate diverse benchmarks with characteristics similar to those of a large proportion of existing benchmarks in several languages.
Overall, our results suggest that BENGAL benchmarks can ease the development of NER and EL tools (especially for resource-poor languages) by providing developers with insights into their performance at virtually no cost. Hence, BENGAL can improve the push towards better NER and EL frameworks. In future work, we plan to extend the ability of BENGAL to generate longer and more complex sentences as well as the capability of generating different surface forms for a given entity by relying on referring expression models such as NeuralREG model . We also intend to provide thorough evaluations of annotators across other resource-poor languages and create corresponding datasets to push the development of tools to process these languages.