SemEval-2019 Task 12: Toponym Resolution in Scientific Papers

We present the SemEval-2019 Task 12 which focuses on toponym resolution in scientific articles. Given an article from PubMed, the task consists of detecting mentions of names of places, or toponyms, and mapping the mentions to their corresponding entries in GeoNames.org, a database of geospatial locations. We proposed three subtasks. In Subtask 1, we asked participants to detect all toponyms in an article. In Subtask 2, given toponym mentions as input, we asked participants to disambiguate them by linking them to entries in GeoNames. In Subtask 3, we asked participants to perform both the detection and the disambiguation steps for all toponyms. A total of 29 teams registered, and 8 teams submitted a system run. We summarize the corpus and the tools created for the challenge. They are freely available at https://competitions.codalab.org/competitions/19948. We also analyze the methods, the results and the errors made by the competing systems with a focus on toponym disambiguation.


Introduction
Toponym resolution, also known as geoparsing, geo-grounding or place name resolution, aims to assign geographic coordinates to all location names mentioned in documents. Toponym resolution is usually performed in two independent steps. First, toponym detection or geotagging, where the span of place names mentioned in a document is noted. Second, toponym disambiguation or geocoding, where each name found is mapped to latitude and longitude coordinates corresponding to the centroid of its physical location. Toponym detection has been extensively studied in named entity recognition: location names were one of the first classes of named entities to be detected in text (Piskorski and Yangarber, 2013).
Disambiguation of toponyms is a more recent task (Leidner, 2007).
With the growth of the internet, the public adoption of smartphones equipped with Geographic Information Systems and the collaborative development of comprehensive maps and geographical databases, toponym resolution has seen an important gain of interest in the last two decades. Not only academic but also commercial and open source toponym resolvers are now available. However, their performance varies greatly when applied on corpora of different genres and domains (Gritta et al., 2018). Toponym disambiguation tackles ambiguities between different toponyms, like Manchester, NH, USA vs. Manchester, UK (Geo-Geo ambiguities), and between toponyms and other entities, such as names of people or daily life objects (Geo-NonGeo ambiguities). Additional linguistic challenges during the resolution step may be metonymic usage of toponyms, "91% of the US didn't vote for either Hilary or Trump" (a country does not vote, thus the toponym refers to the people living in the country), elliptical constructions, "Lakeview and Harrison streets" (the phrase refers to two street names Lakeview street and Harrison street), or when the context simply does not provide enough evidences for the resolution.
Although significant progress has been made in the last decade on toponym resolution, it is still difficult to determine precisely the current stateof-the-art performances (Leidner and Lieberman, 2011). As emphasized by several authors (Tobin et al., 2010;Speriosu, 2013;Weissenbacher et al., 2015;Gritta et al., 2018;Karimzadeh and MacEachren, 2019), the main obstacle is that few corpora of large size exist or are freely available. Consequently, researchers create their own (limited) corpora to evaluate their system, with the known drawbacks and biases that this implies. Moreover, one corpus is not sufficient to evaluate a toponym resolver thoroughly, as the domain of a corpus strongly impacts the performance of a resolver. A disambiguation strategy can be optimal on one domain and damaging on another. In (Speriosu, 2013), Speriosu illustrates that toponyms occurring in historical literature will tend to resolve within a local vicinity, whereas toponyms occurring in international press news refer to the most prominent places by default. Otherwise additional information is provided to help the resolution (ex. Paris, the city in Texas).
In this article we first define the concept of toponym and detail the subtasks of this challenge (Section 3). Then, we summarize how we acquired and annotated our data (Section 4). In Section 5, after describing the evaluation metrics, we briefly describe the resources and the baseline system provided to the participants. In the last Section 6 we discuss the results obtained and the potential future direction for the task of toponym resolution.

Related Work
The Entity Linking task aims to map a name of an entity with the ID of the corresponding entity in a predefined Knowledge database (Bada, 2014). Entity linking has been largely studied by the community (Shen et al., 2015). Toponym resolution is a special case of the entity linking task where strategies dedicated to toponyms can improve overall performances. Three main strategies have been proposed in the literature. The first exploits the linguistic context where a toponym is mentioned in a document. The vicinity of the toponym often contains clues that help the readers to interpret it. These clues can be other toponyms (Tobin et al., 2010), other named entities (Roberts et al., 2010), or even more generally, specific topics associated more often with a particular toponym than with others (Speriosu, 2013;Adams and McKenzie, 2013;Ju et al., 2016). The second strategy relies on the physical properties of the toponyms to disambiguate their mentions in documents. The population heuristic or the minimum distance heuristic are popular heuristics using such properties. The population heuristics disambiguates toponyms by taking, among the ambiguous candidates, the candidate with the largest population, whereas the minimum distance heuristic disambiguates all toponyms in a document by taking the set of candidates that are the closest to each other (Leidner, 2007). A recent heuristic computes from Wikipedia a network expressing important toponyms and their semantic relation with other entities. The network is then used to disambiguate jointly all toponyms in a document (Hoffart and Weikum, 2013;Spitz et al., 2016). The last strategy is less frequently used as it depends on metadata describing the documents where toponyms are mentioned. These metadata are of various kinds, but they all indicate, directly or not, geographic areas to help interpret toponyms mentioned in documents. Such metadata can be geotagging of social media posts (Zhang and Gelernter, 2014) or external databases structuring the information detailed in a document (Weissenbacher et al., 2015). These three strategies are complementary and can be unified with machine learning algorithms as shown by (Santos et al., 2015) or (Kamalloo and Rafiei, 2018).

Task Description
The definition of toponym is still in debate among researchers. In its simpler definition, a toponym is a proper name of an existing populated place on Earth. This definition can be extended to include a place or geographical entity that is named, and can be designated by a geographical coordinate 1 . This encompasses cities and countries, but also lakes or monuments. In this challenge we consider the extended definition of toponyms and exclude all indirect mentions of places such as "30 km north from Boston", as well as metonymic usage and elliptical constructions of toponyms.
Subtask 1: Toponym Detection Toponym detection consists of detecting the text boundaries of all toponym mentions in full PubMed articles. For example, given the sentence An H1N1 virus was isolated in 2009 from a child hospitalized in Nanjing, China., a perfect detector, regardless how, would return two pairs encoding the starting and ending positions of Nanjing and China, i.e. (64, 70) and (73, 77). Despite major progress, toponym detection is still an open problem and it was evaluated in a separate subtask since it determines the overall performance of the resolution. Toponym mentions missed during the detection cannot be disambiguated (False Negative, FN) and, inversely, phrases wrongly detected as toponyms will received geocoordinates during the disambiguation (False Positive, FP). Both FNs and FPs degrade the quality of the overall resolution.
Subtask 2: Toponym Disambiguation The second subtask focuses on the disambiguation of the toponyms only. In this subtask, all names of locations in articles are known by a disambiguator but not their precise coordinates. The disambiguator has to select the GeoNames IDs corresponding to the expected places among all possible candidates. GeoNames 2 is a crowdsourced database of geospatial locations and freely available. Following with our previous example, given the position of Nanjing in the sentence, a perfect disambiguator, regardless how, would have to choose among 12 populated places named Nanjing located in China in GeoNames and return the entry 7843770 in GeoNames. The disambiguator has to infer the expected place based on all information available in the article and not only based on the sentence. This subtask allows one to measure the performance of the disambiguation algorithms independently from the performances of the toponym detector used upstream.

Subtask 3: End-to-end, Toponym Resolution
The last subtask evaluates the toponym resolver as it would be when deployed in real-world applications. Only the full PubMed articles are given to the resolver and all toponyms detected and disambiguated by the resolver are evaluated.

A Case Study: Epidemiology of Viruses
The automatic resolution of the names of places mentioned in textual documents has multiple applications and, therefore, has been the focus of research for both industrial and academic organizations. For this challenge, we chose a scientific domain where the resolution of the names of places is key: epidemiology.
One aim in epidemiology might be to create maps of the locations of viruses and their migration paths, a tool which is used to monitor and intervene during disease epidemics. To create maps of viruses, researchers often use geospatial metadata of individual sequence records in public databases such as NIH's GenBank (Benson et al.,2 https://www.geonames.org/ 2017) 3 . The metadata provides the location of the infected host. With more than 3 million virus sequences 4 , GenBank provides abundant information on viruses. However, previous work has suggested that geospatial metadata, when it is not simply missing, can be too imprecise for local-scale epidemiology (Scotch et al., 2011). In their article Scotch et al., 2011 estimate that only 20% of GenBank records of zoonotic viruses contain detailed geospatial metadata such as a county or a town name (zoonotic viruses are viruses able to infect naturally hosts of different species, like rabies). Most GenBank records provide generic information, such as Japan or Australia, without mentioning the specific places within these countries. However, more specific information about the locations of the viruses may be present in articles which describe the research work. To create a complete map, researchers are then forced to read these articles to locate in the text these additional pieces of geospatial metadata for a set of viruses of interest. This manual process can be highly timeconsuming and labor-intensive.
This challenge was an opportunity to assess the development and evaluation of automated approaches to retrieve geospatial metadata with finer level of granularity from full-text journal articles, approaches that can be further transferred or adapted to resolve names of places in other scientific domains.

Corpus Collection
Our corpus is composed of 150 full text journal articles downloaded from the subset of PubMed Central (PMC) in open access 5 . All articles in this subset of PMC are covered by a Creative Commons license and free to access. We built our corpus using three queries on GenBank.
Subset A: For the first 60 articles, we downloaded 102,949 GenBank records that were linked to NCBI taxonomy id 197911 for influenza A. The downloaded records were associated with 1,424 distinct PubMed articles and 598 of them had links to an open access journal article in PubMed Central (PMC). We randomly sampled 60 articles from this set of 598 articles for manual annotation.
Subset B: We selected 60 additional articles by expanding our search to GenBank records linked to influenza B and C, rabies, hantavirus, western equine encephalitis, eastern equine encephalitis, St. Louis encephalitis, and West Nile virus. Our query returned a total of 544,422 GenBank records. We randomly selected a subset of records associated with 1,915 unique open access PMC articles. From these 1,915 articles, we randomly selected for toponym annotation a stratified sample of 60 articles, where strata were based on the number of GenBank records associated with the articles.
Subset C: We completed our corpus with 30 biomedical research articles to decrease bias and increase the generalizability of our corpus beyond toponym mentions in virus related research articles. From the 1,341 research articles returned by the search in PMC of the journal titles with the Article Attribute of Open access, we randomly selected 30 articles from top epidemiology journals, as determined by their impact factor in September 2018.
Since the 60 articles from Subset A had been used in our prior publications (Weissenbacher et al., 2015(Weissenbacher et al., , 2017, we kept them all for training. We randomly selected half of the articles from Subset B and Subset C for training and left the second half for testing. The resulting corpus of 105 articles for training and 45 for testing was used for all three subtasks of the competition. The corpus is available for download on the Codalab used for the competition: https://competitions. codalab.org/competitions/20171# learn_the_details-data_resources.

Annotation Process
To perform the annotation, we manually downloaded the PDF versions of the PMC articles and converted them to text files using the freely available tool, Pdf-to-text 6 . We formatted the output to be compatible with the BRAT annotator 7 (Stenetorp et al., 2012). We manually detected and disambiguated the toponyms using GeoNames. We annotated toponyms in titles, bodies, tables and captions sections of the documents. We removed contents that would not contain virologyrelated toponyms, such as the names of the authors, acknowledgments and references, this was done manually. In cases where a toponym could not be found in GeoNames, we set its coordinates to a special value N/A. Prior to beginning annotation, we developed a set of annotation guidelines after discussion among three annotators. The resulting guidelines are also available in the Codalab of the competition. Two annotators were undergraduate students in biomedical informatics and biology, respectively, and our senior annotator has a M.S. in biomedical informatics.
Two annotators annotated independently 58 articles of Subset B to estimate the inter-annotator agreement. Since the detection task is a namedentity recognition task, we followed the recommendations of Rothschild and Hripcsak (2005) and used precision and recall metrics to estimate the inter-annotator rate. The inter-annotator agreement rate on the toponym detection was .94 precision (P) and .95 recall (R) which indicates a good agreement between the annotators. The interannotator agreement rate on the toponym disambiguation was 0.857 Accuracy. Subset C was also annotated by two annotators, although not independently, to ensure the quality of the annotation of all documents occurring in the test set of the competition.
The corpus contains a total of 1,506 distinct toponyms for a total of 8,360 occurrences. 1,228 of these toponyms occur in only one document (a document may include multiple occurrences). The average number of occurrences for a toponym is 5.5 with China being the most mentioned toponym with a total of 417 occurrences. The average ambiguity is about 26.3 candidates per toponym which is comparable to the average ambiguity found in existing corpora (Speriosu, 2013). The location San Antonio was the most ambiguous with 2633 possible candidates. 232 toponyms (531 occurrences) were not found in GeoNames using a strict match, this was caused by multiple reasons, like misspellings, non standard-abbreviations, missing entries in GeoNames, etc. 142 countries and continents are mentioned in our corpus with a total of 3,105 occurrences. The resolution of country and continent names are easier than other places but they represent only 37% of the total of the occurrences, making our corpus challenging.

Toponym Resolution Metrics
When a gold standard corpus and a toponym resolver are aligned on the same geographical database, here the database GeoNames, the standard metrics of precision, recall and F-measure can be used to measure the performance of the resolver. For this challenge, we report all results by using two common variations of these metrics: strict and overlapping measures. In the strict measure, resolver annotations are considered matching with the gold standard annotations if they hit the same spans of text; whereas in overlapping measure, both annotations match when they share a common span of text.
We computed the P and R for toponym detection with the standard equations: P recision = T P/(T P + F P ) and Recall = T P/(T P + F N ), where TP (True Positive) is the number of toponyms correctly identified by a toponym detector in the corpus, FP (False Positive) the number of phrases incorrectly identified as toponyms by the detector, and FN (False Negative) the number of toponyms not identified by the detector.
To evaluate the toponym disambiguation, we modified the equations computing the P and R used for toponym detection in order to account for both detection and disambiguation errors. The precision of the toponym disambiguation is given by the equation: P ds = T CD/T CD + T ID, where TCD is the number of toponyms correctly identified and disambiguated by the toponym disambiguator in the corpus and TID is the number of toponyms incorrectly identified or incorrectly disambiguated in the corpus. The recall of the toponym disambiguation was computed by the equation: Rds = T CD/T N , where TN is the total number of toponyms in the corpus. F1ds is the harmonic mean of Pds and Rds. Since the resolvers competing and the gold corpus annotations were aligned on GeoNames, toponyms correctly identified were known by a simple match between the place IDs retrieved by the resolvers and those annotated by the annotators.

Baseline System
We released an end-to-end system to be used as a strong baseline. This system performs sequentially the detection and the disambiguation of the toponyms in raw texts. To detect the toponyms the system uses a feedforward neural network described in (Magge et al., 2018). The disambiguation of all toponyms detected is then performed using a common heuristic, the population heuristic. Using this heuristic, the system always disambiguates a toponym by choosing the place which has the highest population in GeoNames. The baseline system can be downloaded from the Codalab website of the competition. We also made available to the participants a Rest service to search a recent copy of GeoNames, the documentation and the code to deploy the service locally can be found on the Codalab website.

Results
Twenty nine teams registered to participate in the shared-task and eight teams submitted. 21/8/13 submissions from 8/4/6 teams were included in the final evaluations of sub-task 1/2/3 respectively. All systems which attempted to resolve the toponyms in Subtask 3 opted for a pipeline architecture where the detection and the disambiguation steps were performed independently and sequentially. Table 1 summarizes the characteristics of the systems along with their use of external resources. Tables 2, 3 and 4 presents the performances for each team. Team DM NLP achieved the best performances on all sub-tasks (Wang et al., 2019).
Toponym Detection: With all systems but one, Deep Recurrent Neural Networks were the most commonly used and efficient technology to detect toponyms in our corpus. Their architectures varied with respect to the integration of character embedding layers, mechanisms of attention, integration of external features (such as POS tagging or other Named Entities) or the choice of a general or in domain corpus for pre-training their word and sentence embeddings. In our epidemiological corpus, toponyms were not only mentioned in the body of the articles but also in tables. And interestingly, top ranked systems detected the toponyms with two different algorithms, one dedicated to the body and one to the tables of the articles. The top ranking system outperformed other competitors for Subtask 1 significantly, with a margin of 4 points separating it from the second ranked system, even though the same technology was used. Both teams used dedicated algorithms for bodies and tables but Team DM NLP implemented several strategies to improve the pre-training of their    Table 2: Results of the toponym detection task, Subtask 1. system which, according to their ablation study (Wang et al., 2019), proved to be effective 9 . Note that the performance of the first system is close to our IAA for toponym detection.
Toponym Disambiguation: All systems relied on handcrafted features to disambiguate toponyms. Their features described the lexical context of the toponyms and their importance. The importance of the toponyms was estimated by the frequencies of the candidates in the training data or by their populations. While the two top ranked systems combined such features with machine learning, SVM for UniMelb and a gradient boosting algorithm for DM NLP, others just encoded them into hard rules leading to suboptimal disambiguation.

Analysis
We analyzed a sample of errors to understand the remaining challenges for toponym disambiguation systems based on the results of Sub-task 2. We randomly selected 10 articles and analyzed 103 mentions of toponyms disambiguated incorrectly by all systems. We manually found 5 distinct categories of errors. For the largest category of errors, with 62 cases, the systems missed context clues used by the authors of the articles to convey the correct interpretation of the toponym and chose the wrong candidates. Such clues include the mention of a country in the header of a table or the explicit mention of a district after an ambiguous toponym. 17 errors were due to the systems not complying with the guidelines, selecting instead populated places or cities when the expected choices were toponyms with a higher administrative level. 8 candidates were not found in GeoNames by strict or fuzzy matching because of their surface forms. These were unconventional abbreviations, rare acronyms or words split by a hyphen. Despite our efforts to limit annotation errors, 15 were found in our sample 10 . The last error was a toponym where the choice made by the annotators can be argued. 9 Team QWERTY did not describe their system at the time of writing. We were therefore unable to compare it with other systems. 10 Since we analyzed entire articles, this count includes multiple mentions of the same toponym repeatedly annotated with the same error

Conclusion
In this paper we presented an overview of the results of SemEval 2019 Task 12 which focuses on toponym resolution in scientific articles. Given an article from PubMed, the task consists of detecting all mentions of place names, or toponyms, in the article and mapping them to their corresponding entry in GeoNames, a database of geospatial locations. All systems resolved the toponyms in our corpus sequentially, detecting the toponyms before disambiguating them. Among the 21 systems submitted for toponym detection, neural network based approaches were the most popular and the most efficient to detect toponyms with scores approaching the Inter-Annotator agreement. One key to success for the top ranked systems was to design two different algorithms to detect toponyms in the body and in the tables of the articles. The disambiguation of the toponyms remains challenging. Despite a clever use of rules or machine learning to combine features describing the lexical context of the toponyms and their importance from the 4 competing systems, the strict macro F1ds score of .82 of the best system signals space for improvement. Our analysis of common disambiguation errors reveals that it is still difficult for the systems to capture linguistic evidence in the context of the toponyms that dictate their disambiguation, causing 60% of the errors of the systems. The end-to-end performance of the best toponym resolver was .77 F1ds strict macro, a score high enough for scientists to benefit from automation to reduce their workload when extracting toponyms from the voluminous and quickly growing literature, while still leaving room for technical improvement.

Funding
Research reported in this publication was supported by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health under Award Number R01AI117011 to M.S. and G.G. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.   Table 4: Results of the toponym resolution task, Subtask 3.