University of Arizona at SemEval-2019 Task 12: Deep-Affix Named Entity Recognition of Geolocation Entities

We present the Named Entity Recognition (NER) and disambiguation model used by the University of Arizona team (UArizona) for the SemEval 2019 task 12. We achieved fourth place on tasks 1 and 3. We implemented a deep-affix based LSTM-CRF NER model for task 1, which utilizes only character, word, pre- fix and suffix information for the identification of geolocation entities. Despite using just the training data provided by task organizers and not using any lexicon features, we achieved 78.85% strict micro F-score on task 1. We used the unsupervised population heuristics for task 3 and achieved 52.99% strict micro-F1 score in this task.


Introduction
Geoparsing is the task of detecting geolocation phrases in unstructured text and normalizing them to a unique identifier, e.g. GeoNames 1 IDs. Although many automatic resolvers have been released in the past years, their performance fluctuates when applied to different domains (Gritta et al., 2018b). Most have also not been applied to and evaluated on scientific publications. The Sem-Eval 2019 Shared Task 12: Toponym Resolution in Scientific Papers (Weissenbacher et al., 2019) aims to boost the research on geoparsing for the scientific domain by focusing on epidemiology journal articles.
The task includes three sub-tasks: toponym detection, toponym disambiguation, and end-to-end toponym resolution. The first one requires participants to detect the text boundaries of all toponym mentions in articles. In toponym disambiguation, the toponym mentions are known, and the resolver has to align them to their precise coordinates through GeoNames IDs. For the last sub-In this paper, we present the description of our system for SemEval 2019 Shared Task 12, in which we focus mainly on toponym detection. For this sub-task, we propose a recurrent neural network that combines word, character and affix information. By making use of the baseline provided by the organizers for toponym disambiguation, we also obtain results for the end-to-end sub-task.
The disambiguation step has been tackled previously using both supervised models and unsupervised heuristic based approaches. For example, Turton (2008) presented a rule based system for disambiguating locations from PubMed abstracts. Weissenbacher et al. (2015) presented results from P opulation and Distance heuristics (discussed in Section 4.3) for the disambiguation task on PubMed articles. The authors also presented an SVM model with population, distance and set of meta-data as input which achieved higher performance than both the individual heuristics. Gritta et al. (2018a) used a feedforward neural network approach for the disambiguation of geolocations.

Data and Baseline
The corpus of the task is composed of 150 journal articles downloaded from PubMed Central. After removing the author names, acknowledgments and references, titles and body text were fully annotated. The annotators identified and labelled toponyms with their corresponding coordinates according to GeoNames. For cases not found in GeoNames, they used Google Maps and Wikipedia. If the coordinates of a toponym were not available in any of these resources the special value N/A was used. The data is provided in Brat format (Stenetorp et al., 2012). The organizers also released a strong baseline that combines the model by Magge et al. (2018) for toponomy detection and the P opulation heuristic described in (Weissenbacher et al., 2015) for disambiguation. 2 4 Approach

Preprocessing
We used the tokenizer included in the baseline provided by the organizers as we observed it provided the best final results among other options (see Section 5.3). Again using baseline system preprocessing codes, we converted the data into CoNLL 2003 format (Tjong Kim Sang and De Meulder, 2003) for task 1. Following our prior work , we have used a BIO encoding instead of the IO encoding provided by the baseline system.

Toponym Detection
We used the model proposed by  for Named Entity Recognition (NER), shown in figure 1, which uses character, word and affix information. In this architecture, a word is represented by concatenating its word embedding, an LSTM representation over the characters of the word, and learned embeddings for prefixes and suffixes of the word 3 . Then another LSTM is used at the sentence level to give a contextual representation of each word. These representations of words in the sentence are given to a CRF layer to finally predict the NER label.

Toponym Resolution
Weissenbacher et al. (2015) presented two heuristics for disambiguation of geolocation -P opulation and Distance. These two heuristics are often used as features with other meta-data such as the user location meta-data in a Twitter account (Zhang and Gelernter, 2014), GenBank meta-data (Weissenbacher et al., 2015), etc. In the P opulation heuristic, the system simply assigns the geonameID of the most populous 4 candidate for the current location. For the Distance heuristic, the system selects the candidate which is at the minimum distance from all candidates of all other toponyms in the same document. Many previous works (Weissenbacher et al., 2015;Zhang and Gelernter, 2014;Weissenbacher et al., 2019) have shown that the most populous location is often referenced more in the text documents and performs better than the distance heuristics. Thus, we use the P opulation heuristic as our disambiguation model.

Experiments
Using the original fully annotated training set, we achieved 77.3% strict micro-Fscore (mean performance of 3 runs) on the validation set. However, the organizers provided two additional large (but weakly) annotated NER datasets: P OS, which contains sentences having at least 1 location phrase, and N EG, which has sentences with no mention of location entities. We experimented with both these datasets in both joint and transfer learning.

Joint Learning
In the joint learning experiment, we trained the model on a training set by concatenating the P OS data with the original training data. In this configuration, we achieved 81.4% strict micro-F score (mean performance of 3 runs) on the validation set, a 4 point improvement over the original experiment.

Transfer Learning
In this experiment, we first trained our model on just the P OS set and further fine tuned it on the original training data provided for the task. The intuition here was to use the weakly annotated data only to get a good initialization for the "real" training on the manually annotated data, rather than training on both together and possibly getting misled by the noise in the weakly annotated data. We achieved 83.7% strict micro-F score (mean performance of 3 runs) on the validation set. This is an improvement of 2.3 F over the simple joint learning experiment, and 6.4 F over the model using only the original training data.

Effects of Tokenization
The effect of tokenization on NER performance has been shown in the past (Akkasi et al., 2016;Xu et al., 2018). For this reason, we evaluated our model trained on the original training data, using various custom tokenization functions, and saw the strict micro-F1 score vary from 72% to 77% in the validation set.
The NLTK regexp tokenizer resulted in 70% strict F1-score. We wrote several rules to improve this tokenizer which further improved the performance by 4%. However, the custom tokenization implemented by the shared task organizers in the baseline model performed the best, achieving 77% on the validation set when trained on just the original training data. In this case, we also wrote a few additional rules to improve the tokenization but achieved marginal improvements in the overall performance.

Hyperparameters
We trained the  model using the parameters in Table 1. For transfer learning from P OS data, we first trained the model for 40 epochs. We then retrained this model on the original training data for 80 epochs with 20 as the early stopping patience. After training on the original training data, we retrained this model on train+development data for another 40 epochs. For the final evaluation, we submitted the models at epoch = 25, 35 and 40. Epoch 35 achieved the best performance among the three submissions.

Results
We achieved the 4 th position in both task 1 (toponym detection) and task 3 (end-to-end toponym resolution) as shown in tables table 2 and table 3, respectively. Although it has been shown previously that adding lexicon features improves the overall performance of several NER models Gritta et al., 2018b), we have focused on extraction of context information using LSTMs over character, word and affixes of the word. Hence, our resource-independent NER model achieves competitive results, despite not using any dictionary information. Also, we have just used the training data provided by the task organizers and did not use any external training data or   Table 3: Results of subtask 3 -end-to-end toponym resolution. Our system is again ranked fourth. lexicon resources. We used the unsupervised P opulation heuristic which is fast and simple to implement for disambiguating toponyms. As shown by Weissenbacher et al. (2015), feeding features like population, distance, and other meta-data to machine learning models often achieved higher performances. However, as shown here, the P opulation heuristic serves as a strong baseline for this disambiguation task.

Future Work
We plan to include the following features in our current model: • Part of Speech (POS) features -as per the annotations guidelines, locations that were used as adjectives were not labelled in the annotation process. We will explore the effect of adding POS feature representation to the word, character and affix representations.
• Inclusion of geoname dictionary -our current approach is resource independent. We will include dictionary features in the next version of our model, to understand how much signal can be inferred from local information, and how much must come from world knowledge.
• Using domain-specific embeddings -we relied on pretrained GloVe embeddings for our submissions. In future versions of our software, we will explore domain-specific embeddings, i.e., trained on scientific texts, as well as contextualized embeddings such as FLAIR (Akbik et al., 2018).

Acknowledgments
This work was supported by the Defense Advanced Research Projects Agency (DARPA) under the World Modelers program, grant number W911NF1810014. Mihai Surdeanu declares a financial interest in lum.ai. This interest has been properly disclosed to the University of Arizona Institutional Review Committee and is managed in accordance with its conflict of interest policies.