UniMelb at SemEval-2019 Task 12: Multi-model combination for toponym resolution

This paper describes our submission to SemEval-2019 Task 12 on toponym resolution over scientific articles. We train separate NER models for toponym detection over text extracted from tables vs. text from the body of the paper, and train another auxiliary model to eliminate misdetected toponyms. For toponym disambiguation, we use an SVM classifier with hand-engineered features. The best setting achieved a strict micro-F1 score of 80.92% and overlap micro-F1 score of 86.88% in the toponym detection subtask, ranking 2nd out of 8 teams on F1 score. For toponym disambiguation and end-to-end resolution, we officially ranked 2nd and 3rd, respectively.


Introduction
Toponym resolution (TR) refers to the task of automatically assigning geographic references to place names in text, which has applications in question answering and information retrieval tasks (Leidner, 2008;Daoud and Huang, 2013;Vasardani et al., 2013), user geolocation prediction (Roller et al., 2012;Han et al., 2014;Rahimi et al., 2015), and historical research (Grover et al., 2010). This paper describes our system entry to the Toponym resolution in scientific paper task of Se-mEval 2019 (Weissenbacher et al., 2019). The task consists of three subtasks: toponym detection, toponym disambiguation, and end-to-end toponym resolution.
For the toponym detection task, we extract tables from the full text and train separate BiLSTM-ATTN models for each. For tables, the model captures the horizontal row-wise structure of the table. For non-table content, the model can capture syntactic and semantic features. In both cases, we use a deep contextualized word representation -ELMo (Peters et al., 2018) -to represent each token. After detecting toponyms, we use an organization name detection model to eliminate misdetected toponyms that are actually part of an organization name. For the toponym disambiguation task, we first construct a candidate set by searching toponyms on GeoNames. 1 Then, we manually construct features based on search results, and finally, train an SVM model to disambiguate the locations. For the end-to-end resolution task, we pipeline the two aforementioned steps.
Our work makes the following contributions: • we show that training separate models for table and non-table portions of the paper is better than simply training one model over the full text; • we show that contextualized word representation boosts performance; • we show that auxiliary organization name recognition model is helpful for toponym detection, and better than training a single named entity recognizer (NER).
2 Toponym Detection Figure 1 shows our workflow on the toponym detection task, which consist of 4 parts: (a) preprocessing, which contains tokenization, table extraction and sentence segmentation; (b) training and inference for the toponym detection model; (c) post-processing, to combine detected words into toponyms; and (d) refinement of the results by incorporating an auxiliary model.

Pre-processing
Tables are ubiquitous in scientific articles, and differ in structure to text in the body of the paper, in   terms of syntactic structure. As such, training a single text embedding model over both the main body of text and tables will likely lead to suboptimal representations, leading us to train separate models for: (1) tables, and (2) the remainder of the text content of the paper. To extract tables from the plain text dump provided by the shared task organisers, we use a rule-based table detection method. We first tokenize the entire article, as part of which we treat all punctuation as a separator. In the process of table extraction, we process the raw text line-by-line rather than performing sentence tokenization. We treat numbers, OOV tokens (using GloVe vocabulary), |, andas table elements, and consider lines with more than 70% of table elements to be table rows. Three or more consecutive table rows are considered to make up a table.
In this way, we extract tables from the plain text dump of the articles. Note that the original PDF versions of papers were not made available by the task organizers, meaning that it wasn't possible to use vision-based methods to identify tables.
For the remainder of the text dump not detected as tables, we perform tokenization, remove hyphens caused by line breaks, and then perform sentence segmentation using SpaCy. 2 Sentences that 2 https://spacy.io are shorter than 5 tokens in length are concatenated with the preceding and proceeding sentences to make up a single sentence. By expanding short sentences, richer context can be exploited by both ELMo and the RNN-based model.

Contextual Representation
We use ELMo (Peters et al., 2018) word representations in this paper, which are learned from the internal states of a deep bidirectional language model (biLM), pre-trained on a large text corpus. ELMo representations are purely characterbased, allowing the network to use morphological clues to form robust representations for outof-vocabulary tokens unseen in training. They are also robust to syntactic disfluencies caused by the fine-grainedness of word segmentation. For the purposes of empirical comparison, we also report on experiments using GloVe (Pennington et al., 2014) embeddings.

Models
For toponym extraction in the table part, we experimented with two kinds of models. The first is a token-level model which is described in Magge et al. (2018). In this model, each training instance consists of an input word, the word's context, and a label indicating whether the word is a part of a toponym. The context of the word is formed by the words in its neighbourhood, which is a window of words centred on the given word. We experimented with two-and three-layer feed-forward models.
The second model is built with RNN and selfattention (Vaswani et al., 2017). Although an RNN is able to make predictions over long sequences, the documents in this task are too long for an RNN, and at the same time, the size of the training data is not sufficient to train an RNN. As such, we split each document with several sentences and make predictions on separate sentences (hidden states are not passed through sentences). We use a two-layer bidirectional LSTM (Hochreiter and Schmidhuber, 1997) to capture the sequential information of the body and table contents, and use self-attention to enhance the connection of each token in the line. We consider the paper body content to have semantic information which can be captured by a sequential model like a BiLSTM. However, for tables, it is not clear that sequential information across cells in a table row should be processed as a sequence. Therefore, we use self-attention to learn the table structure over an entire line. We consider each sentence as a matrix which we denote as L, where L ∈ R t×h ; t represents the number of tokens in the line, and h is the dimensionality of the embedding representation. To improve training efficiency, we pack l lines into a single batch, thereby making L become a threedimensional tensor L ∈ R l×t×h . We pad short lines to make the same length as the longest line in a batch, and set the embedding of each padding word to a zero vector with dimensionality h.
We first encode each line with a two-layer bidirectional LSTM, denoted as: Then, we feed L into the attention model to encode structural information. The attention model can be denoted as follows: This style of attention is named scaled dotproduct attention by Vaswani et al. (2017), where Q, K, V ∈ R t×h represent the query, key, and value, respectively, and can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. In this model, we use tokens in the same line to represent the query, key, and value, and use the attention function Attn to find selfcorrelations among them. Meanwhile, in Eqn 2 we define f to be a one-layer feed-forward network with different parameter sets θ Q , θ K , θ V , which we denote as f (X; θ). This allows us to learn the correlation with these three parameter sets. We use the following attention function: Finally, we pass L into a 3-layer feed-forward network denoted as g, using layer normalization in each layer to increase the training speed. The output of the feed-forward block is passed into the output layer with a residual connection with L , denoted as:ŷ The architecture of the model is shown in Figure 2.

Post-processing
Since we are training a sequence labelling model, result segmentation and combination is necessary. For instance, the sentence AIV H9N2 was spread to New York, Washington DC and Ottawa contains three toponyms, and 5 tokens which are contained in those toponyms (e.g. the words New and York are combined into one toponym). An external gazetteer 3 downloaded from GeoNames, and an in-house place name abbreviation library were used.
We first restore all abbreviations in order to facilitate matching in the gazetteer. We then combine all consecutive tokens that were labelled as a toponym. After this, two different segmentation methods were used: (1) longest string match in the gazetteer; and (2) no segmentation. The result shows that the second method is better because of the limitation of string matching. We think using a better toponym match method like searching via Geonames rather than string matching could achieve better results.

Auxiliary Model
The single NER model picks up on features such as the word-initial character being uppercase, that are also common in non-toponym named entities, possibly resulting in toponym named entity FPs.

Model
Overlap  For example, in the phrase The Royal Melbourne Hospital, the word Melbourne should not be detected as toponym according to the competition setting. This issue was also identified by Dredze et al. (2009).
In this paper, we use two methods to tackle this. The first is to train a single NER to detect toponyms and organization names together. The second is to train an organization name recognizer to correct misdetected toponyms in organization names.
We used the WikiNER (Nothman et al., 2012) dataset to train an organization detection model, and applied it to our dataset. Then we build an organization type set containing Institute, School, Hospital etc.. Finally, we re-label toponyms that are part of a corresponding organization name as non-toponyms.

Toponym Disambiguation
We used a support vector machine to disambiguate toponyms. For each detected toponym, we first search for it on Geonames, and keep the top 20 records as candidate results. Features are constructed from this, as follows: • History Result: If the toponym appears in the training set, history result refers to the ranking of the number of times the Geonames ID appears as a standard answer. For instance, the toponym Melbourne appears 13 times in training, of which 12 occurrence have Geonames ID 2158177 and 1 has ID 7839805, so the history result feature for 2158177 is 1, 7839805 is 2, and all other Geonames IDs are 3.
• Population: The ranking of the population of the candidate.
• Name Similarity: The ranking of the string similarity of the toponym and Name item in each record.
• AncestorsNames Correlation: The ranking of matching words of the AncestorsNames item in each record.

Experiment Setting
The model architecture used for the toponym detection task is depicted in Figure 2. We use the Adam (Kingma and Ba, 2015) optimizer with β 1 = 0.9, β 2 = 0.999, = 10 −9 and an initial learning rate of 1e −3 . A dropout (Srivastava et al., 2014) rate of 0.5 is used to prevent overfitting. The hidden size (d) of the model is 300, and cross-entropy loss is used for training.
To compare different word embeddings, we use pre-trained 300-dimensional GloVe embeddings and pre-trained 1024 dimensional EMLo embeddings, respectively. We do not update the word embeddings during training.   Table 2 shows the subtask 1 performance of different word representations. From that, we find that using ELMo representation is much better than using GloVe embeddings. The reason is that our tokenization method separates many words like I'm, let's, which ELMo can generate a contextualized representation for, while GloVe cannot. Furthermore, there are many numbers and OOVs in the tables, the GloVe embedding for which is a random 300-dimensional vector that does not provide useful context information.

Conclusions
In this work, we presented a method for toponym detection and disambiguation in scientific papers, in the context of Sem-Eval 2019 Task 12, using an LSTM model and SVM model respectively. We extract tables from plain text, and train a dedicated model for each to improve overall performance due to the different structures of tables and the body of text. We also demonstrated the per-formance of the different models for toponym detection, with our final submission coming in 2nd (among 8).