Joint Learning of Named Entity Recognition and Entity Linking

Named entity recognition (NER) and entity linking (EL) are two fundamentally related tasks, since in order to perform EL, first the mentions to entities have to be detected. However, most entity linking approaches disregard the mention detection part, assuming that the correct mentions have been previously detected. In this paper, we perform joint learning of NER and EL to leverage their relatedness and obtain a more robust and generalisable system. For that, we introduce a model inspired by the Stack-LSTM approach. We observe that, in fact, doing multi-task learning of NER and EL improves the performance in both tasks when comparing with models trained with individual objectives. Furthermore, we achieve results competitive with the state-of-the-art in both NER and EL.


Introduction
In order to build high quality systems for complex natural language processing (NLP) tasks, it is useful to leverage the output information of lower level tasks, such as named entity recognition (NER) and entity linking (EL). Therefore NER and EL are two fundamental NLP tasks.
NER corresponds to the process of detecting mentions of named entities in a text and classifying them with predefined types such as person, location and organisation. However, the majority of the detected mentions can refer to different entities as in the example of Table 1, in which the mention "Leeds" can refer to "Leeds", the city, and "Leeds United A.F.C.", the football club. To solve this ambiguity EL is performed. It consists in determining to which entity a particular mention refers to, by assigning a knowledge base entity id.
In this example, the knowledge base id of the entity "Leeds United A.F.C." should be selected.  In real world applications, EL systems have to perform two tasks: mention detection or NER and entity disambiguation. However, most approaches have only focused on the latter, being the mentions that have to be disambiguated given.
In this work we do joint learning of NER and EL in order to leverage the information of both tasks at every decision. Furthermore, by having a flow of information between the computation of the representations used for NER and EL we are able to improve the model.
One example of the advantage of doing joint learning is showed in Table 1, in which the joint model is able to predict the correct entity, by knowing that the type predicted by NER is Organisation.
This paper introduces two main contributions: • A system that jointly performs NER and EL, with competitive results in both tasks.
• A empirical qualitative analysis of the advantage of doing joint learning vs using separate models and of the influence of the different components to the result obtained.

Related work
The majority of NER systems treat the task has sequence labelling and model it using conditional random fields (CRFs) on top of hand-engineered features (Finkel et al., 2005) Table 2: Actions and stack states when processing sentence "Obama met Donald Trump". The predicted types and detected mentions are contained in the Output and the entities the mentions refer to in the Entity. Chiu and Nichols, 2016). Recently, NER systems have been achieving state-of-the-art results by using word contextual embeddings, obtained with language models (Peters et al., 2018;Devlin et al., 2018;Akbik et al., 2018).
Most EL systems discard mention detection, performing only entity disambiguation of previously detected mentions. Thus, in these cases the dependency between the two tasks is ignored. EL state-of-the-art methods often correspond to local methods which use as main features a candidate entity representation, a mention representation, and a representation of the mention's context (Sun et al., 2015;Yamada et al., 2016Yamada et al., , 2017Ganea and Hofmann, 2017). Recently, there has also been an increasing interest in attempting to improve EL performance by leveraging knowledge base information (Radhakrishnan et al., 2018) or by allying local and global features, using information about the neighbouring mentions and respective entities (Le and Titov, 2018;Cao et al., 2018;Yang et al., 2018). However, these approaches involve knowing the surrounding mentions which can be impractical in a real case because we might not have information about the following sentences. It also adds extraneous complexity that might implicate a longer time to process. Some works, as in this paper, perform endto-end EL trying to leverage the relatedness of mention detection or NER and EL, and obtained promising results. Kolitsas et al. (2018) proposed a model that performs mention detection instead of NER, not identifying the type of the detected mentions, as in our approach. Sil and Yates (2013), Luo et al. (2015), andNguyen et al. (2016) introduced models that do joint learning of NER and EL using hand-engineered features. (Durrett and Klein, 2014) performed joint learning of en-tity typing, EL, and coreference using a structured CRF, also with hand-engineered features. In contrast, in our model we perform multi-task learning (Caruana, 1997;Evgeniou and Pontil, 2004), using learned features.

Model Description
In this section firstly, we briefly explain the Stack-LSTM Lample et al., 2016), model that inspired our system. Then we will give a detailed explanation of our modifications and of how we extended it to also perform EL, as showed in the diagram of Figure 1. An example of how the model processes a sentence can be viewed in Table 2.

Stack-LSTM
The Stack-LSTM corresponds to an action-based system which is composed by LSTMs augmented with a stack pointer. In contrast to the most common approaches which detect the entity mentions for a whole sequence, with Stack-LSTMs the entity mentions are detected and classified on the fly. This is a fundamental property to our model, since we perform EL when a mention is detected. This model is composed by four stacks: the Stack, that contains the words that are being processed, the Output, that is filled with the completed chunks, the Action stack, which contains the previous actions performed during the processing of the current document, and the Buffer, that contains the words to be processed.
For NER, in the Stack-LSTM there are three possible types of actions: • Shift, that pops a word off the Buffer and pushes it into the Stack. It means that the last word of the Buffer is part of a named entity.  • Out, that pops a word off the buffer and inserts it into the Output. It means that the last word of the Buffer is not part of a named entity.
• Reduce, that pops all the words in the Stack and pushes them into the Output. There is one action Reduce for each possible type of named entities, e.g. Reduce-PER and Reduce-LOC.
Moreover, the actions that can be performed at each step are controlled: the action Out can only occur if the stack is empty and the actions Reduce are only available when the Stack is not empty.
3.2 Our model NER. To better capture the context, we complement the Stack-LSTM with a representation v t of the sentence being processed, for each action step t. For that the sentence x 1 , . . . , x |w| is passed through a bi-directional LSTM, being h 1 w the hidden state of its 1 st layer (bi-LSTM 1 in Figure 1), that corresponds to the word with index w: We compute a representation of the words contained in the Stack, q t , by doing the mean of the hidden states of the 1 st layer of the bi-LSTM that correspond to the words contained in the stack at action step t, set S t ,: This is used to compute the attention scores α t : where W 1 , W 2 , and u are trainable parameters. The representation v t is then obtained by doing the weighted average of the bi-LSTM 1 st layer's hidden states: To predict the action to be performed, we implement an affine transformation (affine NER in Figure 1) whose input is the concatenation of the last hidden states of the Buffer LSTM b t , Stack LSTM s t , Action LSTM a t , and Output LSTM o t , as well as the sentence representation v t .
Then, for each step t, we use these representations to compute the probability distribution p t over the set of possible actions A, and select the action y t NER with the highest probability: (p t (a)).
The NER loss function is the cross entropy, with the gold action for step t being represented by the one-hot vector y t NER : where T is the total number of action steps for the current document.
EL. When the action predicted is Reduce, a mention is detected and classified. This mention is then disambiguated by selecting its respective entity knowledge base id. The disambiguation step is performed by ranking the mention's candidate entities.
The candidate entities c ∈ C for the present mention are represented by their entity embedding c e and their prior probability c p . The prior probabilities were previously computed based on the co-occurrences between mentions and candidates in Wikipedia.
To represent the mention detected the 2 nd layer of the sentence bi-LSTM (bi-LSTM 2 in Figure 1), is used, being the representation m obtained by averaging the hidden states h 2 w that correspond to the words contained in the mention, set M: These features are concatenated with the representation of the sentence v t , and the last hidden state of the Action stack-LSTM a t : We compute a score for each candidate with affine transformations (affine EL in Figure 1) that have c as input, and select the candidate entity with the highest score, y t EL : l t = affine(tanh(affine(c i , . . . , c n ))) The EL loss function is the cross entropy, with the gold entity for step t being represented by the onehot vector y t EL : where T is the total number of mention that correspond to entities in the knowledge base. Due to the fact that not every mention detected has a corresponding entity in the knowledge base, we first classify whether this mention contains an entry in the knowledge base using an affine transformation followed by a sigmoid. The affine's input is the stack LSTM last hidden state s t : d = sigmoid(affine(s t )).
The NIL loss function, binary cross-entropy, is given by: where y N IL corresponds to the gold label, 1 if mention should be linked and 0 otherwise.
During training we perform teacher forcing, i.e. we use the gold labels for NER and the NIL classification, only performing EL when the gold action is Reduce and the mention has a corresponding id in the knowledge base. The multi-task learning loss is then obtained by summing the individual losses:

Datasets and metrics
We trained and evaluated our model on the biggest NER-EL English dataset, the AIDA/CoNLL dataset (Hoffart et al., 2011). It is a collection of news wire articles from Reuters, composed by a training set of 18,448 linked mentions in 946 documents, a validation set of 4,791 mentions in 216 documents, and a test set of 4,485 mentions in 231 documents. In this dataset, the entity mentions are classified as person, location, organisation and miscellaneous. It also contains the knowledge base id of the respective entities in Wikipedia.
For the NER experiments we report the F1 score while for the EL we report the micro and macro F1 scores. The EL scores were obtained with the Gerbil benchmarking platform, which offers a reliable evaluation and comparison with the stateof-the-art models (Röder et al.). The results were obtained using strong matching settings, which requires exactly predicting the gold mention boundaries and their corresponding entity.

Training details and settings
In our work, we used 100 dimensional word embeddings pre-trained with structured skip-gram on the Gigaword corpus . These were concatenated with 50 dimensional character embeddings obtained using a bi-LSTM over the sentences. In addition, we use contextual embeddings obtained using a character bi-LSTM language model by Akbik et al. (2018). The entity embeddings are 300 dimensional and were trained by Yamada et al. (2017) on Wikipedia. To get the set of candidate entities to be ranked for each mention, we use a pre-built dictionary (Pershina et al., 2015).
The LSTM used to extract the sentence and mention representations, v t and m is composed by 2 hidden layers with a size of 100 and the ones used in the Stack-LSTM have 1 hidden layer of size 100. The feedforward layer used to determine the entity id has a size of 5000. The affine layer used to predict whether the mention is NIL has a size of 100. A dropout ratio of 0.3 was used throughout the model.
The model was trained using the ADAM optimiser (Kingma and Ba, 2014) with a decreasing learning rate of 0.001 and a decay of 0.8 and 0.999 for the first and second momentum, respectively.

Results
Comparison with state of the art models. We compared the results obtained using our joint learning approach with state-of-the-art NER models, in Table 3, and state-of-the-art end-to-end EL models, in Table 4. In the comparisons, it can be observed that our model scores are competitive in both tasks.

System
Test F1 Flair (Akbik et al., 2018) 93.09 BERT Large (Devlin et al., 2018) 92.80 CVT + Multi  92.60 BERT Base (Devlin et al., 2018) 92.40 BiLSTM-CRF+ELMo (Peters et al., 2018) 92.22 Our model 92.43  Comparison with individual models. To understand whether the multi-task learning approach is advantageous for NER and EL we compare the results obtained when using a multi-task learning objective with the results obtained by the same models when training with separate objectives. In the EL case, in order to perform a fair comparison, the mentions that are linked by the individual system correspond to the ones detected by the multi-task approach NER. These comparisons results can be found in Tables 5 and 6, for NER and EL, respectively. They show that, as expected, doing joint learning improves both NER and EL results consistently. This indicates that by leveraging the relatedness of the tasks, we can achieve better models.   Ablation tests. In order to comprehend which components had the greatest contribution to the obtained scores, we performed an ablation test for each task, which can be seen in Tables 7 and 8, for NER and EL, respectively. These experiments show that the use of contextual embeddings (Flair) is responsible for a big boost in the NER performance and, consequently, in EL due to the better detection of mentions. We can also see that the addition of the sentence representation (sent rep v t ) improves the NER performance slightly. Interestingly, the use of a mention representation (ment rep m) for EL that is computed by the sentence LSTM, not only yields a big improvement on the EL task but also contributes to the improvement of the NER scores. The results also indicate that having a simple affine transformation selecting whether the mention should be linked, improves the EL results.

Conclusions and Future Work
We proposed doing joint learning of NER and EL, in order to improve their performance. Results show that our model achieves results competitive with the state-of-the-art. Moreover, we verified that the models trained with the multi-task objective have a better performance than individual ones. There is, however, further work that can be done to improve our system, such as training entity contextual embeddings and extending it to be cross-lingual.