Joint Entity Recognition and Disambiguation

Extracting named entities in text and linking extracted names to a given knowledge base are fundamental tasks in applications for text understanding. Existing systems typically run a named entity recognition (NER) model to extract entity names ﬁrst, then run an entity linking model to link extracted names to a knowledge base. NER and linking models are usually trained separately, and the mutual dependency be-tween the two tasks is ignored. We pro-pose JERL, Joint Entity Recognition and Linking, to jointly model NER and linking tasks and capture the mutual dependency between them. It allows the information from each task to improve the performance of the other. To the best of our knowledge, JERL is the ﬁrst model to jointly optimize NER and linking tasks together completely. In experiments on the CoNLL’03/AIDA data set, JERL outperforms state-of-art NER and linking systems, and we ﬁnd improvements of 0.4% absolute F 1 for NER on CoNLL’03, and 0.36% absolute precision@1 for linking on AIDA.


Introduction
In applications of complex Natural Language Processing tasks, such as automatic knowledge base construction, entity summarization, and question answering systems, it is essential to first have high quality systems for lower level tasks, such as partof-speech (POS) tagging, chunking, named entity recognition (NER), entity linking, and parsing among others. These lower level tasks are usually decoupled and optimized separately to keep the system tractable. The disadvantage of the decoupled approach is that each lower level task is not aware of other tasks and thus not able to leverage information provided by others to improve performance. What is more, there is no guarantee that their outputs will be consistent.
This paper addresses the problem by building a joint model for Entity Recognition and Disambiguation (ERD). The goal of ERD is to extract named entities in text and link extracted names to a knowledge base, usually Wikipedia or Freebase. ERD is closely related to NER and linking tasks. NER aims to identify named entities in text and classify mentions into predefined categories such as persons, organizations, locations, etc. Given a mention and context as input, entity linking connects the mention to a referent entity in a knowledge base.
Existing ERD systems typically run a NER to extract entity mentions first, then run an entity linking model to link mentions to a knowledge base. Such a decoupled approach makes the system tractable, and both NER and linking models can be optimized separately. The disadvantages are also obvious: 1) errors caused by NER will be propagated to linking and are not recoverable 2) NER can not benefit from information available used in entity linking; 3) NER and linking may create inconsistent outputs.
We argue that there is strong mutual dependency between NER and linking tasks. Consider the following two examples: 1. The New York Times (NYT) is an American daily newspaper. 2. Clinton plans to have more news conferences in 2nd term. WASHINGTON 1996-12-06 Example 1 is the first sentence from the Wikipedia article about "The New York Times". It is reasonable but incorrect for NER to identify "New York Times" without "The" as a named entity, while entity linking has no trouble connecting "The New York Times" to the correct entity.
Example 2 is a news title where our NER classifies "WASHINGTON" as a location, since a location followed by a date is a frequent pattern in news articles it learned, while the entity linking prefers linking this mention to the U.S. president "George Washington" since another president's name "Clinton" is mentioned in the context. Both the entity boundaries and entity types predicted by NER are correlated to the knowledge of entities linked by entity linking. Modeling such mutual dependency is helpful in resolving inconsistency and improving performance for both NER and linking.
We propose JERL, Joint Entity Recognition and Linking, to jointly model NER and linking tasks and capture the mutual dependency between them. It allows the information from each task to improve the performance of the other. If NER is highly confident on its outputs of entity boundaries and types, it will encourage entity linking to link an entity which is consistent with NER's outputs, and vice versa. In other words, JERL is able to model how consistent NER and linking's outputs are, and predict coherent outputs. According to our experiments, this approach does improve the end to end performance. To the best of our knowledge, JERL is the first model to jointly optimize NER and linking tasks together completely . Sil (2013) also proposes jointly conducting NER and linking tasks. They leverage existing NER/chunking systems and Freebase to over generate mention candidates and leave the linking algorithm to make final decisions, which is a reranking model. Their model captures the dependency between entity linking decisions and mention boundary decisions with impressive results. The difference between our model and theirs is that our model jointly models NER and linking tasks from the training phrase, while their model is a combined one which depends on an existing state-of-art NER system. Our model is more powerful in capturing mutual dependency by considering entity type and confidences information, while in their model the confidence of outputs is lost in the linking phrase. Furthermore, in our model NER can naturally benefit from entity linking's decision since both decisions are made together, while in their model, it is not clear how the linking decision can help the NER decision in return.
Joint optimization is costly. It increases the problem complexity, is usually inefficient, and requires the careful consideration of features of multiple tasks and mutual dependency, making proper assumptions and approximations to enable tractable training and inference. However, we believe that joint optimization is a promising direction for improving performance for NLP tasks since it is closer to how human beings process text information. Experiment result indicates that our joint model does a better job at both NER and linking tasks than separate models with the same features, and outperforms state-of-art systems on a widely used data set. We found improvements of 0.4% absolute F 1 for NER on CoNLL'03 and 0.36% absolute precision@1 for linking on AIDA. NER is a widely studied problem, and we believe our improvement is significant.
The contributions of this paper are as follows: 1. We identify the mutual dependency between NER and linking tasks, and argue that NER and linking should be conducted together to improve the end to end performance. 2. We propose the first completely joint NER and linking model, JERL, to train and inference the two tasks together. Efficient training and inference algorithms are also presented. 3. The JERL outperforms the best NER record on the CoNLL'03 data set, which demonstrates how NER could be improved further by leveraging knowledge base and linking techniques.
The remainder of this paper is organized as follows: the next section discusses related works on NER, entity linking, and joint optimization; section 3 presents our Joint Entity Recognition and Linking model in detail; section 4 describes experiments, results, and analysis; and section 5 concludes.

Related Work
The NER problem has been widely addressed by symbolic, statistical, as well as hybrid approaches. It has been encouraged by several editions of evaluation campaigns such as MUC (Chinchor and Marsh, 1998), the CoNLL 2003 NER shared task (Tjong Kim Sang and De Meulder, 2003) and ACE (Doddington et al., 2004). Along with the improvement of Machine Learning techniques, statistical approaches have become a major direction for research on NER, especially after Conditional Random Field is proposed by Lafferty et al. (2001). The well known state-of-art NER systems are Stanford NER (Finkel et al., 2005) and UIUC NER (Ratinov and Roth, 2009). Liang (2005) compares the performance of the 2nd order linear chain CRF and Semi-CRF (Sarawagi and Cohen, 2004) in his thesis. Lin and Wu (2009) cluster tens of millions of phrases and use the resulting clusters as features in NER reporting the best performance on the CoNLL'03 English NER data set. Recent works on NER have started to focus on multi-lingual named entity recognition or NER on short text, e.g. Twitter.
Entity linking was initiated with Wikipediabased works on entity disambiguation (Bunescu and Pasca, 2006;Cucerzan, 2007). This task is encouraged by the TAC 2009 KB population task 1 first and receives more and more attention from the research community (Hoffart et al., 2011;Ratinov et al., 2011;. Linking usually takes mentions detected by NER as its input. Stern et al. (2012) and Wang et al. (2012) present joint NER and linking systems and evaluate their systems on French and Chinese data sets. Sil and Yates (2013) take a re-ranking based approach and achieve the best result on the AIDA data set. In 2014, Microsoft and Google jointly hosted "Entity Recognition and Disambiguation Challenge" which focused on the end to end performance of linking system 2 .
Joint optimization models have been studied at great length. E.g. Dynamic CRF (McCallum et al., 2003) has been proposed to conduct Partof-Speech Tagging and Chunking tasks together. Finkel and Manning (2009) show how to model parsing and named entity recognition together. Yu et al. (2011) work on jointly entity identification and relation extraction from Wikipedia. Sil's (2013) work on jointly NER and linking is described in the introduction section of this paper. It is worth noting that joint optimization does not always work. The CoNLL 2008 shared task (Surdeanu et al., 2008) was intended to encourage jointly optimize parsing and semantic role labeling, but the top performing systems decoupled the two tasks.

Joint Entity Recognition and Linking
Named entity recognition is usually formalized as a sequence labeling task, in which each word is classified to not-an-entity or entity labels. Conditional Random Fields (CRFs) is one of the popu-1 http://www.nist.gov/tac/2014/KBP/ 2 http://web-ngram.research.microsoft.com/ERD2014/ Figure 1: The factor graph of JERL model lar models used. Most features used in NER are word-level (e.g. a word sequence appears at position i or whether a word contains exactly four digits). It is hard, if not impossible, to encode entity-level features (such as "entity length" and "correlation to known entities") in traditional CRF. Entity linking is typically formalized as a ranking task. Features used for entity linking are at entitylevel inherently (such as entity prior probability; whether there are any related entity names or discriminative keywords occurring in the context).
The main challenges of joint optimization between NER and linking are: how to combine a sequence labeling model and a ranking model; and how to incorporate word-level and entity-level features. In a linear chain CRF model, each word's label is assumed to depend on the observations and the label of its previous word. Semi-CRF carefully relaxes the Markov assumption between words in CRF, and models the distribution of segmentation boundaries directly. We further extend Semi-CRF to model entity distribution and mutual dependency over segmentations, and name it Joint Entity Recognition and Linking (JERL). The model is described below.

JERL
Let x = {x i } be a word sequence containing |x| words. Let s = {s j } be a segmentation assignment over x, where segment s j = (u j , v j ) consist of a start position u j and an end position v j . All segments have a positive length and are adjacent to each other, so every (u j , v j ) always satisfies 1 ≤ u j ≤ v j ≤ |x| and u j+1 = v j + 1. Let y = {y j } be labels in a fixed label alphabet Y over a segmentation assignment s. Here Y is the set of types NER to predict.
is the corresponding word sequence to s j , and E s j = {e j,k } is a set of entities in the knowledge base (KB), which may be referred by word sequence x s j in the entity linking task. Each entity e j,k is associated with a label y e j,k ∈ {0, 1}. Label y e j,k takes 1 iff x s j referring to entity e j,k , and 0 otherwise. If x s j does not refer to any entity in the KB, y e j,0 takes 1, which is analogous to the NIL 3 identifier in entity linking.
Based on the preliminaries and notations, Figure 1 shows the factor graph (Kschischang et al., 2001) of JERL. There are similar factor nodes for every (u j , v j , y e j,k ), we only show the first one (u j , v j , y e j,0 ) for clarity. Given x, let a = (s, y, y e ) be a joint assignment, and g(x, j, a) be local functions for x s j , namely features, each of which maps an assignment a to a measurement g k (x, j, a) ∈ . Then G(x, a) = |s| j=1 g(x, j, a) is the factor graph defining a probability distribution of assignment a conditioned on word sequence x.
Then JERL, conditional probability of a over x, is defined as: where w is the weight vector corresponding to G will be learned later, and Z(x) is the normalization factor Z(x) = a∈A e w·G(x,a) , in which A is the union of all possible assignments over x. JERL is a probabilistic graphical model. More specificly, as shown in Figure 1, there are three groups of local functions and one constrain introduced. Each of them take a different role in JERL, as described below: Features defined on x, s j , y j , y j−1 are written as g ner (x, s j , y j , y j−1 ). These functions model segmentation and entity types' distribution over x. Actually, every local features used in NER can be formulated in this way, and thus can be included in JERL. We thus refer to them as "NER features".
Features defined on x, s j , y e j,k are written as g el (x, s j , y e j,k ) and are called "linking features". These features model joint probabilities of word sequence x s j and linking decisions y k j,k = 1(0 ≤ k ≤ |E s j |) given context x. JERL incorporates all linking features in this way.
Features defined on y j , y e j,k are written as g cr (y j , y e j,k ). These features model "mutual de-pendency" between NER and linking's outputs. For each entity e j,k , there is additional information available in the knowledge base, e.g. categories information, popularity and relationship to other entities. These features encourage predicting coherent outputs for NER and linking. There is one constrain for each y e j that the corresponding x s j can refer to only one entity e j,k ∈ E s j or NIL. This is equivalent to |Es j | k=0 y e j,k = 1. Based on the above description, G(x, a) in equation 1 is the sum of conjunction (g ner , g el , g cr ) over s, and can be rewritten as, In summary, JERL jointly models the NER and linking, and leverages mutual dependency between them to predict coherent outputs. Previous works (Cucerzan, 2007;Ratinov et al., 2011;Sil and Yates, 2013) on linking argued that entity linking systems often suffer because of errors involved in mention detection phrase, especially false negative errors, and try to mitigate it via overgenerating mention candidates. From the mention generation perspective, JERL actually considers every possible assignment and is able to find the optimal a.

Parameter Estimation
We describe how to conduct parameter estimation for JERL in this section. Given independent and identically distributed (i.i.d.) training data T = {(x t , a t )} N t=1 , the goal of parameter estimation is to find optimal w * to maximize the joint probability of the assignments {a t } over {x t }.
We use conditional log likelihood with 2 norm as the objective function in training, The above function is concave, adding regularization to ensure that it has exactly one global optimum. We adopt a limited-memory quasi-Newton method (Liu and Nocedal, 1989) to solve the optimization problem.
The gradient of L(T , w) is derived as, As shown in Figure 1, our model's factor graph is a tree, which means the calculation of the gradient is tractable. Inspired by the forward backward algorithm (Sha and Pereira, 2003) and Semi-CRF (Sarawagi and Cohen, 2004), we leverage dynamic programming techniques to compute the normalization factor Z w and marginal probability P (a j |x t , w) when w is given. (Sutton and McCallum, 2006) The parameter estimation algorithm is abstracted in Algorithm 1.
Algorithm 1: JERL parameter estimation for t ← 1 to N do calculate α t , β t according to eq.3; calculate Z t according to eq.4 calculate w t according to eq.2, 5; Z ← Z + Z t ; w ← w + w t ; end update w to maximize log likelihood L(T , w) under (Z, w ) via L-BFGS; end Let α i,y (i ∈ [0, |x|], y ∈ Y) be the sum of potential functions of all possible assignments over (x 1 . . . x i ) whose last segmentation's labels are y. Then α i,y can be calculated recursively from i = 0 to i = |x| as below.
We first define base cases as α 0,y = 1| {y∈Y} . When i ∈ (0, |x|]: where L is the max segmentation length in Semi-CRF, and Y e * j is all valid assignments for y e j which satisfies |Es j | k=0 y e j,k = 1. The ψ ner u j ,v j ,y j ,y j−1 and ψ el.cr u j ,v j ,y j ,y e j are precomputed ahead as below, ψ ner u j ,v j ,y j ,y j−1 = e w ner ·g ner (x,s j ,y j ,y j−1 ) ψ el.cr u j ,v j ,y j ,y e j = |E j | k=0 e w el g el (x,s j ,y e j,k )+w cr g cr (y j ,y e j,k ) where w ner , w el and w cr are weights for g ner , g el and g cr in w accordingly. The value of Z w can then be written as Define β i,y (i ∈ [0, |x|], y ∈ Y) as the sum of potential functions of all possible assignments over (x i+1 . . . x |x| ) whose first segmentation's labels are y. β i,y is calculated in a similar way, except they are calculated from i = |x| to left i = 0.
Once we get {α i,j } and {β i,j }, the marginal probability of arbitrary assignment a j = (s j , y j , y e j ), where s j = (u j , v j ), can be calculated as below: P (s j , y j |x, w) = ( y ∈Y α u j −1,y ψ ner u j ,v j ,y j ,y )β v j ,y j Z w (x) and P (a j |x t , w) = P (s j , y j |x, w) ψ el.cr u j ,v j ,y j ,y e j y e ∈Y e * j ψ el.cr u j ,v j ,y j ,y e (5)

Inference
Given a new word sequence x and model weights w trained on a training set, the goal of inference is to find the best assignment, a * = argmax a P (a|x, w) for x. We extend the Viterbi algorithm to exactly infer the best assignment. The inference algorithm is shown in Algorithm 2.
Let φ(u j , v j , y j , y j−1 ) be the product of potentials depending on (s j , y j , y j−1 ) as, φ(u j , v j , y j , y j−1 ) = ψ ner u j ,v j ,y j ,y j−1 ( y e j ∈Y e * j ψ el.cr u j ,v j ,y j ,y e j ) (6) Algorithm 2: JERL inference input : one word sequence x and weights w output: the best assignment a over x // shrink JERL graph to a Semi-CRF graph; for u ← 1 to |x| do for v ← u + 1 to |x| do for (y, y ) ∈ Y × Y do calculate φ u,v,y,y // see eq.6; end end end // infer the best assignment of (s * , y * ); for i ← 1 to |x| do for y ∈ Y do calculate V i,y // see eq.7; end end (s * , y * ) ← argmax(V i,y ); // infer the best assignment of {y e j }; for j ← 1 to |s * | do y e j ← argmax(P (|x, w, s * j , y * j )) end a * ← (s * , y * , y e * ); and let V (i, y) denotes the largest value of (w · G(x, a )) where a could be any possible partial assignment starting from x 1 to x i . The best (s * , y * ) are derived during the following recursive calculation, where L is the maximum segmentation length for Semi-CRF. Once (s * , y * ) are found, the corresponding y e * j = argmax {y e ∈Y e * j } (ψ el.cr is also the optimal one. Then a * = (s * , y * , y e * ) is the best assignment for the given x and w.

Experiments
In our experiments, we first construct two baseline models JERL ner and JERL el , which use exact NER and EL feature sets used in JERL. Then evaluate JERL and the two baseline models against several state-of-art NER and linking systems. After that, we evaluate JERL under different feature

Data set
We take the CoNLL'03/AIDA English data set to evaluate the performance of NER and linking systems. CoNLL'03 is extensively used in prior work on NER evaluation (Tjong Kim Sang and De Meulder, 2003). The English data is taken from Reuters news articles published between August 1966 and August 1997. Four types of entities persons (PER), organizations (ORG), locations (LOC), and miscellaneous names (MISC) are annotated. Hoffart et al. (2011) hand-annotated all proper nouns with corresponding entities wiht YAGO2, Freebase and Wikipedia IDs. This data is referenced as AIDA here. To the best of our knowledge, this data set is the biggest data set which has been labeled for both NER and linking tasks. It becomes a really good starting point for our work. Table 1 contains of an overview of the CoNLL'03/AIDA data set. For entity linking, we take Wikipedia as the referent knowledge base. We use a Wikipedia snapshot dumped in May 2013, which contains around 4.8 million articles. We also align our Wikipedia dump with additional knowledge bases, Freebase and Satori (a Microsoft internal knowledge base), to enrich the information of these entities.

Evaluation Metrics
We follow the CoNLL'03 metrics to evaluate NER performance by precision, recall, and F 1 scores, and follow Hoffart's (2011) experiment setting to evaluate linking performance by micro preci-sion@1. Since the linking labels of CONLL'03 were annotated in 2011, it is not completely consistent with the Wikipedia dump we used in the case. We only consider mention entity pairs where the ground truth are known, and ignore around 20% of NIL mentions in the ground truth.   Table 2 shows features used in our models. JERL uses all features in the three categories, while JERL ner and JERL el use only one corresponding category. All three models are trained on the train and development set, and evaluated on the test set of CoNLL'03/AIDA.

NER
Features in the NER category are relevant to NER. We considered the most commonly used features in literatures (Finkel et al., 2005;Liang, 2005;Ratinov and Roth, 2009). We collect several known name lists, like popular English first/last names for people, organization lists and so on from Wikipedia and Freebase. UIUC NER's lists are also included. In addition, we extract entity name lists from the knowledge base we used for entity linking, and construct 655 more lists. Although those lists are noisy, we find that statistically they do improve the performance of our NER baseline by a significant amount.

Linking
Features in linking category are relevant to entity linking. An entity can be referred by its canonical name, nick names, alias, and first/last names. Those names are defined as alternative names for this entity. We collect all alternative names for all known entities and build a name to entity index. This index is used to select entity candidates for any word sequence, also known as surface form. Following previous work by , we calculate entity priors and entity name priors from Wikipedia. Context scores are calculated based on discriminative keywords. Geo distance and related entities capture the relatedness among entities in the given context.

Mutual
Features in this category capture the mutual dependency between NER and linking's outputs. For each entity in a knowledge base, there is category information available. We aggregate around 1000 distinct categories from multiple sources. One entity can have multiple categories. For example, London is connected to 29 categories. We use all combinations between NER types and categories as features in JERL, and let the model learn the correlation of each combination. This encourages coherent NER and EL decisions, which is one of the key contributions of our work.

Non-local features
Features capturing long distance dependency between hidden labels are classified as non-local features. Those features are very helpful in improving NER system performance but are costly. Since this is not the focus of this paper, we take a simple approach to incorporate non-local features. We cache history results of previous sentences in a 1000 words window, and adopt several heuristic rules for personal names. This approach contributes 0.2 points to the final NER F 1 score. Non-local features are also considered in linking (Ratinov et al., 2011;. We try several features, which has been proved to be helpful in TAC data set. However, the gain on CoNLL'03/AIDA data set is not obvious, we do not optimize linking globally.
Lastly, based on preliminary studies and experiments, we set the maximum segmentation length to 6 and max candidate count per segmentation to 5 for efficient training and inference.

State-of-Art systems
We take three state-of-art NER systems: NereL (Sil and Yates, 2013), UIUC NER (Ratinov and Roth, 2009) and Stanford NER (Finkel et al., 2005). NereL firstly over generates mentions and decomposes them to sets of connected compo-   (Kulkarni et al., 2009) and Hof11 (Hoffart et al., 2011) are compared with our models. NereL achieves the best precision@1. Kul09 formulates the local compatibility and global coherence in entity linking, and optimizes the overall entity assignment for all entities in a document via a local hill-climbing approach. Hof11 unifies the prior probability of an entity being mentioned, the similarity between context and entity, and the coherence between entity candidates among all mentions in a dense graph. Table 3 shows the performance of different NER systems on the CoNLL'03 testb data set. We refer the numbers of state-of-art systems reported by Sil and Yates (2013). Stanford NER achieves the best precision, but its recall is low. UIUC reports the (almost) best recorded F 1 . JERL ner considers features only in the NER category, which could be treated as a pure NER system implemented in Semi-CRF. Actually CRF-based implementation with a similar feature set has comparable performance. Our baseline JERL ner is strong enough. We argue that that it is mainly because of the additional dictionaries derived from the knowledge base. JERL further pushes the F 1 to 91.2, which outperforms UIUC by 0.4 points in F 1 score. To the best of our knowledge, it is the best F 1 on CoNLL'03 since 2009. The reason our model can outperform state-of-art systems is that, it has more knowledge about entities via incorporate entity linking techniques. If an entity can be linked to a well known entity via entity linking in high    Table 4 shows the performance of different entity linking systems on the AIDA test set. Kul09 and Hof11 use only the correct mentions detected by the Stanford NER as input, and thus their recall is bound by the recall of NER. NereL uses its overgeneration techniques to generate mention candidates, and outperforms Hoff11 in both precision and recall. Our baseline model JERL el is also evaluated on Stanford NER generated mentions, which has comparable performance with Kul09 and Hof11. JERL achieves precision@1 84.58 which is better than NereL.

Results
We run 15 trials for both NER and linking's experiments and report the average numbers above. The standard deviations are 0.11% and 0.08% for NER and linking separately, which pass the standard t-test with confidence level 5%, demonstrating the significance of our results.
In order to investigate how different features contribute to the overall gain. We compare JERL ner with four different feature sets. Table  5 summaries the results. In the trial "+candidate", JERL expands every possible segmentation with corresponding entity list and builds its factor graph without any linking and mutual features. This version's F 1 drops to 88.7 which indicates the created structure is quite noisy. In the "+candidate +linking" trial, only linking features are enabled and the F 1 is comparable to the baseline. On the other side, in the "+candidate +mutual" trial when mutual features are enabled the F 1 increases to 90.6. If we combine both linking and mutual features,    Table 6 shows weights of learned mutual dependency of three categories "people.person", "location.city", and "sports.team". The bigger a weight is, the more consistent this combination would be. From the weights, we find several interesting things. If an entity belongs to any of the three categories, it is less likely to be predicted as non-an-entity by NER. If an entity belongs to the category of "people.person", it more likely to be predicted as PER. When an entity belongs to the category "location.city" or "sports.team", NER may predict it as ORG or LOC. This is because in the CoNLL'03/AIDA data set, there are many sports teams mentioned by their city/country names. JERL successfully models such unexpected mutual dependency. Table 7 compares the performance and training time under different settings of max segmentation length (MSL) and max referent count (MRC). We use machines with Intel Xeon E5620 @ 2.4GHz CPU (8 cores / 16 logical processors) and 48GB memory. We run every setting 10 times and report the averages. As MSL and MRC increasing, the performance is slightly better, but the training time increased a lot. MSL has linear impact on training time, while MRC affects training time more.

Conclusion and Future Work
In this paper, we address the problem of joint optimization of named entity recognition and linking. We propose a novel model, JERL, to jointly train and infer for NER and linking tasks. To the best of our knowledge, this is the first model which trains two tasks at the same time. The joint model is able to leverage mutual dependency of the two tasks, and predict coherent outputs. JERL outperforms the state-of-art systems on both NER and linking tasks on the CoNLL'03/AIDA data set.
For future works, we would like to study how to leverage existing partial labeled data, either for NER or for linking only, in joint optimization, and incorporate more NLP tasks together for multitasks joint optimization.