Entity Identification as Multitasking

Standard approaches in entity identification hard-code boundary detection and type prediction into labels and perform Viterbi. This has two disadvantages: 1. the runtime complexity grows quadratically in the number of types, and 2. there is no natural segment-level representation. In this paper, we propose a neural architecture that addresses these disadvantages. We frame the problem as multitasking, separating boundary detection and type prediction but optimizing them jointly. Despite its simplicity, this architecture performs competitively with fully structured models such as BiLSTM-CRFs while scaling linearly in the number of types. Furthermore, by construction, the model induces type-disambiguating embeddings of predicted mentions.


Introduction
A popular convention in segmentation tasks such as named-entity recognition (NER) and chunking is the so-called "BIO"-label scheme. It hard-codes boundary detection and type prediction into labels using the indicators "B" (Beginning), "I" (Inside), and "O" (Outside). For instance, the sentence Where is John Smith is tagged as Where/O is/O John/B-PER Smith/I-PER. In this way, we can treat the problem as sequence labeling and apply standard structured models such as CRFs.
But this approach has certain disadvantages. First, the runtime complexity grows quadratically in the number of types (assuming exact decoding with first-order label dependency). We emphasize that the asymptotic runtime remains quadratic even if we heuristically prune previous labels based on the BIO scheme. This is not an issue when the number of types is small but quickly becomes problematic as the number grows. Second, there is no segment-level prediction: every prediction happens at the word-level. As a consequence, models do not induce representations corresponding to multi-word mentions, which can be useful for downstream tasks such as named-entity disambiguation (NED).
In this paper, we propose a neural architecture that addresses these disadvantages. Given a sentence, the model uses bidirectional LSTMs (BiL-STMs) to induce features and separately predicts: 1. Boundaries of mentions in the sentence.

Entity types of the boundaries.
Crucially, during training, the errors of these two predictions are minimized jointly.
One might suspect that the separation could degrade performance; neither prediction accounts for the correlation between entity types. But we find that this is not the case due to joint optimization. In fact, our model performs competitively with fully structured models such as BiLSTM-CRFs (Lample et al., 2016), implying that the model is able to capture the entity correlation indirectly by multitasking. On the other hand, the model scales linearly in the number of types and induces segment-level embeddings of predicted mentions that are type-disambiguating by construction.

Related Work
Our work is directly inspired by Lample et al. (2016) who demonstrate that a simple neural architecture based on BiLSTMs achieves state-ofthe-art performance on NER with no external features. They propose two models. The first makes structured prediction of NER labels with a CRF loss (LSTM-CRF) using the conventional BIO-label scheme. The second, which performs slightly worse, uses a shift-reduce framework mirroring tansition-based dependency parsing (Yamada and Matsumoto, 2003). While the latter also scales linearly in the number of types and produces embeddings of predicted mentions, our approach is quite different. We frame the problem as multitasking and do not need the stack/buffer data structure. Semi-Markov models (Kong et al., 2015;Sarawagi et al., 2004) explicitly incorporate the segment structure but are computationally intensive (quadratic in the sentence length).
Multitasking has been shown to be effective in numerous previous works (Collobert et al., 2011;Yang et al., 2016;Kiperwasser and Goldberg, 2016). This is especially true with neural networks which greatly simplify joint optimization across multiple objectives. Most of these works consider multitasking across different problems. In contrast, we decompose a single problem (NER) into two natural subtasks and perform them jointly. Particularly relevant in this regard is the parsing model of Kiperwasser and Goldberg (2016) which multitasks edge prediction and classification.
LSTMs (Hochreiter and Schmidhuber, 1997), and other variants of recurrent neural networks such as GRUs (Chung et al., 2014), have recently been wildly successful in various NLP tasks (Lample et al., 2016;Kiperwasser and Goldberg, 2016;Chung et al., 2014). Since there are many detailed descriptions of LSTMs available, we omit a precise definition. For our purposes, it is sufficient to treat an LSTM as a mapping φ : R d × R d → R d that takes an input vector x and a state vector h to output a new state vector h = φ(x, h).

Model
Let C denote the set of character types, W the set of word types, and E the set of entity types. Let ⊕ denote the vector concatenation operation. Our model first constructs a network over a sentence closely following Lample et al. (2016); we describe it here for completeness. The model parameters Θ associated with this base network are and induces a character-and context-sensitive for each i = 1 . . . n. These vectors are used to define the boundary detection loss and the type classification loss described below.
Boundary detection loss We frame boundary detection as predicting BIO tags without types. A natural approach is to optimize the conditional probability of the correct tags y 1 . . . y n ∈ {B, I, O}: where g : R 200 → R 3 is a function that adjusts the length of the LSTM output to the number of targets. We use a feedforward network We write Θ 1 to refer to T ∈ R 3×3 and the parameters in g. The boundary detection loss is given by the negative log likelihood: 1 For simplicity, we assume some random initial state vectors such as f C 0 and b C |w i |+1 when we describe LSTMs.
where l iterates over tagged sentences in the data. The global normalizer for (2) can be computed using dynamic programming; see Collobert et al. (2011). Note that the runtime complexity of boundary detection is constant despite dynamic programming since the number of tags is fixed (three).
Type classification loss Given a mention boundary 1 ≤ s ≤ t ≤ n, we predict its type using (1) as follows. We introduce an additional pair where q : R 400 → R |E| is again a feedforward network that adjusts the vector length to |E|. 2 We write Θ 2 to refer to the parameters in φ E f , φ E b , q. Now we can optimize the conditional probability of the correct type τ : The type classification loss is given by the negative log likelihood: where l iterates over typed mentions in the data.

Joint loss
The final training objective is to minimize the sum of the boundary detection loss and the type classification loss: In stochastic gradient descent (SGD), this amounts to computing the tagging loss l 1 and the classification loss l 2 (summed over all mentions) at each annotated sentence, and then taking a gradient step on l 1 + l 2 . Observe that the base network Θ is optimized to handle both tasks. During training, we use gold boundaries and types to optimize L 2 (Θ, Θ 2 ). At test time, we predict boundaries from the tagging layer (2) and classify them using the classification layer (4).
2 Clearly, one can consider different networks over the boundary, for instance simple bag-of-words or convolutional neural networks. We leave the exploration as future work.   Sang and De Meulder, 2003), and the newswire portion of OntoNotes Release 5.0 which has 18 entity types (Weischedel et al., 2013).
Implementation and baseline We denote our model Mention2Vec and implement it using the DyNet library. 3 We use the same pre-trained word embeddings in Lample et al. (2016). We use the Adam optimizer (Kingma and Ba, 2014) and apply dropout at all LSTM layers (Hinton et al., 2012). We perform minimal tuning over development data. Specifically, we perform a 5 × 5 grid search over learning rates 0.0001 . . . 0.0005 and dropout rates 0.1 . . . 0.5 and choose the configuration that gives the best performance on the dev set.
We also re-implement the BiLSTM-CRF model of Lample et al. (2016); this is equivalent to optimizing just L 1 (Θ, Θ 1 ) but using typed BIO tags. Lample et al. (2016) use different details in optimization (SGD with gradient clipping), data preprocessing (replacing every digit with a zero), and the dropout scheme (droptout at BiLSTM input (1)). As a result, our re-implementation is not directly comparable and obtains different (slightly lower) results. But we emphasize that the main goal of this paper is to demonstrate the utility the PER In another letter dated January 1865, a well-to-do Washington matron wrote to Lincoln to plead for . . . Chang and Washington were the only men's seeds in action on a day that saw two seeded women's . . . "Just one of those things, I was just trying to make contact," said Bragg. Washington's win was not comfortable, either. LOC Lauck, from Lincoln, Nebraska, yelled a tirade of abuse at the court after his conviction for inciting . . . . . . warring factions, with the PUK aming to break through to KDP's headquarters in Saladhuddin. . . . is not expected to travel to the West Bank before Monday," Nabil Abu Rdainah told Reuters. . . . off a bus near his family home in the village of Donje Ljupce in the municipality of Podujevo.

ORG
English division three -Swansea v Lincoln. SOCCER -OUT-OF-SORTS NEWCASTLE CRASH 2 1 AT HOME. Moura, who appeared to have elbowed Cyprien in the final minutes of the 3 0 win by Neuchatel, was . . . In Sofia: Leviski Sofia (Bulgaria) 1 Olimpija (Slovenia) 0 WORK OF ART . . . Bond novels, and "Treasure Island," produced by Charlton Heston who also stars in the movie.
. . . probably started in 1962 with the publication of Rachel Carson's book "Silent Spring." . . . Victoria Petrovich) spout philosophic bon mots with the self-concious rat-a-tat pacing of "Laugh In." Dennis Farney's Oct. 13 page -one article "River of Despair," about the poverty along the . . .  proposed approach rather than obtaining a new state-of-the-art result on NER. This shows that despite the separation between boundary detection and type classification, we can achieve good performance through joint optimization. On OntoNotes in which the number of types is much larger, our model still performs well with an F1 score of 89.37 but is behind BiLSTM-CRF which achieves 90.77. We suspect that this is due to strong correlation between mention types that fully structured models can exploit more effectively. However, our model is also an order of magnitude faster: 4949 compared to 495 words/second. Finally, Table 2 Table 3 shows nearest neighbors of detected mentions using the mention representations µ in (3). Since µ τ represents the score of type τ , the mention embeddings are clustered by entity types by construction. The model induces completely different representations even when the mention has the same lexical form. For instance, based on its context Lincoln receives a person, location, or organization representation; Treasure Island receives a book or location representation.

Mention Embeddings
The model also learns representations for long multi-word expressions such as the General Agreement on Tariffs and Trade.

Conclusion
We have presented a neural architecture for entity identification that multitasks boundary detection and type classification. Joint optimization enables the base BiLSTM network to capture the correlation between entities indirectly via multitasking. As a result, the model is competitive with fully structured models such as BiLSTM-CRFs on CoNLL 2003 while being more scalable and also inducing context-sensitive mention embeddings clustered by entity types. There are many interesting future directions, such as applying this framework to NED in which type classification is much more fine-grained and finding a better method for optimizing the multitasking objective (e.g., instead of using gold boundaries for training, dynamically use predicted boundaries in a reinforcement learning framework).