A Joint Named-Entity Recognizer for Heterogeneous Tag-sets Using a Tag Hierarchy

We study a variant of domain adaptation for named-entity recognition where multiple, heterogeneously tagged training sets are available. Furthermore, the test tag-set is not identical to any individual training tag-set. Yet, the relations between all tags are provided in a tag hierarchy, covering the test tags as a combination of training tags. This setting occurs when various datasets are created using different annotation schemes. This is also the case of extending a tag-set with a new tag by annotating only the new tag in a new dataset. We propose to use the given tag hierarchy to jointly learn a neural network that shares its tagging layer among all tag-sets. We compare this model to combining independent models and to a model based on the multitasking approach. Our experiments show the benefit of the tag-hierarchy model, especially when facing non-trivial consolidation of tag-sets.


Introduction
Named Entity Recognition (NER) has seen significant progress in the last couple of years with the application of Neural Networks to the task.Such models achieve state-of-the-art performance with little or no manual feature engineering (Collobert et al., 2011;Huang et al., 2015;Lample et al., 2016;Ma and Hovy, 2016;Dernoncourt et al., 2017).Following this success, more complex NER setups are approached with neural models, among them domain adaptation (Qu et al., 2016;He and Sun, 2017;Dong et al., 2017).
In this work we study one type of domain adaptation for NER, denoted here heterogeneous tagsets.In this variant, samples from the test set are not available at training time.Furthermore, the test tag-set differs from each training tag-set.However every test tag can be represented either as a single training tag or as a combination of several training tags.This information is given in the form of a hypernym hierarchy over all tags, training and test (see Fig. 1).
This setting arises when different schemes are used for annotating multiple datasets for the same task.This often occurs in the medical domain, where healthcare providers use customized tagsets to create their own private test sets (Shickel et al., 2017;Lee et al., 2018).Another scenario is selective annotation, as in the case of extending an existing tag-set, e.g.{'Name', 'Location'}, with another tag, e.g.'Date'.To save annotation effort, new training data is labeled only with the new tag.This case of disjoint tag-sets is also discussed in the work of Greenberg et al. (2018).A similar case is extending a training-set with new examples in which only rare tags are annotated.In domains where training data is scarce, out-ofdomain datasets annotated with infrequent tags may be very valuable.
A naive approach concatenates all trainingsets, ignoring the differences between the tagging schemes in each example.A different approach would be to learn to tag with multiple training tagsets.Then, in a post-processing step, the predictions from the different tag-sets need to be consolidated into a single test tag sequence, resolving tagging differences along the way.We study two such models.The first model learns an independent NER model for each training tag-set.The second model applies the multitasking (MTL) (Collobert et al., 2011;Ruder, 2017) paradigm, in which a shared latent representation of the input text is fed into separate tagging layers.
The above models require heuristic postprocessing to consolidate the different predicted tag sequences.To overcome this limitation, we propose a model that incorporates the given tag hierarchy within the neural NER model.Specifically, this model learns to predict a tag sequence only over the fine-grained tags in the hierarchy.We conducted two experiments.The first evaluated the extension of a tag-set with a new tag via selective annotation of a new dataset with only the extending tag, using datasets from the medical and news domains.In the second experiment we integrated two full tag-sets from the medical domain with their training data while evaluating on a third test tag-set.The results show that the model which incorporates the tag-hierarchy is more robust compared to a combination of independent models or MTL, and typically outperforms them.This is especially evident when many tagging collisions need to be settled at post-processing.In these cases, the performance gap in favor of the tag-hierarchy model is large.

Task Definition
The goal in the heterogeneous tag-sets domain adaptation task is to learn an NER model M that given an input token sequence x = {x i } n 1 infers a tag sequence y = {y i } n 1 = M (x) over a test tag-set T s , ∀ i y i ∈T s .To learn the model, K training datasets {DS r k } K k=1 are provided, each labeled with its own tag-set T r k .Superscripts 's' and 'r' stand for 'test' and 'training', respectively.In this task, no training tag-set is identical to the test tagset T s by itself.However, all tags in T s can be covered by combining the training tag-sets {T r k } K k=1 .This information is provided in the form of a directed acyclic graph (DAG) representing hyper- nymy relations between all training and test tags.Fig. 1 illustrates such a hierarchy.
As mentioned above, an example scenario is selective annotation, in which an original tag-set is extended with a new tag t, each with its own training data, and the test tag-set is their union.But, some setups require combinations other than a simple union, e.g.covering the test tag 'Address' with the finer training tags 'Street' and 'City', each from a different tag-set.
This task is different from inductive domain adaptation (Pan and Yang, 2010;Ruder, 2017), in which the tag-sets are different but the tasks differ as well (e.g.NER and parsing), with no need to map the outcomes to a single tag-set at test time.

Neural network for NER
As the underlying architecture shared by all models in this paper, we follow the neural network proposed by Lample et al. (2016), which achieved state-of-the-art results on NER.In this model, depicted in Fig. 2, each input token x i is represented as a combination of: (a) a one-hot vector x w i , mapping the input to a fixed word vocabulary, and (b) a sequence of one-hot vectors {x c i,j } n i j=1 , representing the input word's character sequence.
Each input token x i is first embedded in latent space by applying both a word-embedding matrix, we i = E x w i , and a character-based embedding layer ce i = CharBiRNN({x c i,j }) (Ling et al., 2015).This output of this step is e i = ce i ⊕ we i , where ⊕ stands for vector concatenation.Then, the embedding vector sequence {e i } n is re-encoded in context using a bidirectional RNN layer {r i } n 1 = BiRNN({e i } n 1 ) (Schuster and Paliwal, 1997).The sequence {r i } n 1 constitutes the latent representation of the input text.
Finally, each re-encoded vector r i is projected to tag space for the target tag-set T , t i = P r i , where 1 is then taken as input to a CRF layer (Lafferty et al., 2001), which maintains a global tag transition matrix.At inference time, the model output is y = M (x), the most probable CRF tag sequence for input x.

Models for Multiple Tagging Layers
One way to learn a model for the heterogeneous tag-sets setting is to train a base NER (Sec.2.2) on the concatenation of all training-sets, predicting tags from the union of all training tag-sets.In our experiments, this model under performed, due to the fact that it treats each training example as fully tagged despite being tagged only with the tags belonging to the training-set from which the example is taken (see Sec. 6).
We next present two models that instead learn to tag each training tag-set separately.In the first model the outputs from independent base models, each trained on a different tag-set, are merged.The second model utilizes the the multitasking approach to train separate tagging layers that share a single text representation layer.

Combining independent models
In this model, we train a separate NER model for each training set, resulting in K models {M k } K k=1 .At test time, each model predicts a sequence y k = M k (x) over the corresponding tag-set T r k .The sequences {y k } K k=1 are consolidated into a single sequence y s over the test tag-set T s .
We perform this consolidation in a postprocessing step.First, each predicted tag y k,i is mapped to the test tag-set as y s k,i .We employ the provided tag hierarchy for this mapping by traversing it starting from y k,i until a test tag is reached.Then, for every token x i , we consider the test tags predicted at position i by the different models M (x i ) = {y s k,i |y s k,i = 'Other'}.Cases where M (x i ) contains more than one tag are called collisions.Models must consolidate collisions, selecting a single predicted tag for x i .
We introduce three different consolidation methods.The first is to randomly select a tag from M (x i ).The second chooses the tag that originates from the tag sequence y k with the highest CRF probability score.The third computes the marginal CRF tag probability for each tag and selects the one with the highest probability.

Multitasking for heterogeneous tag-sets
Lately, several works explored using multitasking (MTL) for inductive transfer learning within a neural architecture (Collobert and Weston, 2008;Chen et al., 2016;Peng and Dredze, 2017).Such algorithms jointly train a single model to solve different NLP tasks, such as NER, sentiment analysis and text classification.The various tasks share the same text representation layer in the model but maintain a separate tagging layer per task.
We adapt multitasking to heterogeneous tagsets by considering each training dataset, which has a different tag-set T r k , as a separate NER task.Thus, a single model is trained, in which the latent text representation {r i } n 1 (see Sec. 2.2) is shared between NER tasks.As mentioned above, the tagging layers (projection and CRF) are kept separate for each tag-set.Fig. 3 illustrates this architecture.
We emphasize that the output of the MTL model still consists of {y k } K k=1 different tag sequence predictions.They are consolidated into a final single sequence y s using the same post-processing step described in Sec.3.1.

Tag Hierarchy Model
The models introduced in Sec.3.1 and 3.2 learn to predict a tag sequence for each training tagset separately and they do not share parameters between tagging layers.In addition, they require a post-processing step, outside of the model, for merging the tag sequences inferred for the different tag-sets.A simple concatenation of all training data is also not enough to accommodate the differences between the tag-sets within the model (see Sec. 3).Moreover, none of these models utilizes the relations between tags, which are provided as input in the form of a tag hierarchy.
In this section, we propose a model that addresses these limitations.This model utilizes the given tag hierarchy at training time to learn a single, shared tagging layer that predicts only finegrained tags.The hierarchy is then used during inference to map fine-grained tags onto a target tag-set.Consequently, all tagging decisions are made in the model, without the need for a postprocessing step.

Notations
In the input hierarchy DAG, each node represents some semantic role of words in sentences, (e.g.'Name').If a node d has no hyponyms (Sem(d) = {d}), it represents some fine-grained tag semantics.We denote the set of all fine-grained tags by T F G .We also denote all fine-grained tags that are hyponyms of d by Fine(d) = T F G ∩ Sem(d), e.g.Fine(N ame) = {LastN ame, F irstN ame}.As mentioned above, our hierarchical model predicts tag sequences only from T F G and then maps them onto a target tag-set.

Hierarchy extension with 'Other' tags
For each tag d we would like the semantics captured by the union of semantics of all tags in Fine(d) to be exactly the semantics of d, making sure we will not miss any aspect of d when predicting only over T F G .Yet, this semantics-equality property does not hold in general.One such example in Fig. 4 is 'Age> 90'→'Age', because there may be age mentions below 90 annotated in T 2 's dataset.
To fix the semantics-equality above, we use the notion of the 'Other' tag in NER, which has the semantics of "all the rest".Specifically, for every d / ∈ T F G , a fine-grained tag 'd-Other' ∈ T F G and an edge 'd-Other'→'d' are automatically added to the graph, hence 'd-Other'∈ Fine(d).For instance, 'Age-Other'→'Age'.These new tags represent the aspects of d not captured by the other tags in Fine(d).
Next a tag 'T i -Other' is automatically added to each tag-set T i , explicitly representing the "all the rest" semantics of T i .The labels for 'T i -Other' are induced automatically from unlabeled tokens in the original DS r i dataset.To make sure that the semantics-equality property above also holds for 'T i -Other', a fine-grained tag 'FG-Other' is also added, which captures the "all the rest" semantics at the fine-grained level.Then, each 'T i -Other' is connected to all fine-grained tags that do not capture some semantics of the tags in T i , defining: This mapping is important at training time, where 'T i -Other' labels are used as distant supervision over their related fine-grained tags (Sec.4.3).Fig. 4 depicts our hierarchy example after this step.We emphasize that all extensions in this step are done automatically as part of the model's algorithm.

NER model with tag hierarchy
One outcome of the extension step is that the set of fine-grained tags T F G covers all distinct finegrained semantics across all tag-sets.In the following, we train a single NER model (Sec.2.2) that predicts sequences of tags from the T F G tagset.As there is only one tagging layer, model parameters are shared across all training examples.
At inference time, this model predicts the most likely fine-grained tag sequence y f g for the input x.As the model outputs only a single sequence, post-processing consolidation is not needed.The tag hierarchy is used to map each predicated finegrained tag y f g i to a tag in a test tag-set T s by traversing the out-edges of y f g i until a tag in T s is reached.This procedure is also used in the baseline models (see Sec. 3.1) for mapping their predictions onto the test tag-set.However, unlike the baselines, which end with multiple candidate predictions in the test tag-set and need to consolidate between them, here, only a single fine-grained tag sequence is mapped, so no further consolidation is needed.
At training time, each example x that belongs to some training dataset DS r i is labeled with a gold-standard tag sequence y where the tags are taken only from the corresponding tag-set T r i .This means that tags {y i } are not necessarily finegrained tags, so there is no direct supervision for predicting fine-grained tag sequences.However, each gold label y i provides distant supervision over its related fine-grained tags, Fine(y i ).It indicates that one of them is the correct fine-grained label without explicitly stating which one, so we consider all possibilities in a probabilistic manner.
Henceforth, we say that a fine-grained tag sequence y f g agrees with y if ∀ i y f g i ∈ Fine(y i ), i.e. y f g is a plausible interpretation for y at the fine-grained tag level.For example, following Fig. 4, sequences ['Hospital', 'City'] and ['Street', 'City'] agree with ['Location', 'Location'], unlike ['City', 'Last Name'].We denote all fine-grained tag sequences that agree with y by AgreeWith(y).
Using this definition, the tag-hierarchy model is trained with the loss function: where φ(y) stands for the model's score for sequence y, viewed as unnormalized probability.Z is the standard CRF partition function over all possible fine-grained tag sequences.Z y , on the other hand, accumulates scores only of fine-grained tag sequences that agree with y.Thus, this loss function aims at increasing the summed probability of all fine-grained sequences agreeing with y.
Both Z y and Z can be computed efficiently using the Forward-Backward algorithm (Lafferty et al., 2001).
We note that we also considered finding the most likely tag sequence over a test tag-set at inference time by summing the probabilities of all finegrained tag sequences that agree with each candidate sequence y: max y y f g ∈AgreeWith(y) φ(y f g ).However, this problem is NP-hard (Lyngsø and Pedersen, 2002).We plan to explore other alternatives in future work.

Experimental Settings
To test the tag-hierarchy model under heterogeneous tag-set scenarios, we conducted experiments using datasets from two domains.We next describe these datasets as well as implementation details for the tested models.Sec.6 then details the experiments and their results.

Datasets
Five datasets from two domains, medical and news, were used in our experiments.Table 1 summarizes their main statistics.
For the news domain we used the English part of CONLL-2003 (denoted Conll) (Tjong Kim Sang and De Meulder, 2003) and OntoNotes-v5 (denoted Onto) (Weischedel et al., 2013), both with train and test sets.We note that I2B2'14, Conll and Onto also contain a dev-set, which is used for hyper-param tuning (see below).
In all experiments, each example is a full document.Each document is split into tokens on whitespaces and punctuation.A tag-hierarchy covering the 57 tags from all five datasets was given as input to all models in all experiments.We constructed this hierarchy manually.The only non-trivial tag was 'Location', which in I2B2'14 is split into finer tags ('City', 'Street' etc.) and includes also hospital mentions in Conll and Onto.We resolved these relations similarly to the graph in Figure 1.

Compared Models
Four models were compared in our experiments: M Concat A single NER model on the concatenation of datasets and tag-sets (Sec.3).
M Hier A tag hierarchy employed within a single base model (Sec.4).
All models are based on the neural network described in Sec.2.2.We tuned the hyper-params in the base model to achieve state-of-the-art results for a single NER model on Conll and I2B2'14 when trained and tested on the same dataset (Strubell et al., 2017;Dernoncourt et al., 2017) (see Table 2).This is done to maintain a constant baseline, and is also due to the fact that I2B2'06 does not have a standard dev-set.
We tuned hyper-params over the dev-sets of Conll and I2B2'14.For character-based embedding we used a single bidirectional LSTM (Hochreiter and Schmidhuber, 1997) with hidden state size of 25.For word embeddings we used pre-trained GloVe embeddings 1 (Pennington et al., 2014), without further training.For token recoding we used a two-level stacked bidirectional LSTM (Graves et al., 2013) with both output and hidden state of size 100.
Once these hyper-params were set, no further tuning was made in our experiments, which means all models for heterogeneous tag-sets were tested under the above fixed hyper-param set.In each experiment, each model was trained until convergence on the respective training set.In all experiments we assess model performance via micro-averaged tag F1, in accordance with CoNLL evaluation (Tjong Kim Sang and De Meulder, 2003).Statistical significance was computed using the Wilcoxon two-sided signed ranks test at p = 0.01 (Wilcoxon, 1945).We next detail each experiment and its results.
In all our experiments, we found the performance of the different consolidation methods (Sec.3.1) to be on par.One reason that using model scores does not beat random selection may be due to the overconfidence of the tagging models -their prediction probabilities are close to 0 or 1.We report figures for random selection as representative of all consolidation methods.

Tag-set extension experiment
In this experiment, we considered the 4 most frequent tags that occur in at least two of our datasets: 'Name', 'Date', 'Location' and 'Hospital' (Table 3 summarizes their statistics).For each frequent tag t and an ordered pair of datasets in which t occurs, we constructed new training sets by removing t from the first training set (termed base dataset) and remove all tags but t from the second training set (termed extending dataset).For example, for the triplet of { 'Name', I2B2'14, I2B2'06}, we constructed a version of I2B2'14 without 'Name' annotations and a version of I2B2'06 containing only annotations for 'Name'.This process yielded 32 such triplets.For every triplet, we train all tested models on the two modified training sets and test them on the test-set of the base dataset (I2B2'14 in the example above).Each test-set was not altered and contains all tags of the base tag-set, including t.
M Concat performed poorly in this experiment.For example, on the dataset extending I2B2'14 with 'Name' from I2B2'06, M Concat tagged only one 'Name' out of over 4000 'Name' mentions in the test set.Given this, we do not provide further details of the results of M Concat in this experiment.
For the three models tested, this experiment yields 96 results.The main results 2 of this experiment are shown in Table 4. Surprisingly, in more tests M Indep outperformed M MTL than vice versa, adding to prior observations that multitasking can hurt performance instead of improving it (Bingel and Søgaard, 2017;Alonso and Plank, 2017;Bjerva, 2017).But, applying a shared tagging layer on top of a shared text representation boosts the model's capability and stability.Indeed, overall, M Hier outperforms the other models in most tests, and in the rest it is similar to the best performing model.
Analyzing the results, we noticed that the gap between model performance increases when more collisions are encountered for M MTL and M Indep at post-processing time (see Sec. 3.1).The amount of collisions may be viewed as a predictor for the baselines' difficulty to handle a specific heterogeneous tag-sets setting.test triplets.In these tests, M Hier is a clear winner, outperforming the compared models in all but two comparisons, often by a significant margin.Finally, we compared the models trained with selective annotation to an "upper-bound" of training and testing a single NER model on the same dataset with all tags annotated (Table 2).As expected, performance is usually lower with selective annotation.But, the drop intensifies when the base and extending datasets are from different domains -medical and news.In these cases, we observed that M Hier is more robust.Its drop compared to combining datasets from the same domain is the least in almost all such combinations.Table 6 provides some illustrative examples.

Full tag-set integration experiment
A scenario distinct from selective annotation is the integration of full tag-sets.On one hand, more training data is available for similar tags.On the other hand, more tags need to be consolidated among the tag-sets.To test this scenario, we trained the tested model types on the training sets of I2B2'06 and I2B2'14, which have different tag-sets.The models were evaluated both on the test sets of these datasets and on Physio, an unseen test-set that requires the combination of the two training tag-sets for full coverage of its tag-set.We also compared the models to single models trained on each of the training sets alone.Table 7 displays the results.
As expected, single models do well on the testset companion of their training-set but they underperform on the other test-sets.This is expected because the tag-set on which they were trained does not cover well the tag-sets in the other test-sets.
When compared with the best-performing single model, using M Concat shows reduced results on all 3 test sets.This can be attributed to reduced performance for types that are semantically different between datasets (e.g.'Date'), while performance on similar tags (e.g.'Name') does not drop.
Combining the two training sets using either M Indep or M MTL leads to substantial performance drop in 5 out of 6 test-sets compared to the bestperforming single model.This is strongly correlated with the number of collisions encountered (see Table 7).Indeed, the only competitive result, M MTL tested on Physio, had less than 100 collisions.This demonstrates the non triviality in realworld tag-set integration, and the difficulty of resolving tagging decisions across tag-sets.
By contrast, M Hier has no performance drop compared to the single models trained and tested on the same dataset.Moreover, it is the best performing model on the unseen Physio test-set, with 6% relative improvement in F1 over the best single model.This experiment points up the robustness of the tag hierarchy approach when applied to this heterogeneous tag-set scenario.
7 Related Work Collobert et al. (2011) introduced the first competitive NN-based NER that required little or no feature engineering.Huang et al. (2015) combined LSTM with CRF, showing performance similar to non-NN models.Lample et al. (2016) extended this model with character-based embeddings in addition to word embedding, achieving state-of-theart results.Similar architectures, such as combinations of convolutional networks as replacements of RNNs were shown to out-perform previous NER models (Ma and Hovy, 2016;Chiu and Nichols, 2016;Strubell et al., 2017).Dernoncourt et al. (2017) and Liu et al. (2017) showed that the LSTM-CRF model achieves stateof-the-art results also for de-identification in the medical domain.Lee et al. (2018) demonstrated how performance drops significantly when the LSTM-CRF model is tested under transfer learning within the same domain in this task.Collobert and Weston (2008) introduced MTL for NN, and other works followed, showing it helps in various NLP tasks (Chen et al., 2016;Peng and Dredze, 2017).Søgaard and Goldberg (2016) and Hashimoto et al. (2017) argue that cascading architectures can improve MTL performance.Several works have explored conditions for successful application of MTL (Bingel and Søgaard, 2017;Bjerva, 2017;Alonso and Plank, 2017).
Few works attempt to share information across datasets at the tagging level.Greenberg et al. (2018) proposed a single CRF model for tagging with heterogeneous tag-sets but without a hierarchy.They show the utility of this method for indomain datasets with a balanced tag distribution.Our model can be viewed as an extension of theirs for tag hierarchies.Augenstein et al. (2018) use tag embeddings in MTL to further propagate information between tasks.Li et al. (2017) propose to use a tag-set made of cross-product of two different POS tag-sets and train a model for it.Given the explosion in tag-set size, they introduce automatic pruning of cross-product tags.Kim et al. (2015) and Qu et al. (2016) automatically learn correlations between tag-sets, given training data for both tag-sets.They rely on similar contexts for related source and target tags, such as 'professor' and 'student'.
Our tag-hierarchy model was inspired by recent work on hierarchical multi-label classification (Silla and Freitas, 2011;Zhang and Zhou, 2014), and can be viewed as an extension of this direction onto sequences tagging.

Conclusions
We proposed a tag-hierarchy model for the heterogeneous tag-sets NER setting, which does not require a consolidation post-processing stage.In the conducted experiments, the proposed model consistently outperformed the baselines in difficult tagging cases and showed robustness when applying a single trained model to varied test sets.
In the case of integrating datasets from the news and medical domains we found the blending task to be difficult.In future work, we'd like to improve this integration in order to gain from training on examples from different domains for tags like 'Name' and 'Location'.

Figure 1 :
Figure 1: A tag hierarchy for three tag-sets.

Figure 4 :
Figure4: The tag hierarchy in Fig.1for three tag-sets after closure extension.Green nodes and edges were automatically added in this process.Fine-grained tags are surrounded by a dotted box.
A directed edge c → d implies that c is a hyponym of d, meaning c captures a subset of the semantics of d.Examples include 'LastName' → 'Name', and 'Street' → 'Location' in Fig. 1.We denote the set of all tags that capture some subset of semantics of d by Sem(d) = {d} ∪ {c|c R − → d}, where R − → indicates that there is a directed path from c to d in the graph.For example, Sem(N ame) = {N ame, LastN ame, F irstN ame}.

Table 1 :
Dataset statistics.Tokens tagged refer to percentage of tokens tagged not as 'Other'.

Table 2 :
F1 for training and testing a single base NER model on the same dataset.

Table 3 :
Occurrence statistics for tags used in the tagset extension experiment, reported as % out of all tokens in the training and test sets of each dataset.

Table 4 :
F1 in the tag-set extension experiment, averaged over extending for every base dataset.
Table 5 presents the tests in which more than 100 collisions were detected for either M Indep or M MTL , constituting 66% of all 2 Detailed results for all 96 tests are given in the Appendix.