Hierarchical Entity Typing via Multi-level Learning to Rank

We propose a novel method for hierarchical entity classification that embraces ontological structure at both training and during prediction. At training, our novel multi-level learning-to-rank loss compares positive types against negative siblings according to the type tree. During prediction, we define a coarse-to-fine decoder that restricts viable candidates at each level of the ontology based on already predicted parent type(s). We achieve state-of-the-art across multiple datasets, particularly with respect to strict accuracy.


Introduction
Entity typing is the assignment of a semantic label to a span of text, where that span is usually a mention of some entity in the real world.Named entity recognition (NER) is a canonical information extraction task, commonly considered a form of entity typing that assigns spans to one of a handful of types, such as PER, ORG, GPE, and so on.Fine-grained entity typing (FET) seeks to classify spans into types according to more diverse, semantically richer ontologies (Ling and Weld, 2012;Yosef et al., 2012;Gillick et al., 2014;Del Corro et al., 2015;Choi et al., 2018), and has begun to be used in downstream models for entity linking (Gupta et al., 2017;Raiman and Raiman, 2018).
Consider the example in Figure 1 from the FET dataset, FIGER (Ling and Weld, 2012).The mention of interest, Hollywood Hills, will be typed with the single label LOC in traditional NER, but may be typed with a set of types {/location, /geography, /geography/mountain} under a fine-grained typing scheme.In these finer-grained typing schemes, types usually form a hierarchy: there are a set of coarse types that lies on the top level-these are similar to traditional NER types, e.g./person; additionally, there are finer location geography city county mountain island He is interred at Forest Lawn Memorial Park in Hollywood Hills, Los Angeles, CA. types that are subtypes of these top-level types, e.g./person/artist or /person/doctor.
Most prior work concerning fine-grained entity typing has approached the problem as a multilabel classification problem: given an entity mention together with its context, the classifier seeks to output a set of types, where each type is a node in the hierarchy.Approaches to FET include handcrafted sparse features to various neural architectures (Ren et al., 2016a;Shimaoka et al., 2017;Lin and Ji, 2019, inter alia, see section 2).
Perhaps owing to the historical transition from "flat" NER types, there has been relatively little work in FET that exploits ontological tree structure, where type labels satisfy the hierarchical property: a subtype is valid only if its parent supertype is also valid.We propose a novel method that takes the explicit ontology structure into account, by a multi-level learning to rank approach that ranks the candidate types conditioned on the given entity mention.Intuitively, coarser types are easier whereas finer types are harder to classify: we capture this intuition by allowing distinct margins at each level of the ranking model.Coupled with a novel coarse-to-fine decoder that searches on the type hierarchy, our approach guar- antees that predictions do not violate the hierarchical property, and achieves state-of-the-art results according to multiple measures across various commonly used datasets.

Related Work
FET is usually studied as allowing for sentencelevel context in making predictions, notably starting with Ling and Weld (2012) and Gillick et al. (2014), where they created the commonly used FIGER and OntoNotes datasets for FET.While researchers have considered the benefits of document-level (Zhang et al., 2018), and corpuslevel (Yaghoobzadeh and Schütze, 2015) context, here we focus on the sentence-level variant for best contrast to prior work.
• Incorporating the hierarchy: Most prior works approach the hierarchical typing problem as multi-label classification, without using information in the hierarchical structure, but there are a few exceptions.Ren et al. (2016a) proposed an adaptive margin for learning-to-rank so that similar types have a smaller margin; Xu and Barbosa (2018) proposed hierarchical loss normalization that penalizes output that violates the hierarchical property; and Murty et al. (2018) proposed to learn a subtyping relation to constrain the type embeddings in the type space.
In contrast to these approaches, our coarse-tofine decoding approach strictly guarantees that the output does not violate the hierarchical property, leading to better performance.HYENA (Yosef et al., 2012) applied ranking to sibling types in a type hierarchy, but the number of predicted positive types are trained separately with a meta-model, hence does not support neural end-to-end training.
Researchers have proposed alternative FET formulations whose types are not formed in a type hierarchy, in particular Ultra-fine entity typing (Choi et al., 2018;Xiong et al., 2019;Onoe and Durrett, 2019), with a very large set of types derived from phrases mined from a corpus.FET in KB (Jin et al., 2019) labels mentions to types in a knowledge base with multiple relations, forming a type graph.Dai et al. (2019) augments the task with entity linking to KBs.

Problem Formulation
We denote a mention as a tuple x = (w, l, r), where w = (w 1 , • • • , w n ) is the sentential context and the span [l : r] marks a mention of interest in sentence w.That is, the mention of interest is (w l , • • • , w r ).Given x, a hierarchical entity typing model outputs a set of types Y in the type ontology Y, i.e.Y ⊆ Y.
Type hierarchies take the form of a forest, where each tree is rooted by a top-level supertype (e.g./person, /location, etc.).We add a dummy parent node entity = "/", the supertype of all entity types, to all the top-level types, effectively transforming a type forest to a type tree.In Figure 2, we show 3 type ontologies associated with 3 different datasets (see subsection 5.1), with the dummy entity node augmented.We now introduce some notation for referring to aspects of a type tree.The binary relation "type z is a subtype of y" is denoted as z <: y. 1 The unique parent of a type y in the type tree is denoted ȳ ∈ Y, where ȳ is undefined for y = entity.
The immediate subtypes of y (children nodes) are denoted Ch(y) ⊆ Y. Siblings of y, those sharing the same immediate parent, are denoted Sb(y) ⊆ Y, where y Sb(y).
In the AIDA FET ontology (see Figure 2), the maximum depth of the tree is L = 3, and each mention can only be typed with at most 1 type from each level.We term this scenario singlepath typing, since there can be only 1 path starting from the root (entity) of the type tree.This is in contrast multi-path typing, such as in the BBN dataset, where mentions may be labeled with multiple types on the same level of the tree.
Additionally, in AIDA, there are mentions labeled such as as /per/police/<unspecified>. In FIGER, we find instances with labeled type /person but not any further subtype.What does it mean when a mention x is labeled with a partial type path, i.e., a type y but none of the subtypes z <: y?We consider two interpretations: • Exclusive: x is of type y, but x is not of any type z <: y.
• Undefined: x is of type y, but whether it is an instance of some z <: y is unknown.
We devise different strategies to deal with these two conditions.Under the exclusive case, we add a dummy other node to every intermediate branch node in the type tree.For any mention x labeled with type y but none of the subtypes z <: y, we add this additional label "y/other" to the labels of x (see Figure 2: AIDA).For example, if we interpret a partial type path /person in FIGER as exclusive, we add another type /person/other to that instance.Under the undefined case, we do not modify the labels in the dataset.We will see this can make a significant difference depending on the way a specific dataset is annotated.

Mention Representation
Hidden representations for entity mentions in sentence w are generated by leveraging recent advances in language model pre-training, e.g.ELMo (Peters et al., 2018). 2 The ELMo representation for each token w i is denoted as Dropout is applied with probability p D to the ELMo vectors.
Our mention encoder largely follows Lin and Ji (2019).First a mention representation is derived using the representations of the words in the mention.We apply a max pooling layer atop the mention after a linear transformation: Then we employ mention-to-context attention first described in Zhang et al. (2018) and later employed by Lin and Ji (2019): a context vector c is generated by attending the sentence with a query vector derived from the mention vector m.We use the multiplicative attention of Luong et al. (2015): The final representation for an entity mention is generated via concatenation of the mention and context vector:

Type Scorer
We learn a type embedding y ∈ R d t for each type y ∈ Y.To score an instance with representation [m ; c], we pass it through a 2-layer feed-forward network that maps into the same space as the type space R d t , with tanh as the nonlinearity.The final score is an inner product between the transformed feature vector and the type embedding: (4)

Hierarchical Learning-to-Rank
We introduce our novel hierarchical learning-torank loss that (1) allows for natural multi-label classification and (2) takes the hierarchical ontology into account.
We start with a multi-class hinge loss that ranks positive types above negative types (Weston and Watkins, 1999): where [x] + = max{0, x}.This is actually learningto-rank with a ranking SVM (Joachims, 2002): the model learns to rank the positive types y ∈ Y higher than those negative types y Y , by imposing a margin ξ between y and y : type y should rank higher than y by ξ.Note that in Equation 5, since it is a linear SVM, the margin hyperparameter ξ could be just set as 1 (the type embeddings are linearly scalable), and we rely on L 2 regularization to constrain the type embeddings.
Multi-level Margins However, this method considers all candidate types to be flat instead of hierarchical -all types are given the same treatment without any prior on their relative position in the type hierarchy.Intuitively, coarser types (higher in the hierarchy) should be easier to determine (e.g./person vs /location should be fairly easy for the model), but fine-grained types (e.g./person/artist/singer) are harder.
We encode this intuition by (i) learning to rank types only on the same level in the type tree; (ii) setting different margin parameters for the ranking model with respect to different levels: Here lev(y) is the level of the type y: for example, lev(/location) = 1, and lev(/person/artist/singer) = 3.In Equation 6, each positive type y is only compared against its negative siblings Sb(y)\Y , and the margin hyperparameter is set to be ξ lev(y) , i.e., a margin dependent on which level y is in the tree.Intuitively, we should set ξ 1 > ξ 2 > ξ 3 since our model should be able to learn a larger margin between easier pairs: we show that this is superior than using a single margin in our experiments.
Analogous to the reasoning that in Equation 5 the margin ξ can just be 1, only the relative ratios X 9 a u d 1 + 8 3 u 3 n 6 n e / D 2 0 p n S c h x y I 4 2 9 X 9 a u d 1 + 8 3 u 3 n 6 n e / D 2 0 p n S c h x y I 4 2 9 X 9 a u d 1 + 8 3 u 3 n 6 n e / D 2 0 p n S c h x y I 4 2 9

↵⇠2
< l a t e x i t s h a 1 _ b a s e 6 4 = " e 6 g e a u Y 3 m z 9 l e N + 2 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " e 6 g e a u Y 3 m z 9 l e N + 2 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " e 6 g e a u Y 3 m z 9 l e N + 2 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " e 6 g e a u Y 3 m z 9 l e N + 2 < / l a t e x i t > ↵⇠ 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " m g 5 b t P 8 p v A 4 P X 4 L g s s 3 g J W + 5 g B 0 2 + e w v / 4 7 d a V a 6 m u e 1 e E v z t d t + Z w Z S v I 7 u k v 4 j 2 1 Y V z s 0 I C h b y b x 6 5 S j p j w J Z v c 6 I S K 2 g 0 t T Q n W v I 9 0 / v X e w g u 3 w / T Z J h + + T A 4 P 9 l c d k + 8 F m / E i U j F q T g X n 8 W F G A k p n P g u f o p f 0 V m E k Y 7 s X W n U 2 2 h e i U 5 E t 3 8 A Y y n f t Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " m g 5 b t P 8 p v A 4 P X 4 L g s s 3 g J W + 5 g B 0 2 + e w v / 4 7 d a V a 6 m u e 1 e E v z t d t + Z w Z S v I 7 u k v 4 j 2 1 Y V z s 0 I C h b y b x 6 5 S j p j w J Z v c 6 I S K 2 g 0 t T Q n W v I 9 0 / v X e w g u 3 w / T Z J h + + T A 4 P 9 l c d k + 8 F m / E i U j F q T g X n 8 W F G A k p n P g u f o p f 0 V m E k Y 7 s X W n U 2 2 h e i U 5 E t 3 8 A Y y n f t Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " m g 5 b t P 8 p v A 4 P X 4 L g s s 3 g J W + 5 g B 0 2 + e w v / 4 7 d a V a 6 m u e 1 e E v z t d t + Z w Z S v I 7 u k v 4 j 2 1 Y V z s 0 I C h b y b x 6 5 S j p j w J Z v c 6 I S K 2 g 0 t T Q n W v I 9 0 / v X e w g u 3 w / T Z J h + + T A 4 P 9 l c d k + 8 F m / E i U j F q T g X n 8 W F G A k p n P g u f o p f 0 V m E k Y 7 s X W n U 2 2 h e i U 5 E t 3 8 A Y y n f t Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " m g 5 b t P 8 p v A 4 P X 4 L g s s 3 g J W + 5 g B 0 between ξ's are important.For simplicity,3 if the ontology has L levels, we assign For example, given an ontology with 3 levels, the margins per level are (ξ 1 , ξ 2 , ξ 3 ) = (3, 2, 1).
Flexible Threshold Equation 6 only ranks positive types higher than negative types so that all children types given a parent type are ranked based on their relevance to the entity mention.What should be the threshold between positive and negative types?We could set the threshold to be 0 (approaching the multi-label classification problem as a set of binary classification problem, see Lin and Ji (2019)), or tune an adaptive, type-specific threshold for each parent type (Zhang et al., 2018).
Here, we propose a simpler method.
We propose to directly use the parent node as the threshold.If a positive type is y, we learn the following ranking relation: where means "ranks higher than".For example, a mention has gold type /person/artist/singer.Since the parent type /person/artist can be considered as a kind of prior for all types of artists, the model should learn that the positive type "singer" should have a higher confidence than "artist", and in turn, higher than other types of artists like "author" or "actor".Hence the ranker should learn that "a positive subtype should rank higher than its parent, and its parent should rank higher than its negative children."Under this formulation, at decoding time, given parent type y, a child subtype z <: y that scores higher than y should be output as a positive label.
We translate the ranking relation in Equation 8 into a ranking loss that extends Equation 6.In Equation 6, there is an expected margin ξ between positive types and negative types.Since we inserted the parent in the middle, we divide the margin ξ into αξ and (1 − α)ξ: αξ being the margin between positive types and the parent; and (1−α)ξ is the margin between the parent and the negative types.For a visualization see Figure 3.
The hyperparameter α ∈ [0, 1] can be used to tune the precision-recall tradeoff when outputting types: the smaller α, the smaller the expected margin there is between positive types and the parent.This intuitively increases precision but decreases recall (only very confident types can be output).Vice versa, increasing α decreases precision but increase recall.
Therefore we learn 3 sets of ranking relations from Equation 8: (i) positive types should be scored above parent by αξ; (ii) parent should be scored above any negative sibling types by (1 − α)ξ; (iii) positive types should be scored above negative sibling types by ξ.Our final hierarchical ranking loss is formulated as follows.

Decoding
Predicting the types for each entity mention can be performed via iterative searching on the type tree, from the root entity node to coarser types, then to finer-grained types.This ensures that our output does not violate the hierarchical property, i.e., if a subtype is output, its parent must be output.
Given instance x we compute the score F(x, y) for each type y ∈ Y, the searching process starts with the root node entity of the type tree in the queue.For each type y in the node, a child node until Q = queue is empty 14: return Ŷ return all decoded types 15: end procedure z <: y (subtypes) is added to the predicted type set if F(x, z) > F(x, y), corresponding to the ranking relation in Equation 8that the model has learned.
Here we only take the top-k element to add to the queue to prevent from over-generating types.This can also be used to enforce the single-path property (setting k = 1) if the dataset is singlepath.For each level i in the type hierarchy, we limit the branching factor (allowed children) to be k i .The algorithm is listed in Algorithm 1, where the function TOPK(S, k, f ) selects the top-k elements from S with respect to the function f .

Subtyping Relation Constraint
Each type y ∈ Y in the ontology is assigned a type embedding y ∈ R d t .We notice the binary subtyping relation " <: " ⊆ Y × Y on the types.Trouillon et al. (2016) proposed the relation embedding method ComplEx that works well with anti-symmetric and transitive relations such as subtyping.It has been employed in FET before -in Murty et al. (2018), ComplEx is added to the loss to regulate the type embeddings.ComplEx operates in the complex space -we use the natural isomorphism between real and complex spaces to map the type embedding into complex space (first half of the embedding vector as the real part, and the second half as the imaginary part): We learn a single relation embedding r ∈ C d t /2 for the subtyping relation.Given type y and z, the subtyping statement y <: z is modeled using the following scoring function: where is element-wise product and x is the complex conjugate of x.If y <: z then r(y, z) > 0; and vice versa, r(y, z) < 0 if y ≮: z.
Loss Given instance (x, Y ), for each positive type y ∈ Y , we learn the following relations: Translating these relation constraints as a binary classification problem ("is or is not a subtype") under a primal SVM, we get a hinge loss: This is different from Murty et al. (2018), where a binary cross-entropy loss on randomly sampled (y, y ) pairs is used.Our experiments showed that the loss in Equation 14 performs better than the cross-entropy version, due to the structure of the training pairs: we use siblings and siblings of parents as negative samples (these are types closer to the positive parent type), hence are training with more competitive negative samples.

Training and Validation
Our final loss is a combination of the hierarchical ranking loss and the subtyping relation constraint loss, with L 2 regularization: The AdamW optimizer (Loshchilov and Hutter, 2019) is used to train the model, as it is shown to be superior than the original Adam under L 2 regularization.Hyperparameters α (ratio of margin above/below threshold), β (weight of subtyping relation constraint), and λ (L 2 regularization coefficient) are tuned.
At validation time, we tune the maximum branching factors for each level k 1 , • • • , k L .These parameters tune the trade-off between the precision and recall for each layer and prevents overgeneration (as we observed in some cases).All hyperparameters are tuned so that models achieve maximum micro F 1 scores (see subsection 5.4).

Experiments
5.1 Datasets AIDA The AIDA Phase 1 practice dataset for hierarchical entity typing comprises of 297 documents from LDC2019E04 / LDC2019E07, and the evaluation dataset is from LDC2019E42 / LDC2019E77.We take only the English part of the data, and use the practice dataset as train/dev, and the evaluation dataset as test.The practice dataset comprises of 3 domains, labeled as R103, R105, and R107.Since the evaluation dataset is out-ofdomain, we use the smallest domain R105 as dev, and the remaining R103 and R107 as train.
The AIDA entity dataset has a 3-level ontology, termed type, subtype, and subsubtype.A mention can only have one label for each level, hence the dataset is single-path, thus the branching factors (k 1 , k 2 , k 3 ) for the three layers are set to (1, 1, 1).BBN Weischedel and Brunstein (2005) labeled a portion of the one million word Penn Treebank corpus of Wall Street Journal texts (LDC95T7) using a two-level hierarchy, resulting in the BBN Pronoun Coreference and Entity Type Corpus.We follow the train/test split by Ren et al. (2016b), and follow the train/dev split by Zhang et al. (2018).
OntoNotes Gillick et al. (2014) sampled sentences from the OntoNotes corpus and annotated the entities using 89 types.We follow the train/dev/test data split by Shimaoka et al. (2017).
FIGER Ling and Weld (2012) sampled a dataset from Wikipdia articles and news reports.Entity mentions in these texts are mapped to a 113-type ontology derived from Freebase (Bollacker et al., 2008).Again, we follow the data split by Shimaoka et al. (2017).
The statistics of these datasets and their accompanying ontologies are listed in Table 1.

Setup
To best compare to recent prior work, we follow Lin and Ji (2019) where the ELMo encodings of words are fixed and not updated.We use all 3 layers of ELMo output, so the initial embedding has dimension d w = 3072.We set the type embedding dimensionality to be d t = 1024.The initial learning rate is 10 −5 and the batch size is 256.
Hyperparameter choices are tuned on dev sets, and are listed in Table 1.We employ early stopping: choosing the model that yields the best micro F 1 score on dev sets.

Baselines
We compare our approach to major prior work in FET that are capable of multi-path entity typing. 4 For AIDA, since there are no prior work on this dataset to our knowledge, we also implemented multi-label classification as set of binary classifier models (similar to Lin and Ji (2019)) as a baseline, with our mention feature extractor.The results are shown in Table 2 as "Multi-label".

Metrics
We follow prior work and use strict accuracy (Acc), macro F 1 (MaF), and micro F 1 (MiF) scores.Given instance x i , we denote the gold type set as Y i and the predicted type set Ŷi .The strict accuracy is the ratio of instances where Y i = Ŷi .Macro F 1 is the average of all F 1 scores between Y i and Ŷi for all instances, whereas micro F 1 counts total true positives, false negatives and false positives globally.
We also investigate per-level accuracies on AIDA.The accuracy on level l is the ratio of instances whose predicted type set and gold type set are identical at level l.If there is no type output at level l, we append with other to create a dummy type at level l: e.g./person/other/other. 4 Zhang et al. ( 2018) included document-level information in their best results-for fair comparison, we used their results without document context, as are reported in their ablation tests.
Hence accuracy of the last level (in AIDA, level 3) is equal to the strict accuracy.

Results and Discussions
All our results are run under the two conditions regarding partial type paths: exclusive or undefined.The result of the AIDA dataset is shown in Table 2. Our model under the exclusive case outperforms a multi-label classification baseline over all metrics.
Of the 187 types specified in the AIDA ontology, the train/dev set only covers 93 types.The test set covers 85 types, of which 63 are seen types.We could perform zero-shot entity typing by initializing a type's embedding using the type name (e.g./fac/structure/plaza) together with its description (e.g."An open urban public space, such as a city square") as is designated in the data annotation manual.We leave this as future work.Results for the BBN, OntoNotes, and FIGER can be found in Table 3. Across 3 datasets, our method produces the state-of-the-art performance on strict accuracy and micro F 1 scores, and stateof-the-art or comparable (±0.5%) performance on macro F 1 score, as compared to prior models, e.g.(Lin and Ji, 2019).Especially, our method improves upon the strict accuracy substantially (4%-8%) across these datasets, showing our decoder are better at outputting exact correct type sets.
Partial type paths: exclusive or undefined?Interestingly, we found that for AIDA and FIGER, partial type paths should be better considered as exclusive, whereas for BBN and OntoNotes, considering them as undefined leads to better performance.We hypothesize that this comes from  how the data is annotatated-the annotation manual may contain directives as whether to interpret partial type paths as exclusive or undefined, or the data may be non-exhaustively annotated, leading to undefined partial types.We advocate for careful investigation into partial type paths for future experiments and data curation.

Ablation Studies
We compare our best model with various components of our model removed, to study the gain from each component.From the best of these two settings (exclusive and undefined), we report the performance of (i) removing the subtyping constraint as is described in subsection 4.5; (ii) substituting the multi-level margins in Equation 7 with a "flat" margin, i.e., margins on all levels are set to be 1.These results are shown in Table 2 and Table 3 under our best results, and they show that both multi-level margins and subtyping relation constraints offer orthogonal improvements to our models.
Error Analysis We identify common patterns of errors, coupled with typical examples: • Confusing types: In BBN, our model outputs /gpe/city when the gold type is /location/region for "... in shipments from the Valley of either hardware or software goods."These types are semantically similar, and our model failed to discriminate between these types.
• Incomplete types: In FIGER, given instance "... multi-agency investigation headed by the U.S. Immigration and Customs Enforcement 's homeland security investigations unit", the gold types are /government agency and /organization, but our model failed to output /organization.
• Focusing on only parts of the mention: In AIDA, given instance "... suggested they were the work of Russian special forces assassins out to blacken the image of Kievs pro-Western authorities", our model outputs /org/government whereas the gold type is /per/militarypersonnel.Our model focused on the "Russian special forces" part, but ignored the "assassins" part.Better mention representation is required to correct this, possibly by introducing type-aware mention representation-we leave this as future work.

Conclusions
We proposed (i) a novel multi-level learning to rank loss function that operates on a type tree, and (ii) an accompanying coarse-to-fine decoder to fully embrace the ontological structure of the types for hierarchical entity typing.Our approach achieved state-of-the-art performance across various datasets, and made substantial improvement (4-8%) upon strict accuracy.Additionally, we advocate for careful investigation into partial type paths: their interpretation relies on how the data is annotated, and in turn, influences typing performance.

Figure 1 :
Figure 1: An example mention classified using the FIGER ontology.Positive types are highlighted.

Figure 2 :
Figure 2: Various type ontologies.Different levels of the types are shown in different shades, from L0 to L3.The entity and other special nodes are discussed in section 3.

Figure 3 :
Figure 3: Hierarchical learning-to-rank.Positive type paths are colored black, negative type paths are colored gray.Each blue line corresponds to a threshold derived from a parent node.Positive types (on the left) are ranked above negative types (on the right).

Table 1 :
Statistics of various datasets and their corresponding hyperparameter settings.

Table 2 :
Results on the AIDA dataset.
: Not strictly comparable due to non-standard, much larger training set; ‡ : Result has document-level context information, hence not comparable.
† : Not run on the specific dataset; *

Table 3 :
Results of common FET datasets: BBN, OntoNotes, and FIGER.Numbers in italic are results obtained with various augmentation techniques, either larger data or larger context, hence not directly comparable.