MZET: Memory Augmented Zero-Shot Fine-grained Named Entity Typing

Named entity typing (NET) is a classification task of assigning an entity mention in the context with given semantic types. However, with the growing size and granularity of the entity types, few previous researches concern with newly emerged entity types. In this paper, we propose MZET, a novel memory augmented FNET (Fine-grained NET) model, to tackle the unseen types in a zero-shot manner. MZET incorporates character-level, word-level, and contextural-level information to learn the entity mention representation. Besides, MZET considers the semantic meaning and the hierarchical structure into the entity type representation. Finally, through the memory component which models the relationship between the entity mention and the entity type, MZET transfers the knowledge from seen entity types to the zero-shot ones. Extensive experiments on three public datasets show the superior performance obtained by MZET, which surpasses the state-of-the-art FNET neural network models with up to 8% gain in Micro-F1 and Macro-F1 score.


Introduction
Named entity typing (NET) is the task of inferring semantic types for the given named entity mentions in utterances. For instance, given an entity mention "John" in the utterance "John plays piano on the stage". The goal for NET is to infer that "John" is a pianist or a musician, and a person. Standard NET approaches (Chinchor and Robinson, 1997;Tjong Kim Sang and De Meulder, 2003;Doddington et al., 2004) only consider a tiny set of coarse-grained types, and discard fine-grained types with a different level of granularity. In recent years, fine-grained named entity typing (FNET) (Ling and Weld, 2012;Nakashole et al., 2013;Del Corro et al., 2015;Ren et al., 2016;Abhishek et al., 2017;Zhou et al., 2018) continues to draw researchers' attention, because it can provide additional information that benefits a lot of downstream tasks like relation extraction (Liu et al., 2014), entity linking (Stern et al., 2012), and question answering (Han et al., 2017).
However, with the ever-growing number of entity types especially for fine-grained ones, it is difficult and expensive to collect sufficient annotations per category and retrain the whole model. Therefore, a zero-shot paradigm is welcomed in FNET to handle the increasing number of unseen types. The task we deal with in this paper is named zero-shot fine-grained named entity typing (ZFNET), which is to detect the unseen fine-grained entity types that have no labeled data available.
Learning generalizable representations for entity mentions and types is essential for the ZFNET task. Previous works learn these representations either from hand-crafted features (Ma et al., 2016;Yuan and Downey, 2018), or pre-trained word embeddings (Ren et al., 2016). These methods are insufficient and inefficient when challenged by poly-semantic, ambiguity, or even the newly-emerged mentions. The most recent works (Obeidat et al., 2019;Zhou et al., 2018) learn more informative but resource-costing representations by assembling the exterior Wikipedia knowledge base.
With the learned representations for entity mentions and entity types, most of the existing zero-shot FNET methods (Ma et al., 2016;Obeidat et al., 2019) project them into a shared semantic space. The shared space is learned through minimizing the distance between entity mentions and its corresponding seen entity types. In the prediction phase, testing entity mentions are classified to the nearest unseen entity types based on the assumption that the learned distance measurement also works for unseen types. These methods' ability to transfer knowledge from seen types to unseen types is limited since they do not explicitly build connections between seen types and unseen types.
In this work, we propose the memory augmented zero-shot FNET model (MZET) to tackle the aforementioned problems. MZET is designed to automatically extract the multi-information integrated mention representations and structure-aware semantic type representations with a large-scale pre-trained language model (Devlin et al., 2019). To effectively transfer knowledge from seen types to unseen types, MZET regards seen types as memory components and explicitly models the relationships between seen types and unseen types. Intuitively, we want to mimic the way how humans learn new concepts. Humans learn new concepts by comparing the similarities and differences between new concepts and old concepts stored in our memory.
In summary, the main contributions of MZET are as follows. 1) We propose the memory augmented zero-shot FNET model (MZET) that can be trained in an end-to-end fashion. MZET extracts multiinformation integrated mention representations and structure-aware semantic type representations without additional augmented data sources. 2) MZET regards seen types as memory components and explicitly models the relationships between seen types and unseen types to effectively transfer knowledge to new concepts. 3) MZET outperforms existing zero-shot FNET models significantly on the zero-shot fine-grained, coarse-grained, and hybrid-grained named entity typing over three benchmark datasets.

Problem Definition
We begin by formalizing the problem of zero-shot fine-grained named entity typing (ZFNET). For a given entity mention x, the task of named entity typing (NET) is to identify the type y for x. Suppose we have a training type set Y seen = {y s 1 , y s 2 , ..., y s Ds } with D s seen types. There are a large number of labeled examples available for these seen types, D tr = {(x i , y i ), i = 1, 2, ..., |D tr |} with y i ∈ Y seen .
The task of ZFNET is to classify a new mention which belongs to one of the unseen fine-grained entity types Y unseen = {y u 1 , y u 2 , ..., y u Du }, where D u is the number of unseen fine-grained entity types and Y seen ∩ Y unseen = ∅.

The proposed Model
The overview of the proposed MZET framework is illustrated in Figure 1. Specifically, MZET consists of three components: 1) Zero-shot Memory Network that identifies entity types for entity mentions, introduced in Sec.3.1; 2) Mention Processor which extracts representation for entity mentions, detailed in Sec.3.2; 3) Label Processor which obtains type representation, depicted in Sec.3.3.

Zero-Shot Memory Network
In the zero-shot entity typing task, there are no mentions available for these unseen entity types. Without the labeled data, we are not able to model the direct mapping from the new mentions to the new types. Here, we propose a novel zero-shot memory network that utilizes seen entity types to bridge the gap between the new mentions and the zero-shot entity types.

Memory augmented Typing Function
To enable the zero-shot paradigm, previous researches (Ma et al., 2016;Obeidat et al., 2019) introduce a score function f (·) to rate the match of a given entity mention x and an entity type y, where y is the raw type picked from Y seen or Y unseen . The definition for f (·) is: Figure 1: The framework of MZET for zero-shot fine-grained named entity typing. It consists of three main components: mention processor, label processor, and zero-shot memory network.
where θ(x, A) : x → Ax and φ(y, B) : y → By serve as the mapping functions that project x and y into a shared semantic space by neural networks (ours are depicted in Sec. 3.2 and Sec.3.3 respectively). f (·) is the distance estimation of Ax and By in the shared space.
Considering the lack of interpretability of the representations in the shared semantic space, we propose the memory network augmented zero-shot FNET to construct another high-level shared space, called Association Space. Each dimension in the association space links to an entity type. The representation in this space indicates the association information with each entity type. The score function for memory augmented zero-shot FNET is changed into f : f (x, y) = M EM Yseen (θ(x, A), φ(y, B)) = M EM Yseen (Ax, By), where M EM Yseen (·) means rating score estimated by a memory network for the zero-shot paradigm with Y seen as the memory component. Meanwhile, the memory component is the aforementioned association space to guide unseen entity typing by linking them to each seen type stored in the memory components, like humans recognizing new things by intuitively associating with the knowledge they have memorized.
In fact, M EM Yseen (·) loads Ax and By in the shared semantic space into the high-level association space, and then estimates matching score under the help of their connections to the seen types in the memory.

Zero-Shot Memory Network Model
All the seen entity representations are utilized as the memories in the zero-shot memory network. We propose to use the memory network as a special attention mechanism to model the relationships between the mentions and the seen entity types. Furthermore, we build a zero-shot version memory network that utilizes the type representation similarities to transfer the knowledge from the seen types to the unseen types, which exactly implement the detail of M EM Yseen (·). The key points to implement M EM Yseen (·) are interpreted as follows: (1) The Association Space is constructed with all the seen types representations to bridge the gap between mentions and unseen types.
(2) The mention representation is augmented by the association with seen types. That means to obtain attention between mentions and seen types as the association. After absorbing the association, the mentions obtain more informative representations benefiting the knowledge transferring in the Association Space.
(3) The augmented mention and type representation are projected into the Association Space. Association augmented mention can be directly project into it. But for the unseen types, associations between them with the seen types are formed by the type semantic similarity, which exactly presents each unseen type in the Association Space. We first construct the Association Space with all the seen types representations from the Label Pro- where D s is the number of seen types and D b is the dimension of the type representation.
To augment the mention representation by it association with seen types, we construct two dependant memory components G and C. As shown in Figure 1, the input memory representation G = (g 1 , ..., g Ds ) ∈ R Ds×Dm is converted from F using an embedding matrix W f 1 ∈ R D b ×Dm , where D m is the dimension of the memory components. To catch the association between mentions and seen labels through memory component, we model the attention p i between the mention input u and each memory component (3) where u = W m, and W ∈ R De×Dm . m is the mention representation that is obtained from the Mention Processor in Sec.3.2, and its dimension size is D e . The input memory representation G works to update the association P for each mention u. We construct the output memory representations C ∈ R Ds×Dm from F using another embedding matrix W f 2 ∈ R D b ×Dm . The attentions p i are used as weights to associate the output memory representations and obtain the associated mention embedding: Finally, (o + u) is the adjusted mention representation augmented by the information associated with seen types. Then it is projected into the association space by W p ∈ R Ds×Dm . To load unseen types into the association space for the zero-shot capability of our memory network, we use the similarities between the type representations to transfer knowledge from seen types to unseen types. The similarities between type f i and f j are calculated as: where .., f u Du ) ∈ R Du×D b during zero-shot testing, while f j is from F for training the model. Then we can get the similarity matrix R ∈ R Ds×Du for all the unseen types during prediction. We use the associated mention embedding o, the mention input u, and the similarity matrix R together to classify the zero-shot entity types in the association space: In this way, we construct a 2-level shared space for zero-shot FNET by the memory network as shown in Figure 1. The lower one is the semantic representation space, which is formed by m in Sec.3.2 and f in Sec.3.3. The higher one is the association space that models the connections between not only mentions and seen types but also seen types and unseen types. Therefore, we can tell the reasoning process of the prediction from the association space. For instance, the association space contains seen type "/SUBSTANCE", and "/DRUG". Given the mention "pills", it matches the unseen type "/SUB-STANCE/DRUG", as both "pills" and "/SUBSTANCE/DRUG" associate with the seen type "/SUB-STANCE", and "/DRUG".
We can also extend the memory components to handle multiple hop operations (Sukhbaatar et al., 2015) by stacking the memories sequentially which leaves for the future work.

Mention Processor
To better understand the entity mention, we not only consider the words contained in the mention, but also the context around it. The Mention Processor has two sub-components. A Word Processor is proposed to get the semantic meaning for each word in the entity mention. Another Context Processor is utilized to understand the sequential information together with the context. The final mention representation m is a concatenation of the word-level representation from the Word Processor and the sequential representation from the Context Processor as shown in Figure 1.

Word Processor
Following most existing works (Lample et al., 2016;Lin and Lu, 2018;Bari et al., 2019), Word Processor is proposed to achieve basic understandings over the words in the entity mentions. Given an input entity mention X w = (t 1 , ..., t K ) with K tokens, each token t k is represented as [w k ; c k ]. It is a concatenation of a pre-trained word embedding w k ∈ R Dw (D w is the dimension of the pre-trained word embedding) and a character-level embedding c k which provides morphological information and makes a complement when faced with out-of-vocabulary (OOV) words. The character-level embedding c k ∈ R D hc is obtained through a bi-directional LSTM (D hc is the dimension size after concatenating bi-direction hidden states), named as Character Bi-LSTM.
Additionally, another bi-directional LSTM, named as Word-Character Bi-LSTM, is utilized to gather the information from all the token embeddings X w by concatenating the forward and backword hidden states, where m w ∈ R D h , D h is the dimension after concatenating Word-Character Bi-LSTM hidden states. As illustrated in Figure 1, the Word Processor outputs a word-character embedding m w for each entity mention.

Context Processor
In the Context Processor, we leverage the powerful pre-trained language model, BERT (Devlin et al., 2019), to incorporate two more context-aware parts into the mention representation: (1)m b , the mention embedding given the context; (2)m c , the surrounding context embedding.
Considering that a context-aware word embedding can carry syntax feature, we first conduct BERT to embed the whole sentence and obtain the BERT contextual embedding for each token. For the tokens contained in the entity mention, named as mention tokens, their BERT embeddings are represented as For the tokens in the surrounding context, named as context tokens, we only consider a fixed window for each mention to balance the computational cost. The BERT embeddings for left context tokens are e l 1 , ..., e l n , and those in the right are e r 1 , ..., e r n , where e j i ∈ R D b and j ∈ {l, r}. n is the window size and we set it as 10. We utilize Bi-LSTMs (with concatenated bi-directional hidden state size D h ) to aggregate the separated token embeddings to extract the mention embedding m b and the context embedding m c . m b is obtained from the BERT embeddings of mention tokens X b with the Bi-LSTM, called as Mention Bi-LSTM: where t are the forward and backword hidden states of Mention Bi-LSTM, respectively. m c is obtained from the context tokens with a bi-directional LSTM with attention mechanism, called as Attention Bi-LSTM. The hidden states in the bi-directional LSTM for the context tokens are denoted as: The attentions over all the context tokens are computed using a 2-layer feed forward neural network:

Mention Representation
The final entity mention representation with dimension D e ∈ R (D h +D h +D h ) concatenates the wordcharacter embedding, the mention embedding, and the context embedding as follow:

Label Processor
Understanding the label is important in our task, since there is no information other than the label name for the zero-shot entity types. In the Label Processor, we get the semantic embeddings B S ∈ R (Ds+Du)×D b for all the label names, including the seen labels Y seen and the unseen labels Y unseen , using a pre-trained BERT model. The fine-grained labels and coarse-grained labels in Y seen and Y unseen consist a hierarchical structure naturally. Each fine-grained type includes a coarse-grained type as the root in the hierarchical structure. Following (Ma et al., 2016), we utilize a sparse matrix B H ∈ R (Ds+Du)×(Ds+Du) to represent the hierarchical structure in the labels. Each row B H i corresponds to a binary hierarchical embedding for label y i . For each entry in B H i , we use 1 to denote the label itself and its parent node, 0 for the rest: In the Label Processor, we integrate the semantic embeddings of the child label and its parent label into a single embedding vector as the fine-grained label representftion. For a label y i , the final label representation f ∈ R D b is represented together by the semantic embedding B S and its hierarchical embedding B H i as shown in Figure 1:

Loss function
We train our model with a multi-label max-margin ranking objective as follows: Given example mention x, Y is the set of correct types assigned to x, p pos is the possibility for such a positive assignment. In contrast, Y is the set of incorrect assigned types. p neg is the possibility to assign a false label neg ∈ Y to x.

Datasets
We evaluate the performance of our model on three public datasets that are widely used in FNET task. BBN (Weischedel and Brunstein, 2005) consists of 2,311 WSJ articles that are manually annotated using 93 types in a 2-level hierarchy.
OntoNote (Weischedel et al., 2011) has 13,109 news documents where 77 test documents are manually annotated using 89 types in a 3-level hierarchy.
Wiki (Ling and Weld, 2012) consists of 1.5M sentences sampled from 780k Wikipedia articles. 434 news sentences are manually annotated for evaluation. 112 entity types are organized into a 2-level hierarchy.

Zero-shot Setting
We follow Ma (2016) and Obei (2019) to apply the zero-shot setting that the training set only contains coarse-grained types (level-1), while all fine-grained types (level-2) only appear in the testing data. For the OntoNotes dataset that has 3 levels, we combine the level-1 and level-2 as the coarse-grained typing for training, and level-3 as the fine-grained types for testing.

Baselines
We compare the proposed method (MZET) and its variants with state-of-the-art FNET neural models. However few research approaches zero-shot FNET without auxiliary resource or hand-crafted features.
In such a situation, we select the benchmarks and baselines as follows: DZET Obei et al. (2019) propose a neural structure to extract the mention representations but leverage Wikipedia to augment the label representations. So we only compare with them on the learned mention representation capability, and incorporate our label embedding methods to construct this baseline. ProtoZET Ma et al. (2016) fist adapt zero-shot learning on FNET with hand-crafted features and propose prototype embedding to form label representation. Unfortunately, to the best of our knowledge, their system is not available online. We adopt its prototype label embedding technique and incorporate it with our Mention Processor for empirical comparison like Obei (2019) and Zhou (2018) did before. MZET + avg emb For better contrast of label embedding techniques, we replace BERT label embedding in MZET with label average GloVe embedding that is widely adopted for entity typing in previous works (Shimaoka et al., 2016;Abhishek et al., 2017;Yuan and Downey, 2018).

Training and Implementation Details
To train the neural network models, we optimize the multi-label max-margin loss function over training data concerning all model parameters. We adopt the Adam optimization algorithm with a decreasing learning rate of 0.0001, and the decay rate of 0.9. We utilize the pre-trained BERT (BERT-base, cased) with the number of transformer blocks is 12, the hidden layer size is 768, and the number of selfattention heads is 12. We also choose GloVe pre-training embeddings of size 300 for word-character representation. The hidden state of LSTMs is in size of 200. We use hyperparameter τ as the maximum gap for selected labels. τ is optimized through validation sets (10% of testing examples). Another strategy is for the prediction on overall dataset. We consider type inference over the predicted fine-grained type to include its parent coarse type into the final decision. Because we expect that such type inference can improve the recall score.

Results and Discussion
Zero-Shot FNET Evaluation We first evaluate our methods for FNET on BBN dataset. Following Ma(2016), we train the models on coarse-grained types, while testing in three ways: (1) Overall, predicting on both coarse-grained and fine-grained testing types; (2) Level 1, predicting only on coarse-grained types; (3) Level 2, predicting only on fine-grained types which are unseen before. Level-1 shows the performance for seen types, Level-2 evaluates the ability for zero-shot FNET, and Overall balances the performance between seen types and unseen types.     Table 1 illustrates the performance of the baselines and MZET on these 3 aspects. We see that for the coarse-grained typing (Level 1), MZET improvements strict accuracy significantly up to 19%. For the zero-shot setting that testing on fine-grained types (Level 2), MZET achieves the highest scores and gains up to 10% on strict accuracy. MZET attains the best with a 9% gain on accuracy over the overall types. Compared to MZET+avg emb and ProtoZET, MZET gains significance performance from the Label Processor. Apart from the benefit from Label Processor, MZET also takes advantage of Mention Processor and Memory Network to achieve the best performer over the other baselines. At last, performance on all-grained types indicates the superiority of MZET over the rest, especially for the Micro-F1, which indicates the achievements over infrequent types.
To show the effectiveness of our proposed model, not only on the unseen fine-grained types but also on seen coarse-grained types, we evaluate the overall performance for three benchmark FNET datasets: BBN, OntoNotes, and Wiki. As Table 2 shows, there are significant improvements of MZET on small datasets, BBN and OntoNotes. For the large-size Wiki data, MZET also attains the highest scores for all metrics. Compared with ProtoZET and MZET+avg emb, MZET shows only small improvement on Wiki but surpasses OTyper and DZET+bert almost 5% on strict accuracy. This indicates that when the size of data increases, Mention Processor and Memory Network plays a great role for our model's ever-growing strength. As mentioned before, OntoNotes contains fine-grained entity type in 3-level hierarchy. Form the results over OntoNotes, MZET shows its superiority of Memory Network and Mention Processor with a significant margin especially for the most fine-grained dataset.
Ablation Study We carry out ablation studies that quantify the contribution of each component in our framework shown in Figure 1. As Table 3 shows, the vital parts are the memory network and the word and character representation. The performance decreases significantly over 2.5% in strict accuracy by removing either of them. The memory network contributes decent augmentations on fine-grained typing, which indicates the noteworthy associations between the seen labels and mentions, as well as seen labels and unseen labels. The word and character representation shows its importance on capturing the morphological and semantic information for a single entity mention. The secondary important is the informative context part with attention. It is aggregated into the final representation of the mention to guide the classification. Last, m b in Figure 1 plays a considerable complementary role, as leading the BERT to embed a mention enables the model to gather more contextual information to avoid ambiguity for the polysemantic, like the word "valley" in mention "Silicon Valley".

Case Study
We visualize how to match the entity mention and type in the association space in Figure 2. Example 1 is a simple case as the unseen label words appear in seen types. This case shows agreement matching on most of the dimensions between the unseen type "/person/artist" and the mention "Carel Balth". Example 2 is more complex in the type similarity map as the new word "hospital" has scattered associations with multiple seen types, like "medicine" and "disease". But they provide informative association about the unseen type for linking it to related mentions, like "Baxter Creek Veterinary Clinic" in the case.
Error Analysis We also provide insights into specific reasons for the mistakes made by our model. First, all the datasets follow long-tail frequency distributions. The examples for each label are significantly imbalanced. Accordingly, the model is prone to assign frequent types for the infrequent ones. For example, the training set processes 719 examples of "/LOCATION" and 6,672 examples of "/GPE" (Geopolitical Entity). The model prefers predicting on the fine-grained type "/GPE/CITY" rather than "/LOCATION/REGION".
Second, types are incorrectly tagged in the raw data. To test the ratio for incorrect tagging, we randomly pick out 100 examples in the raw data, including types coming from both training and testing sets. We find there are about 11% for BBN, 10% for OntoNotes, 13% for Wiki with noise, such as mentions with incoherent labels, or missing the correct mention words for the corresponding tagged labels. For example, "The government estimates corn output at 7.45 billion bushels , up 51% from last fall." labeled "The" with two types: ["/ORGANIZATION/CORPORATION", "/ORGANIZATION"]. The correct mention should be the "The government" other than "The" for the assigned labels. Figure 2: Two examples to show the association between the zero-shot fine-grained type and a mention in an utterance. Their similarities with seen types are shown with the heat maps. Each dimension in the map links to one seen type. The above type heat map denotes the similarity between this unseen fine-grained type and all seen types. The mention heat map in the bottom is the output from Zero-shot Memory Network before the component Sigmoid + R in Figure 1.

Related Work
FNET is a long-standing task in Natural Language Processing (Xia et al., 2019). Most of the proposed FNET methods are based on a distant supervisor, but diverse in classification architectures. Ling et al.(Ling and Weld, 2012) propose multi-label and multi-class multilayer perceptron model assigns each mention of the corresponding label tags. Naka et al. (Nakashole et al., 2013) type newly emerging out-of Knowledge Base entities by a fine-grained typing system and harnesses relational paraphrase with type signatures for probabilistic weight computation. Del et al. (Del Corro et al., 2015) designs a system, FINET, with the help from WordNet (Miller, 1995). Ren et al.(Ren et al., 2016) propose AFET for automatic fine-grained entity typing with hand-crafted features and label embedding from the hierarchical type path. Shim et al. (Shimaoka et al., 2016) and Anan et al. (Abhishek et al., 2017) adopt attentive neural network models for FNET. Attention information and contextural embedding are proposed to enhance FNET performance. Those methods develope, from hand-crafted features to neural network learned fea-tures, to allow fine-grained typing system fancy, automatic and effective. But their architectures can not apply to new and unseen entity types.
To handle unseen types, zero-shot learning (Xia et al., 2018) is introduced for named entity typing. Several works (Zhou et al., 2018; propose to solve unseen entity typing with clustering. These works cluster mentions and propagate type information from representative mentions to unseen types. Another direction is to construct a shared space for linking the seen and unseen data. These models (Ma et al., 2016;Yuan and Downey, 2018;Obeidat et al., 2019) map the mention and label embedding into a shared latent space, then estimate the closeness score for each mention-label pair. Most of the existing zero-shot FNET methods limit the model's flexibility with considerable auxiliary resources or pre-prepared hand-crafted features. In the perspective of entity type representations, most recent researchers Zhou et al., 2018;Obeidat et al., 2019) obtain informative entity type representations by assembling related Wikipedia pages. Their performances are decent yet resourcecosting. Others (Ren et al., 2016;Ma et al., 2016;Shimaoka et al., 2016;Abhishek et al., 2017;Yuan and Downey, 2018) exploit typical pre-trained semantic label embedding, which is easily-applied but pale in performance. Apart from various methods for entity type representation, mention reion approaches are also evolving recently. Yuan (2018) and Ma (2016) ultilize the pre-prepared hand-crafted features, while others (Obeidat et al., 2019;Zhou et al., 2018) embed the mention by some pre-trained wording embedding methods (Pennington et al., 2014;Peters et al., 2018). But, these methods are insufficient and inefficient when challenged by polysemantic, ambiguity, or even the newly-emerged mention.

Conclusions
In this paper, we propose an end-to-end neural network, MZET, that enables zero-shot fine-grained named entity typing. It extracts comprehensive representations concerning word and character, mention, mention's context, and raw label text without auxiliary information. It adopts the memory network to gather the representations for zero-shot paradigm. Extensive experiments on three public datasets show prominent performances obtained by MZET, which surpasses the state-of-the-art neural network models for Zero-Shot FNET.