Global Bootstrapping Neural Network for Entity Set Expansion

Bootstrapping for entity set expansion (ESE) has been studied for a long period, which expands new entities using only a few seed entities as supervision. Recent end-to-end bootstrapping approaches have shown their advantages in information capturing and bootstrapping process modeling. However, due to the sparse supervision problem, previous end-to-end methods often only leverage information from near neighborhoods (local semantics) rather than those propagated from the co-occurrence structure of the whole corpus (global semantics). To address this issue, this paper proposes Global Bootstrapping Network (GBN) with the “pre-training and fine-tuning” strategies for effective learning. Specifically, it contains a global-sighted encoder to capture and encode both local and global semantics into entity embedding, and an attention-guided decoder to sequentially expand new entities based on these embeddings. The experimental results show that the GBN learned by “pre-training and fine-tuning” strategies achieves state-of-the-art performance on two bootstrapping datasets.


Introduction
Bootstrapping is a classical technique for entity set expansion (ESE), which starts from several seed entities of a specific category (e.g., {London, Beijing, U.S.} for GPE category) and then iteratively expands the entity set to cover more entities of the category (e.g., Egypt and Harare). Most previous ESE studies (Riloff and Jones, 1999;Curran et al., 2007;Yan et al., 2019) adopt the pipelined paradigm (see Figure 1a), which iteratively: evaluates patterns using seeds, matches and evaluates entities using patterns, adds top entities to the seed set. Such a pipelined paradigm makes it hard to represent the whole bootstrapping process *   as a single learnable model, and the implementation of ESE systems are often very ad hoc.
Witnessed the drawbacks of the pipelined bootstrapping paradigm, recent studies start turning to the end-to-end paradigm. For instance, Yan et al. (2020) propose the first end-to-end bootstrapping neural network for ESE, which uses the encoderdecoder architecture (see Figure 1b): the encoder leverages and encodes the co-occurrence relations between entities and patterns into their embeddings; the decoder models bootstrapping as a sequential entity generation process, and the generated entities are used as expansion results. Compared with the pipeline paradigm, the end-to-end paradigm represents the whole bootstrapping process as a single learnable model and therefore is capable of leveraging more information and is more flexible.
One of the biggest challenges of end-to-end bootstrapping is how to learn it effectively since only very sparse supervision signals (i.e., several seed entities) are provided. In general, bootstrapping systems expand entities based on the entity-pattern duality assumption that "similar entities will share similar patterns, and similar patterns will match similar entities". Based on this assumption, a bootstrapping network should be able to represent entities/patterns by leveraging both their near neighborhoods (local semantics) and the information propagated via the entity/pattern co-occurrence structure in the whole corpus (global semantics). Currently, using only several seeds as supervision signals, previous end-to-end bootstrapping models often only aggregate neighborhood information to represent entity/patterns, therefore the final representations of entities/patterns are mostly learned from their neighborhoods (short-sighted), rather than from global information (global-sighted). This raises a big issue because most entities are long-tail (Zipf, 1935), which will only match a limited number of patterns, and as a result, the local semantics cannot provide reliable and informative representations for effective bootstrapping (see Figure 2 for an example).
To address the sparse supervision problem, this paper proposes a new end-to-end bootstrapping neural network for ESE, called Global Bootstrapping Network (GBN), which can effectively capture the global information of a corpus via an augmented entity-pattern bipartite graph, and learn to leverage both the local and the global semantics for bootstrapping via effective pre-training and finetuning strategies. Our method is motivated by the recent success of the pre-training and fine-tuning strategy in addressing the sparse supervision challenges (Devlin et al., 2019;Hu et al., 2020).
Concretely, the Global Bootstrapping Network adopts the encoder-decoder architecture. The encoder is a global-sighted graph neural network, in which each layer aggregates rich information not only between directly linked entities and patterns but also the entities and patterns multi-hop away via augmented links. The decoder is an attentionguided RNN model, which efficiently generates expansion results based on the global-sighted entity representations. Compared with previous methods, GBN can also effectively aggregate the global information rather than only local neighborhood information, therefore it is more reliable even for the long-tail entities/patterns with sparse links.
To learn the GBN, we propose several pretraining and fine-tuning algorithms: 1) In the pretraining stage, we design both the self-supervised and the supervised pre-training strategies to learn the encoder in the GBN, which ensures the learned representations of entities/patterns will capture  Figure 2: An example of local/global-sighted neighborhood with/without augmented links (we use a longtail GPE-"Harare" as the center entity). By adding an augmented link, "Harare" can easily observe its global-sighted neighborhood such as the strong GPE patterns-"visit to *", "located in *", etc., and therefore it can be accurately expanded.
both the local and global semantics. 2) In the finetuning stage, based on the learned representation, we use a multi-view learning algorithm to fine-tune GBN to fit a specific bootstrapping task using only a few seeds.
To summarize, the main contributions are: 1. We propose a new end-to-end bootstrapping neural network-GBN, to leverage the globalsighted information and encode the globalsighted information into entity embeddings. 2. We propose a novel pre-training and finetuning strategy for learning bootstrapping network with only sparse supervision signals. In pre-training, our method learns entity/pattern representations by effectively exploiting cooccurrence information in the corpus, and in fine-tuning, our method can be easily fitted to a specific bootstrapping task.

Entity-Pattern Bipartite Graph Construction
This section describes how to construct the entitypattern bipartite graph, which captures the global structure of entity-pattern co-occurrences in the original ESE corpus. Furthermore, augmented links are added for long-tail entities/patterns. Traditionally, entity-pattern duality is modeled as a set of individual entity, pattern entries (Riloff and Jones, 1999;Curran et al., 2007;Shi et al., 2014), e.g., { Harare, * court , London, visit to * , ...}. However, such a data model considers different entity, pattern entries independently, makes it hard to leverage the global co-occurrence structure.
To capture both the local and global semantics, we follow Yan et al. (2020) and use the entitypattern bipartite graph. Concretely, the entities and patterns are represented as graph nodes, and an entity and a pattern will be linked if the pattern matches the entity in the corpus. Finally, the entity-pattern bipartite graph is formulated as a tuple G =< V, E, S >, where V is the node set of entities and patterns, E is the set of edges connecting entities and patterns, S is the set of seed entities with corresponding labels.
Graph augmentation. In real-world corpus, entities/patterns usually follow the long-tail distribution, therefore most entities/patterns have only a few links to others. Such a sparse neighborhood makes it challenging to effectively leverage both the local and global semantics (see Figure 2).
To address this issue, we design to add augmented links to the constructed graph. Specifically, if there exist at least M paths ≤ K hops between an unlinked entity and pattern pair, we will add an augmented link between them (see Figure 2). In this paper, we set M and K both as 2 for efficiency and effectiveness 1 .

Global Bootstrapping Network
This section describes the Global Bootstrapping Network (GBN), which adopts the encoder-decoder architecture (see Figure 3) and contains: • GBEncoder: a global-sighted GNN encoder, which takes the augmented bipartite graph as the input and encodes both local and global semantics into entity/pattern embeddings.
• GBDecoder: an attention-guided RNN decoder, which iteratively generates new entities as expansion results based on their globalsighted embeddings.

GBEncoder
Given the entity-pattern bipartite graph, the GBEncoder embeds entities and patterns by leveraging both the local and the global semantics. The globalsighted embeddings can be further leveraged to perform the global-sighted entity set expansion. Architecture. To capture both the local and global semantics, we use a multi-layer graph neural network, where each layer aggregates information from node neighborhood through both original links and augmented links as: where i represents the i-th node to be updated, N (i) is the set of nodes linked to the i-th node by both original and augmented links, h l i is the node representation after the l-th layer, a l−1 j is the updating weight for neighboring node j, σ is a non-linear mapping function (this paper uses the ReLU).
Attention mechanism. The updating weights of Eq.1 is critical for finding out related patterns/entities and filtering out noises. To estimate it accurately, we use the attention mechanism: where g(·) is the scaled dot production-based attention function, W a and W b are learnable parameter matrices. To calculate the attention score, we use the following three features: • Node feature h k : the representation of node k from the last layer.
• Distance feature d k : a learnable distance embedding. The distance of two nodes equals to: 1 if they are directly linked; 2 if they are linked by an augmented link. • Link type feature t k : a learnable link type embedding. This paper uses three link types: before, middle and after 2 .
Node initialization. This paper initializes entity/pattern representations by their average token embeddings using pre-trained GloVe tables (Pennington et al., 2014). There are many other choices for initialization, such as CNN and BERT (Devlin et al., 2019). Based on the flying experiments, this paper adopts the average token embeddings for its simplicity and effectiveness.
Compared to previous end-to-end model (Yan et al., 2020), the GBEncoder is different in two aspects: 1). It can leverage more information between entities by introducing distance information and link type features; 2). It has a more globalsighted perceptual field by explicitly modeling augmented links and passing messages through them.

GBDecoder
Using the global-sighted entity embeddings from GBEncoder, the GBDecoder sequentially generates expanded entities using a recurrent neural network.
Specifically, based on the global-sighted embeddings, the GBDecoder is a GRU (Cho et al., 2014)based model, where the GRU hidden state is used as the category embedding. The GBDecoder expands entities in the following bootstrapping schema: 1. At the very beginning, the seed entities are used to update the category embedding using the category updating function. 2. The unexpanded entities are evaluated based on their similarities to the category embedding calculated by similarity function, and the top ones are expanded. 3. The expanded entities are added to the seed set, and the category updating function will be used to update the category embedding. 4. Go to step 2 unless reaching the end iteration.

Attention-guided category updating function.
To adaptively capture the target category semantics throughout the bootstrapping process, the GBDecoder updates the GRU hidden state (category em-2 before, middle and after are corresponds respectively to the entity appearing before/within/after the pattern. bedding) using previous expanded entities at each step as the follows: where h t−1 c is the hidden state vector of category c after step t − 1 (h 0 c is set to all-zero), and s t c is the embedding of the expanded entities of the last step.
To avoid introducing noises when updating the category embedding, it is crucial to filter out the noisy expansions from the last step. Therefore, we use the attention mechanism 3 to compute s t c : where s t−1 c,i is the i-th expanded entity embedding of category c at step i − 1, g(·) is a score function (this paper uses the scaled dot production). And we set s t c to all-zero if there is no expanded entity. Similarity function. This paper calculates the similarity using the cosine similarity: where v i is the global-sighted embedding of entity i. And the top-N unexpanded entities with the highest similarity scores will be expanded at step t.

Learning GBN with Pre-training and Fine-tuning
In this section, we describe how to learn GBN effectively using the "pre-training and fine-tuning" (Devlin et al., 2019). In the pre-training stage, we adopt both the self-supervised and the supervised pre-training algorithms to pre-train the GBEncoder; in the fine-tuning stage, we adopt the multi-view learning algorithm to fine-tune both the GBEncoder and the GBDecoder. In this way, the sparse supervision problem can be effectively resolved.

Pre-training Strategies
The pre-training stage mainly aims to pre-train the GBEncoder to effectively capture both the local and global semantics from entity-pattern graphs. Specifically, we want the GBEncoder to aggregate related information for all entities and patterns from their global-sighted neighborhoods while ignoring the noises. To this end, the GBEncoder should be able to effectively leverage as much information as possible from the dataset and task definition, including the inherent structural information within each dataset (self-supervised) and the labeled entity/pattern information within humanannotated datasets (supervised).
The selfsupervised pre-training strategies are designed to leverage the structural information included within the dataset and the task definition without the help of manually-labeled supervision signals. And we design the following learning algorithm for selfsupervised pre-training: • Neighborhood learning. This strategy mainly learns to discriminate the neighboring nodes of a certain node and the nodes many hops away from it. This is because the links between entities and patterns usually indicate relevance between them; on the opposite side, the long path between them usually indicates the irrelevance. Therefore, we want the learned entity and pattern embeddings more similar if they are neighbors than if they are many hops away. To this end, we try to maximize the following function: where v i is the outputted embedding of node i, N (i) is the set of directly linked nodes of node i, N (i) is the set of nodes at least nhops way from i. In this paper, we set n = 20 following Yan et al. (2020). • Masked link prediction. This strategy learns to predict the masked links between entities and patterns. Specifically, we randomly mask a fixed ratio r of existing links between entities and patterns in the bipartite graph; then we use the GBEncoder to encode the masked graph; finally, we use the following function to predict whether the link is masked between entity i and pattern j: where e i and p j are corresponding embed-

Algorithm 1 Multi-View Fine-Tuning Algorithm
Require: A bipartite graph G, seed entities (SEs) 1: Construct GBTeacher with the GBEncoder followed by an MLP classifier 2: Learn GBTeacher using SEs, and predict entity labels using the GBTeacher 3: while NOT reach the finish iteration do 4: Learn GBN using predicted entity labels and expand seeds using GBN 5: Learn GBTeacher using SEs and expanded entities, and predict entity labels using the GBTeacher 6: end while dings of entity i and pattern j outputted by GBEncoder, g LP (·) is an MLP function. In this paper, we experimentally set r as 0.1. For training, we sample one negative link per masked link.
Supervised pre-training. In addition, some datasets provide manually-labeled node types, which can be good supplementary supervision. And we exploit them using the following algorithm: • Node label prediction. This strategy mainly learns to predict the given entity labels in the supervised datasets. Specifically, we use the following function to predict the entity labels: where g T P (·) is another MLP function and σ(·) is the sigmoid activation function.
Note that, since the GBDecoder needs the seed entities (supervision signals) to start the bootstrapping process, which cannot be pre-trained unsupervisedly. Therefore, this paper does not pre-train but only fine-tunes it without loss of generality.

Fine-tuning via Multi-View Learning
After pre-training the GBEncoder, this paper finetunes both the GBEncoder and the GBDecoder using the multi-view learning algorithm proposed by Yan et al. (2020) on the bootstrapping dataset.
Specifically, this paper first constructs an auxiliary neural network to directly predict the entity labels, called GBTeacher, which contains a GBEncoder followed by an MLP classifier. Then we iteratively optimize the GBTeacher and GBN as the following steps (see Alg. 1):   (Tjong Kim Sang and De Meulder, 2003), which contains 4 entity types.
OntoNotes is a sparse dataset constructed from the OntoNotes datasets (Pradhan et al., 2013) but without numerical categories, which contains 11 entity types. The patterns are n-grams (n ≤ 4).
As for the pre-training datasets, we use Wikigold (Balasuriya et al., 2009), GUM (Zeldes, 2017) and half of the DocRED (Yao et al., 2019) for supervised pre-training; we use the remaining half of the DocRED without labels for self-supervised pretraining 4 .
Baselines. We use the following baselines: 1). LP 5 : this is the classical label propagation method, which propagates the seed labels to other entities based on the co-occurrence features.
2). Gupta (Gupta and Manning, 2014): this is a classical bootstrapping system that evaluates patterns and new entities by learning an entity classifier 6 .
3). Emboot (Zupon et al., 2019): this method follows Gupta and Manning (2014), but learns custom word embeddings for entities and patterns, which are used to guide the entity classifier. 4). LTB (Yan et al., 2019): this method performs the lookahead search to capture more information for each entity using the MCTS algorithm. 5). BootstrapNet (Yan et al., 2020): this method uses an end-to-end model to capture information from entity/pattern neighborhoods and expand seeds without attention mechanism. In other words, this is the short-sighted baseline of our method on both model and learning algorithms.
Metrics.To evaluate these methods, we follow Zupon et al. (2019) to report the cumulative precision-throughput curve. And we also report the P @Iter.K 7 (the precision after K-th expansion iterations, K = 1, 10, 20) and the corresponding MAP (the mean average precision).
Other Settings. Our pre-training strategy is to first perform the self-supervised pre-training and then the supervised pre-training on the pre-training datasets. After that, we fine-tune the GBN on the bootstrapping datasets.
For all methods, we run them 20 bootstrapping iterations and expand 10 entities per iteration. We set the layer number of the GBEncoder as 3, the learning rate as 1e-3. We implemented our model using PyTorch (Paszke et al., 2019) with the PyTorch Geometric extension (Fey and Lenssen, 2019). And all models are run on a single Nvidia TiTan RTX 8 .  Table 1: The ablation study results of GBN. GBN −gs is the model without global-sighted encoder; GBN −pt is the model not learned by the "pre-training and fine-tuning" strategies. Figure 4 shows the overall results on CoNLL and OntoNotes. From this figure, we can see that:

Overall Results
• GBN significantly outperforms all baselines. On both CoNLL and OntoNotes, the proposed GBN can expand entities with higher precision compared with the baselines. Specifically, on the CoNLL, GBN can expand 800 entities with the precision more than 90%, while the baselines can achieve at most 80%, the LP method even can not expand more than 300 entities; on the OntoNotes, GBN can also expand 2200 entities with the precision more than 47%, while the precisions of most other baselines are less than 40%, BootstrapNet can achieve around 45% precision in the end, but its final expanded entities are less than 2000.
• End-to-end paradigm is promising for bootstrapping. From the Figure 4, we can see that both two end-to-end models-GBN and BootstrapNet can achieve better performance than other pipelined methods in two aspects: compared with the pipelined methods, both end-to-end models can achieve significantly higher precision; the precisionthroughput curves decrease more slightly with the increases of the throughput on CoNLL and OntoNotes datasets.

Detailed Analysis
Ablation study of GBN. To detailedly analyze the contribution of the global-sighted encoding and the "pre-training and fine-tuning" strategy, we conduct ablation study on the two datasets (see Table 1), where GBN −gs replaces the global-sighted encoder in GBN with a simple graph attention network (Veličković et al., 2018); GBN −pt denotes a variant of the GBN model that is not learned by "pretraining and fine-tuning" strategy but rather by the multi-view learning algorithm like Yan et al. (2020).  We can see that, without the global-sighted encoding, the final performance may decrease even with the "pre-training and fine-tuning" strategy. This indicates that our proposed global-sighted encoder can effectively capture global-sighted information than other encoder models. From Table 1, we can also see that, without using the "pre-training and fine-tuning" strategy, the performance of GBN decreases sharply. This verifies the importance of the "pre-training and fine-tuning" strategies for bootstrapping tasks. Furthermore, we found that "pretraining and fine-tuning" is critical for models with a large capacity: there is a large performance gap between GBN −pt and GBN. Therefore we believe that the capacity of models should be consistent with its learning algorithms and supervision signals: an expressive model with a weak learning algorithm may not result in a strong performance.
Effect of different pre-training strategies. To further analyze the effect of different pre-training strategies, we conduct another ablation study by ablating the self-supervised pre-training strategies (GBN -self ), the supervised pre-training strategy (GBN -sup ) and both of them (GBN -both ). The results are shown in Table 2. We can see that: 1). Both the self-supervised pre-training strategies and the supervised pre-training strategy are effective for GBN's final performance. 2). Compared to the supervised pre-training strategy, self-supervised pre-training strategies obtain less performance improvement. This could be explained by the fact that the pre-training and the bootstrapping datasets are often with different structures, making it more difficult to capture structural information via selfsupervised pre-training strategies.
Effect of Encoder layers. To analyze the effect of layer numbers of GBEncoder, we conduct experiments with different layer numbers (see Figure 5). From Figure 5, we can see that the performance of the GBN increases with more layers, which also indicates that the performance of bootstrapping methods for ESE can benefit from effectively capturing more global-sighted information, as more layers we used, more global-sighted information can be captured.
Pre-training and fine-tuning. The early pretrained models on the ImageNet (Russakovsky et al., 2015) show its advantages in many CV tasks (Simonyan and Zisserman, 2014;Johnson et al., 2016;Huang et al., 2017;He et al., 2017). In NLP, the pre-training has also been proven its effectiveness on many tasks, including the early word vectors such as word2vec or Glove (Mikolov et al., 2013;Pennington et al., 2014) and recent language model pre-training such as Elmo (Peters et al., 2018), BERT (Devlin et al., 2019) and XLNet (Yang et al., 2020). Recently, Hu et al. (2020) also show the advantages of graph pre-training, which directly inspires our work.

Conclusions
In this paper, we propose the Global Bootstrapping Network (GBN) and effective "pre-training and fine-tuning" strategies to learn it. Specifically, we design global-sighted GBEncoder to capture both local and global semantics from the corpus and an effective attention-guided GBDecoder to adaptively expand new entities. To learn GBN, we design several pre-training and fine-tuning strategies. Experiments show that the proposed GBN together with "pre-training and fine-tuning" al-gorithm significantly outperforms state-of-the-art methods. For future work, we want to design more effective "pre-training and fine-tuning" strategies and apply our model on other bootstrapping tasks.