A Two-phase Prototypical Network Model for Incremental Few-shot Relation Classification

Relation Classification (RC) plays an important role in natural language processing (NLP). Current conventional supervised and distantly supervised RC models always make a closed-world assumption which ignores the emergence of novel relations in open environment. To incrementally recognize the novel relations, current two solutions (i.e, re-training and lifelong learning) are designed but suffer from the lack of large-scale labeled data for novel relations. Meanwhile, prototypical network enjoys better performance on both fields of deep supervised learning and few-shot learning. However, it still suffers from the incompatible feature embedding problem when the novel relations come in. Motivated by them, we propose a two-phase prototypical network with prototype attention alignment and triplet loss to dynamically recognize the novel relations with a few support instances meanwhile without catastrophic forgetting. Extensive experiments are conducted to evaluate the effectiveness of our proposed model.


Introduction
Relation Classification (RC) is a fundamental task in natural language processing (NLP), aiming to assign semantic relations to the entity pairs mentioned in sentences. Currently, conventional supervised (Zeng et al., 2014;Chen et al., 2020) or distantly supervised (Mintz et al., 2009; RC models are widely used and achieve remarkable performance. They are always based on a closed-world assumption that the relations expressed in query instances must have appeared in the pre-defined relation set. It is clearly limited in many realistic scenarios, especially in a dynamic or open environment. As shown in Table 1, the query instance Q expresses the relation father which is out of the pre-defined relation set. However, current supervised RC models ignore the novel relations (i.e., out of the pre-defined relations) and incorrectly classify this query instance into one of the pre-defined relations. To incrementally recognize the novel relations, two kinds of solutions, i.e. re-training (Gidaris and Komodakis, 2018) and lifelong learning Han et al., 2020), are proposed. However, these two solutions still suffer from the lack of large-scale labeled data for novel relations. They are prone to overfitting on novel relations and may even lead to catastrophic forgetting on base ones (i.e., previous pre-defined relations) when given insufficient training data for novel relations (Xiang et al., 2019).
In contrast, it is intuitive that the human can learn new knowledge after being taught just few instances (Snell et al., 2017). Based on this intuition, a series of few-shot RC models (Mishra et al., 2017;Gao et al., 2019a;Soares et al., 2019;Gao et al., 2020) are proposed and can effectively recognize the novel relations with only a few (e.g., 1 or 5) support instances. Nevertheless, current few-shot RC models only focus on learning novel relations and ignore a fact that many common relations (i.e., base relations) are readily available in large datasets. They neglect the existence of large-scale training data for base relations and still learn the base relations in the low-resource setting (i.e., each base relation is given only a few support instances), which cannot fully capture or model the features of base relations. To tackle this limitation of current few-shot RC models, our work in this paper considers a more realistic For the closed-world assumption based RC models, the set of predefined relations (i.e., capital of, data of birth and member of ) is fixed after model training.
[.] e1/e2 denotes the entity name mentioned in the corresponding sentence. Figure 1: Incompatible Feature Embedding Space. Five relations (3 base and 2 novel relations) with 30 instances are randomly selected from dataset FewRel (Gao et al., 2019a) and encoded by prototypical network (Yang et al., 2018). P991, P6, P176, P921 and P2094 respectively represent relation successful candidate, head of government, manufacturer, main subject and competition class.
setting where the relation learning system is able to enjoy both the ability to learn from large-scale data for base relations and the flexibility of few-shot learning for novel ones. Specifically, the RC model not only can learn the base relations from large-scale training data, but also can dynamically recognize the novel relations with only a few support instances. Research on this subject can be named as incremental few-shot relation classification.
In both fields of deep supervised learning and few-shot learning, prototypical networks (Snell et al., 2017) obtain better performance on several benchmarks. They conduct classification by learning the distance distribution among relations. However, limited by the closed-world assumption, they only focus on the feature embedding learning for the base relations. When the novel relations come in, the feature spatial distributions of novel relations might be distorted and become incompatible with those of base relations. As shown in Figure 1, the base relations are well-distinguished in the feature embedding spaces. Nevertheless, as the novel relations come in, the feature spatial distributions of novel relations are extremely wider than those of base ones and even overlap the spatial distributions of the base relations. It becomes infeasible to conduct classification simultaneously for base and novel relations. To solve this incompatible feature embedding problem, the prototype attention alignment (ProtoAtt-Alignment) and triplet loss function are designed in our proposed model. They aim to force the prototypical network to narrow down the feature spatial distributions of novel relations and meanwhile to enlarge the distances among different relations in the same embedding space.
In our paper, we propose a two-phase prototypical network model with ProtoAtt-Alignment and triplet loss for incremental few-shot relation classification. The whole framework is shown in Figure 2. In the first phase, a deep prototypical network is proposed to learn the feature embedding space of base relations in a supervised learning manner, following Yang et al. (2018). Each base relation is represented as the center (base prototype) of its training instances. To dynamically recognize the novel relations, the novel prototype generator is designed to learn the representations for novel relations (novel prototype) with only a few support data. Then, an incremental prototypical network with novel prototype generator is proposed in the second phase and classification is conducted by comparing the distances between query instance and each prototype (i.e., both the base and novel prototypes).
The main contributions of this paper can be summarized as follows: (1) We explore a problem of incremental few-shot relation classification and propose a two-phase prototypical network model to dynamically recognize the novel relations with a few support data meanwhile without catastrophic forgetting. To the best of our knowledge, our work is the first study focusing on incremental few-shot relation classification.
(2) We design a prototype attention alignment and triplet loss to solve the incompatible feature embedding problem which exists in current prototypical network. (3) Extensive experiments and visualization analysis are conducted on a real-world dataset to evaluate the effectiveness of our model.

Related Work
Relation classification (RC) is one of the most important techniques in natural language processing (NLP) and has various applications such as information retrieval (Ercan et al., 2019), question answering (Tong et al., 2019) and dialogue systems (Ma et al., 2019). Currently, conventional deep supervised (Zeng et al., 2014;Gormley et al., 2015) and distantly supervised (Mintz et al., 2009;Jiang et al., 2016;Ye and Ling, 2019a) RC models are widely used and achieve remarkable performance. They are always based on the closed-world assumption (Fei and Liu, 2016) that the relation expressed in the query instances must have appeared in the pre-defined relation set. The set of relations which RC models can recognize is fixed after training. However, it is often violated and limited a lot in many realistic scenarios, especially in a dynamic or open environment. Novel relations can emerge dynamically in an open-world scenario.
To dynamically expand the fixed relation set, two solutions can be concluded. Firstly, the base and straightforward method is re-training (Gidaris and Komodakis, 2018). Every time the novel relations come in, we need to collect training data for novel relations and then train from scratch on the enhanced training data, aiming to avoid catastrophic forgetting (McCloskey and Cohen, 1989;McClelland et al., 1995). However, the repeating training process is computationally expensive and time-consuming. Recently, two lifelong learning based RC models Han et al., 2020) are proposed to alleviate the expensive re-training process. Nevertheless, both the solutions still suffer from the lack of large-scale of training data for novel relations. Without enough training data for novel relations, both the above two solutions risk overfitting on the recognition of novel relations and even suffer from catastrophic forgetting on base relations.
In contrast, humans have the ability to perform even one-shot classification, where only one example of each new category is given. Based on this intuition, a series of few-shot RC models are proposed. They can be classified into two categories: meta-learning based models (Santoro et al., 2016;Ravi and Larochelle, 2016;Mishra et al., 2017) and metric-learning based models (Koch et al., 2015;Snell et al., 2017;Han et al., 2018;Gao et al., 2019a;Fan et al., 2019;Gao et al., 2019b;Soares et al., 2019). However, the few-shot RC models only focus on novel relations learning, but ignore a fact that many common relations are readily available in large datasets. To tackle this problem, we consider a more realistic setting where the relations learning system can not only learn the base relations from the large-scale training data, but also dynamically recognize the novel relations with only a few support examples (termed as incremental few-shot relation classification). Currently, several related works (Qi et al., 2018;Gidaris and Komodakis, 2018;Xiang et al., 2019;Ren et al., 2019) are proposed in computer vision field and they concentrate on image classification task. Different from images, the text is more diverse and noisy. It is hard to directly generalize to NLP applications (Gao et al., 2019a). In this paper, we propose a two-phase prototypical network model for incremental few-shot relation classification. Extensive experiments are conducted to evaluate the effectiveness of our proposed model.

Model
To address the problem of incremental few-shot relation classification, we propose a two-phase prototypical network model with the prototype attention alignment and an auxiliary triplet loss (IncreProtoNet). In a dynamic and open environment, novel relations can emerge in test stage. However, current conventional supervised RC models are based on the closed-world assumption and neglect the emergence of novel relations. Although current few-shot RC models achieve remarkable performances on recognizing novel relations with a few support instances, they ignore a fact that the common relations (i.e., base relations) are readily available in large datasets. To simultaneously learn the base relations with large training data and dynamically recognize the novel relations with only a few support data, a two-phase IncreProtoNet is proposed, as shown in Figure 2. The first phase, named deep prototypical network, is designed to pre-train a base model for base relations in a deep supervised manner. The second phase, named incremental prototypical network, is proposed to dynamically recognize the novel relations with only a few support instances meanwhile do not forget the base relations.

Problem Definitions and Notations
The incremental few-shot relation classification can be defined as a task as follows. We assume there exist a large dataset Using this large training data of base relations, our work aims to effectively learn the base relations and meanwhile to dynamically recognize the novel relations with only a few (e.g., 1 or 5) support instances. Therefore, given a support set S for N novel novel relations, the model can classify the entity pair (h, t) mentioned in query instance q into the most possible relation The support set S can be defined as follows: (1) , where K n is the number of support instances of novel relation r n and I n,i is its i-th support instance.

Token Embedding Layer
Given an instance x = {w 1 , w 2 , . . . , w L }, the token embedding layer aims to transform each input word token w i into a real-valued vector v i ∈ R d (1 ≤ i ≤ L). Following Gao et al. (2019a), the vector v i consists of two parts: word embedding v w i ∈ R dw and position embedding v pos i ∈ R 2×dp . We can obtain token representation v i by concatenating word embedding and position embedding, as follows: Finally, each instance can be transformed into an instance matrix

Instance Encoder Layer
The instance encoder layer is used to map an instance x into a low-dimensional vector x using a compositional function f (S) over the token embedding sequence S.
where φ is the learnable parameters of compositional function f (·). In our proposed model, we firstly employ convolutional neural networks (CNNs) (Kim, 2014) to capture the local features of instance. The instance matrix S is input into CNNs with d h filters whose window size is win. It outputs another hidden embedding matrix H ∈ R L×d h . Then, a max-pooling operation is applied over the matrix H to obtain the final instance embedding x ∈ R d h .

Deep Prototypical Network
Prototypical network (Snell et al., 2017;Yang et al., 2018) obtains remarkable performance and enjoys better robustness on several benchmarks. It conducts classification by measuring the distance distribution among relations. In the first phase, following Yang et al. (2018), deep prototypical network with prototype loss is used to train a base model in a deep supervised manner. The goal of this phase is to learn both a good feature extractor and a good base classifier. Given a query instance q from D train , the query instance representation x q can be obtained by the feature extractor. Then, the probability of query instance q belonging to relation r i ∈ R base can be calculated as follows: , where µ i ∈ R d h denotes a learnable weight vector of relation r i ∈ R base and d(·, ·) is the Euclidean distance function for two given vectors. The parameters of the deep prototypical network are learned in this phase and will be freezed after pre-training. Then, we can obtain the prototypes for base relations (base prototypes) by averaging all the available training instance embeddings. They can be denoted as P base = {p 1 , p 2 , . . . , p N base }.

Incremental Few-shot Prototypical Network
In order to dynamically recognize the novel relations with only a few support samples, the incremental prototypical network is proposed to learn the features of novel relations and measure their prototypes (novel prototypes). Then, classification can be conducted by measuring the distances between query instance and all the relations' prototypes (i.e., base prototypes and novel prototypes). The second phase mainly consists of two components, including Novel Prototype Generator which measures the novel prototypes with a MetaCNN encoder and Merged Prototypical Network which merges the base and novel features with a prototype attention alignment.

Novel Prototype Generator
Given a support set S = ∪ N novel n=1 {I n,i } K n i=1 , each support instance I n,i is encoded by the Token Embedding Layer in freezed Feature Extractor as a word embedding matrix S n,i .
MetaCNN Encoder: As shown in Figure 1, the feature embedding space is distorted a lot when the novel relations come in, which would cause serious classification errors on both base and novel relations. Instead of using the freezed Feature Extractor, we build an another MetaCNN encoder to capture the features of novel relations. The network structure is the same as the Instance Encoder Layer using in the base model. Given a word embedding matrix S n,i , the MetaCNN encoder can obtain the support instance embedding x n,i .
Feature Averaging Prototype: For each novel relation r n ∈ R novel with K n support instances, we can obtain the prototype of novel relation r n by p n = 1 x n,i . Then, the novel prototypes can be denoted as P novel = {p 1 , p 2 , . . . , p N novel }.

Merged Prototypical Network with Prototype Attention Alignment
The base and novel prototypes are merged and denoted as P all = P base ∪ P novel . Given a query instance q, two instance embedding x base q and x novel q can be obtained respectively by Feature Extractor and MetaCNN Encoder. To merge the base and novel features, prototype attention alignment is designed to measure the important degree of base features and novel ones. The merged query instance embedding can be calculated as follows: x , where both ω b and ω n are scale weight values and ω b + ω n = 1.0. The weight ω b and ω n can be measured by the prototype attention alignment as follows: and ω n = 1.0 − ω b , where v base and v novel respectively denotes the base and novel feature representation. They are respectively calculated as follows: , where α i denotes the weight value of i-th base prototype and β i denotes the weight value of i-th novel prototype. The weight value α i and β i is calculated as follows: Finally, the probability of query instance q belonging to relation r ∈ R base ∪ R novel can be measured as follows: where p all i denotes the i-th prototype in P all .

Triplet Loss for IncreProtoNet
The performance of prototypical network highly depends on the spacial distributions of relations in embedding space. To improve the robustness of prototypical network and further solve the incompatible feature embedding problem, the triplet loss function is adopted in our model. Specifically, the target of triplet loss is to force the prototypical network to narrow down the feature spatial distribution of novel relations and meanwhile to enlarge the distances among different relations. Following Fan et al. (2019), the triplet loss function is designed as follows: , where M is the total number of training episodes and (a k i , p k i , n k i ) is a triplet consists of the anchor, the positive and the negative instances and δ is a hyper-parameter. Note that the anchor is a virtual instance and denotes the novel prototype.
Finally, the loss £ in the second phase is a trade-off between the softmax cross-entropy loss £ sof tmax of incremental prototypical network and the triplet loss £ triplet by a hyper-parameter λ: 4 Experiment

Experiment Settings
We conduct experiments 1 on a large-scale public dataset (i.e., FewRel) to evaluate the effectiveness of our proposed model. Two kinds of pretrained word embedding methods, namely Glove (Pennington et al., 2014) and language model BERT (Devlin et al., 2018), can be used to initialize word embeddings in our model and are finetuned during the training stage. The out-of-vocabulary (OOV) words are initialized as an uniform distribution with range [−0.01, 0.01]. For the triplet loss, the hyper-parameter δ is set as 5.0 and λ is set as 1.0. The stochastic gradient descent (SGD) optimizer with initial learning rate of 0.01 is used to optimize the model parameters.
In our experiments, we evaluate our proposed model in two incremental few-shot settings (i.e., N base base relations and 5 novel relations with 1-shot or 5-shot learning, where N base is 54 for FewRel). In the first phase, the deep prototypical network (i.e., base model) is trained in a supervised learning manner and is freezed after training. Specifically, the target of the first phase is to learn the parameters φ of Models  Table 2: Average classification accuracy (%) on dataset FewRel. The Novel columns report the average 5-way 1-shot or 5-shot classification accuracy of novel relations; the Base and Both columns respectively report the average classification accuracy of base relations and both type of relations. The above results are calculated by sampling 2000 tasks each with 54 base relations and 5 novel relations. Each relation is randomly sampled 5 query instances.
the feature extractor and the prototypes of base relations P base . In the second phase, the incremental prototypical network is trained by iteratively sampling few-shot episodes and tries to learn the metaparameters, following Gidaris and Komodakis (2018).

Datasets and Data Settings
In our experiments, we use accuracy as the metric. To evaluate the effectiveness of our proposed model, extensive experiments are conducted on a large-scale few-shot RC dataset FewRel (Gao et al., 2019a). The dataset totally contains 80 relations and each relation has 700 instances. To satisfy our experimental settings, we split the dataset into three parts: training set which consists of 54 relations (i.e., base relations R base ) each with 550 instances; validation set which consists of 54 relations (i.e., base relations R base ) each with 50 instances and 10 relations (i.e., novel relations in validation stage) each with 700 instances; and testing set which consists of 54 relations (i.e., base relations R base ) each with 100 instances and 16 relations (i.e., novel relations R novel in testing stage) each with 700 instances. There are no-overlapping instances between training, validation and testing dataset.

Result Analysis
In our experiments, we compare the performance of our proposed model with two groups of models: several few-shot RC models and four incremental few-shot learning models which designed for CV applications as follows: 1. Few-shot Learning: We select several few-shot RC models (which can be adapted into the incremental few-shot scenario) as baselines. To adapt them into the incremental few-shot setting, we train the few-shot RC models on the training set of base relations. Then, both the base and novel relations are recognized with a few support instances in test stage. They are listed as follows: Siamese 2. Incremental Few-shot Learning: • ProtoNet(incremental) (Snell et al., 2017): prototypical network is adapted to incremental few-shot settings. Each base relation is represented as the average embedding (base prototype) over its all training instances. At test stage, the novel relations are also represented as the average embedding (novel prototypes) over a few support instances. Finally, classification is conducted by comparing the distances between the query instance and each relation prototype.  Table 3: Ablation experiments (%) on dataset FewRel; DeepProtoNet denotes the base model (Yang et al., 2018) which is directly used to recognize both the base and novel relations; † indicates that our model IncreProtoNet without prototype attention alignment; and ‡ indicates that our model IncreProtoNet without triplet loss function.
• Imprint (Qi et al., 2018): the base classes representations are learned through the supervised pre-training and the novel classes are represented simply by prototypical averaging. Then, classification is conducted over the fully connection layer by concatenating the base and novel classes representations. • LwoF (Gidaris and Komodakis, 2018): Similar to Imprint, a two-stage incremental fewshot learning algorithm with a class-wise attention mechanism is designed to learn better classification-weight values for both base and novel classes. • AttractorNet (Ren et al., 2019): Inspired by the attractor networks (Zemel and Mozer, 2001), the attention attractor network model which regularizes the learning of novel classes is designed for incremental few-shot learning on image classification task.

Comparison with Related Models
To demonstrate the effectiveness of our model in the incremental few-shot scenario, we compare our proposed model with two groups of related works (i.e., few-shot RC models and incremental few-shot learning models designed in CV). To adapt the few-shot models to the incremental few-shot settings, both the base and novel relations are recognized in the few-shot learning manner. Recently, a series of works have demonstrated the effectiveness of few-shot learning technique on relation classification task. Nevertheless, they only focus on the novel relation learning and ignore a fact that the common relations (i.e., base relations) have been readily available in large datasets. Specifically, the large-scale training data of base relations is neglected and each base relation is still recognized with only a few support instances. As shown in Table 2, our proposed model achieves higher accuracy by a significant margin on the recognition of base relations. Meanwhile, the novel relations can be also effectively recognized and our model even obtains better performance than current few-shot RC models. Through the comparison with current few-shot RC models, it can demonstrate that our proposed model can not only effectively recognize the base relations, but also dynamically learn the novel relations with only a few support instances.
For the second related works, four incremental few-shot learning models proposed on computer vision field are also implemented and adapted to the relation classification task as the baselines. Different from images, text is more diverse and noise (Gao et al., 2019a). Current incremental few-shot learning models focusing on image classification task are hard to generalize to NLP tasks. From the experimental results shown in Table 2, our proposed model achieves better recognition performance by a significant margin on all experimental settings. Specifically, the model proto(increment) encodes the novel relations simply by the pre-trained prototypical network (i.e., base model) and suffers from the incompatible feature embedding problem. Comparing with proto(increment), our proposed model obtains higher accuracy by a large margin on the recognition of novel relations. To some extent, we can conclude that our proposed model can effectively learn compatible feature embedding spaces when the novel relations incrementally come in. The intuitive and specific visualization analysis are given in the final section.

Ablation Studies
As shown in Table 3, the ablation experiments on prototype attention alignment and triplet loss are conducted. The target of the above two components (i.e., ProtoAtt-Alignment and Triplet loss) is to learn the compatible and adaptive embedding spaces when novel relations come in. From the experimental results, both the ProtoAtt-Alignment and Triplet loss can significantly improve the recognition performance of novel relations and maintain comparable recognition performances of base relations. Especially, the base model (i.e., deep prototypical network) achieves better recognition performance of base relations than our model in some experimental settings. However, it seriously suffers from the incompatible feature embedding problem when novel relations are directly added into the base (initial) embedding space, as shown in Figure 1. Thus, the base model deepProtoNet gets the lowest accuracy on the recognition of novel relations in all experimental settings. Through the ablation studies, we can conclude that prototype attention alignment and triplet loss can effectively force the prototypical network to learn the compatible feature embedding space in the incremental few-shot scenario.

Visualization Analysis
To specifically and intuitively explain the effectiveness of our proposed model, we randomly select 30 instances from the corresponding relations (i.e., 3 base relations and 2 novel relations) in dataset FewRel and encode them into the hidden embeddings after the model training. Then, we map them into 2dimensional points using Principal Component Analysis (PCA) in the same feature embedding space. As shown in Figure 1, current prototypical networks suffer from the incompatible feature embedding problem when the novel relations are directly added into the base embedding space. To solve this problem, prototype attention alignment is proposed to learn the compatible or adaptive feature embedding space through aligning and combining the novel and base relations features. Specifically, the feature spatial distribution of both novel and base relations become intra-relation compact and inter-relation separable Comparing with Figure 3(a) and 3(c), it can evaluate the effectiveness of prototype attention alignment.
The feature spatial distribution of both novel and base relations can be effectively distinguished. What's more, the triplet loss is able to further force the prototypical network to enlarge the distances among relations and shorten the distances within the same relation, as shown in Figure 3(b) and 3(c). It is also beneficial for our proposed model to avoid the catastrophic forgetting on base relations.

Conclusion
In this paper, we propose a two-phase prototypical network model for incremental few-shot relation classification. Current conventional supervised RC models are always based on the closed-world assumption that the relations expressed in query instances must have appeared in the pre-defined relations. However, novel relations often emerge in the dynamic or open-world environment. Although current few-shot RC models effectively recognize the novel relations with only a few support instances, they ignore a fact that the common relations (i.e., base relations) are readily available in large datasets. To simultaneously learn the base relations with large-scale training data and the novel relations with a few support data, an incremental few-shot relations learning model is proposed in our paper. The extensive experimental results and visualization analysis show that our proposed model can effectively recognize the novel relations with a few support data and maintain high recognition accuracy on base relations.