Bridging Text and Knowledge with Multi-Prototype Embedding for Few-Shot Relational Triple Extraction

Current supervised relational triple extraction approaches require huge amounts of labeled data and thus suffer from poor performance in few-shot settings. However, people can grasp new knowledge by learning a few instances. To this end, we take the first step to study the few-shot relational triple extraction, which has not been well understood. Unlike previous single-task few-shot problems, relational triple extraction is more challenging as the entities and relations have implicit correlations. In this paper, We propose a novel multi-prototype embedding network model to jointly extract the composition of relational triples, namely, entity pairs and corresponding relations. To be specific, we design a hybrid prototypical learning mechanism that bridges text and knowledge concerning both entities and relations. Thus, implicit correlations between entities and relations are injected. Additionally, we propose a prototype-aware regularization to learn more representative prototypes. Experimental results demonstrate that the proposed method can improve the performance of the few-shot triple extraction.


Introduction
Relational Triple Extraction is an essential task in Information Extraction for Natural Language Processing (NLP) and Knowledge Graph (KG) , which is aimed at detecting a pair of entities along with their relation from unstructured text. For instance, there is a sentence "Paris is known as the romantic capital of France.", and in this example, an ideal relational triple extraction system should extract the relational triple Paris, Capital of, France , in which Capital of is the relation of Paris and France.
Current works in relational triple extraction typically employ traditional supervised learning based on feature engineering (Kambhatla, 2004;Reichartz et al., 2010) and neural networks (Zeng et al., 2014;Bekoulis et al., 2018a). The main problem with supervised learning models is that they can not perform well on unseen entity types or relation categories (e.g., train a model to extract knowledge triples from the economic text, then run this model to work on scientific articles). As a result, supervised relational triple extraction can not extend to the unseen entity or relation types. A trivial solution is to annotate more data for unseen triple types, then retraining the model with newly annotated data (Zhou et al., 2019). However, this method is usually impractical because of the extremely high cost of annotation.
Intuitively, humans can learn about a new concept with limited supervision, e.g., one can detect and classify new entities with 3-5 examples (Grishman et al., 2005). This motivates the setting that we aim at for relational triple extraction: Few-Shot Learning (FSL). In few-shot learning, a trained model rapidly learns a new concept from a few examples while keeping great generalization from observed examples (Vinyals et al., 2016;. Hence, if we need to extend relational triple extraction into a new domain, a few examples are needed to activate the system in the new domain without retraining the model. By formulating this FSL relational triple extraction, we can significantly reduce the annotation cost and training cost while maintaining highly accurate results.  Figure 1: Illustration of our proposed model for relational triple extraction in the few-shot setting. The texts marked in red are head entities while in blue are tail entities. Head and tail entity prototypes are connected with the relation prototype. Though methods of few-shot learning develop fast in recent yeas, most of these works concentrate on single tasks such as relation extraction and text classification (Geng et al., 2019;Ye and Ling, 2019). However, the effect of joint extraction of entities and relations on low-resource scenarios is still not well understood, which are two subtasks belonging to relational triple extraction. Unlike extraction for each single task, joint entity and relation extraction is more challenging, as entity and relations have implicit correlations, which cannot be ignored.
To address this issue, we propose a Multi-Prototype Embedding network (MPE) model to extract the few-shot relational triples, inspired by the prototypical network (Snell et al., 2017). To be specific, we utilize two kinds of prototypes regarding both entities and relations. Note that entity pairs and relations have explicit knowledge constraints (Bordes et al., 2013), such as the Born in relation suggests that the type of head entity must be PERSON, and vice versa. Based on those observations and motivated by the knowledge graph embedding (Xie et al., 2016), we introduce the hybrid prototypical learning to explicitly inject knowledge constraints. We firstly learn entity and relation prototypes and then leverage translation constraint in hyperspace to regularize prototype embedding. Note that such knowledge-aware regularization not only injects prior knowledge from the external knowledge graph, but also leads to a more smooth and representative prototype for few-shot extraction. Moreover, we introduce prototype-regularization considering both intramural and mutual similarities between different prototypes. Experimental results on the FewRel dataset (Han et al., 2018) demonstrate that our approach outperforms baseline models in the few-shot setting.
To summarize, our main contributions include: • We study the few-shot relational triple extraction problem and provide a baseline for this new research direction. To our best knowledge, this is a new branch of research that has not been explored.
• We propose a novel Multi-Prototype Embedding approach with hybrid prototype learning and prototype-aware regularization, which bridge text and knowledge for few-shot relational extraction.
• Extensive experimental results on the FewRel dataset demonstrate the effectiveness of our method.

Related Work
Two main directions have been proposed for relational triple extraction, which has two subtasks: entity extraction and relation extraction, namely, pipeline (Lin et al., 2016;Trisedya et al., 2019;Wang et al., 2020;Nan et al., 2020) and joint learning methods (Bekoulis et al., 2018b;Nayak and Ng, 2020;Ye et al., 2020). The pipeline model can be more flexible because it extracts entity pairs and relations sequentially, but this design will lead to error propagation . Meanwhile, joint relational triple extraction models can solve this problem well by extracting triples end-to-end, and the interaction between entities and relations can be realized within the model, which makes the performance of the two mutually enhanced. However, due to the "data-hungry" attribute of conventional neural networks, these relational triple extraction models need a large amount of data for training. Thus, lots of efforts Yu et al., 2020; have been devoted to few-shot learning, (Han et al., 2018) presents a few-shot relation extraction datasets to promote the research of information extraction in few-shot scenarios and adapt some few-shot learning methods (Munkhdalai and Yu, 2017;Satorras and Estrach, 2018;Mishra et al., 2017; for this task. Among these models, the prototypical network (Snell et al., 2017) achieves comparable results on several few-shot learning benchmarks; meanwhile, it is simple and effective. This model assumes that each class exists a prototype, and it tries to find the prototypes for classes from supporting instances and compares the distance between the query instance under a particular distance metric. In natural language processing, (Gao et al., 2019) first proposes a hybrid attention-based prototypical network for few-shot relation extraction. (Fritzler et al., 2019) proposes to utilize the prototypical network to tackle the few-shot named entity recognition. (Hou et al., 2020) proposes a collapsed dependency transfer mechanism and a Label-enhanced Task-Adaptive Projection Network (L-TapNet) for few-shot slot filing. However, all previous few-shot works mainly consider single tasks, while relational triple extraction should take both entity and relation into consideration. To the best of our knowledge, we are the first approach for the few-shot relational triple extraction, which addresses both entities and relations.
Our work is motivated by knowledge graph embedding (Xie et al., 2016) such as TransE (Bordes et al., 2013) from Knowledge graph (KG), which is composed of many relational triples like head, relation, tail . TransE is first proposed by (Bordes et al., 2013) to encode triples into a continuous low-dimensional space, which is based on the translation law h + r ≈ t. Many follow-up works like TransH (Wang et al., 2014), DistMult (Yang et al., 2014), and TransR (Lin et al., 2015), propose advanced methods of translation by introducing different embedding spaces. In few-shot settings, it is extremely challenging to inject implicit knowledge constrains in vector space. Such simple yet effective knowledge constraints provide an intuitive solution.

Problem Definition
In few-shot relational triple extraction task, we are given two datesets, D meta−train and D meta−test . Each dataset consists of a set of samples (x, t), where x is a sentence composed of N words, and t indicates relational triple extracted from x. The form of t is head, relation, tail , where head and tail are entity pairs associated with the relation. These two datasets have their own relation domain spaces that are disjoint with each other. In few-shot settings, D meta−test is split into two parts: D test−support and D test−query . Due to entity pair types can be determined by the relation categories, e.g. the Born in relation suggests that the type of head might be PERSON and tail might be LOCATION, we are able to determine the classification of triples only by specifying the categories of the relations. Therefore if D test−support contains K labeled samples for each of N relation classes, this target few-shot problem is named N-way-K-shot. D test−query contains test samples, each should be labeled with one of N relation classes, and associated entity pairs also need to be extracted correctly.
It is non-trivial to train a good model from scratch using D test−support and evaluate its performance on D test−query , limited by the number of test-support samples (i.e.,., N × K). Inspired by an important machine learning principle that test and train conditions must match, we can also split D meta−train into two parts, D train−support and D train−query , and mimic the few-shot settings at the training stage. In each training iteration, N triple categories are randomly selected from D train−support , and K support instances are randomly selected from each of N triple categories. In this way, we construct the train-support set S = {s i k ; i = 1, . . . , N, k = 1, . . . , K}, where s i k is the k-th instance in triple category i. Meanwhile, we randomly select R samples from the remaining samples of those N triple categories and construct the train-query set Q = {(q j , t j ); j = 1, . . . , R}, where t j is the triple extracted from instance q j . Our goal is to optimize the following function: Where P (t|S, q) is the probability of gold standard relational triples.

Framework Overview
In this section, we will introduce our proposed Multi-Prototype Embedding (MPE) model for few-shot relational triple extraction. For brevity, we will temporarily study a sentence with one relation and associated entity pairs. The framework of our proposed model is shown in Fig. 2, which has three main modules.
• Instance Encoder. We utilize the pre-trained language model BERT (Devlin et al., 2018) to encode sentence, which adopts multi-head attention to learn contextual representations. Note that any other encoders such Roberta  and XLNet (Yang et al., 2019) can also be applied.
• Hybrid Prototype Learning. After obtaining entity pairs representations of each sentence used by sequence labeling methods, we can get entity prototypes in support set, and then construct relation prototype based on knowledge graph constraint, which takes the interaction between entity pairs and relations into account.
• Prototype-Aware Regularization. To further enhance the prototype learning, we optimize the position of prototypes in representation spaces. We make the distance between each prototype and related instances closer and distract those prototypes with different types.

Instance Encoder
For each sentence x = {w 1 , w 2 , . . . , w n } in the support or query dataset, where w i ∈ x is the word token in sentence x, we first construct input sentence in the form: {[CLS], w 1 , w 2 , . . . , w n , [SEP]}, in order to match the input of BERT (Devlin et al., 2018). The pre-trained language model has been shown to be effective in many NLP tasks.
[CLS] token is used to represent the entire sentence information, and [SEP] is the end token of sentence. After multi-head attention (Vaswani et al., 2017) calculation, we can get sentence contextual embeddings B = {h 0 , h 1 , h 2 , . . . , h n , h n+1 }, where B ∈ R d n+2 ×d b , d b is BERT pre-defined hidden size, h 0 is [CLS] token embedding, h n+1 is [SEP] token embedding, and h i , i ∈ [1, n] is each token embedding in sentence. Note that n can be different from input sentence length because of tokenizer (e.g., byte-pair-encoding) might split words into sub-tokens.

Hybrid Prototypical Learning
Entity Prototype Learning. During training stages, sentence representations in support datasets are first used to construct the entity pairs prototypes. We build an entity labeling set S = {B-Head, I-Head, B-Tail, I-Tail, O, X} to label out each token in the sentence, where B-Head, I-Head indicate head entity positions, B-Tail, I-Tail indicate tail entity positions, O means other tagging labels, and X is any remaining fragments of tokens split by the tokenizer. We utilize Conditional Random Field (CRF) (Lafferty et al., 2001) for sequence labeling as it models the constraints between labels, which is more convenient in few-shot learning scenarios. Let y = {y 0 , y 1 , y 2 , . . . , y n , y n+1 }, where y 0 is [CLS] token label which means the start of sentence, y n+1 is [SEP] token label which means the end of sentence, and y i , i ∈ [1, n] is each token label of sentence in entity labelling set. CRF uses emission and transition scores to combine local and global information, in our model, score of this sequence is evaluated as: Let Y X indicate the exponential space of all possible labelings of this sequence x. The probability of a specific labeling y ∈ Y X is evaluated as: y∈Y X e Score(x,y) We name the CRF-based sequence labeling loss as loss crf and minimize it during training stage. After the above instance encoder and sequence labeling, we can obtain the head and tail representation to match the entities between the query and support set. Due to the variable length of entity words, we only use the first token representation of each entity word as head/tail embeddings, which is also used in (Soares et al., 2019). For measuring the distance between samples in query set and support set, we need compute a representative vector, called prototype, for each class t ∈ T in the support set S from its instances' vectors. The original Prototypical Network (Snell et al., 2017) hypothesis that all instance vectors are equally important, so it aggregates all the representation vectors of the instance of class t i , and then perform averaging over all vectors as follows: where head i , tail i are each sentence's entity pairs representations. Intuitively, the instances of a given relation may be quite different. Thus, we propose to adopt weighted sum prototype, named Proto+Att network inspired by (Gao et al., 2019). The weights are obtained by attention mechanism according to the representational vector of the query Q as follow: where Specifically, we use Euclidean distance d(z − z ) = z − z 2 , to calculate the distance between entity prototypes and instances in query set, and minimize this distance as loss entity .
Relation Prototype Learning. This module computes relation prototypes associated with each entity pair. On the one hand, the first token [CLS] in the sentence representation can represent the whole sentence information. So like the above entity prototypes calculation, we can get sentence prototypes sent proto , used by this sentence information in support set.
On the other hand, knowledge graph representation learning inspires us to learn a translation law h + r ≈ t (Bordes et al., 2013) on a continuous low-dimensional space, where h, r, t describe the head entity, the relation and the tail entity respectively. So we use head proto and tail proto to construct knowledge graph prototype kg proto , which takes the interaction between entities and relations into consideration as follows: Finally, we combine the prototype of sentence represent ions sentproto and prototype from knowledge constrains between entity pairs kg proto to form the relation prototype as follows: Where [; ] refers to the feature vector concatenation. Similar to the entity prototype, we use Euclidean distance to calculate the distance between relation prototype relation proto and the sentence in the query set Q, and minimize this distance as loss relation .

Prototype-Aware Regularization
Previous few-shot learning approaches (Ye and Ling, 2019) have shown that if the representations of all support instances in a class are far away from each other, it could become difficult for the derived class prototype to capture the common characteristics of all support instances. Therefore, we propose prototype-aware regularization to optimize prototype learning. Intuitively, we argue that the representational vectors (e.g, sentence representations/prototypes) of the same class should be close to each other; the prototypes of different types should be located far from each other in the prototypical space. Specifically, We use Euclidean and Cosine distance to measure these similarities, and optimize the prototype represetations as follows: where x i is each sentence representation, p i is associated prototypes, loss intra and loss inter are two different prototype-aware regularization functions. The overall regularizationn loss is: loss regular = loss intra + αloss inter , and α is hyperparameter.
The overall objective of the optimization is as follows: L = loss crf + βloss entity + γloss relation + δloss regular where β, γ and δ are the trade-off parameters.

Datasets
We conduct experiments on the public dataset FewRel 1 (Han et al., 2018), which is derived from Wikipedia and annotated by crowd workers. FewRel releases 80 relation categories, and each relation has 700 samples. We reconstruct the FewRel dataset to satisfy the few-shot relational triple extraction task. Our input information has only one sentence, and the required output is the relation and related entity pairs, which is a complete knowledge triple in the scheme of head, relation, tail . In our experiments, we randomly select 50 relations for training, 15 for validation, and the rest 15 relation types for testing. Note that there are no overlapping types between these three datasets. We implement our approach with Pytorch (Paszke et al., 2019). We employed minibatch stochastic gradient descent (SGD) (Bottou, 2010) with the initial learning rate of 1e −1 . The learning rate was decayed to one third with every 2000 steps, and we train 30,000 iterations. The dropout rate of 0.2 is used to avoid overfitting. Previous study (Snell et al., 2017) found that models trained on more laborious tasks may achieve better performances than using the same configurations at both training and test stages. Therefore, we set N = 20 to construct the trainsupport sets for 5-way and 10-way tasks. Furthermore, in each step, we sample 5 instances for query datasets. We utilize grid-search on valid set to tune hyperparameters. All of the hyperparameters used in our experiments are listed in Table 1. We consider two types of few-shot relational triple extraction tasks in our experiments: 5-way 5-shot and 10-way 10-shot. We evaluate the performance of the entity, relation, and triple with the micro F1 score. To be specific, the entity performance refers to that the entity's span and span type are correctly predicted, the relation performance means that the relation of the entity pairs is correctly classified. Moreover, the triple performance means that the entity pair and associated relation are all matched correctly.

Baselines
We compared our model with baselines of supervised approaches and few-shot learning methods: Supervised Learning. We utilize BERT (Devlin et al., 2018) with fine-tuning (Finetune) as supervised learning baselines. We finetune BERT with a batch size of 16 for 100 iterations.

Overall Evaluation Results
The first line of Table 2 shows the performance of our model on the FewRel test set. From the results, we observe that:  1) Our approach MPE achieve the best performance in few-shot setting compared with all baselines (about absolute 5% improvement than Proto+Att in 5-way-5-shot), which demonstrates that the multiprototype leveraging both text and knowledge is effective.
2) Entity recognition performs much worse than relation extraction in few-shot settings, as sequence labeling is more challenging than classification tasks, and the empirical results also observed by (Hou et al., 2020). More studies need to be taken to handle the challenging few-shot entity recognition task.
3) Proto+Att achieve better performance than Proto, which reveals that different instances have different contribution to prototype learning.
4) The overall performance is far from satisfactory, which need more future works to be taken into consideration.

Ablation Study
Muitl-Proto 10-Way-10-Shot  We further analyze the different modules of our approach by taking ablation studies, as shown in Table 3. w/o CRF implied without the CRF decoder; w/o Att implied without the attention in prototypical learning; w/o intra implied without the intra-constrains between instances and prototypes ; w/o inter implied without the interconstrains between prototypes. From Table 3, we observe that: 1) All approaches without different modules obtain performance decays, and w/o CRF has significant performance decay than w/o Att, w/o intra, and w/o inter, which demonstrates that the efficacy of CRF is more critical in few-shot relational triple extraction.
2) w/o intra or w/o inter has more performance drop compared with w/o Att, which also illustrates that prototype-aware regularization benefits the prototype learning.
From Figure 3, we observe that the multi proto achieves better performance than sen proto and kg proto , and kg proto is more advantageous than sen proto for entity extraction, which further indicates that such knowledge constrains is beneficial.
In summary, we observe that entity recognition is more difficult than relation extraction in few-shot settings and the implicit correlation between them contribute to the performance.

Error Analysis
To further analyze the drawbacks of our approach and promote future works of few-shot relational extraction, we random select instances and conduct error analysis, as shown in Table 4: Distract Context. As instance #1 shows, we observe that our approach may fail to those ambiguous contexts that may be expressed in a similar context but differ only in the fine-grained type of entities.   We argue that this may be caused by the unbalanced learning problems that models tend to classify the sentence with similar context to high-frequency relations. Wrong Boundaries. As instance #2 shows, we observe that lots of extracted triples have incorrect boundaries, which further demonstrates the difficulty of entity recognition in the few-shot setting. More future works should be focused on the direction of few-shot sequence labeling.
Wrong Triples. As instance #3 shows, we observe that lots of extracted triples have entities that do not exist in the gold standard set. Generally, this is mostly happening in the sentence with multiple triples. Note the FewRel dataset does not have those labeled triples, and part of those cases is correct.

Conclusion and Future Work
In this paper, we study the few-shot relational triple extraction problem and propose a novel multiprototype embedding network that bridge text representation learning and knowledge constraints. Extensive experimental results prove that our model is effective, but remains challenges. Those empirical findings shed light on promising future directions, including 1) enhancing entity recognition with effective sequence decoders; 2) studying few-shot relational triple extraction with more triples in a single sentence; 3) injecting logic rules to enable robust extraction; and 4) developing few-shot relational triple extraction benchmarks.