A Novel Cascade Binary Tagging Framework for Relational Triple Extraction

Extracting relational triples from unstructured text is crucial for large-scale knowledge graph construction. However, few existing works excel in solving the overlapping triple problem where multiple relational triples in the same sentence share the same entities. In this work, we introduce a fresh perspective to revisit the relational triple extraction task and propose a novel cascade binary tagging framework (CasRel) derived from a principled problem formulation. Instead of treating relations as discrete labels as in previous works, our new framework models relations as functions that map subjects to objects in a sentence, which naturally handles the overlapping problem. Experiments show that the CasRel framework already outperforms state-of-the-art methods even when its encoder module uses a randomly initialized BERT encoder, showing the power of the new tagging framework. It enjoys further performance boost when employing a pre-trained BERT encoder, outperforming the strongest baseline by 17.5 and 30.2 absolute gain in F1-score on two public datasets NYT and WebNLG, respectively. In-depth analysis on different scenarios of overlapping triples shows that the method delivers consistent performance gain across all these scenarios. The source code and data are released online.


Introduction
The key ingredient of a knowledge graph is relational facts, most of which consist of two entities connected by a semantic relation. These facts are in the form of (subject, relation, object), or (s, r, o), referred to as relational triples. Extracting relational triples from natural language text is a crucial step towards constructing large-scale knowledge graphs. Early works in relational triple extraction took a pipeline approach (Zelenko et al., 2003;Zhou et al., 2005;Chan and Roth, 2011). It first recognizes all entities in a sentence and then performs relation classification for each entity pair. Such an approach tends to suffer from the error propagation problem since errors in early stages cannot be corrected in later stages. To tackle this problem, subsequent works proposed joint learning of entities and relations, among them are feature-based models (Yu and Lam, 2010;Li and Ji, 2014;Miwa and Sasaki, 2014;Ren et al., 2017) and, more recently, neural network-based models (Gupta et al., 2016;Katiyar and Cardie, 2017;Zheng et al., 2017;Zeng et al., 2018;Fu et al., 2019). By replacing manually constructed features with learned representations, neural network-based models have achieved considerable success in the triple extraction task.
However, most existing approaches cannot efficiently handle scenarios in which a sentence contains multiple relational triples that overlap with each other. Figure 1 illustrates these scenarios, where triples share one or two entities in a sentence. This overlapping triple problem directly challenges conventional sequence tagging schemes that assume each token bears only one tag (Zheng et al., 2017). It also brings significant difficulty to relation classification approaches where an entity pair is assumed to hold at most one relation (Miwa and Bansal, 2016). Zeng et al. (2018) is among the first to consider the overlapping triple problem in relational triple extraction. They introduced the categories for different overlapping patterns as shown in Figure 1 and proposed a sequence-to-sequence (Seq2Seq) model with copy mechanism to extract triples. Based on the Seq2Seq model, they further investigate the impact of extraction order (Zeng et al., 2019) and gain considerable improvement with reinforcement learning. Fu et al. (2019) also studied the overlapping triple problem by modeling text as relational graphs with a graph convolutional networks (GCNs) based model.
Despite their success, previous works on extracting overlapping triples still leave much to be desired. Specifically, they all treat relations as discrete labels to be assigned to entity pairs. This formulation makes relation classification a hard machine learning problem. First, the class distribution is highly imbalanced. Among all pairs of extracted entities, most do not form valid relations, generating too many negative examples. Second, the classifier can be confused when the same entity participates in multiple valid relations (overlapping triples). Without enough training examples, the classifier can hardly tell which relation the entity participates in. As a result, the extracted triples are usually incomplete and inaccurate.
In this work, we start with a principled formulation of relational triple extraction right at the triple level. This gives rise to a general algorithmic framework that handles the overlapping triple problem by design. At the core of the framework is the fresh perspective that instead of treating relations as discrete labels on entity pairs, we can model relations as functions that map subjects to objects. More precisely, instead of learning relation classifiers f (s, o) → r, we learn relation-specific taggers f r (s) → o, each of which recognizes the possible object(s) of a given subject under a specific relation; or returns no object, indicating that there is no triple with the given subject and relation. Under this framework, triple extraction is a two-step process: first we identify all possible subjects in a sentence; then for each subject, we apply relationspecific taggers to simultaneously identify all possible relations and the corresponding objects.
We implement the above idea in CASREL, an end-to-end cascade binary tagging framework. It consists of a BERT-based encoder module, a sub-ject tagging module, and a relation-specific object tagging module. Empirical experiments show that the proposed framework outperforms state-of-theart methods by a large margin even when the BERT encoder is not pre-trained, showing the superiority of the new framework itself. The framework enjoys a further large performance gain after adopting a pre-trained BERT encoder, showing the importance of rich prior knowledge in triple extraction task.
This work has the following main contributions: 1. We introduce a fresh perspective to revisit the relational triple extraction task with a principled problem formulation, which implies a general algorithmic framework that addresses the overlapping triple problem by design.
2. We instantiate the above framework as a novel cascade binary tagging model on top of a Transformer encoder. This allows the model to combine the power of the novel tagging framework with the prior knowledge in pretrained large-scale language models.
3. Extensive experiments on two public datasets show that the proposed framework overwhelmingly outperforms state-of-the-art methods, achieving 17.5 and 30.2 absolute gain in F1-score on the two datasets respectively. Detailed analyses show that our model gains consistent improvement in all scenarios.

Related Work
Extracting relational triples from unstructured natural language texts is a well-studied task in information extraction (IE). It is also an important step for the construction of large scale knowledge graph (KG) such as DBpedia (Auer et al., 2007), Freebase (Bollacker et al., 2008) andKnowledge Vault (Dong et al., 2014). Early works (Mintz et al., 2009;Gormley et al., 2015) address the task in a pipelined manner. They extract relational triples in two separate steps: 1) first run named entity recognition (NER) on the input sentence to identify all entities and 2) then run relation classification (RC) on pairs of extracted entities. The pipelined methods usually suffer from the error propagation problem and neglect the relevance between the two steps. To ease these issues, many joint models that aim to learn entities and relations jointly have been proposed. Traditional joint models (Yu and Lam, 2010;Li and Ji, 2014;Miwa and Sasaki, 2014;Ren et al., 2017) are feature-based, which heavily rely on feature engineering and require intensive manual efforts. To reduce manual work, recent studies have investigated neural network-based methods, which deliver state-of-the-art performance. However, most existing neural models like (Miwa and Bansal, 2016) achieve joint learning of entities and relations only through parameter sharing but not joint decoding. To obtain relational triples, they still have to pipeline the detected entity pairs to a relation classifier for identifying the relation of entities. The separated decoding setting leads to a separated training objective for entity and relation, which brings a drawback that the triple-level dependencies between predicted entities and relations cannot be fully exploited. Different from those works, Zheng et al. (2017) achieves joint decoding by introducing a unified tagging scheme and convert the task of relational triple extraction to an end-to-end sequence tagging problem without need of NER or RC. The proposed method can directly model relational triples as a whole at the triple level since the information of entities and relations is integrated into the unified tagging scheme.
Though joint models (with or without joint decoding) have been well studied, most previous works ignore the problem of overlapping relational triples. Zeng et al. (2018) introduced three patterns of overlapping triples and try to address the problem via a sequence-to-sequence model with copy mechanism. Recently, Fu et al. (2019) also study the problem and propose a graph convolutional networks (GCNs) based method. Despite their initial success, both methods still treat the relations as discrete labels of entity pairs, making it quite hard for the model to learn overlapping triples.
Our framework is based on a training objective that is carefully designed to directly model the relational triples as a whole like (Zheng et al., 2017), i.e., to learn both entities and relations through joint decoding. Moreover, we model the relations as functions that map subjects to objects, which makes it crucially different from previous works.

The CASREL Framework
The goal of relational triple extraction is to identify all possible (subject, relation, object) triples in a sentence, where some triples may share the same entities as subjects or objects. Towards this goal, we directly model the triples and design a training objective right at the triple level. This is in contrast to previous approaches like (Fu et al., 2019) where the training objective is defined separately for entities and relations without explicitly modeling their integration at the triple level.
Formally, given annotated sentence x j from the training set D and a set of potentially overlapping triples T j = {(s, r, o)} in x j , we aim to maximize the data likelihood of the training set D: (3) Here we slightly abuse the notation T j . s ∈ T j denotes a subject appearing in the triples in T j . T j |s is the set of triples led by subject s in T j .
(r, o) ∈ T j |s is a (r, o) pair in the triples led by subject s in T j . R is the set of all possible relations. R\T j |s denotes all relations except those led by s in T j . o ∅ denotes a "null" object (explained below). Eq.
(2) applies the chain rule of probability. Eq. (3) exploits the crucial fact that for a given subject s, any relation relevant to s (those in T j |s) would lead to corresponding objects in the sentence, and all other relations would necessarily have no object in the sentence, i.e. a "null" object.
This formulation provides several benefits. First, since the data likelihood starts at the triple level, optimizing this likelihood corresponds to directly optimizing the final evaluation criteria at the triple level. Second, by making no assumption on how multiple triples may share entities in a sentence, it handles the overlapping triple problem by design. Third, the decomposition in Eq. (3) inspires a novel tagging scheme for triple extraction: we learn a subject tagger p(s|x j ) that recognizes subject entities in a sentence; and for each relation r, we learn an object tagger p r (o|s, x j ) that recognizes relationspecific objects for a given subject. In this way we can model each relation as a function that maps subjects to objects, as opposed to classifying relations for (subject, object) pairs. Indeed, this novel tagging scheme allows us to extract multiple triples at once: we first run the subject tagger to find all possible subjects in the sentence, and then for each subject found, apply relation-specific object taggers to find all relevant relations and the corresponding objects.
The key components in the above general framework, i.e., the subject tagger and relation-specific object taggers, can be instantiated in many ways. In this paper, we instantiate them as binary taggers on top of a deep bidirectional Transformer BERT (Devlin et al., 2019). We describe its detail below.

BERT Encoder
The encoder module extracts feature information x j from sentence x j , which will feed into subsequent tagging modules 2 . We employ a pre-trained BERT model (Devlin et al., 2019) to encode the context information.
Here we briefly review BERT, a multi-layer bidirectional Transformer based language representation model. It is designed to learn deep representations by jointly conditioning on both left and right context of each word, and it has recently been proven surprisingly effective in many downstream tasks (Zhong et al., 2019). Specifically, it is composed of a stack of N identical Transformer blocks. We denote the Transformer block as T rans(x), in which x represents the input vector. The detailed operations are as follows: where S is the matrix of one-hot vectors of subwords 3 indices in the input sentence, W s is the sub-words embedding matrix, W p is the positional embedding matrix where p represents the position index in the input sequence, h α is the hidden state vector, i.e., the context representation of input sentence at α-th layer and N is the number of Transformer blocks. Note that in our work the input is a single text sentence instead of sentence pair, hence the segmentation embedding as described in original BERT paper was not taken into account in Eq. (4). For a more comprehensive description of the Transformer structure, we refer readers to (Vaswani et al., 2017).

Cascade Decoder
Now we describe our instantiation of the novel cascade binary tagging scheme inspired by the previous formulation. The basic idea is to extract triples in two cascade steps. First, we detect subjects from the input sentence. Then for each candidate subject, we check all possible relations to see if a relation can associate objects in the sentence with that subject. Corresponding to the two steps, the cascade decoder consists of two modules as illustrated in Figure 2: a subject tagger; and a set of relationspecific object taggers.

Subject Tagger
The low level tagging module is designed to recognize all possible subjects in the input sentence by directly decoding the encoded vector h N produced by the N -layer BERT encoder. More precisely, it adopts two identical binary classifiers to detect the start and end position of subjects respectively by assigning each token a binary tag (0/1) that indicates whether the current token corresponds to a start or end position of a subject. The detailed operations of the subject tagger on each token are as follows: where p start s i and p end s i represent the probability of identifying the i-th token in the input sequence as the start and end position of a subject, respectively. The corresponding token will be assigned with a tag 1 if the probability exceeds a certain threshold or with a tag 0 otherwise. x i is the encoded representation of the i-th token in the input sequence, i.e., x i = h N [i], where W (·) represents the trainable weight, and b (·) is the bias and σ is the sigmoid activation function.
The subject tagger optimizes the following likelihood function to identify the span of subject s given a sentence representation x: where L is the length of the sentence. I{z} = 1 if z is true and 0 otherwise. y start s i is the binary tag of subject start position for the i-th token in x, and y end s i indicates the subject end position. The parameters θ = {W start , b start , W end , b end }. Figure 2: An overview of the proposed CASREL framework. In this example, there are three candidate subjects detected at the low level, while the presented 0/1 tags at high level are specific to the first subject Jackie R. Brown, i.e., a snapshot of the iteration state when k = 1 is shown as above. For the subsequent iterations (k = 2, 3), the results at high level will change, reflecting different triples detected. For instance, when k = 2, the high-level orange (green) blocks will change to 0 (1), respectively, reflecting the relational triple (Washington, Capital of, United States Of America) led by the second candidate subject Washington.
For multiple subjects detection, we adopt the nearest start-end pair match principle to decide the span of any subject based on the results of the start and end position taggers. For example, as shown in Figure 2, the nearest end token to the first start token "Jackie" is "Brown", hence the detected result of the first subject span will be "Jackie R. Brown". Notably, to match an end token for a given start token, we don't consider tokens whose position is prior to the position of the given token. Such match strategy is able to maintain the integrity of any entity span if the start and end positions are both correctly detected due to the natural continuity of any entity span in a given sentence.

Relation-specific Object Taggers
The high level tagging module simultaneously identifies the objects as well the involved relations with respect to the subjects obtained at lower level. As Figure 2 shows, it consists of a set of relation-specific object taggers with the same structure as subject tagger in low level module for all possible relations. All object taggers will identify the corresponding object(s) for each detected subject at the same time.
Different from subject tagger directly decoding the encoded vector h N , the relation-specific object tagger takes the subject features into account as well. The detailed operations of the relation-specific object tagger on each token are as follows: where p start o i and p end o i represent the probability of identifying the i-th token in the input sequence as the start and end position of a object respectively, and v k sub represents the encoded representation vector of the k-th subject detected in low level module.
For each subject, we iteratively apply the same decoding process on it. Note that the subject is usually composed of multiple tokens, to make the additions of x i and v k sub in Eq. (9) and Eq. (10) possible, we need to keep the dimension of two vectors consistent. To do so, we take the averaged vector representation between the start and end tokens of the k-th subject as v k sub . The object tagger for relation r optimizes the following likelihood function to identify the span of object o given a sentence representation x and a subject s: where y start o i is the binary tag of object start position for the i-th token in x, and y end o i is the tag of object end position for the i-th token. For a "null" object o ∅ , the tags y Note that in the high level tagging module, the relation is also decided by the output of object taggers. For example, the relation "Work in" does not hold between the detected subject "Jackie R. Brown" and the candidate object "Washington". Therefore, the object tagger for relation "Work in" will not identify the span of "Washington", i.e., the output of both start and end position are all zeros as shown in Figure 2. In contrast, the relation "Birth place" holds between "Jackie R. Brown" and "Washington", so the corresponding object tagger outputs the span of the candidate object "Washington". In this setting, the high level module is capable of simultaneously identifying the relations and objects with regard to the subjects detected in low level module.

Data Log-likelihood Objective
Taking log of Eq. (3), the objective J(Θ) is:   Note that we instantiate the CASREL framework on top of a pre-trained BERT model to combine the power of the proposed novel tagging scheme and the pre-learned prior knowledge for better performance. To evaluate the impact of introducing the Transformer-based BERT model, we conduct a set of ablation tests. CASREL random is the framework where all parameters of BERT are randomly initialized; CASREL LST M is the framework instantiated on a LSTM-based structure as in (Zheng et al., 2017) with pre-trained Glove embedding (Pennington et al., 2014); CASREL is the full-fledged framework using pre-trained BERT weights. Table 2 shows the results of different baselines for relational triple extraction on two datasets. The CASREL model overwhelmingly outperforms all the baselines in terms of all three evaluation metrics and achieves encouraging 17.5% and 30.2% improvements in F1-score over the best state-of-the-art method (Zeng et al., 2019) on NYT and WebNLG datasets respectively. Even without taking advantage of the pre-trained BERT, CAS-REL random and CASREL LST M are still competitive to existing state-of-the-art models. This validates the utility of the proposed cascade decoder that adopts a novel binary tagging scheme. The performance improvements from CASREL random to CASREL highlight the importance of the prior knowledge in a pre-trained language model. We can also observe from the table that there is a significant gap between the performance on NYT and WebNLG datasets for existing models, and we believe this gap is due to their drawbacks in dealing with overlapping triples. More precisely, as presented in Table 1, we can find that NYT dataset is mainly comprised of Normal class sentences while the majority of sentences in WebNLG dataset belong to EPO and SEO classes. Such inconsistent data distribution of two datasets leads to a comparatively better performance on NYT and a worse performance on WebNLG for all the baselines, exposing their drawbacks in extracting overlapping relational triples. In contrast, the CASREL model and its variants (i.e., CASREL random and CAS-REL LST M ) all achieve a stable and competitive performance on both NYT and WebNLG datasets, demonstrating the effectiveness of the proposed framework in solving the overlapping problem.

Main Results
Detailed Results on Different Types of Sentences To further study the capability of the proposed CASREL framework in extracting overlapping relational triples, we conduct two extended experiments on different types of sentences and compare the performance with previous works.
The detailed results on three different overlapping patterns are presented in Figure 3. It can be seen that the performance of most baselines on Normal, EPO and SEO presents a decreasing trend, reflecting the increasing difficulty of extracting relational triples from sentences with different overlapping patterns. That is, among the three overlapping patterns, Normal class is the easiest pattern while EPO and SEO classes are the relatively harder ones for baseline models to extract. In contrast, the proposed CASREL model attains consistently strong performance over all three overlapping patterns, es-  pecially for those hard patterns. We also validate the CASREL's capability in extracting relational triples from sentences with different number of triples. We split the sentences into five classes and Table 3 shows the results. Again, the CASREL model achieves excellent performance over all five classes. Though it's not surprising to find that the performance of most baselines decreases with the increasing number of relational triples that a sentence contains, some patterns still can be observed from the performance changes of different models. Compared to previous works that devote to solving the overlapping problem in relational triple extraction, our model suffers the least from the increasing complexity of the input sentence. Though the CAS-REL model gain considerable improvements on all five classes compared to the best state-of-the-art method CopyR RL (Zeng et al., 2019), the greatest improvement of F1-score on the two datasets both come from the most difficult class (N≥5), indicating that our model is more suitable for complicated scenarios than the baselines.
Both of these experiments validate the superiority of the proposed cascade binary tagging framework in extracting multiple (possibly overlapping) relational triples from complicated sentences compared to existing methods. Previous works have to explicitly predict all possible relation types con-tained in a given sentence, which is quite a challenging task, and thus many relations are missing in their extracted results. In contrast, our CASREL model side-steps the prediction of relation types and tends to extract as many relational triples as possible from a given sentence. We attribute this to the relation-specific object tagger setting in high level tagging module of the cascade decoder that considers all the relation types simultaneously.

Conclusion
In this paper, we introduce a novel cascade binary tagging framework (CASREL) derived from a principled problem formulation for relational triple extraction. Instead of modeling relations as discrete labels of entity pairs, we model the relations as functions that map subjects to objects, which provides a fresh perspective to revisit the relational triple extraction task. As a consequent, our model can simultaneously extract multiple relational triples from sentences, without suffering from the overlapping problem. We conduct extensive experiments on two widely used datasets to validate the effectiveness of the proposed CASREL framework. Experimental results show that our model overwhelmingly outperforms state-of-theart baselines over different scenarios, especially on the extraction of overlapping relational triples.

A Implementation Details
We adopt mini-batch mechanism to train our model with batch size as 6; the learning rate is set to 1e −5 ; the hyper-parameters are determined on the validation set. We also adopt early stopping mechanism to prevent the model from over-fitting. Specifically, we stop the training process when the performance on validation set does not gain any improvement for at least 7 consecutive epochs. The number of stacked bidirectional Transformer blocks N is 12 and the size of hidden state h N is 768. The pre-trained BERT model we used is [BERT-Base, Cased] 5 ,which contains 110M parameters. For fair comparison, the max length of input sentence to our model is set to 100 words as previous works (Zeng et al., 2018;Fu et al., 2019) suggest. We did not tune the threshold for both start and end position taggers to predict tag 1, but heuristically set the threshold to 0.5 as default. The performance might be better after carefully tuning the threshold, however it is beyond the research scope of this paper.

B Error Analysis
To explore the factors that affect the extracted relational triples of the CASREL model, we analyze the performance on predicting different elements of the triple (E1, R, E2) where E1 represents the subject entity, E2 represents the object entity and R represents the relation between them. An element like (E1, R) is regarded as correct only if the subject and the relation in the predicted triple (E1, R, E2) are both correct, regardless of the correctness of the predicted object. Similarly, we say an instance of E1 is correct as long as the subject in the extracted triple is correct, so is E2 and R. Table 4 shows the results on different relational triple elements. For NYT, the performance on E1 and E2 are consistent with that on (E1, R) and (R, E2), demonstrating the effectiveness of our proposed framework in identifying both subject and object entity mentions. We also find that there is only a trivial gap between the F1-score on (E1, E2) and (E1, R, E2), but an obvious gap between (E1, R, E2) and (E1, R)/(R, E2). It reveals that most relations for the entity pairs in extracted triples are correctly identified while some extracted entities 5  fail to form a valid relational triple. In other words, it implies that identifying relations is somehow easier than identifying entities for our model. In contrast to NYT, for WebNLG, the performance gap between (E1, E2) and (E1, R, E2) is comparatively larger than that between (E1, R, E2) and (E1, R)/(R, E2). It shows that misidentifying the relations will bring more performance degradation than misidentifying the entities. Such observation also indicates that it's more challenging for the proposed CASREL model to identify the relations than entities in WebNLG, as opposed to what we observed in NYT. We attribute such difference to the different number of relations contained in the two datasets (i.e., 24 in NYT and 246 in WebNLG), which makes the identification of relation much harder in WebNLG.

C Supplemental Experiments
In addition to validating the effectiveness of the proposed CASREL framework in handling the overlapping triple problem, we also conduct a set of supplemental experiments to show the generalization capability in more general cases on four widely used datasets, namely, ACE04, NYT10-HRL, NYT11-HRL and Wiki-KBP. Unlike the datasets we adopt in the main experiment, most test sentences in these datasets belong to the Normal class where no triples overlap with each other. Table 5 shows the result of a comprehensive comparison with recent stateof-the-art methods.
Notably, there are two different evaluation metrics selectively adopted among previous works: (1) The widely used one is Partial Match as we described in Section 4.1, i.e., an extracted relational triple (subject, relation, object) is regarded as correct only if the relation and the heads of both subject and object are all correct (Li and Ji, 2014;Miwa and Bansal, 2016;Katiyar and Cardie, 2017;Zheng et al., 2017;Zeng et al., 2018;Takanobu    ACE04 We follow the same 5-fold crossvalidation setting as adopted in previous works (Li and Ji, 2014;Miwa and Bansal, 2016;Li et al., 2019) and use the code 6 released by (Miwa and Bansal, 2016) to preprocess the raw XML-style data for fair comparison. Eventually, it has 2,171 valid sentences in total and each sentence contains at least one relational triple.
NYT10-HRL & NYT11-HRL NYT corpus has two versions: (1) the original version of which both the training set and test set are produced via distant supervision by Riedel et al. (2010) and (2) a smaller version with fewer relation types, where the training set is produced by distant supervision while the test set is manually annotated by Hoffmann et al. (2011). Here we denote the original one and the smaller one as NYT10 and NYT11, respectively. 6 https://github.com/tticoin/LSTM-ER These two versions have been selectively adopted and preprocessed in many different ways among various previous works, which may be confusing sometimes and lead to incomparable results if not specifying the version. To fairly compare these models, HRL (Takanobu et al., 2019) adopted a unified preprocessing for both NYT10 and NYT11, and provided a comprehensive comparison with previous works using the same datasets. Here we denote the preprocessed two versions as NYT10-HRL and NYT11-HRL.
For fair comparison, we use the preprocessed datasets released by Takanobu et al. (2019), where NYT10-HRL contains 70,339 sentences for training and 4,006 sentences for test and NYT11-HRL contains 62,648 sentences for training and 369 sentences for test. We also create a validation set by randomly sampling 0.5% data from the training set for each dataset as in (Takanobu et al., 2019).
Wiki-KBP We use the same version as Dai et al.
(2019) adopted, where the training set is from (Liu et al., 2017) and the test set is from (Ren et al., 2017). It has 79,934 sentences for training and 289 sentences for test. We also create a validation set by randomly sampling 10% data from the test set as Dai et al. (2019) suggested.
Dataset Study As stated beforehand, these datasets are not suitable for testing the overlapping problem. To further explain this argument, we analyze the datasets in detail and the statistics are shown in Table 6. We find that the test data in these datasets suffer little from the so-called overlapping triple problem since the sentences contain few overlapping triples. Even worse, we also find that the