SelfORE: Self-supervised Relational Feature Learning for Open Relation Extraction

Open relation extraction is the task of extracting open-domain relation facts from natural language sentences. Existing works either utilize heuristics or distant-supervised annotations to train a supervised classifier over pre-defined relations, or adopt unsupervised methods with additional assumptions that have less discriminative power. In this work, we proposed a self-supervised framework named SelfORE, which exploits weak, self-supervised signals by leveraging large pretrained language model for adaptive clustering on contextualized relational features, and bootstraps the self-supervised signals by improving contextualized features in relation classification. Experimental results on three datasets show the effectiveness and robustness of SelfORE on open-domain Relation Extraction when comparing with competitive baselines. Source code is available at https://github.com/THU-BPM/SelfORE.


Introduction
With huge amounts of information people generate, Relation Extraction (RE) aims to extract triplets of the form (subject, relation, object) from sentences, discovering the semantic relation that holds between two entities mentioned in the text. For example, given a sentence Derek Bell was born in Belfast, we can extract a relation born in between two entities Derek Bell and Belfast. The extracted triplets from the sentence are used in various downstream applications like web search, question answering, and natural language understanding.
Existing RE methods work well on pre-defined relations that have already appeared either in human-annotated datasets or knowledge bases. While in practice, human annotation can be laborintensive to obtain and hard to scale up to a large number of relations. Lots of efforts are made to al-leviate the human annotation efforts in Relation Extraction. Distant Supervision (Mintz et al., 2009) is a widely-used technique to train a supervised relation extraction model with less annotation as it only requires a small amount of annotated triplets as the supervision. However, distant supervised methods usually make strong assumptions on entity cooccurrence without sufficient contexts, which leads to noises and sparse matching results. More importantly, it works on a set of pre-defined relations, which prevent its applicability on open-domain text corpora.
Open Relation Extraction (OpenRE) aims at inferring and extracting triplets where the target relations cannot be specified in advance. Besides approaches that first identify relational phrases from open-domain corpora using heuristics or external labels via distant supervision and then recognize entity pairs (Yates et al., 2007;Fader et al., 2011), clustering-based unsupervised representation learning models get lots of attentions recently due to their ability to recognize triplets from meaningful semantic features with minimized or even no human annotation. Yao et al. (2011) regard OpenRE as a totally unsupervised task and use clustering method to extract triplets with new relation types. However, it cannot effectively discard irrelevant information and select meaningful relations. Simon et al. (2019) train expressive relation extraction models in an unsupervised setting. But it still requires that the exact number of relations in the open-domain corpora is known in advance.
To further alleviate the human annotation efforts while obtaining high-quality supervision for open relation extraction, in this paper, we propose a selfsupervised learning framework which obtains supervision from the data itself and learns to improve the supervision quality by learning better feature presentations in an iterative fashion. The proposed framework has three modules, Contextualized Relation Encoder, Adaptive Clustering, and Relation

Input Relation Classification
Adaptive Clustering

Pseudo Label Generation
Ba ck pr op L AC L RC Figure 1: Open Relation Extraction via Self-supervised Learning.
Classification. As shown in Figure 1, the Contextualized Relation Encoder leverages pretrained BERT model to encode entity pair representations based on the context in which they are mentioned.
To recognize and facilitate proximity of relevant entity pairs in the relational semantic space, the Adaptive Clustering module effectively clusters the contextualized entity pair representations generated by Contextualized Relation Encoder and generates pseudo-labels as the self-supervision. The Relation Classification module takes the cluster labels as pseudo-labels to train a relation classification module. The loss of Relation Classification on selfsupervised pseudo labels helps improve contextualized entity pairs features in Contextualized Relation Encoder, which further improves the pseudo label quality in Adaptive Clustering in an iterative fashion.
To summarize, the main contributions of this work are as follows: • We developed a novel self-supervised learning framework SelfORE for relation extraction from open-domain corpus where no relational human annotation is available.
• We demonstrated how to leverage pretrained language models to learn and refine contextualized entity pair representations via selfsupervised training schema.
• We showed that the self-supervised model outperforms strong baselines, and is robust when no prior information is available on target relations.

Proposed Model
The proposed model SelfORE consists of three modules: Contextualized Relation Encoder, Adaptive Clustering, and Relation Classification. As illustrated in Figure 1, the Contextualized Relation Encoder takes sentences as the input, where named entities are recognized and marked in the sentence. Contextualized Relation Encoder leverages the pretrained BERT (Devlin et al., 2018) model to output contextualized entity pair representation. The Adaptive Clustering takes the contextualized entity pair representation as the input, aiming to perform clustering that determines the relational cluster an entity pair belongs to. Unlike traditional clustering methods which assign hard cluster labels to each entity pair and are sensitive to the number of clusters, Adaptive Clustering performs soft-assignment which encourages high confidence assignments and is insensitive to the number of clusters. The pseudo labels based on the clustering results are considered as the self-supervised prior knowledge, which further guides the Relation Classification and features learning in Contextualized Relation Encoder. Before introducing details of each module, we briefly summarize the overall learning schema: i 1 Obtain contextualized entity pair representations based on entities mentioned in sentences using Contextualized Relation Encoder.
i 2 Apply Adaptive Clustering based on updated entity pair representations in 1 to generate pseudo labels for all relational entity pairs. i 3 Use pseudo labels as the supervision to train and update both Contextualized Relation Encoder and Relation Classification. Repeat 2 .

Contextualized Relation Encoder
The contextualized relation encoder aims to extract contextualized relational representations between two given entities in a sentence. In this work, we assume named entities in the text have been recognized ahead of time and we only focus on binary relations which involve two entities. The type of relationship between a pair of entities can be reflected by their contexts. Also, the nuances of expression in contexts also contribute to the relational representation of entity pairs. Therefore, we leverage pretrained deep bi-directional transformers networks (Devlin et al., 2018) to effectively encode entity pairs, along with their context information.
For a sentence X = [x 1 , .., x T ] where two entities E1 and E2 are mentioned, we follow the labeling schema adopted in Soares et al. (2019) and augment X with four reserved tokens to mark the beginning and the end of each entity mentioned in the sentence. We introduce the [E1 start ], [E1 end ], [E2 start ], [E2 end ] and inject them to X: (1) as the input token sequence for Contextualized Relation Encoder.
The contextualized relation encoder is denoted as f θ (X, E1, E2). To get the relation representation of two entities E1 and E2, instead of using the output of [CLS] token from BERT which summarizes the sentence-level semantics, we use the outputs corresponding to [E1 start ] , [E2 start ] positions as the contextualized entity representation and concatenate them to derive a fixed-length relation representation h ∈ R 2·h R : (2)

Adaptive Clustering
After we obtained H = {h 1 , h 2 , ..., h N } from N contextualized entity pair representations using Contextualized Relation Encoder, the Adaptive Clustering aims to cluster the entity pair representations into K semantically-meaningful clusters. The Adaptive Clustering gives each entity pair a cluster label, which serves as the pseudo label for later stages.
Comparing with the traditional clustering method which gives hard label assignment for each entity pair (e.g. k-means), the Adaptive Clustering adopts a soft-assignment, adaptive clustering schema. The adaptive clustering encourages high confidence assignments and is insensitive to the number of clusters. More specifically, Adaptive Clustering consists of two parts: (1) a nonliner mapping g φ to convert the entity pair rep- , and a soft-assignment of all N entity pairs to K cluster centroids.
For the first part, we simply adopt a set of fully connected layers as the non-linear mapping. Instead of initializing parameters randomly and training the mapping from scratch, the initial parameters are adopted from an encoder of an autoencoder model (Vincent et al., 2010). We pretrain an autoencoder model separately, which takes h as the input and minimizes the reconstruction loss over all N samples: For the second part, the module learns to optimize g φ 's parameters and assign each sample to a cluster with high-confidence. We first perform standard k-means clustering in the feature space R h AC to obtain K initial centroids {µ k ∈ R h AC } K k=1 . Inspired by Xie et al. (2016), we use the Student's t-distribution as a kernel to measure the similarity between embedded point z n and each centroid µ k : where α represents the freedom of the Student's t-distribution and q nk can be regarded as the probability of assigning sample n to cluster k as the soft assignment. We set α = 1 for all experiments. We normalize each cluster by frequency as an auxiliary target distribution in Equation 8 and iteratively refine the clusters by learning from their high confidence assignment with the help of an auxiliary distribution: where f k = n q nk is the soft cluster frequency.
With the auxiliary distribution, we define KL divergence loss between the soft assignments q n and the auxiliary distribution p n as follows to train the Adaptive Clustering module: We use gradient descent based optimizer to minimize the L AC . Note that only the parameters for g φ will be updated -parameters in the Contextualized Relation Encoder (f θ ) are not effected when minimizing L AC . We assign the pseudo label s n for the n-th entity pair by taking the label associated with the largest probability: To alleviate the negative impact from choosing unideal initial centroids, the Adaptive Clustering re-selects a set of K initial centroids randomly if the L AC does not decrease after the first epoch.
In summary, comparing with traditional clustering methods such as k-means, Adaptive Clustering adopts an iterative, soft-assignment learning process which encourages high-confidence assignments and uses high-confidence assignments to improve low confidence ones. It possesses the following advantages: 1) Adaptive Clustering improves clustering purity and benefits low-confidence assignment for an overall better relational clustering performance. 2) It prevents large relational clusters from distorting the hidden feature space. (3) It neither requires the actual number of target relations in advance (although it is good to have the target relations as the prior knowledge), nor the distribution of relations.

Relation Classification
The Adaptive Clustering generates cluster labels S = {s 1 , s 2 , ..., s N } for all entity pairs as pseudo labels. With these pseudo labels as the selfsupervised signals derived from the corpora themselves, Relation Classification module aims to use pseudo labels to guide the relational feature learning in Contextualized Relation Encoder as well as relation classifier learning in Relation Classification.
Similar to a supervised classifier which learns to predict golden labels, the Relation Classification module learns to predict the pseudo labels generated by Adaptive Clustering. More specifically, we have: where c τ denotes the relation classification module parameterized by τ and l n is a probability distribution over K pseudo labels for the n-th sample.
In order to find the best-performing parameters θ for Contextualized Relation Encoder and τ for the classifier, we optimize the following classification loss: where loss is the cross entropy loss function and one hot(s n ) returns a one-hot vector indicating the pseudo label assignments.

The Bootstrapping Self-Supervision Loop
After optimizing L RC , we repeat Adaptive Clustering and Relation Classification in an iterative fashion, shown as 2 , 3 in Figure 1. Overall, the Adaptive Clustering exploits weak, self-supervised signals from data and Relation Classification bootstraps the discriminative power of the and calculate the most frequent n-gram as the surface form. For quantitative evaluation, we assign the majority ground truth label within each cluster as the predict relation label for each relation cluster.

Experiments
We conduct extensive experiments on real-world datasets to show the effectiveness of our selfsupervised learning rationale on relation extraction, and give a detailed analysis to show its advantages.

Datasets
We use three labeled datasets to evaluate our model: NYT+FB, T-REx SPO, and T-REx DS. The NYT+FB dataset is generated via distant supervision, aligning sentences from the New York Times corpus (Sandhaus, 2008) with Freebase (Bollacker et al., 2008) triplets. It has been widely used in previous RE works (Marcheggiani and Titov, 2016;Yao et al., 2011;Simon et al., 2019). We follow the setting in Simon et al. (2019) and filter out sentences with non-binary relations. We get 41,000 labeled sentences containing 262 target relations from 2 million sentences. 20% of these sentences will be used as validation datasets for hyperparameter tuning and 80% will be used for model training.
Both T-REx SPO and T-REx DS datasets come from T-REx (Elsahar et al., 2018) which is generated by aligning Wikipedia corpora with Wikidata (Vrandečić, 2012). We filter triplets and keep sentences where both entities appear in the same sentence -a sentence will appear multiple times if it contains multiple binary relations associated with different entity pairs. We built two datasets T-REx SPO and T-REx DS depending on whether the dataset has surface-form relations or not. For example, the relation give birth to could be conveyed by surface-forms like born in, date of birth, etc. T-REx SPO contains 615 relations and 763,000 sentences, where all sentences contain triplets having the surface form relation in the sentence. T-REx DS is generated via distant supervision where the surfaceform of relation is not necessarily contained in the sentence. T-REx DS contains 1189 relations and nearly 12 million sentences. The dataset still contains some misalignment, but should nevertheless be easier for models to extract the correct semantic relation. 20% of these sentences will be used as the validation dataset and 80% will be used for model training.

Baseline and Evaluation metrics
We use standard unsupervised evaluation metrics for comparisons with other three baseline algorithms Yao et al. (2011); Marcheggiani and Titov (2016); Simon et al. (2019) where no human annotation is available for Relation Extraction from the open-domain data. For all models, we assume the number of target relations is known to the model in advance. We set the number of clusters to the number of ground-truth categories and evaluate performance with B 3 , V-measure and ARI.
Additionally, we evaluate the performance of our proposed model in a practical, yet more challenging setting: we assume the size of target relations is not known. A much larger cluster sizeK such as 1000 is adopted. To make it a fair comparison whenK K, we use unsupervised approaches such as k-means to further mergeK clusters into K clusters (the size of ground-truth categories) for the evaluation.
For baselines, rel-LDA is a generative model proposed by Yao et al. (2011). We consider two variations of rel-LDA which only differ in the number of features they considered. rel-LDA uses the 3 simplest features and rel-LDA-full is trained with a total number of 8 features listed in Marcheggiani and Titov (2016). UIE (Simon et al., 2019) is the state-of-the-art method that trains a discriminative relation extraction model on unlabeled datasets by forcing the model to predict each relation with confidence and encourage all relations to be predicted on average. Two base model architectures (UIE-March and UIE-PCNN) are considered. To make it a fair comparison, we further introduce UIE-BERT, which is trained with losses introduced in Simon et al. (2019) but we replace the PCNN classifier + GloVe embedding with our BERT-based Relation Encoder and Classification module.
To convert pseudo labels indicating the clustering assignment to relation labels for evaluation purposes, we follow the setting in the previous work (Simon et al., 2019) and assign the majority of ground truth relation labels in each cluster to all samples in that cluster as the prediction label. For evaluation metrics, we use B 3 precision and recall to measure the correct rate of putting each sentence in its cluster or clustering all samples into a single class. More specifically, B 3 F 1 is the harmonic mean of precision and recall: We use V-measures (Rosenberg and Hirschberg, 2007) to calculate homogeneity and completeness, which is analogous to B 3 precision and recall, but with the conditional entropy: Homogeneity = 1 − H(c(X)|g(X))/H(c(X)) Completeness = 1 − H(g(X)|c(X))/H(g(X)) where these two metrics penalize small impurities in a relatively "pure" cluster more harshly than in less pure ones. We also report F1 value, which is the harmonic mean of Homogeneity and Completeness.
Adjusted Rand Index (Hubert and Arabie, 1985) measures the degree of agreement between two data distributions. The range of ARI is [-1,1], the larger the value, the more consistent the clustering result is with the real situation.

Implementation Details
Following the settings used in Simon et al. (2019), all models are trained with 10 relation classes. Although it is lower than the number of true relations in the dataset, it still reveals important insights as the distribution of target relations is very unbalanced. Also, this allows us to do a fair comparison with baseline results.
For Contextualized Relation Encoder, we use the default tokenizer in BERT to preprocess dataset and set max-length as 128. We use the pretrained T-REx SPO rel-LDA (Yao et al., 2011) 11.9 10.2 14.1 5.9 4.9 7.4 3.9 rel-LDA-full (Yao et al., 2011) 18  BERT-Base Cased model to initialize parameters for Contextualized Relation Encoder and use BertAdam to optimize the loss. For Adaptive Clustering, we use an autoencoder with fully connected layers with the following dimensions 2h R -500-500-200 as the encoder and 200-500-500-2h R for the decoder. We randomly initialize weights using a Gaussian distribution with zero-mean and a standard deviation of 0.01. The autoencoder is pretrained for 20 epoches with 1e−3 learning rate and 1e−5 weight-decay with Adam Optimizer. To get the initial centroids, we applied k-means and set K as 10.
For Relation Classification, we use a fully connected layer as c τ and set dropout rate to 10%, learning rate to 1e−5 and warm-up scheduling rate to 0.1. We fixed the parameters in f θ for the first three epochs to allow the classification layer to warm up. Table 1 shows the experimental results. UIE-PCNN is considered as the previous state-of-the-art result. We enhance this baseline by replacing PCNN and GolVe embedding with the proposed BERT-based encoder and classifier. The enhanced stateof-the-art model, namely UIE-BERT, achieves the best performance among baselines. The proposed SelfORE model outperforms all baseline models consistently on B 3 F1/Precision, V-measure F1/Homogeneity and ARI. SelfORE on average achieves 7.0% higher in B 3 F1, 3.4% higher in Vmeasure F1 and 7.7% higher in ARI across three dataset when comparing with previous state-of-theart. Unlike baseline methods which achieve high B 3 Recall but low Precision, or high V-measure Completeness but low Homogeneity, our model obtains a more balanced performance while achieving the highest Precision and Homogeneity, although the B 3 Recall and V-measure Completeness are less satisfactory. Having high precision and homogeneity scores can be a quite appealing property for precision-oriented applications in the real-world.

Ablation Study
We conduct ablation study to show the effectiveness of different module components of SelfORE to the overall improved performance. SelfORE w/o Classification is the proposed model without Relation Classification and only uses the Contextualized Relation Encoder for Adaptive Clustering. SelfORE w/o Adaptive Clustering replaces the proposed soft-assignment clustering methods with k-means clustering as a hard-assignment alternative.
A general conclusion from ablation rows in Table 1 is that all modules contribute positively to the improved performance. More specifically, without self-supervised signals for relational feature learning, SelfORE w/o Classification gives us 14.4% less performance averaged over all metrics on all datasets. Similarly, Adaptive Clustering gives 6.2% performance boost in average over all metrics when comparing with the hard-assignment alternative (SelfORE w/o Adaptive Clustering).

Visualize Contextualized Features
To intuitively show how self-supervised learning helps learn better contextualized relational features on entity pairs for Relation Extraction, we visualize the contextual representation space R 2·h R after dimension reduction using t-SNE (Maaten and Hinton, 2008). We randomly choose 4 relations from NYT+FB dataset and sample 50 entity pairs. The visualization results are shown in Figure 2. Features are colored according to their ground-truth relation labels.
From Figure 2 we can see that the features obtained through the raw BERT model (left) can already give meaningful semantics to entity pairs having different relations. But these features are not tailored for the relation extraction task. When Adaptive Clustering is not applied (middle) and simply using k-means, which performs hard-assignment on samples, the proposed model without Adaptive Clustering gives decent results but does not provide confident cluster assignments. The proposed model (right) uses soft-assignment and a self-supervised learning schema to improve the relational feature learning -we learn denser clusters and more discriminitaive features.

Sensitivity analysis: when K is unknown
The Adaptive Clustering gives the SelfORE model enough flexibility to model relational features without knowing any prior information on the number of target relations or the relation distribution. This property is appealing when the number of target relations is not available for Relation Extraction on an open-domain corpus.
The proposed model does require an intial cluster sizeK as the scope for pseudo labels. A general guideline for choosingK is to choose a value that is larger than the actual number of relations in the corpora as over-specifying the cluster size should not hurt the model performance. We set an initialK (for exampleK = 1000), and use an unsupervised method, here we use k-means, to mergeK cluster centroids into K clusters for evaluation. We varyK from 10 to 1250 and report the B 3 F1 score when comparing the predicted relation type (based on K clusters after merging) with the golden relation type. As shown in Figure 3, the best performance is obtained whenK = 10, indicating that SelfORE can leverage the number of target relations as a useful prior knowledge. Thanks to the self-learning schema and the Adaptive Clustering, when we veryK from 10 to 1250, the model achieves stable F1 score and is not sensitive to the inital choice ofK on all three datasets. The results also further indicate the applicability of the proposed model when being applied to an opendomain corpus when the number of target relations is not available in advance. We can safely assign a largerK value than needed and the model is still robust. Note that mergingK clusters into K clusters is mainly for evaluation purposes -when K is not known a head of time and we simply use a large K directly, it does result in K clusters where most clusters tend to be smaller, and multiple clusters may correspond to entity pairs having the same relation.

Surface-form Relation Names
We provide a brief case study to show the surfaceform relation names we extracted for each cluster (introduced in Section 2.4). We randomly select 5 relations in T-REx SPO and report the extracted surface-form relation names using frequent n-gram in Table 2.
Extracted surface-form Golden surface-form are close to shares border with the state of country capital city capital son of child member of member of The surface-form relation name extraction gives SelfORE an extended ability to not only discriminate between entity pairs having different relations, but also derive surface-forms for relation clusters as the final Relation Extraction results. However, evaluating the quality of relation surface-forms is out-of-scope for this work.

Related Works
Relation extraction focuses on identifying the relation between two entities in a given sentence. Traditional closed-domain relation extraction methods are supervised models. They need a set of predefined relation labels and require large amounts of annotated triplets, making them less ideal to work on open-domain corpora. Distant supervision (Mintz et al., 2009;Hoffmann et al., 2011;Surdeanu et al., 2012) is a widely adopted method to alleviate human annotation: if multiple sentences contain two entities that have a certain relation in a knowledge graph, at least one sentence is believed to convey the corresponding relation. However, entities convey semantic meanings also according to the contexts -distant supervised models do not explicitly consider contexts and the resulting model cannot discover new relations as the supervision is purely adopted from knowledge bases.
Unsupervised relation extraction gets lots of attention, due to the ability to discover relational knowledge without access to annotations and external resources. Unsupervised models either 1) cluster the relation representation extracted from the sentence; 2) make more assumptions that provide learning signals for classification models.
Among clustering models, an important milestone is the OpenIE approach Angeli et al., 2015), assuming the surface form of relations will appear between two entities in its dependency tree. However, these works heavily rely on surface-form relation and have less ideal generalization capabilities. To solve this problem, Roy et al. (2019) propose a system that learns to supervise unsupervised OpenIE model, which combines the strength and avoid the weakness in each individual OpenIE system. Relation knowledge transfer system (Wu et al., 2019) learns similarity metrics of relations from labeled data of pre-defined relations, and then transfers the relational knowledge to identify novel relations in unlabeled data. Marcheggiani and Titov (2016) propose a variational autoencoder approach(VAE): the encoder part extracts relations from labeled features, and the decoder part predicts one entity when given the other entity and the relation with the function of triplet scoring (Nickel et al., 2011). This scoring function could provide a signal since it is known to predict relation triplets when given their embeddings. However, posterior distribution and prior uniform distribution based on KL divergence is unstable. Simon et al. (2019) propose a model to solve instability and train the features on classifiers such as PCNN model (Zeng et al., 2015). Inspired by the success of self-supervised learning in computer vision tasks (Wiles et al., 2018;Caron et al., 2018), and large pretrained language models that show great potential to encode meaningful semantics for various downstream tasks (Devlin et al., 2018;Soares et al., 2019), we proposed a self-supervised learning schema for open-domain relation extraction. It has the advantages of unsupervised learning to handle the cases where the number of relations is not known in advance, but also keeps the advantage of supervised learning that has strong discriminative power for relational feature learning.

Conclusions
In this paper, we propose a self-supervised learning model SelfORE for open-domain relation extraction. Different from conventional distant supervised models which require pre-defined Knowledge Bases or labeled instances for Relation Extraction in a closed-world setting, our model does not require annotation and has the ability to work on open-domain scenario when target relation number and the relation distribution are not known in advance. Comparing with unsupervised models, our model exploits the advantages of supervised models to bootstraps the discriminative power from self-supervised signals to improve contextualized relational feature learning. Experiments on three real-world datasets show the effectiveness and the robustness of the proposed model over competitive baselines.