Open Relation Extraction: Relational Knowledge Transfer from Supervised Data to Unsupervised Data

Open relation extraction (OpenRE) aims to extract relational facts from the open-domain corpus. To this end, it discovers relation patterns between named entities and then clusters those semantically equivalent patterns into a united relation cluster. Most OpenRE methods typically confine themselves to unsupervised paradigms, without taking advantage of existing relational facts in knowledge bases (KBs) and their high-quality labeled instances. To address this issue, we propose Relational Siamese Networks (RSNs) to learn similarity metrics of relations from labeled data of pre-defined relations, and then transfer the relational knowledge to identify novel relations in unlabeled data. Experiment results on two real-world datasets show that our framework can achieve significant improvements as compared with other state-of-the-art methods. Our code is available at https://github.com/thunlp/RSN.


Introduction
Relation extraction (RE) aims to extract relational facts between two entities from plain texts. For example, with the sentence "Hayao Miyazaki is the director of the film 'The Wind Rises'", we can extract a relation "director_of" between two entities "Hayao Miyazaki" and "The Wind Rises".
Recent progress in supervised methods to RE has achieved great successes. Supervised methods can effectively learn significant relation semantic patterns based on existing labeled data, but the data constructions are time-consuming and human-intensive. To lower the level of supervision, several semi-supervised approaches have been developed, including bootstrapping, active learning, label propagation (Pawar et al., 2017). * Mintz (2009) also proposes distant supervision to generate training data automatically. It assumes that if two entities have a relation in KBs, all sentences that contain these two entities will express this relation. Still, all these approaches can only extract pre-defined relations that have already appeared either in human-annotated datasets or KBs. It is hard for them to cover the great variety of novel relational facts in the open-domain corpora.
Open relation extraction (OpenRE) aims to extract relational facts on the open-domain corpus, where the relation types may not be predefined. There are some efforts concentrating on extracting triples with new relation types. Banko (2008) directly extracts words or phrases in sentences to represent new relation types. However, some relations cannot be explicitly represented with tokens in sentences, and it is hard to align different relational tokens that exactly have the same meanings. Yao (2011) consid-ers OpenRE as a clustering task for extracting triples with new relation types. However, previous clustering-based OpenRE methods (Yao et al., 2011(Yao et al., , 2012Marcheggiani and Titov, 2016;Elsahar et al., 2017) are mostly unsupervised, and cannot effectively select meaningful relation patterns and discard irrelevant information.
In this paper, we propose to take advantage of high-quality supervised data of pre-defined relations for OpenRE. The approach is non-trivial, however, due to the considerable gap between the pre-defined relations and novel relations of interest in open domain. To bridge the gap, we propose Relational Siamese Networks (RSNs) to learn transferable relational knowledge from supervised data for OpenRE. Specifically, RSNs learn relational similarity metrics from labeled data of pre-defined relations, and then transfer the metrics to measure the similarity of unlabeled sentences for open relation clustering. We describe the flowchart of our framework in Figure 1.
Moreover, we show that RSNs can also be generalized to various weakly-supervised scenarios. We propose Semi-supervised RSN to learn from both supervised data of pre-defined relations and unsupervised data with novel relations, and Distantly-supervised RSN to learn from distantly-supervised data and unsupervised data.
We conduct experiments on real-world RE datasets, FewRel and FewRel-distant, by splitting relations into seen and unseen set, and evaluate our models in supervised, semi-supervised, and distantly-supervised scenarios. The results demonstrate that our models significantly outperform state-of-the-art baseline methods in all scenarios without using external linguistic tools. To summarize, the main contributions of this work are as follows: (1) We develop a novel relational knowledge transfer framework RSN for OpenRE, which can effectively transfer existing relational knowledge to novel-relation data and accurately identify novel relations. To the best of our knowledge, RSN is the first model to consider knowledge transfer in clustering-based OpenRE task.
(2) We further propose Semi-supervised RSNs and Distantly-supervised RSNs that can learn from various weakly supervised scenarios. The experimental results show that all these RSN models achieve significant improvements in F-measure compared with state-of-the-art baselines.

Related Work
Open Relation Extraction. Relation extraction (RE) is an important task in NLP. Traditional RE methods mainly concentrate on classifying relational facts into pre-defined relation types (Mintz et al., 2009;Yu et al., 2017). Zeng (2014) utilizes CNN encoders to build sentence representations with the help of position embeddings. Lin (2016) further improves RE performance on distantlysupervised data via instance-level attention. These methods take advantage of supervised or distantlysupervised data to learn neural sentence encoders for distributed representations, and have achieved promising results. However, these methods cannot handle the open-ended growth of new relation types in the open-domain corpora.
To solve this problem, recently many efforts have been invested in exploring methods for open relation extraction (OpenRE), which aims to discover new relation types from unsupervised open-domain corpora. OpenRE methods can be roughly divided into two categories: taggingbased and clustering-based. Tagging-based methods cast OpenRE as a sequence labeling problem, and extract relational phrases consisting of words from sentences in unsupervised (Banko et al., 2007;Banko and Etzioni, 2008) or supervised paradigms (Jia et al., 2018;Cui et al., 2018;Stanovsky et al., 2018). However, tagging-based methods often extract multiple overly-specific relational phrases for the same relation type, and cannot be readily utilized for downstream tasks.
In comparison, conventional clustering-based OpenRE methods extract rich features for relation instances via external linguistic tools, and cluster semantic patterns into several relation types (Lin and Pantel, 2001;Yao et al., 2011Yao et al., , 2012. Marcheggiani (2016) proposes a reconstructionbased model discrete-state variational autoencoder for OpenRE via unlabeled instances. Elsahar (2017) utilizes a clustering algorithm over linguistic features. In this paper, we focus on the clustering-based OpenRE methods, which have the advantage of discovering highly distinguishable relation types.
Few-shot Learning. Few-shot learning aims to classify instances with a handful of labeled samples. Many efforts are devoted to few-shot image classification (Koch et al., 2015) and relation classification (Yuan et al., 2017;Han et al., 2018). Notably, (Koch et al., 2015) introduces Convolu-  tional Siamese Neural Network for image metric learning, which inspires us to learn relational similarity metrics for OpenRE.
Semi-supervised Clustering. Semi-supervised clustering aims to cluster semantic patterns given instance seeds of target categories (Bair, 2013;Hongtao Lin, 2019). Differently, our proposed Semi-supervised RSN only leverages labeled instances of pre-defined relations, and does not need any seed of new relations.

Methodology
Our OpenRE framework mainly consists of two modules, the relation similarity calculation module and the relation clustering module. For relation similarity calculation, we propose Relational Siamese Networks (RSNs), which learn to predict whether two sentences mention the same relation. To utilize large-scale unsupervised data and distantly-supervised data, we further propose Semi-supervised RSN and Distantly-supervised RSN. Finally, in the relation clustering module, with the learned relation metric, we utilize hierarchical agglomerative clustering (HAC) and Louvain clustering algorithms to cluster target relation instances of new relation types.

Relational Siamese Network (RSN)
The architecture of our Relational Siamese Networks is shown in Figure 2. CNN modules encode a pair of relational instances into vectors, and several shared layers compute their similarity.
Sentence Encoder. We use a CNN module as the sentence encoder. The CNN module includes an embedding layer, a convolutional layer, a max-pooling layer, and a fully-connected (FC) layer. The embedding layer transforms the words in a sentence x and the positions of entities e head and e tail into pre-trained word embeddings and random-initialized position embeddings. Following (Zeng et al., 2014), we concatenate these embeddings to form a vector sequence. Next, a one-dimensional convolutional layer and a maxpooling layer transform the vector sequence into features. Finally, an FC layer with sigmoid activation maps features into a relational vector v. To summarize, we obtain a vector representation v for a relational sentence with our CNN module: in which we denote the joint information of a sentence x and two entities in it e head and e tail as a data sample s. And with paired input relational instances, we have: in which two CNN modules are identical and share all the parameters. Similarity Computation. Next, to measure the similarity of two relational vectors, we calculate their absolute distance and transform it into a realnumber similarity p ∈ [0, 1]. First, a distance layer computes the element-wise absolute distance of two vectors: Then, a classifier layer calculates a metric p for relation similarity. The layer is a one-dimensionaloutput FC layer with sigmoid activation: in which σ denotes the sigmoid function, k and b denote the weights and bias. To summarize, we obtain a good similarity metric p of relational instances.
Cross Entropy Loss. The output of RSN p can also be explained as the probability of two sentences mentioning two different relations. Thus, we can use binary labels q and binary cross entropy loss to train our RSN: in which θ indicates all the parameters in the RSN.

Semi-supervised RSN
To discover relation clusters in the open-domain corpus, it is beneficial to not only learn from labeled data, but also capture the manifold of unlabeled data in the semantic space. To this end, we need to push the decision boundaries away from high-density areas, which is known as the cluster assumption (Chapelle and Zien, 2005). We try to achieve this goal with several additional loss functions. In the following paragraphs, we denote the labeled training dataset as D l and a couple of labeled relational instances as d l . Similarly, we denote the unlabeled training dataset as D u and a couple of unlabeled instances as d u .
Conditional Entropy Loss. In classification problems, a well-classified embedding space usually reserves large margins between different classified clusters, and optimizing margin can be a promising way to facilitate training. However, in clustering problems, type labels are not available during training. To optimize margin without explicit supervision, we can push the data points away from the decision boundaries. Intuitively, when the distance similarity p between two relational instances equals 0.5, there is a high prob-ability that at least one of two instances is near the decision boundary between relation clusters. Thus, we use the conditional entropy loss (Grandvalet and Bengio, 2005), which reaches the maximum when p = 0.5, to penalize close-boundary distribution of data points: Virtual Adversarial Loss. Despite its theoretical promise, conditional entropy minimization suffers from shortcomings in practice. Due to neural networks' strong fitting ability, a very complex decision hyperplane might be learned so as to keep away from all the training samples, which lacks generalizability. As a solution, we can smooth the relational representation space with locally-Lipschitz constraint.
To satisfy this constraint, we introduce virtual adversarial training (Miyato et al., 2016) on both branches of RSN. Virtual adversarial training can search through data point neighborhoods, and penalize most sharp changes in distance prediction. For labeled data, we have in which D KL indicates the Kullback-Leibler divergence, p θ (d l , t 1 , t 2 ) indicates a new distance estimation with perturbations t 1 and t 2 on both input instances respectively. Specifically, t 1 and t 2 are worst-case perturbations that maximize the KL divergence between p θ (d l ) and p θ (d l , t 1 , t 2 ) with a limited length. Empirically, we approximate the perturbations the same as the original paper (Miyato et al., 2016). Specifically, we first add a random noise to the input, and calculate the gradient of the KL-divergence between the outputs of the original input and the noisy input. We then add the normalized gradient to the original input and get the perturbed input. And for unlabeled data, we have in which the perturbations t 1 and t 2 are added to word embeddings rather than the words themselves.
To summarize, we use the following loss function to train Semi-supervised RSN, which learns from both labeled and unlabeled data: in which λ v and λ u are two hyperparameters.

Distantly-supervised RSN
To alleviate the intensive human labor for annotation, the topic of distantly-supervised learning has attracted much attention in RE. Here, we propose Distantly-supervised RSN, which can learn from both distantly-supervised data and unsupervised data for relational knowledge transfer. Specifically, we use the following loss function: which treats auto-labeled data as labeled data but removes the virtual adversarial loss on the autolabeled data. The reason to remove the loss is simple: virtual adversarial training on auto-labeled data can amplify the noise from false labels. Indeed, we do find that the virtual adversarial loss on autolabeled data can harm our model's performance in experiments.
We do not use more denoising methods, since we think RSN has some inherent advantages of tolerating such noise. Firstly, the noise will be overwhelmed by the large proportion of negative sampling during training. Secondly, during clustering, the prediction of a new relation cluster is based on areas where the density of relational instances is high. Outliers from noise, as a result, will not influence the prediction process so much.

Open Relation Clustering
After RSN is learned, we can use RSN to calculate the similarity matrix of testing instances. With this matrix, several clustering methods can be applied to extract new relation clusters.
Hierarchical Agglomerative Clustering. The first clustering method we adopt is hierarchical agglomerative clustering (HAC). HAC is a bottomup clustering algorithm. At the start, every testing instance is regarded as a cluster. For every step, it agglomerates two closest instances. There are several criteria to evaluate the distance between two clusters. Here, we adopt the complete-linkage criterion, which is more robust to extreme instances.
However, there is a significant shortcoming of HAC: it needs the exact number of clusters in advance. A potential solution is to stop agglomerating according to an empirical distance threshold, but it is hard to determine such a threshold. This problem leads us to consider another clustering algorithm Louvain (Blondel et al., 2008).
Louvain. Louvain is a graph-based clustering algorithm traditionally used for detecting communities. To construct the graph, we use the binary approximation of RSN's output, with 0 indicating an edge between two nodes. The advantage of Louvain is that it does not need the number of potential clusters beforehand. It will automatically find proper sizes of clusters by optimizing community modularity. According to the experiments we conduct, Louvain performs better than HAC.
After running, Louvain might produce a number of singleton clusters with few instances. It is not proper to call these clusters new relation types, so we label these instances the same as their closest labeled neighbors.
Finally, we want to explain the reason why we do not use some other common clustering methods like K-Means, Mean-Shift and Ward's (Ward Jr, 1963) method of HAC: these methods calculate the centroid of several points during clustering by merely averaging them. However, the relation vectors in our model are high-dimensional, and the distance metric described by RSN is non-linear. Consequently, it is not proper to calculate the centroid by simply averaging the vectors.

Experiments
In this section, we conduct several experiments on real-world RE datasets to show the effectiveness of our models, and give a detailed analysis to show its advantages.

Dataset
In experiments, we use FewRel (Han et al., 2018) as our first dataset. FewRel is a human-annotated dataset containing 80 types of relations, each with 700 instances. An advantage of FewRel is that every instance contains a unique entity pair, so RE models cannot choose the easy way to memorize the entities.
We use the original train set of FewRel, which contains 64 relations, as labeled set with predefined relations, and the original validation set of FewRel, which contains 16 new relations, as the unlabeled set with novel relations to extract. We then randomly choose 1, 600 instances from the unlabeled set as the test set, with the rest labeled and unlabeled instances considered as the train set.
The second dataset we use is FewRel-distant, which contains the distantly-supervised data obtained by the authors of FewRel before human an-notation. We follow the split of FewRel to obtain the auto-labeled train set and unlabeled train set. For evaluation, we use the human-annotated test set of FewRel with 1, 600 instances. Unlabeled instances already existing in this test set are removed from the unlabeled train set of FewRel-distant. Finally, the auto-labeled train set contains 323, 549 relational instances, and the unlabeled train set contains 60, 581 instances.
A previous OpenRE work reports performance on an unpublic dataset called NYT-FB (Marcheggiani and Titov, 2016). However, it has several shortcomings compared with FewRel-distant. First, NTY-FB's test set is distantly-supervised and is noisy for instance-level RE. Moreover, instances in NYT-FB often share entity pairs or relational phrases, which makes it much easier for relation clustering. Therefore, we think the results on FewRel-distant are convincing enough for Distantly-supervised OpenRE.

Implementation Details
Data Sampling. The input of RSN should be a pair of sampled instances. For the unlabeled set, the only possible sampling method is to select two instances randomly. For the labeled set, however, random selection would result in too many different-relation pairs, and cause severe biases for RSN. To solve this problem, we use downsampling. In our experiments, we fix the percentage of same-relation pairs in every labeled data batch as 6%.
Let us denote this percentage number as the sample ratio for convenience. Experimental results show that the sample ratio decides RSN's tendency to predict larger or smaller clusters. In other words, it controls the granularity of the predicted relation types. This phenomenon suggests a potential application of our model in hierarchical relation extraction. However, we leave any serious discussion to future work.
Hyperparameter Settings. Following (Lin et al., 2016) and (Zeng et al., 2014), we fix the less influencing hyperparameters for sentence encoding as their reported optimal values. For word embeddings, we use pre-trained 50-dimensional Glove (Pennington et al., 2014) word embeddings. For position embeddings, we use randominitialized 5-dimensional position embeddings. During training, all the embeddings are trainable. For the neural network, the number of feature maps in the convolutional layer is 230. The filter length is 3. The activation function after the max-pooling layer is ReLU, and the activation functions after FC layers are sigmoid. Besides, we adopt two regularization methods in the CNN module. We put a dropout layer right after the embedding layer as (Miyato et al., 2016). The dropout rate is 0.2. We also impose L2 regularization on the convolutional layer and the FC layer, with parameters of 0.0002 and 0.001 respectively. Hyperparameters for virtual adversarial training are just the same as (Miyato et al., 2016) proposed.
At the same time, major hyperparameters are selected with grid search according to the model performance on a validation set. Specifically, the validation set contains 10,000 randomly chosen sentence pairs from the unlabeled set (i.e. 16 novel relations) and does not overlap with the test set. The model is evaluated according to the precision of binary classification of sentence pairs on the validation set, which is an estimation for models' clustering ability. We do not use F1 during model validation because the clustering steps are timeconsuming.
For optimization, we use Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.0001, which is selected from For baseline models, original papers do grid search for all possible hyperparameters and report the best result during testing. We follow their settings and do grid search directly on the test set.

Experiment Results on OpenRE
In this section, we demonstrate the effectiveness of our RSN models by comparing our models with state-of-the-art clustering-based OpenRE methods. We also conduct ablation experiments to detailedly investigate the contributions of different mechanisms of Semi-supervised RSN and Distantly-supervised RSN. Baselines.
Conventional clustering-based OpenRE models usually cluster instances by either clustering their linguistic features (Lin and Pantel, 2001;Yao et al., 2012;Elsahar et al., 2017) or imposing reconstruction constraints (Yao et al., 2011;Marcheggiani and Titov, 2016). To demonstrate the effectiveness of our RSN models, we compare our models with two state-of-the-art models: (1) HAC with re-weighted word embeddings (RW-HAC) (Elsahar et al., 2017): RW-HAC is the state-of-the-art feature clustering model for OpenRE. The model first extracts KB types and NER tags of entities as well as re-weighted word embeddings from sentences, then adopts principal component analysis (PCA) to reduce feature dimensionality, and finally uses HAC to cluster the concatenation of reduced feature representations.
(2) Discrete-state variational autoencoder (VAE) (Marcheggiani and Titov, 2016): VAE is the state-of-the-art reconstruction-based model for OpenRE via unlabeled instances. It optimizes a relation classifier by reconstructing entities from pairing entities and predicted relation types. Rich features including entity words, context words, trigger words, dependency paths, and context POS tags are used to predict the relation type.
RW-HAC and VAE both rely on external linguistic tools to extract rich features from plain texts. Specifically, we first align entities to Wikidata and get their KB types. Next, we preprocess the instances with part-of-speech (POS) tagging, named-entity recognition (NER), and dependency parsing with Stanford CoreNLP . It is worth noting that these features are only used by baseline models. Our models, in contrast, only use sentences and entity pairs as inputs.
Evaluation Protocol. In evaluation, we use B 3 metric (Bagga and Baldwin, 1998) as the scoring function. B 3 metric is a standard measure to balance the precision and recall of clustering tasks, and is commonly used in previous OpenRE works (Marcheggiani and Titov, 2016;Elsahar et al., 2017). To be specific, we use F 1 measure, the harmonic mean of precision and recall.
First, we report the result of supervised RSN with different clustering methods. Specifically, SN represents the original RSN structure, HAC and L indicate HAC and Louvain clustering introduced in Sec. 3.3. The result shows that Louvain performs better than HAC, so in the following experiments we focus on using Louvain clustering.
Next, for Semi-supervised and Distantlysupervised RSN, we conduct various combinations of different mechanisms to verify the contribution of each part. (+C) indicates that the model is powered up with conditional entropy minimization, while (+V) indicates that the model is pow-  Table 1: Precision, recall and F1 results (%) for different models. The first two models are baselines. The next five models are different variants of our model. ered up with virtual adversarial training.
Experimental Result Analysis. Table 1 shows the experimental results, from which we can observe that: (1) RSN models outperform all baseline models on precision, recall, and F1-score, among which Weakly-supervised RSN (SN-L+CV) achieves state-of-the-art performances. This indicates that RSN is capable of understanding new relations' semantic meanings within sentences.
(2) Supervised and distantly-supervised relational representations improve clustering performances. Compared with RW-HAC, SN-HAC achieves better clustering results because of its supervised relational representation and similarity metric. Specifically, unsupervised baselines mainly use sparse one-hot features. RW-HAC uses word embeddings, but integrates them in a rulebased way. In contrast, RSN uses distributed feature representations, and can optimize information integration process according to supervision.
(3) Louvain outperforms HAC for clustering with RSN, comparing SN-HAC with SN-L. One explanation is that our model does not put additional constraints on the prior distribution of relational vectors, and therefore the relation clusters might have odd shapes in violation of HAC's assumption. Moreover, when representations are not distinguishable enough, forcing HAC to find finegrained clusters may harm recall while contributing minimally to precision. In practice, we do observe that the number of relations SN-L extracts is constantly less than the true number 16.
(4) Both SN-L+V and SN-L+C improve the performance of supervised or distantly-supervised RSN by further utilizing unsupervised corpora. Both semi-supervised approaches bring significant improvements for F 1 scores by increasing the precision and recall, and combining both can further increase the F 1 score.
(5) One interesting observation is that SN-L+V does not outperform SN-L so much on FewReldistant. This is probably because VAT on the noisy data might amplify the noise. In further experiments, we perform VAT only on unlabeled set and observe improvements on F 1 , with SN-L+V from 45.8% to 49.2% and SN-L+CV from 52.0% to 52.6%, which proves this conjecture.

The Influence of Pre-defined Relation Diversity on Generalizability
In this subsection, we mainly focus on analyzing the influence of pre-defined relation diversity, i.e., the number of relations in the labeled train set. To study this influence, we use FewRel for evaluation and change the number of relations in the labeled train set from 40 to 64 while fixing the total num-ber of labeled instances to 25, 000, and report the clustering results in Figure 5. Several conclusions can be drawn according to Figure 5. Firstly, a rich variety of labeled relations do improve the performance of our models, especially RSN. The models trained on 64 relations perform better than those trained on 40 relations constantly. Secondly, while the performance of supervised RSN is very sensitive to pre-defined relation diversity, its semi-supervised counterparts suffer much less from the relation number limit. This phenomenon suggests that Semi-supervised RSNs succeed in learning from unlabeled novelrelation data and are more generalizable to novel relations.

Relational Knowledge Representation Visualization
To intuitively evaluate the knowledge transfer effects of RSN and Semi-supervised RSN, we visualize their relational knowledge representation spaces in the last layer of CNN encoders with t-SNE(Maaten and Hinton, 2008) in Figure 4. We also compare with a supervised CNN trained on 9, 600 labeled instances of novel relations, which suggests the optimal relational knowledge representation. In each figure, we plot 402 relation instances of 4 randomly-chosen relation types in the test set, and points are colored according to their ground-truth labels.
As we can see from Figure 4, RSN is able to roughly distinguish different relations, and Semi-supervised RSN further facilitated knowledge transfer by optimizing the margin between potential relation clusters during training. As a result, Semi-supervised RSN can extract more distinguishable novel relations, and gains comparable relational knowledge representation ability with supervised CNN.

Conclusions and Future Work
In this paper, we propose a new model Relational Siamese Network (RSN) for OpenRE. Different from conventional unsupervised models, our model learns to measure relational similarity from supervised/distantly-supervised data of predefined relations, as well as unsupervised data of novel relations. There are mainly two innovative points in our model. First, we propose to transfer relational similarity knowledge with RSN structure. To the best of our knowledge, we are the first to propose knowledge transfer for OpenRE. Second, we propose Semi/Distantly-supervised RSN, to further perform semi-supervised and distantlysupervised transfer learning. Experiments show that our models significantly surpass conventional OpenRE models and achieve new state-of-the-art performance.
For future research, we plan to explore the following directions: (1) Besides CNN, there are some other popular sentence encoder structures like piecewise convolutional neural network (PCNN) and Long Short-Term Memory (LSTM) for RE. In the future, we can try different sentence encoders in our model. (2) As mentioned above, our model has the potential ability to discover the hierarchical structure of relations. In the future, we will try to explore this application with additional experiments.