Revisiting Unsupervised Relation Extraction

Unsupervised relation extraction (URE) extracts relations between named entities from raw text without manually-labelled data and existing knowledge bases (KBs). URE methods can be categorised into generative and discriminative approaches, which rely either on hand-crafted features or surface form. However, we demonstrate that by using only named entities to induce relation types, we can outperform existing methods on two popular datasets. We conduct a comparison and evaluation of our findings with other URE techniques, to ascertain the important features in URE. We conclude that entity types provide a strong inductive bias for URE.


Introduction
Relation extraction (RE) extracts semantic relations between entities from plain text. For instance, "Jon Robin Baitz head , born in Los Angeles tail ..." expresses the relation /people/person/place of birth between the two head-tail entities. Extracted relations are then used for several downstream tasks such as information retrieval (Corcoglioniti et al., 2016) and knowledge base construction (Al-Zaidy and Giles, 2018). RE has been widely studied using fully supervised learning (Nguyen and Grishman, 2015;Miwa and Bansal, 2016;Zhang et al., 2017Zhang et al., , 2018 and distantly supervised approaches (Mintz et al., 2009;Riedel et al., 2010;Lin et al., 2016).
Unsupervised relation extraction (URE) methods have not been explored as much as fully or distantly supervised learning techniques. URE is promising, since it does not require manually annotated data nor human curated knowledge bases (KBs), which are expensive to produce. Therefore, it can be applied to domains and languages where annotated data and KBs are not available. Moreover, URE can discover new relation types, since it is not restricted to specific relation types in the same way as fully and distantly supervised methods. One might argue that Open Information Extraction (OpenIE) can also discover new relations. However, OpenIE identifies relations based on textual surface information. Thus, similar relations with different textual forms may not be recognised. Unlike OpenIE, URE groups similar relations into clusters. Despite these advantages, there are only a few attempts tackling URE using machine learning (ML) (Hasegawa et al., 2004;Banko et al., 2007;Yao et al., 2011;Marcheggiani and Titov, 2016;Simon et al., 2019).
Similarly to other unsupervised learning tasks, a challenge in URE is how to evaluate results. Recent approaches (Yao et al., 2011;Marcheggiani and Titov, 2016;Simon et al., 2019) employ a widely used data generation setting in distantly supervised RE, i.e., aligning a large amount of raw text against triplets in a curated KB. A standard metric score is computed by comparing the output relation clusters against the automatically annotated relations. In particular, the NYT-FB dataset (Marcheggiani and Titov, 2016) which is used for evaluation, has been created by mapping relation triplets in Freebase (Bollacker et al., 2008) against plain text articles in the New York Times (NYT) corpus (Sandhaus, 2008). Standard clustering evaluation metrics for URE include B 3 (Bagga andBaldwin, 1998), V-measure (Rosenberg andHirschberg, 2007), and ARI (Hubert and Arabie, 1985).
Although the above mentioned experimental setting can be created automatically, there are three challenges to overcome. Firstly, the development and test sets are silver, i.e., they include noisy labelled instances, since they are not human-curated. Secondly, the development and test sentences are part of the training set, i.e., a transductive setting.
It is thus unclear how well the existing models perform on unseen sentences. Finally, NYT-FB can be considered highly imbalanced, since only 2.1% of the training sentences can be aligned with Freebase's triplets. Due to the noisy nature of silver data (NYT-FB), evaluation on silver data will not accurately reflect the system performance. We also need unseen data during testing to examine the system generalisation. To overcome these challenges, we will employ the test set of TACRED (Zhang et al., 2017), a widely used manually annotated corpus. Regarding the imbalanced data, we will demonstrate that in fact around 60% (instead of 2.1%) of instances in the training set express relation types defined in Freebase.
In this work, we present a simple URE approach relying only on entity types that can obtain improved performance compared to current methods. Specifically, given a sentence consisting of two entities and their corresponding entity types, e.g., PERSON and LOCATION, we induce relations as the combination of entity types, e.g., PERSON-LOCATION. It should be noted that we employ only entity types because their combinations form reasonably coarse relation types (e.g., PERSON-LOCATION covers /people/person/place of birth defined in Freebase). We further discuss our improved performance in §3.
Our contributions are as follows: (i) We perform experiments on both automatically/manuallylabelled datasets, namely NYT-FB and TACRED, respectively. We show that two methods using only entity types can outperform the state-of-theart models including both feature-engineering and deep learning approaches. The surprising results raise questions about the current state of unsupervised relation extraction. (ii) For model design, we show that link predictor provides a good signal to train a URE model (Fig 1). We also illustrate that entity types are a strong inductive bias for URE (Table 1).

Methods for URE
The goal of URE is to predict the relation r between two entities e head and e tail in a sentence s. We will describe three recent ML-based methods tackling URE and our own methods. We divide the ML-based methods into two main approaches: generative and discriminative.

Generative Approach
Yao et al. (2011) extended topic modelling -Latent Dirichlet Allocation (LDA) (Blei et al., 2003) for RE, developing two models, herewith RelLDA and RelLDA1. In both models, a sentence and an entity pair perform as a document in topic modelling, while a relation type corresponds to a topic. RelLDA uses three features, i.e., the shortest dependency path between two entities and the two entity mentions. RelLDA1 extends RelLDA with five more features, i.e., the entity types, words and part-of-speech tags between the two entities.

Discriminative Approaches
Marcheggiani and Titov (2016) proposed a discretestate variational autoencoder (VAE) to tackle URE (herewith March). Their model consists of two components: a relation classifier and a link predictor. The relation classifier, which is discriminative, takes entity types and several linguistic features (e.g., dependencies) as input to predict the relation r. The link predictor then uses the (soft) predicted relation r to predict the missing entity e i in a specific position {head, tail}, given the other entity e −i , where if i = head then −i = tail and vice versa. In other words, entity prediction, in a self-supervised manner, provides training signals to learn the relation classifier. However, by using only entity prediction, only a few relation types are chosen. They thus used entropy over all relations as a regulariser. The maximisation of the entropy regulariser ensures the uniform relation distribution and allows more relations to be predicted.
Another discriminative method is by Simon et al. (2019) (herewith Simon) which differs from March in the following ways: a) firstly, its relation classifier employs a piece-wise convolutional network (PCNN) using only surface form without requiring hand-crafted features; b) secondly, they replaced entropy with two regularisers: L s (skewness), to encourage the relation classifier to be confident in its prediction, and L d (dispersion), to ensure several relation types are predicted over a minibatch. Note that, L s is equivalent to the negation of the entropy used in March.

Our Methods
We introduce two entity-based methods, herewith EType and EType+. Our motivation is that entity types are helpful for RE, as mentioned in Zhang et al. (2017) (2016) also used entity types. We therefore propose EType that induces coarse relation clusters from the entity types. In particular, given two entity types t e head , t e tail as input, EType would output their concatenation t e head -t e tail as the relation.
One problem with EType is that the number of relation types is determined by the number of entity types. For instance, 4 entity types lead to 4 2 = 16 relation types. To extract an arbitrary number of relation types, we build a relation classifier that consists of one-layer feed-forward network taking entity type combinations as input: where t e head -t e tail is the one hot vector of the entity type pair. We then employ the link predictor used in March and the two regularisers used in Simon, to produce a new method, herewith EType+.

Experiments and Results
Evaluation metrics We use the following evaluation metrics for our analysis: a) B 3 (Bagga and Baldwin, 1998) used in previous work, which is the harmonic mean of precision and recall for clustering task; b) V-measure (Rosenberg and Hirschberg, 2007), and c) ARI (Hubert and Arabie, 1985) used in Simon et al. (2019). 2 V-measure is analysed in terms of homogeneity and completeness, while ARI measures the similarity between two clusterings. We note that V-measure is sensitive to the dependency between the number of clusters and instances. A relatively small number of clusters compared to the number of instances should be used to maintain the comparability of using V-measure. More precisely, we evaluated V-measure of the trivial homogeneity, where there are only singular clusters (i.e., each instance is its own cluster). The V-measure of the trivial homogeneity on NYT-FB reached 43.77%, which is higher than all the implemented methods in Table 1. Meanwhile, neither B 3 nor ARI encounters this problem. Datasets We employed NYT-FB for training and evaluation following previous work (Yao et al., 2011;Marcheggiani and Titov, 2016;Simon et al., 2019). Because only 2.1% of the sentences in NYT-FB were aligned against Freebase's triplets, we were concerned whether this dataset contains enough sentences for a model to learn relation types from Freebase. We thus examined 100 randomly chosen instances from 1.86m non-aligned sentences. We found that 61% of them (or 60% of the whole dataset) express relation types in Freebase. This suggests that the NYT-FB dataset can be employed to train a relation extractor. However, there are two further issues when evaluating URE methods on NYT-FB. Firstly, the development and test sets are all aligned sentences without human curation, which means that they include wrong/noisy labelled instances. In particular, we found that 35 out of 100 randomly chosen sentences were given incorrect relations. Secondly, the two validation sets are part of the training set. This setting is obviously not inductive, as it does not evaluate how a model performs on unseen sentences. Therefore, we additionally evaluate all methods (except topic modelling) on the test set of TACRED (Zhang et al., 2017), a widely used manually annotated corpus for supervised RE.

Discussion
The results of our evaluation demonstrate that our models outperform previous methods, despite being simpler than them. These results lead us to the 3 github.com/diegma/relation-autoencoder following findings.

Do ML models employ proper inductive biases?
In common with other unsupervised learning approaches, there is no guarantee that a URE model would learn the relation types in the used KBs and/or annotated data. A common solution is to employ inductive biases (Wagstaff, 2000) to guide the learning process towards desired relation types. Inductive biases can emanate from pre-processed data. Since our models outperform other methods, we conclude that entity type information alone constitutes a better bias than the biases employed by existing ML models. Indeed, entity types constitute a useful bias for this task. Among the topic modelling based methods, RelLDA1 outperforms RelLDA, which does not employ entity types. In a separate experiment, we found that adding entity types to the Simon model helped to achieve higher performance than the original version, i.e., 42.74% vs. 39.4% F1 B 3 on the NYT-FB test set. However, although both RelLDA1 and March also employ entity types, their performance is still lower than ours. This is because other syntactic and word features used in these two models might cancel out the useful bias of entity types. (More details are in the last paragraph of this section.) Inductive biases can emanate from training signals. March and Simon are trained from a link predictor, which provides indirect signals to train a relation classifier. Hence, the question here is "can the link predictor induce good training signals?" To answer this, we examine the link predictor with alternative settings: • Rand10 randomly assigns one among 10 relation types to each entity pair; • Rand10 with silver frequencies, similar to Rand10, randomly generates relation types but follows the silver relation distribution; • One relation assumes all entity pairs sharing the same relation type; • EType uses 16 relation types induced from 4 coarse entity types; • Silver relations (10) takes the top 9 most frequent relation types and groups the rest together to form the tenth relation type; • Silver relations (full) considers the full (silver) annotated relations, i.e., 262 types. Figure 1 illustrates the average loss values of using these settings. If high quality relations were critical for training the link predictor, we would expect lower losses while using annotated relations.  Indeed, the loss curve of using 10 correct relation types is consistently below all the others. This implies that the link predictor is able to provide reasonable signals for training a relation classifier. So why are the Simon and March models outperformed by our models? As pointed out by Simon et al. (2019), the link predictor itself cannot be trained without a good relation classifier. It suggests that the relation classifiers in both methods need to be improved. Empirical evidence shows that both Simon and March models are outperformed (in B 3 and V) by our Etype+, which uses the same link predictor. We also notice that both One relation and EType at the end sharing similar performances. This might imply that we only need one relation (matrix) to predict head/tail entities, as the link predictor is very expressive. However, the silver relations are clearly helpful as during the first 15 epochs their losses are much lower than others.
Why was the performance on TACRED lower? Despite the fact that TACRED shares similar relation types with Freebase, we observed that both the March and Simon models consistently fare less well in terms of their performance on the TACRED dataset. More precisely, Simon model results in significantly worse performance on TACRED, with 15.7% in terms of B 3 , which is twice as low as on NYT-FB (39.4%). This performance drop might be attributed to the distributional shift of the two datasets: variation and semantic shift in vocabulary and language structure over time, since NYT was collected long before TACRED.
How is the performance when combining entity types with other features? Our experiments using only entity types surprisingly perform higher than the previous state-of-the-art methods including feature engineering and deep learning models.
However, we know that context information is crucial to distinguish the relation between two entities, as many RE studies have been proposed to integrate the context information to improve the RE performance. We conduct experiments when combining entity types with common features for RE in Table 2. The list of features include: (i) Entity: textual surface form of two entities, (ii) BOW: bag of words between two entities, (iii) DepPath: words on the dependency path between two entities, (iv) POS: part-of-speech tag sequence between two entities, and (v) Trigger: DepPath without stop words. In general, naively combining entity types with other features could not improve the model performance. Additionally, BOW feature had negative effects on the RE performance. This indicates that bag of words between two entities often include uninformative and redundant words, i.e., noises, that are difficult to eliminate using simple neural architectures. While (i)-(v) are widely used handcrafted features for RE, we also incorporated a neural-based context encoder PCNN which is the combination of Simon's PCNN encoder, the entity masking and position-aware attention proposed in (Zhang et al., 2017). However, the performance of combining PCNN is also lower than only entity types.

Conclusion
We have shown the importance of entity types in URE. Our methods use only entity types, yet they yield higher performance than previous work on both NYT-FB and TACRED. We have investigated the current experimental setting, concluding that a strong inductive bias is required to train a relation extraction model without labelled data. URE remains challenging, which requires improved methods to deal with silver data. We also plan to use different types of labelled data, e.g., domain specific data sets, to ascertain whether entity type information is more discriminative in sub-languages.  Table 3 shows the statistics of the NYT-FB (Marcheggiani and Titov, 2016) and TA-CRED (Zhang et al., 2017) datasets. We followed the same data split and pre-processing described in Marcheggiani and Titov (2016). For all methods, we trained on NYT-FB and evaluated them on both NYT-FB and TACRED. Figure 2 illustrates the relation distributions of two datasets: NYT-FB and TACRED. We can see that 15/253 most frequent relations account for 82.97% of the total number of instances in NYT-FB. Meanwhile, 15/41 relations sum upto 74.94% of the total number of instances in TACRED.

B Hyper-parameter Settings
We used the development set to stop the training process. For every model, we conducted three runs with different initialised parameters and computed the average performance. We list the hyperparameters of different models in Table 4. Table 5 presents the average test scores of three runs on the NYT-FB and TACRED datasets. We note that the two models proposed by Marcheggiani and Titov (2016) and Simon et al. (2019) are sensitive to the hyper-parameters and thus difficult to train. We could not replicate the performance of Simon on the NYT-FB dataset.