Deep Bidirectional Transformers for Relation Extraction without Supervision

We present a novel framework to deal with relation extraction tasks in cases where there is complete lack of supervision, either in the form of gold annotations, or relations from a knowledge base. Our approach leverages syntactic parsing and pre-trained word embeddings to extract few but precise relations, which are then used to annotate a larger corpus, in a manner identical to distant supervision. The resulting data set is employed to fine tune a pre-trained BERT model in order to perform relation extraction. Empirical evaluation on four data sets from the biomedical domain shows that our method significantly outperforms two simple baselines for unsupervised relation extraction and, even if not using any supervision at all, achieves slightly worse results than the state-of-the-art in three out of four data sets. Importantly, we show that it is possible to successfully fine tune a large pretrained language model with noisy data, as opposed to previous works that rely on gold data for fine tuning.


Introduction
The last years have seen a number of important advances in the field of Relation Extraction (RE), mainly based on deep learning models (Zeng et al., 2014(Zeng et al., , 2015Zeng et al., 2016;Wu et al., 2017;Verga et al., 2018). These advances have led to significant improvements in benchmark tasks for RE. The above cases assume the existence of some form of supervision either manually annotated or distantly supervised data (Mintz et al., 2009), where relations from a knowledge base are used in order to automatically annotate data, which then can be used as a noisy training set. For most real-world cases manually labeled data is either limited or completely missing, so typically one resorts to distant supervision to tackle a RE task. There exist cases though, where even the distant supervision approach cannot be followed due to the lack of a knowledge base. This is often the case in domains like the Web or the biomedical literature, where entities of interest might be related with other entities and no available supervision signal exists.
In this work, we propose an approach to deal with such a scenario, from a purely unsupervised approach, that is without providing any manual annotation or any supervision whatsoever. Our goal is to provide a framework that enables a pretrained language model to be self-fine tuned 1 on a set of predefined relation types, in situations without any existing training data and without the possibility or budget for human supervision.
Our method proceeds as follows: • The data are first parsed syntactically, extracting relations of the form subject-verbobject. The resulting verbs are embedded in a vector space along with the relation types that we are interested to learn and each is mapped to their most similar relation type. Table 1 shows an example of this mapping process. This process is entirely automatic, we only provide the set of relation types that we are interested in and a threshold below which a verb is mapped to a Null class.
• Subsequently, we use these extracted relations identically to a distant supervision signal to annotate automatically all cooccurrences of entities on a large corpus.
• The resulting data set is used to fine tune a Deep Bidirectional Transformer (BERT) model (Devlin et al., 2018).
Importantly, the first step ensures that the resulting relations will have high precision (although at the expense of low recall), since they largely exclude the possibility of the two entities cooccurring randomly in the sentence, through the subject-verb-object association. In other words, we end up with a small, but high quality set of relations, which can then be used in a way identical to distant supervision.
The main contribution of this work is the introduction of a novel framework to deal with RE tasks without any supervision, either manually annotated data or known relations. Our approach is empirically evaluated on four data sets. A secondary implication of our work involves how we employ a pre-trained language model such as BERT: unlike previous approaches that employ a small gold data set, we show that it is possible to instead use a large noisy data set to successfully fine tune such a model.
The rest of the paper is organized as follows: we describe the related work in Section 2, subsequently describing our method in Section 3 and presenting the empirical evaluation results in Section 4.

Related work
Dealing with relation extraction in the absence of training data is not a novel task: for more than a decade, researchers have employed successfully techniques to tackle the lack of supervision, mainly by resorting to distant supervision (Mintz et al., 2009;Riedel et al., 2010). This approach assumes the existence of a knowledge base, which contains already known relations between specific entities. These relations are then used to automatically annotate texts containing these entity pairs. Although this approach leads to noisy labelling, it is cheap and has the ability to leverage a vast amount of training data. A great body of work has built upon this approach aiming to alleviate the noise in annotations, using formulations such as multi-label multi-instance learning (Surdeanu et al., 2012;Zeng et al., 2015), employing generative models to reduce wrong labelling (Takamatsu et al., 2012), developing different loss functions for relation extraction (dos Santos et al., 2015;Wang et al., 2016) or using side information to constraint predicted relations (Vashishth et al., 2018).
More recently, a number of other interesting approaches have been presented aiming to deal with the lack of training data, with direct application to RE: data programming (Ratner et al., 2016) provides a framework that allows domain experts to write labelling functions which are then denoised through a generative model. Levy et al. (2017) have formulated the relation extraction task as a reading comprehension problem by associating one or more natural language questions with each relation. This approach enables generalization to unseen relations in a zero-shot setting.
Our work is different from the aforementioned approaches, in that it does not rely on the existence of any form of supervision. We build a model that is driven by the data, discovering a small set of precise relations, using them to annotate a larger corpus and being self-fine tuned to extract new relationships.
To train the RE classifier, we employ BERT, a recently proposed deep language model that achieved state-of-the-art results across a variety of tasks. BERT, similarly to the works of Radford et al. (2018a) and Radford et al. (2018b), builds upon the idea of pre-training a deep language model on massive amounts of data and then applies it (by fine tuning) to solve a diverse set of tasks. The building block of BERT is the Tranformer model (Vaswani et al., 2017), a neural network cell that uses a multi-head, self-attention mechanism.
The first step of our approach is highly reminiscent of approaches from the open Information Extraction (openIE) literature (Banko et al., 2007). Indeed, similar to openIE approaches, we also use syntactic parsing to extract relations. Nevertheless, unlike openIE we are interested in a) specific types of entities which we assume that have been previously extracted with Named Entity Recognition (NER) and b) in specific, predefined types of relations between entities. We use syntactic parsing only as a means to extract a few precise relations and then follow an approach similar to distant supervision to train a neural relation extraction classifier. It should be noted though, that as a potential extension of this work we could employ more sophisticated techniques instead of syntactic parsing, similar to the latest openIE works (Yahya et al., 2014)

Method and Implementation Details
We present here the details of our method. First, we describe how we create our training set which results from a purely unsupervised procedure during which the only human intervention is to define the relation types of interest, e.g., 'treat' or 'associate'. Subsequently, we describe BERT, the model that we use in our approach.

Training Set Creation
Our method assumes that the corpus is split in sentences 2 , which are then passed through a NER model and a syntactic parser. We use the spaCy library 3 for the above steps.
Given a pair of two entities A and B, we find their shortest dependency path and if one or more verbs V are in that path we assume that A is related to B with V . The next step involves mapping the verbs to a set of predefined relation types, as shown in Table 1. To do so, we embed both relation types and verbs to a continuous, lowerdimensional space with a pre-trained skip gram model (Mikolov et al., 2013), and map each verb to its closest relation type, if the cosine similarity of the two vectors is greater than a threshold (in initial small scale experiments using a validation set, we have found that a threshold = 0.4 works well). Otherwise, the verb is not considered to represent a relation. In our experiments we used the pre-trained BioASQ word vectors 4 , since our re-lation extraction tasks come from the biomedical domain.
It is important to note that in the above procedure the only human involvement is defining the set of relation types that we are interested in. In that sense, this approach is neither domain or scale dependent: any set of relations can be used (coming from any domain) and likewise we can consider any number of relation types.
The above procedure results in a small but relatively precise set of relations which can then be used in a way similar to distant supervision, to annotate all of our corpus. Nevertheless, there are a number of caveats to be taken into consideration: • As expected, there will be errors in the relations that come from the syntactic parsing and verbs mapping procedure.
• Our distant supervision-like approach comes also with inherent noise: we end up with a training set that has a lot of false negative and also a few false positive errors.
• The resulting training set will be largely imbalanced, since the way that we extract relations sacrifices recall for precision.
To deal with the above noise, we employ BERT as a relation extraction classifier. Furthermore, we use a balanced bagging approach to deal with class imbalance. Both approaches are described in detail in the following section.

Deep Bidirectional Transformers
BERT is a deep learning network that focus in learning general language representations which can then be used in downstream tasks. Much like the work of Radford et al. (2018a) and Radford et al. (2018b), the general idea is to leverage the expressive power of a deep Transformer architecture that is pre-trained on a massive corpus on a language modelling task. Indeed, BERT comes in two flavors of 12 and 24 layers and 110M and 340M parameters respectively and is pre-trained on a concatenation of the English Wikipedia and the Book Corpus (Zhu et al., 2015). The resulting language model can then be fine tuned across a variety of different NLP tasks.
The main novelty of BERT is its ability to pre-train bidirectional representations by using a masked language model as a training objective. The idea behind the masked language model is to randomly mask some of the word tokens from the input, the objective being to predict what that word actually is, based on its context. The model is simultaneously trained on a second objective in order to model sentence relationships, that is, given two sentences sent a and sent b predict if sent b is the next sentence after sent a .
BERT has achieved state-of-the-art across eleven NLP tasks using the same pre-trained model and only with some simple fine tuning. This makes it particularly attractive for our use case, where we need a strong language model that will be able to learn from noisy patterns.
In order to further deal with the challenges mentioned in the previous section, in our experiments we fine tuned BERT for up to 5 epochs, since in early experiments we noticed that the model started overfitting to noise and validation loss started increasing after that point.

Balanced Bagging
In order to deal with class imbalance we employed balanced bagging (Tao et al., 2006), an ensembling technique where each component model is trained on a sub-sample of the data, such that the negative examples are roughly equal to the positive ones. To train each model of the ensemble, we sub-sample only the negative class so as to end up with a balanced set of positives and negatives.
This sub-sampling of the negative class is important not only in order to alleviate the data imbalance, but also because the negative class will contain more noise than the positive by definition of our approach. In other words, since we consider as positives only a small set of relations coming from syntax parsing and verb mapping, it is more likely that a negative is in reality a positive sample rather than the opposite.

Experiments
In this section we first describe the data sets used in experiments and the experimental setup and then present the results of our experiments.

Data Sets and Setup
We evaluate our method on four data sets coming from the biomedical domain, expressing diseasedrug and disease-gene relations. Three of them are well known benchmark data sets for relation extraction: The Biocreative chemical-disease relations (CDR) data set (Li et al., 2016), the Ge-netic Association Database (GAD) data set (Bravo et al., 2015) and the EU-ADR data set (Van Mulligen et al., 2012). Additionally, we present a proprietary manually curated data set, Healx CD, expressing therapeutic drug-disease relations. We consider only sentence-level relations, so we split CDR instances into sentences (the rest of the data sets are already at sentence-level). Statistics for the data sets are provided in Table 2. We should note that for our approach we map each verb to the respective relation class that is depicted in Table 2 in parentheses.
As stated, we are mainly interested to understand how our proposed method performs under complete lack of training signal, so we compare it with two simple baselines for unsupervised relation extraction. The first, assumes that a sentence co-occurrence of two entities signals a positive relation, while the second is equivalent to the first two steps of our method, syntactic parsing followed by verb mapping to the relation types of interest. In other words, if two entities are connected in the shortest dependency path through a verb that is mapped to a class, they are considered to be related with that class.
Additionally, we would like to understand how our method performs against supervised methods, so for the first three data sets we compare it with a BERT model trained on the respective gold data, reporting also the current state-of-the-art, while for the Healx CD data set since there are no manual annotations, we compare our method against a distant supervision approach, retrieving ground truth relations from our internal knowledge base.
Across all experiments and for all methods we use the same BERT model, BioBERT (Lee et al., 2019), which is a BERT model initialized with the model from Devlin et al. (2018) and then pretrained on PubMed, and thus more relevant to our tasks. That model is fine tuned on relation extraction classification using the code provided by the BioBERT authors, either on the gold or the distantly supervised or our approach's training set. We fine tune for up to 5 epochs with a learning rate of 0.00005 and a batch size of 128, keeping the model that achieves the best loss on the respective validation set.
Finally, for the distant supervision as well as for our method, we use the previously mentioned balanced bagging approach, fine tuning an ensemble of ten models for each relation. Disease-Gene (cause) 250k(62k) -full Disease-Gene (cause) 9.1m(2.2m) -- Table 2: Data sets used in our experiments. 'Our approach' stands for the procedure described in Section 3.1. The Drug-Disease relation for our approach yields two positive classes, treat and cause, therefore we report accordingly positives from each class in parentheses. Table 3 shows the results for the four data sets, reporting the average over five runs. For the GAD and EU-ADR data sets, we use the train and test splits provided by Lee et al. (2019). Also, for CDR, since the state-of-the-art results (Verga et al., 2018) are given at the abstract level, we rerun their proposed algorithm on our transformed sentence-level CDR data set, reporting results for a single model, without additional data (Verga et al. (2018) reports also results when adding weakly labeled data). Let us first focus on the two unsupervised baselines. The first, dubbed 'co-occurrences', achieves a perfect recall since it considers all entity pairs co-occurrences as expressing a relation, but is clearly sub-optimal with regards to precision. The opposite behaviour is observed for the second baseline (syntactic parsing with verb mapping) since that one focuses in extracting high-precision relations, sacrificing recall: only entity pairs with a verb in between that is mapped to a relation are considered positives. Notably, this baseline achieves the highest precision in two out of four data sets, even compared to the supervised methods.

Results
Our method proves significantly better compared to the other two unsupervised baselines, outperforming them by a large margin in all cases apart for EUADR. In that case our method is slightly worse than the co-occurrences baseline, since EUADR contains a big percentage of posi-tives. Specifically, it is interesting to observe the improvement over the second baseline, which acts as a training signal for our method. Thanks to the predictive power and the robustness of BERT, our method manages to learn useful patterns from a noisy data set and actually improve substantially upon its training signal.
An additional advantage of our method compared to the two other unsupervised baselines and similar approaches in general, is that it outputs a probability. Unlike the other methods, this probability allows us to tune our method for better precision or recall, depending on the application.
We then focus on comparing our proposed approach against the same BERT model fine tuned on supervised data, either manually annotated for the first three data sets, or distantly annotated for the fourth. For the first three data sets, we also report the current state-of-the-art results. Interestingly, even if our method is completely unsupervised, it is competitive with the state-of-the-art of fully supervised methods in three out of four cases, being inferior to them from 3.7 to 14.1 F1 points. On average, our method is worse by 7.5 F1 points against the best supervised model (either BERT or current state-of-the-art).
These results are particularly important, if we take into account that they come from a procedure that is fully unsupervised and which entails substantial noise from its sub-steps: the syntactic parsing may come with errors and mapping the verbs to relevant relation types is a process  Table 3: Results on relation classification. State-of-the-art results were obtained from the corresponding papers. We averaged over five runs and report the evaluation metrics for a 0.5 probability threshold.
largely subject to the quality of the embeddings. Even worse, the relations obtained from the previous steps are used to automatically annotate all co-occurrences in a distant supervision-like fashion, which leads to even more noise.
What we show empirically here is that despite all that noise coming from the above unsupervised procedure, we manage to successfully fine tune a deep learning model so as to achieve comparable performance to a fully supervised model. BERT is the main factor driving this robustness to noise and it can be mainly attributed to the fact that it consists of a very deep language model (112M parameters) and that it is pre-trained generatively on a massive corpus (3.3B words). The significance of these results is further amplified if we consider how scarce are labeled data for tasks such as relation extraction.

Qualitative Analysis
Although we showed empirically that our proposed approach is consistently capable to achieve results comparable to the SOTA, we would like to further focus on what are the weak points of the syntax parsing method and of our approach compared to a fully supervised approach.
To this end we inspected manually examples of predictions of the three aforementioned methods on the CDR data set, focusing on failures of our method and the syntactic parsing method which acts as training signal of our approach. Table 4 shows some characteristic cases: • In the first sentence, the syntactic pars-ing+verb mapping baseline (SP+VM) fails since the verb (developed) is not associated with cause. Conversely our method, BERT  with SP+VM manages to model correctly the sentence and extract the relation.
• SP+VM fails in the second example for the same reason, although the sentence is relatively simple.
• The third sentence represents also an interesting case, with SP+VM being "tricked" by the verb induced. Our method also fails here, failing to attend correctly to the DISEASE masked entity.
• The fourth example represents a similar case, both BERT-based models are being tricked by the language. The SP+VM baseline is erroneously associating the verb block to the relation treat instead of cause.
• The fifth sentence resembles the first two: SP+VM fails to extract the relation for the same reason (verb in between). Our method fails too in that case, perhaps due to the relatively uncommon way that the causal relation is expressed (COMPOUND model of DIS-EASE. While further inspecting the results, we also noticed a steady tendency of SP+VM to be able to capture relations in simpler (from a syntax perspective) and shorter sentences, while failing in the opposite case.
Overall, we observe, as expected, that the SP+VM method is largely dependent on the simplicity of the expressed relation. Our method is clearly dependent on the quality of the syntax parsing, but manages up to a point to overcome low quality training data. To conclude, we can safely assume that our method would further benefit by replacing the SP+VM method with a more sophisticated unsupervised approach as the training signal, a future direction that we intend to take.

Conclusions
This work has introduced a novel framework to deal with relation extraction tasks in settings where there is complete lack of supervision. Our method employs syntactic parsing and word embeddings to extract a small set of precise relations which are then used to annotate a larger corpus, in the same way as distant supervision. With that data, we fine tune a pre-trained BERT model to perform relation extraction.
We have empirically evaluated our method against two unsupervised baselines, a BERT model trained with gold or distantly supervised data and the current state-of-the-art. The results showed that our approach is significantly better than the unsupervised baselines, ranking slightly worse than the state-of-the-art in three out of four cases.
Apart from presenting a novel perspective on how to train a relation extraction model in the ab-sence of supervision, our work also shows empirically that it is possible to successfully fine tune a deep pre-trained language model with substantially noisy data.
We are interested in extending this paradigm to other areas of natural language processing tasks or adjusting our framework for more complex relation extraction tasks, as well as using more sophisticated unsupervised methods as training signal.