A Data-driven Approach for Noise Reduction in Distantly Supervised Biomedical Relation Extraction

Fact triples are a common form of structured knowledge used within the biomedical domain. As the amount of unstructured scientific texts continues to grow, manual annotation of these texts for the task of relation extraction becomes increasingly expensive. Distant supervision offers a viable approach to combat this by quickly producing large amounts of labeled, but considerably noisy, data. We aim to reduce such noise by extending an entity-enriched relation classification BERT model to the problem of multiple instance learning, and defining a simple data encoding scheme that significantly reduces noise, reaching state-of-the-art performance for distantly-supervised biomedical relation extraction. Our approach further encodes knowledge about the direction of relation triples, allowing for increased focus on relation learning by reducing noise and alleviating the need for joint learning with knowledge graph completion.


Introduction
Relation extraction (RE) remains an important natural language processing task for understanding the interaction between entities that appear in texts. In supervised settings (GuoDong et al., 2005;Zeng et al., 2014;Wang et al., 2016), obtaining finegrained relations for the biomedical domain is challenging due to not only the annotation costs, but the added requirement of domain expertise. Distant supervision (DS), however, provides a meaningful way to obtain large-scale data for RE (Mintz et al., 2009;Hoffmann et al., 2011), but this form of data collection also tends to result in an increased amount of noise, as the target relation may not always be expressed (Takamatsu et al., 2012;Ritter et al., 2013). Exemplified in Figure 1, the last two Figure 1: Example of a distantly supervised bag of sentences for a knowledge base tuple (neurofibromatosis 1, breast cancer) with special order sensitive entity markers to capture the position and the latent relation direction with BERT for predicting the missing relation. sentences can be seen as potentially noisy evidence, as they do not explicitly express the given relation.
Since individual instance labels may be unknown (Wang et al., 2018), we instead build on the recent findings of Wu and He (2019) and Soares et al. (2019) in using positional markings and latent relation direction (Figure 1), as a signal to mitigate noise in bag-level multiple instance learning (MIL) for distantly supervised biomedical RE. Our approach greatly simplifies previous work by Dai et al. (2019) with following contributions: • We extend sentence-level relation enriched BERT (Wu and He, 2019) to bag-level MIL.
• We demonstrate that the simple applications of this model under-perform and require knowledge base order-sensitive markings, ktag, to achieve state-of-the-art performance. This data encoding scheme captures the latent relation direction and provides a simple way to reduce noise in distant supervision.
• We make our code and data creation pipeline publicly available: https://github.com/ suamin/umls-medline-distant-re 2 Related Work In MIL-based distant supervision for corpus-level RE, earlier works rely on the assumption that at least one of the evidence samples represent the target relation in a triple (Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012). Recently, piecewise convolutional neural networks (PCNN) (Zeng et al., 2014) have been applied to DS (Zeng et al., 2015), with notable extensions in selective attention (Lin et al., 2016) (2019) proposed an entity marking strategy for BERT (referred to here as R-BERT) to perform relation classification. Specifically, they mark the entity boundaries with special tokens following the order they appear in the sentence. Likewise, Soares et al. (2019) studied several data encoding schemes and found marking entity boundaries important for sentence-level RE. With such encoding, they further proposed a novel pre-training scheme for distributed relational learning, suited to few-shot relation classification (Han et al., 2018b).
Our work builds on these findings, in particular, we extend the BERT model ( Let E and R represent the set of entities and relations from a knowledge base KB, respectively. For h, t ∈ E and r ∈ R, let (h, r, t) ∈ KB be a fact triple for an ordered tuple (h, t). We denote all such (h, t) tuples by a set G + , i.e., there exists some r ∈ R for which the triple (h, r, t) belongs to the KB, called positive groups. Similarly, we denote by G − the set of negative groups, i.e., for all r ∈ R, the triple (h, r, t) does not belong to KB. The union of these groups is represented by g ] an unordered sequence of sentences, called bag, for g ∈ G such that the sentences contain the group g = (h, t), where the bag size m can vary. Let f be a function that maps each element in the bag to a low-dimensional relation representation [r With o, we represent the bag aggregation function, that maps instance level relation representation to a final bag representation b g = o(f (B g )). The goal of distantly supervised bag-level MIL for corpus-level RE is then to predict the missing relation r given the bag.

Entity Markers
Wu and He (2019) and Soares et al. (2019) showed that using special markers for entities with BERT in the order they appear in a sentence encodes the positional information that improves the performance of sentence-level RE. It allows the model to focus on target entities when, possibly, other entities are also present in the sentence, implicitly doing entity disambiguation and reducing noise. In contrast, for bag-level distant supervision, the noisy channel be attributed to several factors for a given triple (h, r, t) and bag B g : 1. Evidence sentences may not express the relation.
2. Multiple entities appearing in the sentence, requiring the model to disambiguate target entities among other.
3. The direction of missing relation.
4. Discrepancy between the order of the target entities in the sentence and knowledge base.
To address (1), common approaches are to learn a negative relation class NA and use better bag aggregation strategies (Lin et al., 2016;Luo et al., 2017;Alt et al., 2019). For (2), encoding positional information is important, such as, in PCNN (Zeng et al., 2014), that takes into account the relative positions of head and tail entities (Zeng et al., 2015), and in (Wu and He, 2019; Soares et al., 2019) for sentence-level RE. To account for (3) and (4), multi-task learning with KGC and mutual attention has proved effective (Han et al., 2018a;Dai et al., 2019). Simply extending sentence sensitive marking to bag-level can be adverse, as it enhances (4) and even if the composition is uniform, it distributes the evidence sentence across several bags. On the other hand, expanding relations to multiple sub-classes based on direction (Wu and He, 2019), enhances class imbalance and also distributes supporting sentences. To jointly address (2), (3) and (4), we introduce KB sensitive encoding suitable for bag-level distant RE. Formally, for a group g = (h, t) and a matching sentence s (i) g with tokens (x 0 , ..., x L ) 2 , we add special tokens $ andˆto mark the entity spans as: Sentence ordered: Called s-tag, entities are marked in the order they appear in the sentence. Following Soares et al. (2019), let s 1 = (i, j) and s 2 = (k, l) be the index pairs with 0 < i < j − 1, j < k, k ≤ l − 1 and l ≤ L delimiting the entity mentions e 1 = (x i , ..., x j ) and e 2 = (x k , ..., x l ) respectively. We mark the boundary of s 1 with $ and s 2 withˆ. Note, e 1 and e 2 can be either h or t. KB ordered: Called k-tag, entities are marked in the order they appear in the KB. Let s h = (i, j) and s t = (k, l) be the index pairs delimiting head (h) and tail (t) entities, irrespective of the order they appear in the sentence. We mark the boundary of s h with $ and s t withˆ.
The s-tag annotation scheme is followed by Soares et al. (2019) and Wu and He (2019) for span identification. In Wu and He (2019), each relation type r ∈ R is further expanded to two sub-classes as r(e 1 , e 2 ) and r(e 2 , e 1 ) to capture direction, while holding the s-tag annotation as fixed. For DS-based RE, since the ordered tuple (h, t) is given, the task is reduced to relation classification without direction. This side information is encoded in data with k-tag, covering (2) but also (3) and (4). To account for (1), we also experiment with selective attention ( Figure 2 shows the model's architecture with k-tag. Consider a bag B g of size m for a group g ∈ G representing the ordered tuple (h, t), with corresponding spans 3 Each linear layer is implicitly assumed with a bias vector 3. BAG AGGREGATION: After applying the first two steps to each sentence in the bag, we obtain [r With a final linear layer consisting of a relation matrix M r ∈ R |R|×3d and a bias vector b r ∈ R |R| , we aggregate the bag information with o in two ways: Average: The bag elements are averaged as: Selective attention (Lin et al., 2016): For a row r in M r representing the relation r ∈ R, we get the attention weights as:

Models and Evaluation
We compare each tagging scheme, s-tag and k-tag, with average (avg) and selective attention (attn) bag aggregation functions. To test the setup of Wu and He (2019), which follows s-tag, we expand each relation type (exprels) r ∈ R to two sub-classes r(e 1 , e 2 ) and r(e 2 , e 1 ) indicating relation direction from first entity to second and vice versa.

Results
Performance metrics are shown in Table 2 and plots of the resulting PR curves in Figure 3. Since our data differs from Dai et al. (2019), the AUC cannot be directly compared. However, Precision@k indicates the general performance of extracting the true triples, and can therefore be compared. Generally, models annotated with k-tag perform significantly better than other models, with k-tag+avg achieving state-of-the-art Precision@{2k,4k,6k} compared to the previous best ( In contrast our data-driven method, k-tag, greatly simplifies this by directly encoding the KB information, i.e., order of the head and tail entities and therefore, the latent relation direction. Consider again the example in Figure 1 where our source triple (h, r, t) is (neurofibromatosis 1, associated genetic condition, breast cancer), and only last sentence has the same order of entities as KB. This discrepancy is conveniently resolved (note in Figure 2, for last sentence the extracted entities sentence order is flipped to KG order when concatenating, unlike s-tag) with k-tag. We remark that such knowledge can be seen as learned, when jointly modeling with KGC, however, considering the task of bag-level distant RE only, the KG triples are known information and we utilize this information explicitly with k-tag encoding.
As PCNN (Zeng et al., 2015) can account for the relative positions of head and tail entities, it also performs better than the models tagged with s-tag using sentence order. Similar to Alt et al. (2019) 6 , we also note that the pre-trained contextualized models result in sustained long tail performance. s-tag+exprels reflects the direct application of Wu and He (2019) to bag-level MIL for distant RE. In this case, the relations are explicitly extended to model entity direction appearing first to second in the sentence, and vice versa. This implicitly introduces independence between the two sub-classes of the same relation, limiting the gain from shared knowledge. Likewise, with such expanded relations, class imbalance is further enhanced to more fine-grained classes.
Though selective attention (Lin et al., 2016) has been shown to improve the performance of distant RE (Luo et al., 2017;Han et al., 2018a;Alt et al., 2019), models in our experiments with such an attention mechanism significantly underperformed, in each case bumping the area under the PR curve and making it flatter. We note that more than 50% of bags are under-sized, in many cases, with only 1-2 sentences, requiring repeated over-sampling to match fixed bag size, therefore, making it difficult for attention to learn a distribution over the bag with repetitions, and further adding noise. For such cases, the distribution should ideally be close to uniform, as is the case with averaging, resulting in better performance.

Conclusion
This work extends BERT to bag-level MIL and introduces a simple data-driven strategy to reduce the noise in distantly supervised biomedical RE. We note that the position of entities in sentence and the order in KB encodes the latent direction of relation, which plays an important role for learning under such noise. With a relatively simple methodology, we show that this can sufficiently be achieved by reducing the need for additional tasks and highlighting the importance of data quality.

A Data Pipeline
In this section, we explain the steps taken to create the data for distantly-supervised (DS) biomedical relation extraction (RE). We highlight the importance of a data creation pipeline as the quality of data plays a key role in the downstream performance of our model. We note that a pipeline is likewise important for generating reproducible results, and contributes toward the possibility of having either a benchmark dataset or a repeatable set of rules.

A.1 UMLS processing
The fact triples were obtained for English concepts, filtering for RO relation types only (Dai et al., 2019). We collected 9.9M (CUI head, relation text, CUI tail) triples, where CUI represents the concept unique identifier in UMLS.

A.2 MEDLINE processing
From 34.4M abstracts, we extracted 160.4M unique sentences. To perform fast and scalable search, we use the Trie data structure 7 to index all the textual descriptions of UMLS entities. In obtaining a clean set of sentences, we set the minimum and maximum sentence character length to 32 and 256 respectively, and further considered only those sentences where matching entities are mentioned only once. The latter decision is to lower the noise that may come when only one instance of multiple occurrences is marked for a matched entity. With these constraints, the data was reduced to 118.7M matching sentences.

A.3 Groups linking and negative sampling
Recall the entity groups G = G + ∪G − (Section 3.1). For training with NA relation class, we generate hard negative samples with an open-world assumption (Soares et al., 2019;Lerer et al., 2019) suited to bag-level multiple instance learning (MIL). From 9.9M triples, we removed the relation type and collected 9M CUI groups in the form of (h, t). Since each CUI is linked to more than one textual form, all of the text combinations for two entities must be considered for a given pair, resulting in 531M textual groups T for the 586 relation types. Next, for each matched sentence, let P 2 s denote the size 2 permutations of entities present in the sentence, then T ∩ P 2 s return groups which are present in KB and have matching evidence (positive groups, G + ). Simultaneously, with a probability of 1 2 , we remove the h or t entity from this group and replace it with a novel entity e in the sentence, such that the resulting group (e, t) or (h, e) belongs to G − . This method results in sentences that are seen both for the true triple, as well as for the invalid ones. Further using the constraints that the relation group sizes must be between 10 to 1500, we find 354 8 relation types (approximately the same as Dai et al. (2019)) with 92K positive groups and 2.1M negative groups, which were reduced to 64K by considering a random subset of 70% of the positive groups. Table 1 provides these summary statistics.