Noise Reduction Methods for Distantly Supervised Biomedical Relation Extraction

Distant supervision has been applied to automatically generate labeled data for biomedical relation extraction. Noise exists in both positively and negatively-labeled data and affects the performance of supervised machine learning methods. In this paper, we propose three novel heuristics based on the notion of proximity, trigger word and confidence of patterns to leverage lexical and syntactic information to reduce the level of noise in the distantly labeled data. Experiments on three different tasks, extraction of protein-protein-interaction, miRNA-gene regulation relation and protein-localization event, show that the proposed methods can improve the F-score over the baseline by 6, 10 and 14 points for the three tasks, respectively. We also show that when the models are configured to output high-confidence results, high precisions can be obtained using the proposed methods, making them promising for facilitating manual curation for databases.


Introduction
Biomedical relation extraction is a widely studied field that is concerned with the detection of different kinds of relations between bio-entities mentioned in text. With the rapid growth of biomedical literature, it has attracted much research interest as it makes possible to automatically extract structured information from large amounts of text. Biomedical relation extraction has helped facilitate manual curation of many biomedical databases as well as biological hypothesis generation.
Various tasks have been studied for biomedical relation extraction, e.g., extraction of proteinprotein interaction (Airola et al., 2008), drugdrug interaction (Segura-Bedmar et al., 2013) and mutation-disease association (Singhal et al., 2016). In recent years, community-organized events, such as BioNLP (Kim et al., 2012(Kim et al., , 2013 and BioCreative (Arighi et al., 2014;Wei et al., 2015b), provide comprehensive evaluation for extraction systems of a wide range of biomedical relations and events. In these tasks, supervised learning methods are commonly used and achieve state-of-the-art results.
When applying supervised learning methods, a training corpus is required to train the extraction model. The creation of a training corpus usually requires curators with domain knowledge, and is a time-consuming and labor-intensive process. Thus, it is one of the main obstacles in the use of supervised learning methods for relation extraction. To address this issue, recently researchers have been using distant supervision to construct training data automatically.
In distant supervision, a heuristic labeling process is used to label a text corpus using known related entity pairs from a database. Text containing these entity mentions or their different name variations are labeled as positive instances. To illustrate the labeling process, we show two example sentences labeled using interacting protein pairs from the database IntAct (Orchard et al., 2014).
The above sentences are labeled as positive instances and express protein-protein interaction re-lation between the protein mention pair. When a protein pair mentioned in a sentence is not recorded by IntAct, the sentence is then labeled as a negative instance. The positively and negativelylabeled data generated by this process can potentially be used by supervised learning algorithms to train a model. Various existing biological databases and the large amount of Medline abstracts and PMC full-length articles can support applying distant supervision for many biomedical relation extraction tasks. However, the main drawback of distant supervision is that the created data can be very noisy, due to the guideless heuristic labeling process. Wrongly labeled instances exist in both positively and negatively-labeled data. For example, consider the two labeled sentences below for protein-protein interaction.
• Mdm2, p53 : Ribosomal protein S3: A multi-functional protein that interacts with both p53 and MDM2 through its KH domain.
In the first sentence, although the protein pair Mdm2, p53 are interacting with each other according to IntAct, no explicit description in the sentence expresses such an interaction relation. It is labeled as a positive instance by the heuristic labeling process, which is a wrong annotation. On the other hand, if a related entity pair has not been recorded in the database, all the sentences containing their mentions will be labeled as negative instances, which may also contain wrong annotations. As an example, the protein pair LRAP35a, MYO18A in the second sentence is not recorded by IntAct. The sentence is labeled as negative, while it expresses an interaction relation between the two proteins. Thus, it is a wrong annotation in the negatively-labeled data.
In this paper, we propose three novel heuristics that attempt to reduce the noise in the positivelylabeled data set P as well as the negatively-labeled data set N . First, noise can be removed from P using lexical and syntactic information of the entity mention pairs. Next, high-confidence patterns can be discovered using the purified P , which can then be used to remove noise from N . Experiments on three tasks, extraction of protein-protein interaction, miRNA-gene regulation relation and proteinlocalization event, show that our methods can improve the F-score by 6, 10 and 14 points over the baseline for the three tasks, respectively. Furthermore, we show that our methods obtain 0.71, 0.95 and 0.77 precision at recall level 0.30 for the three tasks, respectively, making them promising for facilitating database curation.
In the rest of the paper, we first discuss the related work in Section 2. Section 3 describes the three tasks for experiments, as well as the databases and text corpora used in our experiments for applying distant supervision. In Section 4, we describe the details of the proposed methods. Experiments results will be reported in Section 5. We conclude with future work in Section 6.

Related Work
Distant supervision for relation extraction was first proposed by Craven and Kumlien (1999) to extract protein-localization relation. Mintz et al. (2009) used Freebase relations to annotate articles in Wikipedia and trained a logistic regression model to extract 102 different types of relations. Riedel et al. (2010) proposed to use multi-instance learning to tolerate noise in the positively-labeled data. They relaxed the original assumption in distant supervision that all the positively-labeled sentences of an entity pair express the relation of interest and instead, they assume that at least one of the sentences does. Hoffmann et al. (2011) and Surdeanu et al. (2012) continued to augment the multi-instance model with a multi-label classifier for each entity pair, to exploit correlations and conflicts among different relations to improve performance. In these approaches, researchers focus on developing models that can tolerate noise and improve extraction performance on entity pair level. However, it is important to note that the noise is not explicitly removed from the labeled data, and extraction on sentence level is not optimized directly.
Focusing on explicitly reducing noise from the distantly-labeled training data, Intxaurrondo et al. (2013) proposed three simple heuristics to remove noise from the positively-labeled data. They tried to filter out positively-labeled instances that appear too frequently or have a large distance from their cluster centroid, or positive entity pairs that have a low partial mutual information. Takamatsu et al. (2012) proposed a statistical model to estimate P (relation|pattern), and removed positively-labeled instances that match a low-probability pattern. Xu et al. (2013) used pseudo-relevance feedback to discover highconfidence related entity pairs which do not exist in the database, and removed negatively-labeled instances of these entity pairs.  tried to reduce noise in the negatively-labeled data by inferring new relations of a knowledge graph using a random-walk algorithm. Roth et al. (2013) gave a nice review of some of the above methods. Distant supervision has also been applied to extract biomedical relation. Zheng and Blake (2015) used a heuristic based on dependency path frequency to reduce noise in the positively-labeled data for extraction of protein-localization relations. Thomas et al. (2011) used a list of words which are frequently employed to indicate protein interaction to filter out noise for protein-protein interaction extraction.  tried to combine existing hand-labeled data with distantly labeled data to improve the performance for drug-condition relations. Multi-instance learning was used by  to extract two subsets of relations in UMLS database with reduced noise by a path ranking algorithm, and by Lamurias et al. (2017) to extract miRNA-gene relations.

Task Definition
In this paper, we use three tasks, extraction of protein-protein interaction (PPI), miRNAgene regulation relation (MIRGENE) and proteinlocalization event (PLOC), to evaluate our methods. Extraction of PPIs is a well-studied task (Miwa et al., 2009;Peng et al., 2016). We aim to extract interacting protein pairs from text using distant supervision, and evaluate it on one of the public corpora used by previous work. Extraction of miRNA-gene regulation relations have attracted much interest recently because of the rapid growth of miRNA-related literature (Bagewadi et al., 2014;Li et al., 2015). In a MIRGENE relation, a miRNA regulates gene expression via direct binding to the gene's 3' UTR or indirect pathway effect. Extraction of protein-localization event has been a subtask in BioNLP shared task from 2009 to 2013 in the Genia track (Kim et al., 2013). It describes the event that a protein is localized to a subcellular location. We only consider extraction of such events when the sentence mentions the protein and the location, same with Zheng and Blake (2015). We list an example sen-tence for each task below.
• PPI: Interaction of Shc with Grb2 regulates association of Grb2 with mSOS.
• PLOC: The cyclin G1 protein was localized in nucleus.

Training Data Construction
To construct the training set, we need a database containing related entity pairs and a large amount of text for the heuristic labeling. Table 1 lists the databases, text corpora and numbers of positively/negatively-labeled instances produced by the heuristic labeling process for the three tasks.  From all the Medline abstracts, we randomly sampled 30,000 abstracts with sentences mentioning a pair of miRNA and gene for miRNAgene regulation relation, and 30,000 abstracts with sentences mentioning a pair of protein and subcellular location for protein-localization event. We tried sampling more abstracts but the experiment results were not significantly different. For protein-protein interaction, using Medline abstracts leads to a skewed labeled data set (1:7.4 positive/negative ratio), we turned to use all the abstracts that are curated by IntAct database as the text corpus. Although this may result in less noise, we will show that our proposed methods are still able to improve performance over the baseline in the experiments.
In the heuristic labeling process, we need to recognize entity mentions in text and map them to their database entry. For gene/protein, we use the output from GenNorm++ (Wei et al., 2015a). We use simple regular expressions to recognize miRNA mentions, and map them to a miRNA entry in TarBase (Vlachos et al., 2014) or miRTar-Base (Hsu et al., 2014) using the number in the miRNA name. For subcellular location, similar to Zheng and Blake (2015), we use a dictionary from UniProt (UniProt Consortium, 2014) and perform string matching to find subcellular location mentions. The entry "secreted" is removed as it is not a specific subcellular location. The dictionary contains name variants for each location, and we normalize a matched variant in text to its standard name.

Test Data
We evaluate the baselines and proposed methods on a test set directly for the three tasks. Note that in the context of distant supervision, we should expect little or no hand-labeled data. Hence, we can not assume the availability of a development set for the purpose of parameter tuning. Thus, when a method has multiple possible choices for a parameter, we will report the results using different parameter values.
For the test set, we use the AIMed corpus (Bunescu et al., 2005) for PPI extraction, same with Bobic et al. (2012). We extend the corpus in our work (Li et al., 2015) to include relation mention annotations, and use the development set to evaluate MIRGENE extraction. For PLOC extraction we use BioNLP 2011 Genia training and development set, same with Zheng and Blake (2015). Gold entity annotations in these corpora are used except for subcellular location, we use the dictionary from UniProt to recognize them, as BioNLP Genia corpus only annotates subcellular locations that participate in an event. The characteristics of the three test corpora are listed in Table 2

Model and Feature Set
Logistic regression (LR) model is used for all our proposed methods in the experiments. An example sentence with relevant dependency relations and its extracted features are shown in Fig. 1 and Table 3. E-walk and v-walk features are edge, stem, edge and stem, edge, stem triples including the direction extracted from the shortest dependency path. They preserve partial structure information and are more generalizable than the full dependency path.   For all the lexical terms, we use their stems produced by Porter's stemmer (Porter, 1980). Charniak parser (Charniak, 2000;Charniak and Johnson, 2005) with the biomedical model (Mcclosky, 2010) is used to produce constituency parse for each sentence, which is converted to collapsed dependency parse using Stanford CoreNLP converter (Manning et al., 2014) with CCprocessed setting. We remove features that only appear once in the whole training set.

Baselines
The baseline is a LR model trained on the distantly labeled set without any filtering of noise. We also implement two previous methods for comparison. First, we train a LR model on the distantly labeled set filtered by a heuristic (DPFreq) proposed by Zheng and Blake (2015), which removes positively-labeled instances with a shortest dependency path that appear less than k times in the positive set. They hypothesize that rare dependency path is unlikely to express a relation. As we tried different values of k and obtained similar F-scores for the three tasks, we only report the results for k = 5 to save space. Note that since different features, text corpus and named entity recognition tool are used, we are not trying to reproduce the exact results reported in Zheng and Blake (2015). In addition, we implement a widely-used multiinstance model described in Surdeanu et al. (2012) and train it on unfiltered distantly labeled data.

Proposed Heuristics
We propose three novel filtering methods to remove noise from both positively and negativelylabeled data. These methods are applied in a sequential manner so that each step removes more noise based on the filtered data from the previous step.
The first heuristic is concerned with multiple mentions of an entity in a sentence. If the entity is related to another entity mentioned in the sentence, all the binary combinations of their mentions will be labeled as positive by the default labeling process. This usually introduces noise, since not all combinations are likely to be in the relation. For example, consider the sentence below.
Overexpression of miR-193b inhibited the expression of CCND1, and knock-down of CCND1 inhibited the proliferation of GC cells, suggesting that miR-193b exerted its anti-tumorigenic role in GC cells through targeting CCND1 gene. miR-193b regulates CCND1 according to the database TarBase. The six binary combinations between miR-193b and CCND1 in the sentence will be labeled as positive instances. However, the sentence only expresses miRNA-gene regulation relation for the first and the last combination. The other four are wrongly labeled and hence constitute noise in the positively-labeled data.
To remove such noise, we hypothesize that only the closest pair of the entity mentions express the relation. The closest pair is defined as following: for a positively-labeled entity mention pair e 1 , e 2 , if their shortest dependency path has the smallest length among all the positively-labeled instances that involve either e 1 or e 2 , the pair e 1 , e 2 is considered as a closest pair. When computing the dependency path length, we skip the appos relation. The heuristic is described as below.
Heuristic of closest pairs (CP): remove positively-labeled instances that are not closest pair, when multiple mentions of one or both en-tities are present in the sentence.
For the three tasks and many other biomedical text-mining tasks, the relation or event is often indicated by a small set of trigger words (e.g., interact/bind for PPI, regulate/target for MIRGENE, and localize/translocate for PLOC). Following the usage in the BioNLP Genia corpus, we can term these words as trigger words. With knowledge of a comprehensive set of trigger words, we can hypothesize that sentences without a trigger word are less likely to express the target relation or event. We propose to automatically mine such trigger words from the large distantly-labeled corpus, and use them to remove noise from the positivelylabeled data.
Trigger words are usually verbs, or in their nominal or adjectival form. Our target is then to identify stems of verb triggers, which can also be used to match nominal or adjectival form of the verb. A simple procedure is used: first, count all the verb stems on the shortest dependency paths of the positively-labeled instances generated by the heuristic labeling process. As we want to choose triggers that are strongly associated with the relation, we only use dependency paths that contain one token, excluding the two entity mentions. These verb stems are then sorted by frequency and the high-frequency stems are chosen for the trigger list. We list the top 10 verb stems for the three tasks in Table 4.
For each positively-labeled instance, we search for trigger stems in the tokens on its shortest dependency path or in the maximum dominating noun phrase. A maximum dominating noun phrase is defined as the maximally-spanning noun phrase that encloses the two entity mentions, with only noun or prepositional phrases as descendants. For example, in the text fragment "interaction between FAK and PP1 regulates a process", the maximum dominating noun phrase is "interaction between FAK with PP1" for this protein mention pair. As sentences without a trigger word are less likely to express the target relation or event, we use the heuristic described below to remove noise.
Heuristic of trigger word (TW): remove positively-labeled instances if a trigger stem is not found on the shortest dependency path or in the maximum dominating noun phrase of the entity mention pair.
By using heuristic CP and TW, we can already filter out a substantial part of the positively-labeled  data. Using heuristic CP+TW with 50 trigger stems, 65% of the positively-labeled data can be removed for PPI. For MIRGENE and PLOC, the removal ratio is 38% and 59%, respectively. We hypothesize that the remaining set will still contain a large amount of data for training and more importantly, it will be of high quality, and thus it would be possible to discover high-confidence patterns from it using pattern occurrence frequency.
Finally, we turn to the last heuristic that we introduce. Recall noisy instances in negativelylabeled data should be labeled as positive but are negatively labeled because of incompleteness of the database used for distant supervision. We try to mine some high-confidence patterns from the purified positively-labeled set after the application of heuristic CP and TW. We define a pattern as a shortest dependency path lexicalized by a trigger stem between the entity mention pair. The pattern frequencies in the positively-labeled data filtered by heuristic CP and TW are counted. The most frequent pattern and an example sentence for each task are shown in Table 4.
Our hypothesis is that any entity mention pair connected by a high-confidence pattern is likely to be related and hence probably constitute noise in the negatively-labeled data. Therefore, we consider the next heuristic described below.
Heuristic of high-confidence patterns (HP): remove negatively-labeled instances which match a high-confidence pattern mined from positivelylabeled data.
Note that heuristic DPFreq, CP and TW remove instances from the positively-labeled data, whereas HP is the only heuristic that removes instances from the negatively-labeled data. Heuristic TW depends on the number of trigger stems, while heuristic HP depends on both the number of trigger stems and high-confidence patterns, as it needs the trigger stems to lexicalize the shortest dependency path to form a pattern.

Results and Discussions
We use precision, recall and F-score to evaluate the baselines and proposed methods. The top 50 trigger stems were used in heuristic TW, while the top 50 trigger stems and the top 100 patterns were used in heuristic HP. The results are presented in Table 5. Specificity is also presented. We will discuss how different numbers of trigger stems and patterns may affect the results later. Table 5 shows that the multi-instance model and the use of heuristic DPFreq or CP increased precision compared to the baseline for all the three tasks, indicating that they can effectively remove noise from the positively-labeled data. Using heuristic CP+TW further improved precisions over heuristic CP for the three tasks. However, using heuristic DPFreq, CP or CP+TW did not improve the F-score over the baseline for PPI and MIRGENE, due the decreased recall. By removing noise from the negatively-labeled data using heuristic HP in addition to CP and TW, the recalls can be improved with minor or no decrease in precision, resulting in higher F-scores than the baseline, the MI model and other heuristics for all the three tasks. This suggests that the proposed heuristics can effectively remove noise from both positively and negatively-labeled data, and to obtain better F-scores, it is important to filter both positive and negative set to improve precision and recall simultaneously. Although PLOC extraction did not obtain a good precision in all the experiments, we will show that high precision can be achieved for high-confidence PLOC extraction later in this section.
By applying heuristic CP+TW+HP, the F-score can be improved by 10 points for PPI extraction compared to Bobic et al. (2012), and 11 points for PLOC extraction compared to Zheng and Blake (2015).
Different numbers of trigger stems: as different numbers of trigger stems can be used in heuristic TW and HP, we investigated how they affect  Table 5: Precision, recall, F-score and specificity of all the methods for three extraction tasks.
the performance for the three tasks. In Fig. 2 (a)-(c), precisions, recalls and F-scores are shown for applying heuristic CP+TW and CP+TW+HP (using top 100 patterns) with different numbers of trigger stems. PPI and MIRGENE extraction maintained a stable precision with increasing recall when the number of trigger stem increased. For PLOC extraction precision decreased with increased recall when more trigger stems were used, indicating that the quality of the trigger stems can be improved. Using 100 patterns to remove noise resulted in much better recalls and F-scores for all the three tasks across different numbers of trigger stems, further confirming that heuristic HP is an effective method to remove noise from the negatively-labeled data. Different numbers of patterns: we investigated how different numbers of patterns used by heuristic HP affect the results. In Fig. 2 (d)-(f), precisions, recalls and F-scores are shown for applying CP+TW+HP (using top 50 trigger stems) with different number of patterns. The performances using heuristic CP+TW with 50 trigger stems are included for comparison. We can see that recalls can be consistently improved when more patterns were used, with minor or no decrease in precision. Compared to the results only using heuristic CP+TW, even using small number of patterns can achieve better performance.
A major use case of biomedical relation extraction is to help identify high-confidence entity pairs to facilitate manual curation for databases. Thus, a desired property of a relation extractor is to achieve high precision for such high-confidence extractions. Logistic regression model outputs a probability for each test instance, and high probability indicates high confidence to be positive.
To investigate the performance of the proposed methods for the high-confidence extractions, we draw precision-recall curves using the probability produced by the logistic regression model. By definition, logistic regression model predicts an instance as positive if the probability is greater than 0.5. By varying the threshold, we can calculate precisions at different recall levels. For example, when the threshold is set to 0.9, the model only predicts an instance with probability greater than 0.9 as positive. Ideally the model should achieve better precision when the threshold is high.
For each task, six curves are drawn in Fig. 3. We can see that using heuristic CP+TW+HP obtained higher precisions than the baselines and other heuristics on the left side of the figures, which correspond to the performance for highconfidence extractions. The multi-instance model also obtained better precisions compared to the baseline at lower recall levels. Specifically, by using heuristic CP+TW+HP, PPI, MIRGENE and PLOC extraction can achieve the highest precisions among the six curves, which are 0.71, 0.95 and 0.77, respectively, at recall level 0.30.

Conclusion
In this paper, we proposed three novel heuristics that use lexical and syntactic information to remove noise from labeled data generated by distant supervision. Experiments showed that the proposed methods achieved significantly higher Fscores than the baseline and previous works for the three tasks, and high precision can be obtained for high-confidence results. For future work, we plan to improve the trigger stem list by asking curators to remove non-informative stems. Aggregating evidences from all the sentences for entity pair level extraction or incorporating direct supervision (Wallace et al., 2016) are two interesting directions.