Distant Supervision for Relation Extraction beyond the Sentence Boundary

The growing demand for structured knowledge has led to great interest in relation extraction, especially in cases with limited supervision. However, existing distance supervision approaches only extract relations expressed in single sentences. In general, cross-sentence relation extraction is under-explored, even in the supervised-learning setting. In this paper, we propose the first approach for applying distant supervision to cross-sentence relation extraction. At the core of our approach is a graph representation that can incorporate both standard dependencies and discourse relations, thus providing a unifying way to model relations within and across sentences. We extract features from multiple paths in this graph, increasing accuracy and robustness when confronted with linguistic variation and analysis error. Experiments on an important extraction task for precision medicine show that our approach can learn an accurate cross-sentence extractor, using only a small existing knowledge base and unlabeled text from biomedical research articles. Compared to the existing distant supervision paradigm, our approach extracted twice as many relations at similar precision, thus demonstrating the prevalence of cross-sentence relations and the promise of our approach.


Introduction
The accelerating pace in technological advance and scientific discovery has led to an explosive growth in knowledge.The ensuing information overload creates new urgency in assimilating frag-mented knowledge for integration and reasoning.A salient case in point is precision medicine (Bahcall, 2015).The cost of sequencing a person's genome has fallen below $10001 , enabling individualized diagnosis and treatment of complex genetic diseases such as cancer.The availability of measurement for 20,000 human genes makes it imperative to integrate all knowledge about them, which grows rapidly and is scattered in millions of articles in PubMed2 .Traditional extraction approaches require annotated examples, which makes it difficult to scale to the explosion of extraction demands.Consequently, there has been increasing interest in indirect supervision (Banko et al., 2007;Poon and Domingos, 2009;Toutanova et al., 2015), with distant supervision (Craven et al., 1998;Mintz et al., 2009) emerging as a particularly promising paradigm for augmenting existing knowledge bases from unlabeled text (Poon et al., 2015;Parikh et al., 2015).
This progress is exciting, but distantsupervision approaches have so far been limited to single sentences, thus missing out on relations crossing the sentence boundary.Consider the following example:"The p56Lck inhibitor Dasatinib was shown to enhance apoptosis induction by dexamethasone in otherwise GC-resistant CLL cells.This finding concurs with the observation by Sade showing that Notch-mediated resistance of a mouse lymphoma cell line could be overcome by inhibiting p56Lck."Together, the two sentences convey the fact that the drug Dasatinib could overcome resistance conferred by mutations to the Notch gene, which can not be inferred from either sentence alone.The impact of missed opportunities is especially pronounced in the long tail of knowledge.Such information is crucial for integrative reasoning as it includes the newest findings in specialized domains.
In this paper, we present DISCREX, the first approach for distant supervision to relation extraction beyond the sentence boundary.The key idea is to adopt a document-level graph representation that augments conventional intra-sentential dependencies with new dependencies introduced for adjacent sentences and discourse relations.It provides a unifying way to derive features for classifying relations between entity pairs.As we augment this graph with new arcs, the number of possible paths between entities grow.We demonstrate that feature extraction along multiple paths leads to more robust extraction, allowing the learner to find structural patterns even when the language varies or the parser makes an error.
The cross-sentence scenario presents a new challenge in candidate selection.This motivates our concept of minimal-span candidates in Section 3.2.Excluding non-minimal candidates substantially improves classification accuracy.
There is a long line of research on discourse phenomena, including coreference (Haghighi and Klein, 2007;Poon and Domingos, 2008;Rahman and Ng, 2009;Raghunathan et al., 2010), narrative structures (Chambers and Jurafsky, 2009;Cheung et al., 2013), and rhetorical relations (Marcu, 2000).For the most part, this work has not been connected to relation extraction.Our proposed extraction framework makes it easy to integrate such discourse relations.Our experiments evaluated the impact of coreference and discourse parsing, a preliminary step toward in-depth integration with discourse research.
We conducted experiments on extracting druggene interactions from biomedical literature, an important task for precision medicine.By bootstrapping from a recently curated knowledge base (KB) with about 162 known interactions, our DIS-CREX system learned to extract inter-sentence drug-gene interactions at high precision.Crosssentence extraction doubled the yield compared to single-sentence extraction.Overall, by applying distant supervision, we extracted about 64,000 distinct interactions from about one million PubMed Central full-text articles, attaining two orders of magnitude increase compared to the original KB.

Related Work
To the best of our knowledge, distant supervision has not been applied to cross-sentence relation ex-traction in the past.For example, Mintz et al. (2009), who coined the term "distant supervision", aggregated features from multiple instances for the same relation triple (relation, entity1, entity2), but each instance is a sentence where the two entities co-occur.Thus their approach cannot extract relations where the two entities reside in different sentences.Similarly, Zheng et al. ( 2016) aggregated information from multiple sentential instances, but could not extract cross-sentence relations.
Distant supervision has also been applied to completing Wikipedia Infoboxes (Wu and Weld, 2007) or TAC KBP Slot Filling3 , where the goal is to extract attributes for a given entity, which could be considered a special kind of relation triples (attribute, entity, value).These scenarios are very different from general cross-sentence relation extraction.For example, the entity in consideration is often the protagonist in the document (title entity of the article).Moreover, state-of-the-art methods typically consider extracting from single sentences only (Surdeanu et al., 2012;Surdeanu and Ji, 2014;Koch et al., 2014).
In general, cross-sentence relation extraction has received little attention, even in the supervised-learning setting.Among the limited amount of prior work, Swampillai & Stevenson (2011) is the most relevant to our approach, as it also considered syntactic features and introduced a dependency link between the root nodes of parse trees containing the given pair of entities.However, the differences are substantial.First and foremost, their approach used standard supervised learning rather than distant supervision.Moreover, we introduced the document-level graph representation, which is much more general, capable of incorporating a diverse set of discourse relations and enabling the use of rich syntactic and surface features (Section 3).Finally, Swampillai & Stevenson (2011) evaluated on MUC64 , which contains only 318 Wall Street Journal articles.In contrast, we evaluated on large-scale extraction from about one million full-text articles and demonstrated the large impact of cross-sentence extraction for an important real-world application.
The lack of prior work in cross-sentence relation extraction may be partially explained by the domains of focus.Prior extraction work focuses on newswire text5 and the Web (Craven et al., 2000).In these domains, the extracted relations often involve popular entities, for which there often exist single sentences expressing the relation (Banko et al., 2007).However, there is much less redundancy in specialized domains such as the frontiers of science and technology, where crosssentence extraction is more likely to have a significant impact.The long-tailed characteristics of such domains also make distant supervision a natural choice for scaling up learning.This paper represents a first step toward exploring the confluence of these two directions.
Distant supervision has been extended to capture implicit reasoning, via matrix factorization or knowledge base embedding (Riedel et al., 2013;Toutanova et al., 2015;Toutanova et al., 2016).Additionally, various models have been proposed to address the noise in distant supervision labels (Hoffmann et al., 2011;Surdeanu et al., 2012).These directions are orthogonal to cross-sentence extraction, and incorporating them will be interesting future work.
The idea of leveraging graph representations has been explored in many other settings, such as knowledge base completion (Lao et al., 2011;Gardner and Mitchell, 2015), frame-semantic parsing (Das andSmith, 2011), andother NLP tasks (Radev andMihalcea, 2008;Subramanya et al., 2010).Linear and dependency paths are popular features for relation extraction (Snow et al., 2006;Mintz et al., 2009).However, past extraction focuses on single sentences, and typically considers the shortest path only.In contrast, we allow interleaving edges from dependency and word adjacency, and consider top K paths rather than just the shortest one.This resulted in substantial accuracy gain (Section 4.5).
There has been prior work on leveraging coreference in relation extraction, often in the standard supervised setting (Hajishirzi et al., 2013;Durrett and Klein, 2014), but also in distant supervision (Koch et al., 2014;Augenstein et al., 2016).Notably, while Koch et al. (2014) and Augenstein et al. (2016) still learned to extract from single sentences, they augmented mentions with coreferent expressions to include linked entities that might be in a different sentence.We explored the potential of this approach in our experiments, but found that it had little impact in our domain, as it produced few additional candidates beyond single sentences.Recently, discourse parsing has received renewed interest (Ji and Eisenstein, 2014;Feng and Hirst, 2014;Surdeanu et al., 2015), and discourse information has been shown to improve performance in applications such as question answering (Sharp et al., 2015).In this paper, we generated coreference relations using the state-ofthe-art Stanford coreference systems (Lee et al., 2011;Recasens et al., 2013;Clark and Manning, 2015), and generated rhetorical relations using the winning approach (Wang and Lan, 2015) in the CoNLL-2015 Shared Task on Discourse Parsing.

Distant Supervision for Cross-Sentence Relation Extraction
In this section, we present DISCREX, short for DIstant Supervision for Cross-sentence Relation EXraction.Similar to conventional approaches, DISCREX learns a classifier to predict the relation between two entities, given text spans where the entities co-occur.Unlike most existing methods, however, DISCREX allows text spans comprising multiple sentences and explores potentially many paths between these entities.

Distant Supervision
Like prior approaches, DISCREX learns from an existing knowledge base (KB) and unlabeled text.
The KB contains known instances for the given relation.In a preprocessing step, relevant entities are annotated within this text using available entity extraction tools.Edges represent conventional intra-sentential dependencies, as well as connections between the roots of adjacent sentences (NEXTSENT).For simplicity, we omit edges between adjacent words or representing discourse relations.

Minimal-Span Candidates
In standard distant supervision, co-occurring entity pairs with known relations are enlisted as candidates of positive training examples.This is reasonable when the entity pairs are within single sentences.In the cross-sentence scenario, however, this would risk introducing too many wrong examples.Consider the following two sentences: Since amuvatinib inhibits KIT, we validated MET kinase inhibition as the primary cause of cell death.Additionally, imatinib is known to inhibit KIT.The mention of drug-gene pair imatinib and KIT (in bold) span two sentences, but the same pair also co-occur in the second sentence alone.In general, one might find co-occurring entity pairs in a large text span, where the same pairs also co-occur in a smaller text span that overlaps with the larger one.In such cases, if there is a relation between the pair, mostly likely it is expressed in the smaller text span when the entities are closer to each other.
This motivates us to define that an co-occurring entity pair has the minimal span if there does not exist another overlapping co-occurrence of the same pair where the distance between the entity mentions is smaller.Here, the distance is measured in the number of consecutive sentences between the two entities.Experimentally, we compared extraction with or without the restriction to minimal-span candidates, and show that the former led to much higher extraction accuracy.

Document Graph
To derive features for entity pairs both within and across sentences, DISCREX introduces a document graph with nodes representing words and edges representing intra-and inter-sentential relations such as dependency, adjacency, and discourse relations.Figure 1 shows an example document graph spanning two sentences.Each node is labeled with its lexical item, lemma, and partof-speech.We used a conventional set of intrasentential edges: typed, collapsed Stanford dependencies derived from syntactic parses (de Marneffe et al., 2006).To mitigate parser errors, we also add edges between adjacent words.
As for inter-sentential edges, a simple but intuitive approach is to add an edge between the dependency roots of adjacent sentences: if we imagined that each sentence participated as a node in a type of discourse dependency tree, this represents a simple right-branching baseline.To gather a finer grained representation of rhetorical structure, we ran a state-of-the-art discourse parser (Wang and Lan, 2015) to identify discourse relations, which returned a set of labeled binary relations between spans of words.We found the shortest path between any word in the first span and any word in the second span using only dependency and adjacent sentence edges, and added an edge labeled with the discourse relation between these two words.Another source of potentially cross-sentence links comes from coreference.We generated coreference relations using the Stanford Coreference systems (both statistical and deter-ministic) (Lee et al., 2011;Recasens et al., 2013;Clark and Manning, 2015), and added edges from anaphora to their antecedents.
We also considered a special case of crosssentence relation extraction by augmenting singlesentence candidates with coreference (Koch et al., 2014;Augenstein et al., 2016).Namely, extraction is still conducted within single sentences, yet entity linking is extended to consider all coreference mentions for a relation argument.However, this did not produce significantly more candidates (2% more for positive examples), most of which were not cross-sentence ones (only 1%).

Features
Dependency paths have been established as a particularly effective source for relation extraction features (Mintz et al., 2009).DISCREX generalizes this idea by defining feature templates over paths in the document graph, which may contain interleaving edges of various types (dependency, word and sentence adjacency, discourse relation).Dependency paths provide interpretable and generalizable features but are subject to parser error.One error mitigation strategy is to add edges between adjacent words, allowing multiple paths between entities.
Feature extraction begins with a pair of entities in the document graph that potentially are connected by a relation.We begin by finding a path between the entities of interest, and extract features from that path.
Over each such path, we explore a number of different features.Below, we assume that each path is a sequence of nodes and edges (n 1 , e 1 , n 2 , . . ., e L−1 , n L ), with n 1 and n L replaced by special entity marker nodes. 6hole path features We extract four binary indicator features for each whole path, with nodes n i represented by their lexical item, lemma, part-ofspeech tag, or nothing.These act as high precision but low recall indicators of useful paths.
Path n-gram features A more robust and generalizable approach is to consider a sliding window along each path.For each position i, we extract ngram (n = 1−5) features starting at each node (n i , then n i • e i and so on until Again, each node could be represented by its lexical item, lemma, or part of speech, leading to 27 feature templates.We add three more feature templates using only edge labels (e i ; e i • e i+1 ; and e i • e i+1 • e i+2 ) for a total of 30 feature templates.

Multiple paths
Most prior work has only looked at the single shortest path between two entities.When authors use consistent lexical and syntactic constructions, and when the parser finds the correct parse, this approach works well.Real data, however, is quite noisy.
One way to mitigate errors and be robust against noise is to consider multiple possible paths.Given a document graph with arcs of multiple types, there are often multiple paths between nodes.For instance, we might navigate from the gene to the drug using only syntactic arcs, or only adjacency arcs, or some combination of the two.Considering such variations gives more opportunities to find commonalities between seemingly disparate language.
We explore varying the number of shortest paths, N , between the nodes in the document graph corresponding to the relevant entities.By default, all edge types have an equal weight of 1, except edges between adjacent words.Empirically, penalizing adjacency edges led to substantial benefits, though including adjacency arcs was important for benefits from multiple paths.This suggests that the parser produces valuable information, but that we should have a back-off strategy for accommodating parser errors.

Evaluation
There is no gold annotated dataset in distant supervision, so evaluation typically resorts to two strategies.One strategy uses held-out samples from the training dataset, essentially treating the noisy annotation as gold standard.This has the advantage of being automatic, but could produce biased results due to false negatives (i.e., entity pairs not known to have the relation might actually have the relation).Another strategy reports absolute recall (number of extractions from all unlabeled text), as well as estimated precision by manually annotating extraction samples from general text.We conducted both types of evaluation in the experiments.

Experiments
We consider the task of extracting drug-gene interactions from biomedical literature.A drug-gene interaction is broadly construed as an association between the drug efficacy and the gene status.The status includes mutations and activity measurements (e.g., overexpression).For simplicity, we only consider the relation at the drug-gene level, without distinguishing among details such as drug dosage or distinct gene status.

Knowledge Base
We used the Gene Drug Knowledge Database (GDKD) (Dienstmann et al., 2015) for distant supervision.Figure 2 shows a snapshot of the dataset.Each row specifies a gene, some drugs, the fine-grained relations (e.g., sensitive), the gene status (e.g., mutation), and some supporting article IDs.In this paper, we only consider the coarse drug-gene association and ignore the other fields.

Unlabeled Text
We obtained biomedical literature from PubMed Central 7 , which as of early 2015 contained about 960,000 full-text articles.We preprocessed the text using SPLAT (Quirk et al., 2012) to conduct tokenization, part-of-speech tagging, and syntactic parsing, and obtained Stanford dependencies (de Marneffe et al., 2006) using Stanford CoreNLP (Manning et al., 2014).We used the entity taggers from Literome (Poon et al., 2014) to identify drug and gene mentions.

Candidate Selection
To avoid unlikely candidates such as entity pairs far apart in the document, we consider entity pairs within K consecutive sentences.K = 1 corresponds to extraction within single sentences.For cross-sentence extraction, we chose K = 3 as it doubled the number of overall candidates, while being reasonably small so as not to introduce too many unlikely ones.Table 1 shows the statistics of drug-gene interaction candidates identified in PubMed Central articles.For K = 3, there are 87,773 instances for which the drug-gene pair has known associations in Gene Drug Knowledge Database (GDKD), which are used as positive training examples.Note that these only include minimal-span candidates (Section 3.2).Without the restriction, there are 225,520 instances matching GDKD, though many are likely false positives.

Classifier
Our classifiers were binary logistic regression models, trained to optimize log-likelihood with an 2 regularizer.We used a weight of 1 for the regularizer; the results were not very sensitive to the specific value.Parameters were optimized using L-BFGS (Nocedal and Wright, 2006).Rather than explicitly mapping each feature to its own dimension, we hashed the feature names and retained 22 bits (Weinberger et al., 2009).Approximately 4 million possible features seemed to suffice for our problem: fewer bits produced degradations, but more bits did not lead to improvements.

Automatic Evaluation
To evaluate the impact of features, we conducted five-fold cross validation, by treating the positive  (Wang and Lan, 2015).
and negative examples from distant supervision as gold annotation.To avoid train-test contamination, all instances from a document are assigned to the same fold.We then evaluated the average test performance across folds.Since our datasets were balanced by design (Section 3.1), we simply reported accuracy.As discussed before, the results could be biased by the noise in annotation, but this automatic evaluation enables an efficient comparison of various design choices.First, we set out to investigate the impact of edge types and path number.We set the weight for adjacent-word edges to 16, to give higher priority to other edge types (weight 1) that are arguably more semantics-related.Table 2 shows the average test accuracy for single-sentence and crosssentence extraction with various edge types and path numbers.Compared to extraction within single sentences, cross-sentence extraction attains a similar accuracy, even though the recall for the latter is much higher (Table 1).
Adding more paths other than the shortest one led to a substantial improvement in accuracy.The gain is consistent for both single-sentence and cross-sentence extraction.This is surprising, as prior methods often derive features from the short-Paths Adj.Wt.Single-Sent.Cross-Sent.est dependency path alone.
Adding discourse relations, on the other hand, consistently led to a small drop in performance, especially when the path number is small.Upon manual inspection, we found that Stanford Coreference made many errors in biomedical text, such as resolving a dummy pronoun with a nearby entity.In hindsight, this is probably not surprising: state-of-the-art coreference systems are optimized for newswire domain and could be ill-suited for scientific literature (Bell et al., 2016).We are less certain about why discourse parsing didn't seem to help.There are clearly examples where extraction errors could have been avoided given rhetorical relations (e.g., when the sentence containing the second entity starts a new topic).We leave more indepth investigation to future work.
Next, we further evaluated the impact of path number and adjacency edge weight.Only dependency and adjacency edges were included in these experiments.Table 3 shows the results.Penalizing adjacency produces large gains; a harsh penalty is particularly helpful with fewer paths.These results support the hypothesis that dependency edges are usually more meaningful for relation extraction than word adjacency.Therefore, if adjacency edges get the same weights, they might cause some dependency sub-paths drop out of the top K paths, thus degrading performance.When the path number increases, there is a consistent and substantial increase in accuracy, which demonstrates the advantage of allowing adjacency edges to interleave with dependency ones.This presumably helps address syntactic parsing errors, among other things.The importance of adjacency weights decreases with more paths, but it is still significantly better to penalize adjacency edges.
In the experiments mentioned above, crosssentence extraction was conducted using minimalspan candidates only.We expected that this would provide a reasonable safeguard to filter out many unlikely candidates.As empirical validation, we also conducted experiments on cross-sentence extraction without the minimal-span restriction, using the base model.Test accuracy dropped sharply from 81.7% to 79.1% (not shown in the table).

PubMed-Scale Extraction
Our ultimate goal is to extract knowledge from all available text.First, we retrained DISCREX on all available distant-supervision data, not restricting to a subset of the folds as in the automatic evaluation.We used the systems performing best on automatic evaluation, with features derived from 30 shortest paths between each entity pair, and minimal-span candidates within three sentences for cross-sentence extraction.We then applied the learned extractors to all PubMed Central articles.We grouped the extracted instances into unique drug-gene pairs.The classifier output a probability for each instance.The maximum probability of instances in a group was assigned to the relation as a whole.Table 4 shows the statistics of extracted relations by varying the probability threshold.Cross-sentence extraction obtained far more unique relations compared to single-sentence extraction, improving absolute recall by 89-102%.Table 5 compares the number of unique genes and drugs.DISCREX extractions cover far more genes and drugs compared to GDKD, which bode well for applications in precision medicine.

Manual Evaluation
Automatic evaluation accuracies can be overly optimistic.To assess the true precision of DISCREX, we also conducted manual evaluation on extracted relations.Based on the automatic evaluation, the accuracy is similar for single-sentence and crosssentence extraction.So we focused on the latter.We randomly sampled extracted relation instances and asked two researchers knowledgeable in precision medicine to evaluate their correctness.For each instance, the annotators were provided with the provenance sentences where the druggene pair were highlighted.The annotators assessed in each case whether some relation was mentioned for the given pair.
A total of 450 instances were judged: 150 were sampled randomly from all candidates (random baseline), 150 from the set of instances with probability no less than 0.5, and 150 with probability no less than 0.9.From each set, we randomly selected 50 relations for review by both annotators.The two annotators agreed on 133 of 150.After review, all disagreements were resolved, and each annotator judged an additional set of 50 relation instances, this time without overlap.
Table 6 showed the sample precision and percentage of errors due to entity linking vs. relation extraction.With either classification threshold, cross-sentence extraction clearly outperformed the random baseline by a wide margin.Not surprisingly, the higher threshold of 0.9 led to higher precision.Interestingly, a significant portion of errors stems from mistakes in entity linking, as has been observed in prior work (Poon et al., 2015).Improved entity linking, either alone or joint with re-Prec.Entity Err.Relation Err.Based on these estimates, DISCREX extracted about 37,000 correct unique interactions at the threshold of 0.5, and about 20,000 at the threshold of 0.9.In both cases, it expanded the Gene Drug Knowledge Base by two orders of magnitude.

Single-sentence extractions
We also performed manual evaluation in the single-sentence setting.As in the automatic evaluation, single-sentence precisions are similar though slightly higher at all thresholds.This suggests that the candidate set is cleaner and the resulting predictions are more accurate.However, the resulting recall is substantially lower, dropping by 46% at a threshold of 0.5, and by 40% at a threshold of 0.9.

Conclusion
We present the first approach for applying distant supervision to cross-sentence relation extraction, by adopting a document-level graph representation that incorporates both intra-sentential dependencies and inter-sentential relations such as adjacency and discourse relations.We conducted both automatic and manual evaluation on extracting drug-gene interactions from biomedical literature.With cross-sentence extraction, our DIS-CREX system doubled the yield of unique interactions, while maintaining the same accuracy.Using distant supervision, DISCREX improved the coverage of the Gene Drug Knowledge Database (GDKD) by two orders of magnitude, without requiring annotated examples.
Future work includes: further exploration of features; improved integration with coreference and discourse parsing; combining distant supervision with active learning and crowd sourcing; evaluate the impact of extractions to precision medicine; applications to other domains.

Figure 2 :
Figure 2: Sample rows from the Gene Drug Knowledge Database.Our current work focuses on two important columns: gene, and therapeutic context (drug).

Table 2 :
Average test accuracy in five-fold crossvalidation.Cross-sentence extraction was conducted within a sliding window of 3 sentences using minimal-span candidates.Base only used the shortest path to construct features.3 paths and 10 paths gathered features from the top three or ten shortest paths, assigning uniform weights to all edges except adjacency, which had a weight of 16. +coref adds edges for the relations predicted by Stanford Coreference.+disc adds edges for the predicted rhetorical relations by a state-of-the-art discourse parser

Table 5 :
Numbers of unique genes and drugs in the Gene Drug Knowledge Database (GDKD) vs. DISCREX extractions.

Table 6 :
Sample precision and error percentage: comparison between the single sentence and cross-sentence extraction models at various thresholds.Single sentence extraction is slightly better at all thresholds, at the expense of substantially lower recall: a reduction of 40% or more in terms of unique interactions.lation extraction, is an important future direction.