Scientific Discovery as Link Prediction in Influence and Citation Graphs

We introduce a machine learning approach for the identification of “white spaces” in scientific knowledge. Our approach addresses this task as link prediction over a graph that contains over 2M influence statements such as “CTCF activates FOXA1”, which were automatically extracted using open-domain machine reading. We model this prediction task using graph-based features extracted from the above influence graph, as well as from a citation graph that captures scientific communities. We evaluated the proposed approach through backtesting. Although the data is heavily unbalanced (50 times more negative examples than positives), our approach predicts which influence links will be discovered in the “near future” with a F1 score of 27 points, and a mean average precision of 68%.


Introduction
The amount of scientific knowledge that is publicly available has increased dramatically in the past few years. For example, PubMed, a search engine of biomedical publications, 1 now indexes over 25 million papers, 17 million of which were published between 1990 and the present. This information overload yields two critical problems. First, this exceeds the human capacity to aggregate and interpret the fragments of knowledge published in these papers, which may result in existing solutions to critical problems being overlooked. Swanson (1986) described this problem as "undiscovered public knowledge". Second, this vast amount of available information complicates the identification of "white spaces" in science, i.e., topics that are insufficiently studied and may lead to important scientific discoveries.
While the first problem has been addressed recently with efforts that combine machine read-1 http://www.ncbi.nlm.nih.gov/pubmed ing and assembly with existing data analysis algorithms Poon et al., 2015, inter alia), the second problem is largely unstudied.
In this work we propose a first enabling step towards addressing the problem of white space discovery from literature (Sebastian et al., 2017;Cameron, 2014) with an approach inspired from the field of link prediction (Liben-Nowell and Kleinberg, 2007;Leskovec et al., 2010). In particular, our method operates over two graphs: (a) a graph of positive/negative influence relations such as the relation "CTCF activates FOXA1" between the two proteins, which were extracted using an existing, open-domain machine reading tool  from over 100K biomedical publications 2 ; and (b) the citation graph between the corresponding papers where these findings were published. The proposed approach approximates the task of white space discovery by predicting new influence relations that do not exist in the influence graph at a given time (hence the white space) but will emerge in future (thus somebody identified the missing knowledge as important).
The contributions of this work are: (1) We propose a novel machine learning (ML) framework for this prediction task that uses features extracted from both the influence graph (e.g., the connectivity of relevant concepts in the graph) and the citation graph (e.g., the affinity between related influence relations measured by membership to communities in citation space).
(2) We evaluate the proposed method on an influence graph extracted from over 100K pa-1 pers, which contains 1,564,748 concepts (e.g., "astrocytes", "proinflammatory cytokines") and 2,395,944 influence relations (e.g., "VEGF increases Akt"). Our method obtains an F1 score of 27 points and a mean average precision of 68%. This outperforms considerably methods that extract features only from one of the two graphs.
(3) To promote future work on this topic, we release a dataset containing both the influence and the citation graphs used in this paper, available at: https://github.com/clulab/releases/tree/ master/textgraphs2018-discovery.

Data
As mentioned, the primary graph this method operates on is a graph of influence relations extracted from a corpus of 119,667 PubMed Open Access publications. These papers were previously selected to be relevant to the topic of children's health, which spans multiple domains, and includes issues such as stunting, wasting, and malnutrition. All these papers were processed using the machine reading and assembly software of . In order to address the multidomain nature of the children's health, Hahn-Powell et al. followed the OpenIE-style approach of Banko et al. (2007) for entity extraction by considering expanded noun phrases as a coarse approximation of the concepts relevant to the topic. For event extraction, the authors adapted a subset of REACH grammars  from the biomolecular domain that capture influence statements (e.g., positive and negative regulations). The adaptation removed selectional restrictions on the arguments of each event predicate. That is, they extract any lexicalized variation of "A causes B" where A and B are concepts identified in the entity extraction step. For example, when processing the sentence "Chronic infection may lead to malnutrition and malabsorption", the system extracts the following entities: "Chronic infection", "malnutrition", and "malabsorption". In this particular case, the extracted entities participate in two promotes relations: the first between "Chronic infection" and "malnutrition", and the second between "Chronic infection" and "malabsorption".
This machine reading approach was used to read the entire content of these publications (including abstract and body of paper). To reduce noise we kept only relations extracted at least twice, and which occur between concepts with an inverse document frequency (IDF) larger than 1. The resulting influence graph (IG) contains 1,564,748 distinct nodes, connected by 2,395,944 influence relations.
Each match to these rules produces a directed influence relation 3 that encodes polarity (i.e., increase or decrease). Finally, the relation instances are then consolidated through a conservative deduplication procedure.
Because the hypotheses studied in these publications are generally expressed using causal language, we believe this influence graph (IG) captures the essence of the scientific knowledge in this domain.
The above IG is accompanied by a citation graph (CG), which contains outgoing citations from the above papers, at a total of 5,523,759 citation links.
We model the discovery of important white spaces in this knowledge base as link prediction: we predict which influence links will be added to the IG after time t, using only information available before time t. We believe this is a reasonable approximation for the discovery of important white spaces in science knowledge: influence links that will be added in the future indicate that somebody identified the missing information to be important enough to be studied and published. To limit search space, in this work we focus on the prediction of A → C influence links, when A → B and B → C exist in the graph before time t, for at least one node B. Figure 1 visualizes this procedure.   Note that transitivity cannot be assumed to be true in this graph due to the fact that influence relations extracted from text usually oversimplify complex causal processes. We show in Section 4 that relying on this transitivity assumption leads to poor predictions. We implemented the above task through backtesting. That is, we look at an arbitrary point in time in the past (t), and create positive training examples from A → C links that were added to IG after t. Similarly, we create negative examples from A → C links that do not appear between time t and the present. Note that these negative examples are an approximation: some of these may correspond to inventions that will be published at future dates that are beyond the coverage of our dataset. In this paper we used t = 2012. This choice of t is justified in Table 1 The dataset was split into training/development/testing as indicated in Table 2, using a 60-20-20% split. During the partitioning, we made sure that identical influence links coming from different papers are all allocated to the same partition.

Approach
We model link prediction as i.i.d. classification on the above dataset, exploring multiple classifiers in Section 4. One key contribution of this work is the feature set used by these classifiers, which is summarized in Table 3. At a higher level, these features capture the connectivity of both the IG and CG around a candidate link, under the assumption that the more connected the corresponding graph is around A and C, the more likely it is that the link A → C will be discovered in the near future. In particular, from the IG we extract the inand out-degrees of the source/destination nodes, and statistics from the path(s) connecting the two nodes such as the length of the shortest path connecting the source and destination nodes, or the inverse document frequency (IDF) scores of the nodes on these paths.
From the CG, we derive features based on the probabilities that papers containing A → B (p A→B ) and B → C (p B→C ) belong to the same community/ies, motivated by the idea that discoveries are easier to be made if the individual fragments that form the puzzle (A → B and B → C here) come from the same or related discipline(s). We model the probability that two papers, p1 and p2, belong to the same community P (p1, p2) using two configurations of the Coda community detection algorithm (Yang et al., 2014), one in which detects 100 communities, and another where it detects 300. Because influence links may be reported in more than one paper, we derived the max/min/avg P(p A→B ,p B→C ) features, which are computed across all possible combinations of papers p A→B and p B→C .
Lastly, we add a series of features (bottom part of Table 3) extracted from the collection of biomedical publications used in these experiments, such as IDF scores of the relevant concept nodes and the counts for the number of papers that mention a given influence link. Table 4 lists the results of several classifiers on the test partition, 4 compared against two baselines. The first baseline randomly creates positive links following the distribution of positive examples from the training partition. The second base-Feature Name Description

CA.outdegree
Out-degree of source concept node A, i.e., number of influence relations starting on A

CA.indegree
In-degree of source concept node A, i.e., number of influence relations ending on A CC .outdegree Out-degree of destination concept node C CC .indegree In-degree of destination concept node C

Cinbetween.outdegree
Out-degree of nodes in all the shortest paths that connect A to C but do not pass through B

Cinbetween.indegree
In-degree of nodes in all the shortest paths that connect A to C but do not pass through B shortest path length The length of the shortest path that connects A to C but does not pass through B; 0 if no such path exists shortest path count The number of shortest paths that connect A to C but do not pass through B Cinbetween.avg-idf Average inverse document frequency (IDF) of nodes in-between A and C in all the above shortest paths rinbetween.avg-seen Average number of papers containing an edge in the above shortest paths max P(pA→B,pB→C ) Maximum probability of papers pA→B and pB→C being related based on their membership to multi-communities detected by the Coda algorithm; pr refers to any paper that contains influence relation r.
min P(pA→B,pB→C ) Minimum probability of papers pA→B and pB→C being related based on their membership to multi-communities detected by the Coda algorithm avg P(pA→B,pB→C ) Average probability of papers pA→B and pB→C being related based on their membership to multi-communities detected by the Coda algorithm Jaccard(p A→B ,p B→C ) Jaccard similarity between the set of papers that contain the link A → B (p A→B ) and the set of papers that contain B → C (p B→C ) Inter-citation ratio The number of citations between the two sets p A→B and p B→C normalized by the size of the union of the two sets.  line assumes that all candidate links are positive, i.e., candidate A → C is always correct if A → B and B → C exist for some B.

CA.idf
This table yields several observations. First, the performance of the first (random) baseline is very low, indicating that this is indeed a hard task that is exacerbated by the biased label distribution. Second, the precision of the second baseline is also very low, confirming that the transitive closure assumption is not supported on this realistic influence graph. Third, all classifiers considerably outperform the baseline, indicating that capturing the structure of the IG and CG is indeed indicative of the likelihood that an influence link will be discovered in the near future. Fourth, a linear support vector machines (SVM) classifier did not converge on this data, indicating that, while it is possible to learn a model for this link prediction task, the resulting model is more complex than a linear function. All in all, the best non-linear model (Ad-  Table 4: Unranked scores -precision, recall, and F1-and ranked scores -precision at 10 (P@10), and mean average precision (MAP)-of several classifiers for the prediction of influence links using backtesting at time t = 2012. The baseline predicts that every A → C link will be discovered after time t, if A → B and B → C exist before time t.
aBoost) obtained an F1 score of 0.27, an order of magnitude higher than the baseline, and a mean average precision (MAP) of 0.68, indicating that most correct predictions are ranked closer to the top. Table 5 shows the results of an ablation experiment in which we measured the drop in performance when each feature was individually removed from the full AdaBoost model. This experiment indicates that, importantly, both the influ-  ence and citation graphs contribute to the overall performance. Removing individual features from either group impacts performance. Several features have a high impact, including C inbetween .avgidf, shortest path length, which are extracted from the influence graph, and max P(p A→B ,p B→C ) and Jaccard(p A→B ,p B→C ), which are extracted from the citation graph. These results demonstrate that the task of scientific discovery requires a multifaceted approach that analyzes several graphs, including graphs that model the content of publications (our IG), as well as citation graphs. Lastly, we rank the discoveries made by the proposed approach using the NN model, using a scoring function that combines the classifier confidence and redundancy (i.e., how many times we saw A → C with different intermediate nodes B) using a Noisy-Or formula: where B loops overall intermediate nodes that support the predicted relation, and prob(A → C|B) is the classifier's output probability given one intermediate node B. 5 Table 6 lists the top 10 prediction of our approach under this scoring function. 80% of these predictions are correct (i.e., they are discovered after time t = 2012). Additionally, the table shows that the predictions are  indeed informative: they capture fragments of protein signaling pathways, and links to biological processes (e.g., apoptosis). A few predicted links such as "TLR → cascade" are not informative, but this could be attributed to limitations in the machine reader, which failed to capture meaningful content from the destination concept ("cascade").

Conclusion
We proposed a novel strategy for the identification of white spaces in scientific knowledge, which are topics that are insufficiently studied and may hide important scientific discoveries. We addressed this task with a link prediction method that operates over two graphs: a graph of influence relations that were automatically extracted from over 100K papers on children's health using a machine reading tool, and which summarize the scientific knowledge in this domain, and a graph of citations originating from these papers. Using a backtesting methodology, we showed that our method is capable of predicting which influence links will be discovered in the future with a F1 score of 27 points, and a mean average precision of 68%. An ablation analysis experiment demonstrated that features extracted from both graphs contribute to overall performance. We believe this work is relevant to many actors involved in scientific discovery including researchers and program managers.