Rationalizing Medical Relation Prediction from Corpus-level Statistics

Nowadays, the interpretability of machine learning models is becoming increasingly important, especially in the medical domain. Aiming to shed some light on how to rationalize medical relation prediction, we present a new interpretable framework inspired by existing theories on how human memory works, e.g., theories of recall and recognition. Given the corpus-level statistics, i.e., a global co-occurrence graph of a clinical text corpus, to predict the relations between two entities, we first recall rich contexts associated with the target entities, and then recognize relational interactions between these contexts to form model rationales, which will contribute to the final prediction. We conduct experiments on a real-world public clinical dataset and show that our framework can not only achieve competitive predictive performance against a comprehensive list of neural baseline models, but also present rationales to justify its prediction. We further collaborate with medical experts deeply to verify the usefulness of our model rationales for clinical decision making.


Introduction
Predicting relations between entities from a text corpus is a crucial task in order to extract structured knowledge, which can empower a broad range of downstream tasks, e.g., question answering , dialogue systems (Lowe et al., 2015), reasoning (Das et al., 2017), etc. There has been a large amount of existing work focusing on predicting relations based on raw texts (e.g., sentences, paragraphs) mentioning two entities (Hendrickx et al., 2010;Zeng et al., 2014;Zhou et al., 2016;Mintz et al., 2009;Riedel et al., 2010;Lin et al., 2016;Verga et al., 2018;Yao et al., 2019). To infer the relation between the target entities (red nodes), we recall (blue dashed line) their associated entities (blue nodes) and infer their relational interactions (red dashed line), which will serve as assumptions or model rationales to support the target relation prediction.
In this paper, we study a relatively new setting in which we predict relations between entities based on the global co-occurrence statistics aggregated from a text corpus, and focus on medical relations and clinical texts in Electronic Medical Records (EMRs). The corpus-level statistics present a holistic graph view of all entities in the corpus, which will greatly facilitate the relation inference, and can better preserve patient privacy than raw or even de-identified textual content and are becoming a popular substitute for the latter in the research community for studying EMR data (Finlayson et al., 2014;. To predict relations between entities based on a global co-occurrence graph, intuitively, one can first optimize the graph embedding or global word embedding (Pennington et al., 2014;Perozzi et al., 2014;Tang et al., 2015), and then develop a relation classifier (Nickel et al., 2011;Socher et al., 2013;Yang et al., 2015;Wang et al., 2018) based on the embedding vectors of the two entities. However, such kind of neural frameworks often lack the desired interpretability, which is especially important for the medical domain. In general, despite their superior predictive performance in many NLP tasks, the opaque decision-making process of neural models has concerned their adoption in high stakes domains like medicine, finance, and judiciary (Rudin, 2019;Murdoch et al., 2019). Building models that provide reasonable explanations and have increased transparency can remarkably enhance user trust (Ribeiro et al., 2016;Miller, 2019). In this paper, we aim to develop such a model for our medical relation prediction task.
To start with, we draw inspiration from the existing theories on cognitive processes about how human memory works, e.g., two types of memory retrieval (recall and recognition) (Gillund and Shiffrin, 1984). Basically, in the recall process, humans tend to retrieve contextual associations from long-term memory. For example, given the word "Paris", one may think of "Eiffel Tower" or "France", which are strongly associated with "Paris" (Nobel and Shiffrin, 2001;Kahana et al., 2008;Budiu, 2014). Besides, there is a strong correlation between the association strength and the co-occurrence graph (Spence and Owens, 1990;Lundberg and Lee, 2017). In the recognition process, humans typically recognize if they have seen a certain piece of information before. Figure 1 shows an example in the context of relation prediction. Assume a model is to predict whether Aspirin may treat Headache or not (That "Aspirin may treat Headache" is a known fact, and we choose this relation triple for illustration purposes). It is desirable if the model could perform the aforementioned two types of memory processes and produce rationales to base its prediction upon: (1) Recall. What entities are associated with Aspirin? What entities are associated with Headache? (2) Recognition. Do those associated entities hold certain relations, which can be leveraged as clues to predict the target relation? For instance, a model could first retrieve a relevant entity Pain Relief for the tail entity Headache as they co-occur frequently, and then recognize there is a chance that Aspirin can lead to Pain Relief (i.e., formulate model rationales or assumptions), based on which it could finally make a correct prediction (Aspirin, may treat, Headache).

Entity Pair
Recall Memory

Recognition Memory
Pred.

Rationalized by
CogStage-1 CogStage-2 CogStage-3 Figure 2: A high-level illustration of our framework. models the process of recalling diverse contextual entities associated with the target head and tail entities respectively, CogStage-2 models the process of recognizing possible interactions between those recalled entities, which serve as model rationales (or, assumptions 2 ) and are represented as semantic vectors, and finally CogStage-3 aggregates all assumptions to infer the target relation. We jointly optimize all three stages using a training set of relation triples as well as the co-occurrence graph. Model rationales can be captured through this process without any gold rationales available as direct supervision. Overall, our framework rationalizes its relation prediction and is interpretable to users 3 by providing justifications for (i) why a particular prediction is made, (ii) how the assumptions of the prediction are developed, and (iii) how the particular assumptions are relied on. On a real-life clinical text corpus, we compare our framework with various competitive methods to evaluate the predictive performance and interpretability. We show that our method obtains very competitive performance compared with a comprehensive list of various neural baseline models. Moreover, we follow recent work Jin et al., 2020) to quantitatively evaluate model interpretability and demonstrate that rationales produced by our framework can greatly help earn expert trust. To summarize, we study the important problem of rationalizing medical relation prediction based on corpus-level statistics and propose a new framework inspired by cognitive theories, which outperforms competitive baselines in terms of both interpretability and predictive performance.

Background
Different from existing work using raw texts for relation extraction, we assume a global co-occurrence graph (i.e., corpus-level statistics) is given, which was pre-constructed based on a text corpus D, and denote it as an undirected graph G = (V, E), where  each vertex v ∈ V represents an entity extracted from the corpus and each edge e ∈ E is associated with the global co-occurrence count for the connected nodes. Counts reflect how frequent two entities appear in the same context (e.g., co-occur in the same sentence, document, or a certain time frame).
In this paper, we focus on clinical co-occurrence graph in which vertices are medical terms extracted from clinical notes. Nevertheless, as we will see later, our framework is very general and can be applied to other relations with corpus-level statistics.
Our motivation for working under this setting lies in three folds: (1) Such graph data is stripped of raw textual contexts and thus, has a better preserving of patient privacy , which makes itself easier to be constructed and shared under the HIPPA protected environments (Act, 1996) for medical institutes (Finlayson et al., 2014); (2) Compared with open-domain relation extraction, entities holding a medical relation oftentimes do not co-occur in a local context (e.g., a sentence or paragraph). For instance, we observe that in a widely used clinical co-occurrence graph (Finlayson et al., 2014), which is also employed for our experiments later, of all entity pairs holding the treatment relation according to UMLS (Unified Medical Language System), only about 11.4% have a co-occurrence link (i.e., co-occur in clinical notes within a time frame like 1 day or 7 days); (3) As suggested by cognitive theories (Spence and Owens, 1990), lexical co-occurrence is significantly correlated with association strength in the recall memory process, which further inspires us to utilize such statistics to find associations and form model rationales for relation prediction.
Finally, our relation prediction task is formulated as: Given the global statistics G and an entity pair, we predict whether they hold a relation r (e.g., MAY TREAT), and moreover provide a set of model rationales T composed of relation triples for the prediction. For the example in Figure 1, we aim to build a model that will not only accurately predict the MAY TREAT relation, but also provide meaningful rationales on how the prediction is made, which are crucial for gaining trust from clinicians.

Methodology
Following a high-level framework illustration in Figure 2, we show a more detailed overview in Figure 3 and introduce each component as follows.

CogStage-1: Global Association Recall
Existing cognitive theories (Kahana et al., 2008) suggest that recall is an essential function of human memory to retrieve associations for later decision making. On the other hand, the association has been shown to significantly correlate with the lexical co-occurrence from the text corpus (Spence and Owens, 1990;Lund and Burgess, 1996). Inspired by such theories and correlation, we explicitly build up our model based on recalled associations stemming from corpus-level statistics and provide global highly-associated contexts as the source of interpretations.
Given an entity, we build an estimation module to globally infer associations based on the corpuslevel statistics. Our module leverages distributional learning to fully explore the graph structure. One can also directly utilize the raw neighborhoods in the co-occurrence graph, but due to the noise introduced in the preprocessing of building the graph, it is a less optimal choice in real practice.
Specifically, for a selected node/entity e i ∈ E, our global association recall module estimates a conditional probability p (e j |e i ), representing how likely the entity e j ∈ E is associated with e i 4 . We formally define such conditional probability as: where υ e i ∈ R d is the embedding vector of node e i and υ e j ∈ R d is the context embedding for e j . There are many ways to approximate p (e j |e i ) from the global statistics, e.g., using global logbilinear regression (Pennington et al., 2014). To estimate such probabilities and update entity embeddings efficiently, we optimize the conditional distribution p (e j |e i ) to be close to the empirical distributionp (e j |e i ) defined as: where E is the set of edges in the co-occurrence graph and p ij is the PPMI value calculated by the co-occurrence counts between node e i and e j . We adopt the cross entropy loss for the optimization: This association recall module will be jointly trained with other objective functions to be introduced in the following sections. After that, given an entity e i , we can select the top-N c entities from p(·|e i ) as e i 's associative entities for subsequent assumption formation.

CogStage-2: Assumption Formation and Representation
As shown in Figure 3, with the associative entities from CogStage-1, we are ready to formulate and represent assumptions. In this paper, we define model assumptions as relational interactions between associations, that is, as shown in Figure 1, the model may identify (Caffeine, MAY TREAT, Migraine) as an assumption, which could help predict Aspirin may treat Headache (Caffeine and Migraine are associations for Aspirin and Headache respectively). Such relational rationales are more concrete and much easier for humans to understand than the widely-adopted explanation strategy (Yang et al., 2016;Mullenbach et al., 2018;Vashishth et al., 2019) in NLP that is based on pure attention weights on local contexts. One straightway way to obtain such rationales is to query existing medical knowledge bases (KBs), e.g., (Caffeine, MAY TREAT, Migraine) may exist in SNOMED CT 5 and can serve as a model rationale. We refer to rationales acquired in this way as the Closed-World Assumption (CWA) (Reiter, 1981) setting since only KB-stored facts are considered and trusted in a closed world. In contrast to the CWA rationales, considering the sparsity and incompleteness issues of KBs that are even more severe in the medical domain, we also propose the Open-World Assumptions (OWA) (Ceylan et al., 2016) setting to discover richer rationales by estimating all potential relations between associative entities based on a seed set of relation triples (which can be regarded as prior knowledge).
In general, the CWA rationales are relatively more accurate as each fact triple has been verified by the KB, but would have a low coverage of other possibly relevant rationales for the target prediction. On the other hand, the OWA rationales are more comprehensive but could be noisy and less accurate, due to the probabilistic estimation procedure and the limited amount of prior knowledge. However, as we will see, by aggregating all OWA rationales into the whole framework with an attention-based mechanism, we can select high-quality and most relevant rationales for prediction. For the rest of the paper, by default we adopt the OWA setting in our framework and describe its details as follows.
Specifically, given a pair of head and tail entity, e h , e t ∈ V, let us denote their association sets are the number of associative entities a h , a t to use. Each entity has been assigned an embedding vector by the previous association recall module. We first measure the probability of relations holding for the pair. Given a i h ∈ A(e h ), a j t ∈ A(e t ) and a relation r k ∈ R, we define a scoring function as Bordes et al. (2013) to estimate triple quality: where υ a i h and υ a j t are embedding vectors, relations are parameterized by a relation matrix R ∈ R Nr×d and ξ k is its k-th row vector. Such a scoring function encourages larger value for correct triples. Additionally, in order to filter unreliable estimations, we define an NA relation to represent other trivial relations or no relation as the score, , which can be seen as a dynamic threshold to produce reasonable rationales. Now we formulate OWA rationales by calculating the conditional probability of a relation given a pair of associations as follows (we save the superscript ij for space): For each association pair, (a i h , a j t ), we only form an assumption with a relation r * k if r * k is top ranked according to p(r k |a i h , a j t ). 6 To represent assumptions, we integrate all relation information per pair into a single vector representation. Concretely, we calculate the assumption representation by treating p(r k |a i h , a j t ) as weights for all relations as follows: Finally, we combine the entity vectors as well as the relation vector to get the final representation of assumptions for association pair (a i h , a j t ), where c i ∈ A(e h ) and c j ∈ A(e t ): where [· ; ·] represents vector concatenation, W p ∈ R 3d×dp , b p ∈ R dp are the weight matrix and bias in a fully-connected network.

CogStage-3: Prediction Decision Making
Analogical to human thinking, our decision making module aggregates all assumption representations and measures their accountability for the final prediction. It learns a distribution over all assumptions and we select the ones with highest probabilities as model rationales. More specifically, we define a scoring function g(e ij ) to estimate the accountability based on the assumption representation e ij and normalize g(e ij ) as: where W a , b a are the weight matrix and bias for the scoring function. Then we get the weighted rationale representation as: With the representation of weighted assumption information for the target pair (e h , e t ), we calculate the binary prediction probability for relation r as: where σ(x) = 1/(1 + exp(−x)) and W r , b r are model parameters.
Rationalizing relation prediction. After fully training the entire model, to recover the most contributing assumptions for predicting the relation between the given target entities (e h , e t ), we compute the importance scores for all assumptions and select those most important ones as model rationales.
In particular, we multiply p ij (the weight for association pair (a i h , a j t ) in Eqn. 9) with p(r k |a i h , a j t ) (the probability of a relation given the pair (a i h , a j t ) in Eqn. 5) to score the triple (a i h , r k , a j t ). We rank all such triples for a i h ∈ A(e h ), a j t ∈ A(e t ), r k ∈ R and select the top-K triples as model rationales for the final relation prediction.

Training
We now describe how we train our model efficiently for multiple modules. For relational learning to estimate the conditional probability p(r k |a i h , a j t ), we utilize training data as the seed set of triples for all relations as correct triples denoted as (h, r, t) ∈ P. The scoring function in Eqn. 4 is expected to score higher for correct triples than the corrupted ones in which we denote N (?, r, t) (N (t, r, ?)) as the set of corrupted triples by replacing the head (tail) entity randomly. Instead of using margin-based loss function, we adopt a more efficient training strategy from (Kadlec et al., 2017;Toutanova and Chen, 2015) with a negative log likelihood loss function as: where the conditional probability p(h|t, r) is defined as follows (p(t|h, r) is defined similarly): For our binary relation prediction task, we define a binary cross entropy loss function with Eqn. 11 as follows: where M is the number of samples, y i is the label showing whether e h , e t holds a certain relation. The above three loss functions, i.e., L n for global association recall, L r for relational learning and L p for relation prediction, are all jointly optimized. All three of them share the entity embeddings and L p will reuse the relation matrix from L r to conduct the rationale generation.

Experiments
In this section, we first introduce our experimental setup, e.g, the corpus-level co-occurrence statistics and datasets used for our experiments, and then compare our model with a list of comprehensive competitive baselines in terms of predictive performance. Moreover, we conduct expert evaluations as well as case studies to demonstrate the usefulness of our model rationales.

Dataset
We directly adopt a publicly available medical cooccurrence graph for our experiments (Finlayson et al., 2014). The graph was constructed in the following way: Finlayson et al. (2014) first used an efficient annotation tool (LePendu et al., 2012) to extract medical terms from 20 million clinical notes collected by Stanford Hospitals and Clinics, and then computed the co-occurrence counts of two terms based on their appearances in one patient's records within a certain time frame (e.g., 1 day, 7 days). We experiment with their biggest dataset with the largest number of nodes (i.e., the per-bin 1-day graph here 7 ) so as to have sufficient training data. The co-occurrence graph contains 52,804 nodes and 16,197,319 edges.
To obtain training labels for relation prediction, we utilize the mapping between medical terms and concepts provided by Finlayson et al. (2014). To be specific, they mapped extracted terms to UMLS concepts with a high mapping accuracy by suppressing the least possible meanings of each term (see Finlayson et al. (2014) for more details). We utilize such mappings to automatically collect relation labels from UMLS. For term e a and e b that are respectively mapped to medical concept c A and c B , we find the relation between c A and c B in UMLS, which will be used as the label for e a and e b .
Following Wang and Fan (2014) that studied distant supervision in medical text and identified several crucial relations for clinical decision making, we select 5 important medical relations with no less than 1,000 relation triples in our dataset. Each relation is mapped to UMLS semantic relations, e.g., relation CAUSES corresponds to cause of, induces, causative agent of in UMLS. A full list of mapping is in the appendix. We sample an equal number of negative pairs by randomly pairing head and tail entities with the correct argument types (Wang 7

Predictive Performance Evaluation
Compared Methods. There are a number of advanced neural methods (Tang et al., 2015;Qu et al., 2018;Wang et al., 2018) that have been developed for the link prediction task, i.e., predicting the relation between two nodes in a co-occurrence graph. At the high level, their frameworks comprise of an entity encoder and a relation scoring function. We adapt various existing methods for both the encoder and the scoring functions for comprehensive comparison. Specifically, given the co-occurrence graph, we employ existing distributional representation learning methods to learn entity embeddings.
With the entity embeddings as input features, we adapt various models from the knowledge base completion literature as a binary relation classifier. More specifically, for the encoder, we select one word embedding method, Word2vec (Mikolov et al., 2013;Levy and Goldberg, 2014), two graph embedding methods, random-walk based DeepWalk (Perozzi et al., 2014), edge-sampling based LINE (Tang et al., 2015), and one distributional approach REPEL-D (Qu et al., 2018) for weakly-supervised relation extraction that leverages both the co-occurrence graph and training relation triples to learn entity representations. For the scoring functions, we choose DistMult (Yang et al., 2015), RESCAL (Nickel et al., 2011) and NTN (Socher et al., 2013). Note that one can apply more complex encoders or scoring functions to obtain higher predictive performance; however, in this work, we emphasize more on model interpretability than predictive performance, and unfortunately, all such frameworks are hard to interpret as they provide little or no  We also show the predictive performance of our framework under the CWA setting in which the CWA rationales are existing triples in a "closed" knowledge base (i.e., UMLS). We first adopt the pre-trained association recall module to retrieve associative contexts for head and tail entities, then formulate the assumptions using top-ranked triples (that exist in our relation training data), where the rank is based on the product of their retrieval probabilities (p ij = p(e i |e h ) × p(e j |e t )). We keep the rest of our model the same as the OWA setting.
Results. We compare the predictive performance of different models in terms of F1 score under each relation prediction task. As shown in Table 2, our model obtains very competitive performance compared with a comprehensive list of baseline methods. Specifically, on the prediction tasks of MAY TREAT and CONTRAINDICATES, our model achieves a substantial improvement (1∼2 F1 score) and a very competitive performance on the task of SYMPTOM OF and MAY PREVENT. The small amount of training data might partly explain why our model does not perform so well in the CAUSES tasks. Such comparison shows the effectiveness of predicting relations based on associations and their relational interactions. Moreover, compared with those baseline models which encode graph structure into latent vector representation, our model utilizes co-occurrence graph more explicitly by leveraging the associative contexts symbolically to generate human-understandable rationales, which can assist medical experts as we will see shortly.
In addition, we observe that our model consistently  outperforms the CWA setting: Despite the CWA rationales are true statements on their own, they tend to have a low coverage of possible rationales, and thus, may be not so relevant for the target relation prediction, which leads to a poor predictive performance.

Model Rationale Evaluation
To measure the quality of our model rationales (i.e., OWA rationales), as well as to conduct an ablation study of our model, we conduct an expert evaluation for the OWA rationales and also compare them with the CWA rationales. We first collaborate with a physician to explore how much a model's rationales help them better trust the model's prediction following recent work for evaluating model interpretability Mullenbach et al., 2018;Atutxa et al., 2019;Jin et al., 2020). Then, we present some case studies to show what kind of rationales our model has learnt. Note that compared with evaluation by human annotators for open-domain tasks (without expertise requirement), evaluation by medical experts is more challenging in general. The physician in our study (an M.D. with 9 years of clinical experience and currently a fellow trained in clinical informatics), who is able to understand the context of terms and the basics of the compared algorithms and can dedicate time, is qualified for our evaluation.
Expert Evaluation. We first explained to the physician about the recall and recognition process in our framework and how model rationales are developed. They endorsed such reasoning process as one possible way to gain their trust in the model. Next, for each target pair for which our model correctly makes the prediction, they were shown the top-5 rationales produced by our framework and were asked whether each rationale helps them better trust the model prediction. For each rationale, they were asked to score it from 0 to 3 in which 0 is no helpful, 1 is a little helpful, 2 is helpful and 3 is very helpful. In addition to the individual rationale evaluation, we further compare the overall quality of CWA and OWA rationales, by letting experts rank them based the helpfulness of each set of rationales (the rationale set ranked higher gets 1 ranking score and both get 0 if they have the same rank). We refer readers to the appendix for more details of the evaluation protocol. We randomly select 30 cases in the MAY TREAT relation and the overall evaluation results are summarized in Table 3. Out of 30, OWA wins in 17 cases and gets higher scores on individual rationales per case on average. There are 8 cases where the two sets of rationales are ranked the same 8 and 5 cases where CWA is better. To get a better idea of how the OWA model obtains more trust, we calculate the average sum score per case, which shows the OWA model gets a higher overall score per case. Considering in some cases only a few rationales are able to get non-zero scores, we also calculate the average max score per case, which shows that our OWA model generally provides one helpful rationale (score>2) per case. Overall, as we can see, the OWA rationales are more helpful to gain expert trust.
Case Study. Table 4 shows two concrete examples demonstrating what kind of model rationales our framework bases its predictions on. We highlight the rationales that receive high scores from the physician for being especially useful for trusting the prediction. As we can see, our framework is able to make correct predictions based on reasonable rationales. For instance, to predict that "cephalosporine" may treat "bacterial infection", our model relies on the rationale that "cefuroxime" may treat "infectious diseases". We also note that not all rationales are clinically established facts or even make sense, due to the unsupervised rationale learning and the probabilistic assumption formation 8 Of which, 7 cases are indicated equally unhelpful.  process, which leaves space for future work to further improve the quality of rationales. Nevertheless, such model rationales can provide valuable information or new insights for clinicians. For another example, as pointed out by the physician, different medications possibly having the same treatment response, as shown in Case 2, could be clinically useful. That is, if three medications are predicted to possibly treat the same condition and a physician is only aware of two doing so, one might get insights into trying the third one. To summarize, our model is able to provide reasonable rationales and help users understand how model predictions are made in general.

Related Work
Relation Extraction (RE) typically focuses on predicting relations between two entities based on their text mentions, and has been well studied in both open domain (Mintz et al., 2009;Zeng et al., 2015;Riedel et al., 2013;Lin et al., 2016;Song et al., 2019;Deng and Sun, 2019) and biomedical domain (Uzuner et al., 2011;Wang and Fan, 2014;Sahu et al., 2016;Lv et al., 2016;He et al., 2019). Among them, most state-of-the-art work develops various powerful neural models by leveraging human annotations, linguistic patterns, distance supervision, etc. More recently, an increasing amount of work has been proposed to improve model's transparency and interpretability. For example, Lee et al.
(2019) visualizes self-attention weights learned from BERT (Devlin et al., 2019) to explain relation prediction. However, such text-based interpretable models tend to provide explanations within a local context (e.g., words in a single sentence mentioning target entities), which may not capture a holistic view of all entities and their relations stored in a text corpus. We believe that such a holistic view is important for interpreting relations and can be provided to some degree by the global statistics from a text corpus. Moreover, global statistics have been widely used in the clinical domain as they can better preserve patient privacy (Finlayson et al., 2014;. On the other hand, in recent years, graph embedding techniques (Perozzi et al., 2014;Tang et al., 2015;Grover and Leskovec, 2016; have been widely applied to learn node representations based on graph structure. Representation learning based on global statistics from a text corpus (i.e., co-occurrence graph) has also been studied (Levy and Goldberg, 2014;Pennington et al., 2014). After employing such methods to learn entity embeddings, a number of relation classifiers (Nickel et al., 2011;Bordes et al., 2013;Socher et al., 2013;Yang et al., 2015;Wang et al., 2018) can be adopted for relation prediction. We compare our method with such frameworks to show its competitive predictive accuracy. However, such frameworks tend to be difficult to interpret as they provide little or no explanations on how decisions are made. In this paper, we focus more on model interpretability than predictive accuracy, and draw inspirations from existing cognitive theories of recall and recognition to develop a new framework, which is our core contribution.
Another line of research related to interpreting relation prediction is path-based knowledge graph (KG) reasoning (Gardner et al., 2014;Neelakantan et al., 2015;Guu et al., 2015;Xiong et al., 2017;Stadelmaier and Padó, 2019). In particular, existing paths mined from millions of relational links in knowledge graphs can be used to provide justifications for relation predictions. For example, to explain Microsoft and USA may hold the relation CountryOfHeadquarters, by traversing a KG, one can extract the path Microsoft as one explanation. However, such path-finding methods typically require largescale relational links to infer path patterns, and cannot be applied to our co-occurrence graph as the co-occurrence links are unlabeled.
In addition, our work is closely related to the area of rationalizing machine decision by generat-ing justifications/rationales accounting for model's prediction. In some scenarios, human rationales are provided as extra supervision for more explainable models (Zaidan et al., 2007;Bao et al., 2018). However, due to the high cost of manual annotation, model rationales are desired to be learned in an unsupervised manner (Lei et al., 2016;Bouchacourt and Denoyer, 2019;Zhao et al., 2019). For example, Lei et al. (2016) select a subset of words as rationales and Bouchacourt and Denoyer (2019) provide an explanation based on the absence or presence of "concepts", where the selected words and "concepts" are learned unsupervisedly. Different from text-based tasks, in this paper, we propose to rationalize relation prediction based on global cooccurrence statistics and similarly, model rationales in our work are captured without explicit manual annotation either, via a joint training framework.

Conclusion
In this paper, we propose an interpretable framework to rationalize medical relation prediction based on corpus-level statistics. Our framework is inspired by existing cognitive theories on human memory recall and recognition, and can be easily understood by users as well as provide reasonable explanations to justify its prediction. Essentially, it leverages corpus-level statistics to recall associative contexts and recognizes their relational connections as model rationales. Compared with a comprehensive list of baseline models, our model obtains competitive predictive performances. Moreover, we demonstrate its interpretability via expert evaluation and case studies. Symptom of disease has finding; disease may have finding; has associated finding; has manifestation; associated condition of; defining characteristic of Table 5: Relations in our dataset and their mapped UMLS semantic relations. (UMLS relation "Treats" does not exist in our dataset and hence is not mapped with the "May treat" relation.)