Improving Biomedical Analogical Retrieval with Embedding of Structural Dependencies

Inferring the nature of the relationships between biomedical entities from text is an important problem due to the difficulty of maintaining human-curated knowledge bases in rapidly evolving fields. Neural word embeddings have earned attention for an apparent ability to encode relational information. However, word embedding models that disregard syntax during training are limited in their ability to encode the structural relationships fundamental to cognitive theories of analogy. In this paper, we demonstrate the utility of encoding dependency structure in word embeddings in a model we call Embedding of Structural Dependencies (ESD) as a way to represent biomedical relationships in two analogical retrieval tasks: a relationship retrieval (RR) task, and a literature-based discovery (LBD) task meant to hypothesize plausible relationships between pairs of entities unseen in training. We compare our model to skip-gram with negative sampling (SGNS), using 19 databases of biomedical relationships as our evaluation data, with improvements in performance on 17 (LBD) and 18 (RR) of these sets. These results suggest embeddings encoding dependency path information are of value for biomedical analogy retrieval.


Introduction
Distributed vector space models of language have been shown to be useful as representations of relatedness and can be applied to information retrieval and knowledge base augmentation, including within the biomedical domain (Cohen and Widdows, 2009). A vast amount of knowledge on biomedical relationships of interest, such as therapeutic relationships, drug-drug interactions, and adverse drug events, exists in largely human-curated knowledge bases (Zhu et al., 2019). However, the rate at which new papers are published means new relationships are being discovered faster than human curators can manually update the knowledge bases. Furthermore, it is appealing to automatically generate hypotheses about novel relationships given the information in scientific literature (Swanson, 1986), a process also known as 'literaturebased discovery.' A trustworthy model should also be able to reliably represent known relationships that are validated by existing literature.
Neural word embedding techniques such as word2vec 1 and fastText 2 are a widely-used and effective approach to the generation of vector representations of words (Mikolov et al., 2013a) and biomedical concepts (De Vine et al., 2014). An appealing feature of these models is their capacity to solve proportional analogy problems using simple geometric operators over vectors (Mikolov et al., 2013b). In this way, it is possible to find analogical relationships between words and concepts without the need to specify the relationship type explicitly, a capacity that has recently been used to identify therapeutically-important drug/gene relationships for precision oncology (Fathiamini et al., 2019). However, neural embeddings are trained to predict co-occurrence events without consideration of syntax, limiting their ability to encode information about relational structure, which is an essential component of cognitive theories of analogical reasoning (Gentner and Markman, 1997). Additionally, recent work (Peters et al., 2018) has found that contextualized word embeddings from language models such as ELMo, when evaluated on analogy tasks, perform worse on semantic relation tasks than static embedding models.
The present work explores the utility of encoding syntactic structure in the form of dependency paths into neural word embeddings for analogical retrieval of biomedical relations. To this end, we build and evaluate vector space models for representing biomedical relationships, using a corpus of dependency-parsed sentences from biomedical literature as a source of grammatical representations of relationships between concepts.
We compare two methods for learning biomedical concept embeddings, the skip-gram with negative sampling (SGNS) algorithm (Mikolov et al., 2013a) and Embedding of Semantic Predications (ESP) (Cohen and Widdows, 2017), which adapts SGNS to encode concept-predicate-concept triples. In the current work, we adapt ESP to encode dependency paths, an approach we call Embedding of Structural Dependencies (ESD). We train ESD and SGNS on a corpus of approximately 70 million sentences from biomedical research paper abstracts from Medline, and evaluate each model's ability to solve analogical retrieval problems derived from various biomedical knowledge bases. We train ESD on concept-path-concept triples extracted from these sentences, and SGNS on full sentences that have been minimally preprocessed with named entities (see §3). Figure 1 shows the pipeline from training to evaluation.
From an applications perspective, we aim to evaluate the utility of these representations of relationships for two tasks. The first involves correctly identifying a concept that is related in a particular way to another concept, when this relationship has already been described explicitly in the biomedical literature. This task is related to the NLP task of relationship extraction, but rather than considering one sentence at a time, distributional models represent information from across all of the instances in which this pair have co-occurred, as well as information about relationships between similar concepts. We refer to this task as relationship retrieval (RR). The second task involves identifying concepts that are related in a particular way to one another, where this relationship has not been described in the literature previously. We refer to this task as literature-based discovery (LBD), as identifying such implicit knowledge is the main goal of this field (Swanson, 1986).
We evaluate on four kinds of biomedical relationships, characterized by the semantic types of the entity pairs involved, namely chemical-gene, chemical-disease, gene-gene, and gene-disease relationships.
The following paper is structured as follows. §2 describes vector space models of language as they are evaluated for their ability to solve proportional analogy problems, as well as prior work in encoding dependency paths for downstream applications in relation extraction. §3 presents the dependency path corpus from Percha and Altman (2018). §4 summarizes the knowledge bases from which we develop our evaluation data sets. §5 describes the training details for each vector space model. §6 and §7 describe the methods and results for the RR and LBD evaluation paradigms. §8 and §9 offer discussion and conclude the paper. Code and evaluation data will be made available at https://github.com/amandalynne/ESD.

Background
We look to prior work in using proportional analogies as a test of relationship representation in the general domain with existing studies on vector space models trained on generic English. While our biomedical data is largely in English, we constrain our evaluation to specific biomedical concepts and relationships as we apply and extend established methods.

Vector space models of language and analogical reasoning
Vector space models of semantics have been applied in information retrieval, cognitive science and computational linguistics for decades (Turney and Pantel, 2010), with a resurgence of interest in recent years. Mikolov et al. (2013a) and Mikolov et al. (2013b) introduce the skip-gram architecture. This work demonstrated the use of a continuous vector space model of language that could be used for analogical reasoning when vector offset methods are applied, providing the following canonical example: if x i is the vector corresponding to word i, x king -x man + x woman yields a vector that is close in proximity to x queen . This result suggests that the model has learned something about semantic gender. They identified some other linguistic patterns recoverable from the vector space model, such as pluralization: x apple -x apples ≈ x car -x cars , and developed evaluation sets of proportional analogy problems that have since been widely used as benchmarks for distributional models (see for example (Levy et al., 2015)).
However, work soon followed that pointed out some of the shortcomings of attributing these results to the models' analogical reasoning capacity. For example, Linzen (2016) showed that the vector for 'queen' is itself one of the nearest neighbors to the vector for 'woman,' and so it can be argued that the model does not actually learn relational information that can be applied to analogical reasoning, but rather, can rely on the direct similarity between the target terms in the analogy to produce desirable results.
Furthermore, Gladkova et al. (2016) introduce the Better Analogy Test Set (BATS) to provide an evaluation set for analogical reasoning that includes a broader set of semantic and syntactic relationships between words. This set proved far more challenging for embedding-based approaches. Newman-Griffis et al. (2017) provide results of vector offset methods applied to a dataset of biomedical analogies derived from UMLS triples, showing that certain biomedical relationships are more difficult to learn with analogical reasoning than others.
Because the aim of this project is to robustly learn a handful of biomedical relationships, we are less concerned about the linguistic generalizability of these particular representations, but future work will examine the application of these vector space models to analogies in the general domain.

Dependency embeddings
Levy and Goldberg (2014a) adapt the SGNS model to encode direct dependency relationships, rather than dependency paths. In this approach, a dependency-type/relative pair is treated as a target for prediction when the head of a phrase is observed (e.g. P (scientist/nsubj|discovers)). The dependency-based skipgram embeddings were shown to better reflect the functional roles of words than those trained on narrative text, which tended to emphasize topical associations. Recent work (Zhang et al. (2018), Zhou et al. (2018), Li et al. (2019)) has also integrated dependency path representations in neural architectures for biomedical relation extraction, framing it as a classification task rather than an analogical reasoning task. The work of Washio and Kato (2018) is perhaps the most closely related to our approach, in that neural embeddings are trained on word-path-word triples. Aside from our application of domainspecific Named Entity Recognition (NER), a key methodological difference between this work and the current work is that their approach represents word pairs as a linear transformation of the concatenation of their embeddings, while we use XOR as a binding operator (following the approach of Kanerva (1996)), which was first used to model biomedical analogical retrieval with semantic predications extracted from the literature by Cohen et al. (2011) 3 . On account of the use of a binding operator, individual entities, pairs of entities and dependency paths are all represented in a common vector space.

Text Data
We train both the ESD and SGNS models on data released by Percha and Altman (2018). This corpus 4 consists of about 70 million sentences from a subset of MEDLINE (approximately 16.5 million abstracts) which have PubTator (Wei et al., 2013) annotations applied to identify phrases that denote names of chemicals (including drugs and other chemicals of interest), genes (and the proteins they code for), and diseases (including side effects Figure 2: Example of a path of dependencies between two entities of interest. The full parse is not shown, but rather, the minimum path of dependency relations between the two entities given the sentence. and other phenotypes). Throughout this paper, we use these shorthand names for each of these categories, following the convention established in Wei et al. (2013) and followed by Percha and Altman (2018).
The following example sentence from an article processed by PubTator shows how multi-word phrases that denote biomedical entities of interest, in this case atypical depression and seasonal affective disorder, are concatenated by underscores to constitute single tokens: Chromium has a beneficial effect on eating-related atypical symptoms of depression, and may be a valuable agent in treating atypical depression and seasonal affective disorder. Percha and Altman (2018) also provide pruned Stanford dependency (De Marneffe and Manning, 2008) parses for the sentences in the corpus, consisting, for each sentence, of the minimal path of dependency relations connecting pairs of biomedical named entities identified by PubTator. Specifically, they extract dependency paths that connect chemicals to genes, chemicals to diseases, genes to diseases, and genes to genes. Figure 2 shows an example of a dependency path of relations between two terms, risperidone and rage. We use these dependency paths as representations for predicates that denote biomedical relationships of interest by concatenating the string representations of each path element, which are shown below the sentence in Figure 2. Following Percha and Altman (2018), we exclude paths that denote a coordinating conjunction between elements and paths that denote an appositive construction, both of which are highly common in the set. In this corpus of 70 million sentences, there are about 44 million unique dependency paths that connect concepts of interest, the vast majority (around 40 million) of which appear just once in the corpus. 540,011 of these paths appear at least 5 times in the corpus.

Knowledge Bases
We construct our evaluation data sets with exemplars from knowledge bases for four primary kinds of biomedical relationships, characterized by the interactions between pairs of entities of the following types: chemical-gene, chemical-disease, gene-disease, and gene-gene.
Each knowledge base consists of pairs of entities that relate in a specific way. For example, SIDER Side Effects consists of chemical-diseasetyped pairs such that the chemical is known to have the disease as a side effect, e.g. (sertraline, insomnia). Meanwhile, another chemical-disease pair from a different database, Therapeutic Target Database (TTD) indications, is such that the chemical is indicated as a treatment for the disease, e.g. (carphenazine, schizophrenia). In constructing our evaluation sets, we process all terms such that they are lower-cased, and multi-word terms are concatenated by underscores. Furthermore, we eliminate from our evaluation sets any knowledge base terms that do not appear in the training corpus described in §3 at least 5 times. It should be noted that across these sets, a single biomedical entity may appear with numerous spellings and naming conventions. Table 2 shows the corresponding relationship type for each of the knowledge bases we use, as well as the number of pairs from each that are used in our evaluation data. The relationship retrieval data consists of knowledge base pairs that appear in our training corpus connected by a dependency path at least once, while the literature-based discovery targets are those knowledge base pairs that do not appear connected by a dependency path in the corpus.

Training Details
SGNS With SGNS, a shallow neural network is trained to estimate the probability of encountering a context term, t c , within a sliding window centered on an observed term, t o . The training objective involves maximizing this probability for true context terms P (t c |t o ), and minimizing it for randomly drawn counterexamples t ¬c , P (t ¬c |t o ), with probability estimated as the sigmoid function of the scalar product between the input weight vector for the observed term and the output weight vector of the context term, σ( . We used the Semantic Vectors 5 implementation of SGNS (which performs similarly to the fastText implementation across a range of analogical retrieval benchmarks (Cohen and Widdows, 2018)) to train 250-dimensional embeddings, with a sliding window radius of two, on the complete set of full sentences from the corpus described in §3 as the training corpus. As previously mentioned, multi-word phrases corresponding to named entities recognized by the PubTator system in these sentences are concatenated by underscores, and consequently receive a single vector representation. ESD With ESD, a shallow neural network is trained to estimate the probability of encountering the object, o, of a subject-predicate-object triple sP o. The training objective involves maximizing this probability for true objects P (o|s, P ) and minimizing it for randomly drawn counterexamples, ¬o, P (¬o|s, P ). We adapted the Semantic Vectors 5 implementation of ESP to encode dependency paths, with binary vectors as representational basis (Widdows and Cohen, 2012) and the non-negative normalized Hamming distance (N N HD) to estimate the similarity between them. NNHD = max 0, 1 − 2 × Hamming distance dimensionality With this representational paradigm, probability can be estimated as N N HD(o, s ⊗ P ), where ⊗ represents the use of pairwise exclusive OR as a binding operator, in accordance with the Binary Spatter Code (Kanerva, 1996). While ESP was originally developed to encode knowledge extracted from the literature using a small set of predefined predicates (e.g. TREATS), we adapt it here to encode a large variety (n=546,085) of dependency paths. For training, we concatenate the dependency relations (the underscored parts in Figure 2) into a single predicate token for which a vector is learned. Some examples of path tokens (concatenated dependency relations) can be seen in Table 1. Unlike the original ESP implementation where predicate vectors were held constant, we permit dependency path vectors to evolve during training 6 . Further details on ESP can be found in (Cohen and Widdows, 2017). For the current work, we set the dimensionality at 8000 bits (as this is equivalent in representational capacity to 250-dimensional single precision real vectors). For ESD, Table 1 shows the nearest neighboring dependency path vectors to the bound product I(metf ormin) ⊗ O(diabetes), illustrating paths that indicate the relationship between these terms, and ESD's capability to learn similar representations for paths with similar meaning.
Both SGNS and ESD were trained over five epochs, with a subsampling threshold of 10 −5 , a minimum term frequency threshold of 5 (which includes concatenated dependency paths for ESD), and a maximum frequency threshold of 10 6 .

Evaluation Methods
We use a proportional analogy ranked retrieval task for both the RR and LBD tasks, following prior work as described in §2. Figure 3 visualizes this process. From a set of (X, Y) entity pairs from a knowledge base, given a term C and all terms D such that (C, D) is a pair in the set, we select n random (A, B) cue pairs from a disjoint set of pairs. We refer to (C, D) pairs as 'target pairs,' correct D completions as 'targets,' and (A, B) pairs as 'cues.' The vectors for the cue terms (A, B) and the term C are summed in the following fashion to produce the resulting vector v. Given an analogical pair A:B::C:D, where A and C, B and D are of the same semantic type, respectively, we develop cue vectors for the target D in each model as follows:

SCORE PATH
0.974 controlled nmod start entity end entity amod controlled 0.935 add-on nmod start entity end entity amod add-on 0.565 reduces nsubj start entity reduces dobj requirement requirement nmod end entity 0.537 associated compound start entity end entity nsubj associated 0.516 start entity conj efficacy efficacy acl treating treating dobj end entity 0.438 treatment amod start entity treatment nmod end entity  where I and O represent the input and output weight vectors of the ESD model, respectively. The SGNS method is the same as the 3COSADD method as described in Levy and Goldberg (2014b).
A K-nearest neighbor search is performed for v (using cosine distance for SGNS, NNHD for ESD) over the search space, and we record the ranks for each correct D target. The search space is constrained such that it consists of those terms from our training corpus that have a vector in both ESD and SGNS, a total of about 300,000 terms overall. For ESD, this space consists of the output weight vectors for each concept. For the proportional analogy task using K-nearest neighbors to rank completions to the analogy, the desired outcome is for the correct targets to be highly similar to the analogy cue vector v, such that the highest ranks are assigned to the correct target terms D in a search over the entire vector space. In this fashion, we perform this KNN search for every (X, Y) pair in the knowledge base and record the ranks for correct targets. We then compare the ranks of terms D across both vector spaces; the higher the ranks, the better the model is at capturing relational similarity. Table 2 shows, for each knowledge base, how many total unique X terms and total (X, Y) pairs are used for each task. Additionally, we show the average number of correct Y terms per X and the maximum number of correct Y terms per X. For the relationship retrieval task, we consider those (X, Y) pairs which are connected by at least one dependency path in our corpus. Meanwhile, (X, Y) pairs for the LBD task must not be connected by a dependency path in the corpus (we treat these heldout pairs as a proxy for estimating the quality of novel hypotheses). We know from the (X, Y) pair's presence in the knowledge base that it is a gold standard pair for the given relationship type, but from the models' perspective this information is not available from the text alone. Thus, we believe it is a good test of the models' ability to generate plausible hypotheses. To reiterate, the methodology for both the relationship retrieval and literaturebased discovery evaluations is the same; the only difference is in which pairs of terms from each knowledge base are used for evaluation data.
We examine the role of increasing the number of cues in improving retrieval. For example, for a given (C, D) target pair, we can combine vectors

Relationship Retrieval
Literature-based Discovery Total X Total Pairs Mean Y / X Max Y / X Total X Total Pairs Mean Y / X Max Y / X

Chem-Gene
Gene Targets (DrugBank)  1626  6290  4  107  3569  37162  10  420  PGKB  535  2089  4  48  1563  28053  18  144  Agonists (TTD)  148  172  1  3  307  462  2  7  Antagonists (TTD)  188  200  1  2  508  620  1  5  Gene Targets (TTD)  1179  1436  1  7  4088  6430  2  15  Inhibitors (TTD)  522  669  1  7  1273  2082  2  15 Chem-Disease  for multiple (A, B) pairs with the C term vector to produce a final cue vector that is closer to the target D. When multiple cues are used, we superpose the cue vector for each of the cues, and normalize the resulting vector, with normalization of real vectors to unit length in SGNS, and normalization of binary vectors using the majority rule with ties split at random with ESD. Cues are always selected from the subset of knowledge base pairs that co-occur in our training corpus. We ensure that none of the (A, B) cue terms overlap with each other, nor with the (C, D) target terms, to assure that self-similarity does not inflate performance. We produced results for a range of 1, 5, 10, 25, and 50 cues, finding that the best results come from using 25 cues; we only report these resulting scores in §7.
As a baseline inspired partly by Linzen (2016), we compute the similarity of vectors for B and D terms and C and D terms compared directly to each other, omitting the analogical task. The intuition here is that C and D terms are potentially close together in the vector space merely due to frequent co-occurrence in the corpus, and any analogical reasoning performance is merely relying on that fact. Meanwhile, terms B and D can be close together in the vector space simply because they are the same semantic type, and thus occur in similar contexts. In this case, relational analogy might not explain the performance, but mere distributional similarity. In the B:D comparison setting, cues B are added together to create a single cue vector with which to perform the KNN ranking over terms in which to find the target term D. These cue terms B are extracted from the same A, B cue pairs as those used for the full analogy setting to ensure a reasonable comparison across methods. In the C:D comparison setting, no cues are aggregated.

Results
We present qualitative and quantitative results for each vector space model's ability to represent and retrieve relational information.
Qualitative Results Table 3 shows a side-byside comparison of the top 10 retrieved terms given the vector for the term risperidone composed with 25 randomly selected (drug, indication) cues from SIDER. The goal is to complete the proportional analogy corresponding to the treatment relationship. Of the top 10 terms retrieved in the ESD vector space, 4 are correct completions to the analogy, while 3 more are plausible completions based on literature. 'Tardive oromandibular dystonia,' while of the correct semantic type targeted by this analogy, is actually a side effect of risperidone. A majority of the retrieved results, however, are known or plausible treatment targets. Meanwhile, most of the top 10 terms retrieved by SGNS are names of other drugs that are similar to risperidone. Additionally, 'psychiatric and visual disturbances' and 'tardive dyskinesia' are side effects of risperidone, not treatment targets. Notably, all of the results retrieved with ESD are of the correct semantic type, i.e., they are disorders, while SGNS retrieves a mix of drugs and side effects.
Quantitative Results For each C term in each evaluation set, we record the ranks of all D tar-rank ESD (ours) SGNS 1 separation anxiety risperidone × 2 schizophrenia olanzapine × 3 depressed state quetiapine × 4 bipolar mania aripiprazole × 5 tardive oromanibular dystonia clozapine × 6 treatment of trichotillomania * psychiatric and visual disturbances 7 pervasive developmental disorder (NOS) * ziprasidone × 8 borderline personality disorder amisulpride × 9 psychotic disorders paliperidone × 10 mania tardive dyskinesia Table 3: Top 10 results for a K-nearest neighbor search over terms for treatment targets for the drug risperidone (an antipsychotic drug), using 25 (drug, indication) pairs from SIDER as cues. Bolded terms are correct targets, i.e., they are listed as treatment targets for risperidone in SIDER. * : a disorder that risperidone treats or might treat, based on external literature or a synonym for a target from SIDER; ×: a chemical, i.e., something that could not be a treatment target for a drug.
get terms resulting from the K-nearest neighbor search. For ease of comparison, we normalize all raw ranks by the length of the full search space (324363 terms in total), and then subtract this value from 1 so that lower ranks (i.e., better results) are displayed as higher numbers, for ease of interpretation. For a baseline score, we ran a simulation in which the entire search space was shuffled randomly 100 times, and recorded the median ranks of multiple target D terms, given some C. We find that the median rank for D terms in a randomly shuffled space tended toward the middle of the ranked list. Thus, the baseline score is established as 0.5; any score lower than this means the model performed worse than a random shuffle at retrieving target terms. In Table 4, 1 is the highest possible score, and 0 is the lowest. We report results at 25 (A, B) cues, the setting for which performance was best for both ESD and SGNS. 'Full' in Table 4 refers to evaluation with a full A:B::C:D analogy, while 'B:D' refers to the baseline that compares vectors for terms directly, rather than using relational information. We do not report C:D comparison results, as they were categorically worse than both Full and B:D results.

Discussion
The results in Table 4 show that ESD outperforms SGNS on the RR task for 18 of 19 databases, and for 17 of 19 databases on the LBD task. It is clear that literature-based discovery is harder than relationship retrieval, as the scores are generally lower across the board for this task. We discuss the results for each task separately.

Relationship retrieval
For a total of 12 out of 19 sets, ESD on full analogies outperforms ESD on direct B:D comparisons, suggesting that the model has learned generalizable relationship information for these types of relations rather than relying on distributional term similarity. Because gene-gene pairs consist of entities of the same semantic type, it can be argued that B:D similarity should be very high, and yet scores are higher for the full analogy over the B:D baseline for most of these sets, for both ESD and SGNS. For SIDER side effects, the B:D baseline for ESD shows higher scores than the full analogy for both LBD and RR; one reason for this could be that there is a high degree of side effect overlap between drugs, and so the side effect terms themselves are highly similar to each other.

Literature-based discovery
The best performance on a majority of the sets comes from the ESD B:D model, suggesting that the model relies on term similarity over relational information for performance. Although SGNS doesn't perform the best overall, the full analogy model tends to outperform its B:D counterpart, suggesting that SGNS has managed to extrapolate relational information to the retrieval of held-out targets. As previously mentioned, performance on this task is made difficult due to the lack of normalization of concepts across our datasets. Additionally, as Table 4 shows, several top ranked terms are plausible analogy completions, but do not appear as gold-standard targets in the databases. Considering the case of SIDER, which is built from automatically extracted information (not human-curated) the plausible results here are missing from the database but are supported by evidence from published papers (e.g. Oravecz andŠtuhec (2014)).

Conclusion
We have compared two vector space models of language, Embedding of Structural Dependencies and Skip-gram with Negative Sampling, for their ability to represent biomedical relationships from literature in an analogical retrieval task. Our results suggest that encoding structural information in the form of dependency paths connecting biomedical entities of interest can improve performance on two analogical retrieval tasks, relationship retrieval and literature-based discovery. In future work, we would like to compare our methods with knowledge base completion techniques using contextualized vectors from language models as in Bosselut et al. (2019) as another method applicable to literaturebased discovery.