SUPP.AI: finding evidence for supplement-drug interactions

Dietary supplements are used by a large portion of the population, but information on their pharmacologic interactions is incomplete. To address this challenge, we present SUPP.AI, an application for browsing evidence of supplement-drug interactions (SDIs) extracted from the biomedical literature. We train a model to automatically extract supplement information and identify such interactions from the scientific literature. To address the lack of labeled data for SDI identification, we use labels of the closely related task of identifying drug-drug interactions (DDIs) for supervision. We fine-tune the contextualized word representations of the RoBERTa language model using labeled DDI data, and apply the fine-tuned model to identify supplement interactions. We extract 195k evidence sentences from 22M articles (P=0.82, R=0.58, F1=0.68) for 60k interactions. We create the SUPP.AI application for users to search evidence sentences extracted by our model. SUPP.AI is an attempt to close the information gap on dietary supplements by making up-to-date evidence on SDIs more discoverable for researchers, clinicians, and consumers. An informational video on how to use SUPP.AI is available at: https://youtu.be/dR0ucKdORwc


Introduction
More than half of US adults use dietary supplements (Kantor et al., 2016). Supplements include vitamins, minerals, enzymes, and other herbal and animal products. Supplements and pharmaceutical drugs, when taken together, can cause adverse interactions (Sprouse and van Breemen, 2016;Asher et al., 2017;Ronis et al., 2018). Some studies describe the prevalence of supplement-drug interactions (SDIs) in the hospital setting (Levy et al., 2016(Levy et al., , 2017a or among groups such as patients with cancer (Alsanad et al., 2014), cardiac disease (Karny-Rahkovich et al., 2015), HIV/AIDS (Jalloh et al., 2017), or Alzheimer's disease (Spence et al., 2017). However, these studies largely rely on manual curation of the literature, and are slow and expensive to produce and update. It is also difficult to aggregate their results, and researchers, clinicians, and consumers can lack appropriate upto-date information to make informed decisions about supplement use.
A resource that provides experimental evidence for SDIs could serve as a good intermediary tool, allowing experts to quickly access information and translate it for healthcare providers and consumers. Such a tool could ease the bottleneck of manual curation by directing researcher attention to the most pertinent and novel interactions appearing in recent trials and case reports. Our goal is to create such a resource using state-of-the-art methods in NLP and IE, and allow users to better identify appropriate uses of supplements as well as risks for SDIs.
Automated approaches have been used to extract drug-drug interactions (DDIs) from literature and other documents (Tari et al., 2010;Percha et al., 2011;Segura-Bedmar et al., 2011;Kim et al., 2014;Noor et al., 2017;Lim et al., 2018), complementing broadly-used but primarily manual methods (Grizzle et al., 2019). We expand upon this work to automatically extract evidence for SDIs, as well as supplement-supplement interactions (SSIs), from a large corpus of 22M biomedical and clinical texts derived from Semantic Scholar. 1 We leverage labeled datasets for DDI identification for supervision, and train a model that transfers to the related task of identifying supplement interactions. We surface the resulting evidence on SUPP.AI for browsing and search.
To summarize, our contributions are: 1. A model for identifying SDI/SSI evidence 2. A dataset of 195k evidence sentences supporting supplement interactions, publicly accessi-ble for download or via a web API, and 3. SUPP.AI, an application for browsing and searching the extracted evidence.

Supplement interaction browser
Information on supplement interactions have immediate implications on public health, which can only be realized by making the data easily accessible to any interested researcher, clinician or consumer. We note that many medical providers in developing countries do not have subscriptions to clinical databases such as TRC 2 and UpToDate, 3 and may lack an easy way to identify possible supplement interactions before prescribing drugs to their patients.
To fill this gap, we develop SUPP.AI (available at https://supp.ai/), an application for browsing evidence of supplement interactions extracted from clinical and biomedical literature. SUPP.AI allows users to: • Search for supplements or drugs, • Search through potential interactions, • Browse evidence sentences with supplement and drug entities highlighted, • Navigate links to source papers We design SUPP.AI to be a rapid way for users to access and search extracted SDI and SSI evidence. Our goal for this application is to provide a high quality, broadly-sourced, up-to-date, and easily accessible platform for searching through SDI and SSI evidence, while providing sufficient information for users to judge the quality of each piece of evidence. In Section 3, we describe the NLP pipeline used to extract evidence from scientific papers. Below, we describe the user interface and data features of SUPP.AI.

User interface
Besides the main search page seen by users when they first navigate to the site, SUPP.AI consists of two other types of pages: entity and interaction pages. Entity pages provide information about one supplement or drug, and a list of potential interacting entities, sorted by quantity of evidence. We provide information such as synonyms, drug trade names, and definitions about each entity upon hover over or expansion. Interaction pages display all discovered pieces of evidence supporting an interaction between a pair of entities. The evidence is sorted by additional features extracted from source papers, such as the level of evidence and recency, discussed in Section 2.2. Figure 1 shows the interface, with results for the ginkgo supplement. Results on the entity page (left) list 140 possible interactions to entities such as Warfarin and Nitric Oxide. When a result is selected, the interaction page is displayed (right), showing evidence sentences supporting the interaction along with metadata and links to each source paper. Spans linked to supplement and drug entities in evidence sentences are highlighted. To see more context or detail about the interaction, the user can navigate to the source paper to continue reading.

Supporting data for search
We extract additional paper metadata as a way to judge evidence quality. From Semantic Scholar, we retrieve the paper title, authors, publication venue, and year of publication. Medical Subject Headings (MeSH) tags associated with each paper are used to determine whether its results are derived from clinical trials, case reports, or animal studies. We also attempt to identify the retraction status of each paper, again using MeSH tags. Evidence sentences are ordered and presented based on associated paper metadata, prioritizing non-retracted studies, clinical trials, human studies, and recency (year of publication).
Using the RxNorm relationship has_tradename via the Unified Medical Language System (UMLS) Metathesaurus (Bodenreider, 2004), we derive trade names associated with drug ingredients, e.g. Prozac and Sarafem are trade names of the ingredient fluoxetine. Trade drugs are associated with active drug ingredients and indexed for search. Users can query a trade name rather than an active ingredient and be directed to the relevant interactions.

Data & API
Data on the site are periodically updated as new papers are incorporated into the Semantic Scholar corpus. Snapshots of the data are available for download at https://api.semanticscholar. org/supp/. Live data on the site, which is updated more frequently, can be accessed through our search API, documented at https://supp.ai/ docs/api. Additionally, we provide training data, evaluation data, and the curated drug/supplement identifier lists (discussed in Section 3) used to produce the dataset of interactions at https:// github.com/allenai/sdi-detection. We encourage others to reuse our data and model to improve information availability around supplement interactions and safety.

Methods
An overview of our NLP pipeline is given in Figure  2. We first retrieve Medline-indexed articles using the Semantic Scholar API, 4 and pre-process the text to generate candidate evidence sentences (Section 3.1). We then use our DDI-detection model, a neural network classifier based on BERT (Devlin et al., 2018) and fine-tuned on labeled DDI data from Ayvaz et al. (2015) (Section 3.2), to classify sentences for the existence of an interaction. Sentences classified as positive by our model are collated and surfaced on SUPP.AI (Section 2).

Generating candidate evidence
Approximately 22M Medline-indexed articles are downloaded using the Semantic Scholar API. The scispaCy library (Neumann et al., 2019) is used to perform sentence tokenization, NER, and entity linking over all paper abstracts. Entity mentions are linked to Concept Unique Identifiers (CUIs) from the UMLS Metathesaurus. An example sentence from Vaes and Hendeles (2000)  Of these linked entities, we preserve entities on a list of curated supplements and drugs (entities in blue). We generate these curated lists in a semi-automatic fashion, by querying the children of UMLS supplement and drug classes and performing fuzzy name matching to known supplements or drugs crawled from the web. We also perform clustering of similar entities to reduce redundancy in the final dataset, e.g., combining several variants of Vitamin D together into a single entity. Details on identifier curation and clustering are given in Appendix A.
We retain all sentences containing at least two entity mentions. For each sentence, we generate candidate evidence as each combination of two entity spans from that sentence.

DDI-detection model
We train a DDI-detection model to predict whether a given candidate sentence provides evidence of an interaction between two drug entities. Our DDI-detection model uses pre-trained BERT models (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2018) to encode input sequences. These models have been shown to be effective at domain transfer, and are able to achieve high performance using small amounts of task-specific annotated data. In particular, we use the large version of the pre-trained RoBERTa model, a further-optimized BERT model, that has approximately 340M parameters (Liu et al., 2019). We fine-tune the pre-trained embeddings of the RoBERTa language model using labeled data for DDI classification, and we call the resulting model RoBERTa-DDI.
Input layer: The input layer consists of the sequence of byte-pair encoding word pieces (Radford et al., 2019) in a sentence. We replace entity mention spans with the special tokens [Arg1] and [Arg2]. This helps generalization by preventing the model from memorizing entity pairs with positive interactions in the training set. For example: where [Arg1] and [Arg2] replace the spans "hormonal contraceptives" and "acetaminophen" respectively. We add special tokens [CLS] and [SEP] at the beginning and end of each sentence to leverage their representations learned in pretraining. At prediction time, candidate sentences are masked similarly and fed to the trained model.
Model architecture: As the name implies, RoBERTa-DDI uses the pre-trained RoBERTa representations (Liu et al., 2019) to encode input sequences. We refer readers to Liu et al. (2019), Devlin et al. (2018), and Vaswani et al. (2017) for more details on BERT and transformer architecture. For the RoBERTa-DDI model, we add a dropout layer followed by one feedforward (output) layer with a softmax non-linearity, which takes the representation of the [CLS] token at the top transformer layer as input and outputs probabilities for labels {0, 1}, where 1 indicates an interaction.
Model training: Due to similarities between DDIs and SDIs/SSIs, we hypothesize that a classifier trained to identify DDI evidence should perform well in identifying SDI and SSI evidence. We therefore take advantage of existing labeled data for categorizing DDIs to fine-tune the model. We use pre-trained weights distributed by the authors of Liu et al. (2019), and further fine-tune the model parameters (as well as parameters of the output layer) using labeled DDI data from the Merged-PDDI dataset (Ayvaz et al., 2015).
In particular, we use training data from the DDI-2013 (Segura-Bedmar et al., 2013) and NLM-DailyMed (Stan et al., 2014) datasets, as they are relatively large and contain evidence sentences with annotated drug mention spans. The DDI-2013 dataset consists of sentences extracted from Drug-Bank and Medline; the NLM-DailyMed dataset draws sentences from cardiovascular drug product labels retrieved from DailyMed. Both datasets contain multi-class labels for different types of interactions. We distinguish between detection, a binary classification problem where the goal is to determine whether an interaction exists or not, and multi-class classification, where the goal is to determine the type of interaction. In this work, we focus on detection, but provide results for a variant of our model trained on classification that obtains SOTA performance compared to prior work.
For detection, we collapse labels corresponding to all interaction types (e.g., mechanism, advise, effect, etc.) into binary labels of 0 and 1, where 0 means no interaction, and 1 means an interaction of some type exists. Collapsing the positive labels is necessary for training one DDI-detection model on both the DDI-2013 and NLM-DailyMed datasets, since the two datasets are annotated with inconsistent interaction types. We preserve the train/test splits used in Ayvaz et al. (2015), and create a development set from the training set for iteration on model design and tuning.
A sentence from the training data can contain multiple drug entities. For training, we generate pairwise combinations of drug mention spans in each sentence. We note that many sentences are seen multiple times by our model with different labeled spans. Due to combinatorial explosion, and to prevent our model from learning excessively from a few instances containing lots of entity mentions, we restrict the training data to sentences containing less than or equal to 100 pairwise entity combinations. Table 1 shows the resulting data splits for the two datasets.

Results & evaluation
Of the 22M articles we retrieve, around 4.6M abstracts contain candidate sentences. After initial filtering, 33.0M candidate sentences containing supplement entity mentions are classified by RoBERTa-DDI. Around 625k (1.9%) of these sentences are classified as positive for an interaction. We perform entity normalization across positive sentences based on CUI clusters, and perform additional ad hoc filtering of evidence to eliminate incorrectly detected spans resulting from poor NER and linking, such as the span "retina" linking to Vitamin A (C0040845). The resulting 195k sentences contain mentions of 2044 unique supplements and 2772 unique drugs, and provide evidence sentences for 60k interactions sourced from 133k papers.
Comparisons of model variants on DDI classification and detection (including SOTA results on both tasks) are given in Appendix B. To evaluate the transferability of DDI detection to the related task of SDI/SSI detection, we use a test set consisting of 500 sentences annotated for the presence or absence of a supplement interaction. To obtain a balanced test set despite the rare presence of a positive interaction, we sample half the instances from the set of sentences labeled as positive by a previous variant of our model based on fine-tuning BERT-large, and the other half from those labeled as negative. After manual annotation, 40% of the sampled instances were positive for an interaction. Annotation was performed by two authors without seeing model predictions, with an inter-annotator agreement of 94%. This test set was used for final evaluation, and never for model development or tuning. Table 2 shows the performance of RoBERTa-DDI on the DDI and supplement test sets. Performance on the SDI test set has precision 0.82, recall 0.58, and F1-score 0.68. Although there is performance degradation during transfer, the precision of detection remains high at 0.82.
Decrease in recall can be attributed to a larger percentage of positive instances in the SDI test set (roughly 40%, compared to 20% in the DDI training data). Another factor is the presence of incorrectly labeled entity spans in the supplements test set due to NER/linking errors. To better understand this second source of errors, we attempt to evaluate the performance of the scispaCy entity linker. Processing each sentence from the two DDI training sets using scispaCy, we determine that only 80% of drug entities from DDI-2013 and 76% from NLM-DailyMed are recognized and linked. The likelihood of supplement entities being successfully linked is likely lower, due to sparse training data for supplement NER and linking. These numbers provide an estimate of the global ceiling on recall for our model. In future work, we aim to explore ways to improve NER and linking and assess their impact on the results of SDI detection. SDI/SSI sentences in our output set can also be labeled by biomedical expert annotators and used to further tune the model for SDI/SSI detection.

Discussion
Information describing the safety and efficacy of dietary supplements can be difficult to find. The inability to locate evidence of SDIs can challenge clinician ability to advise patients and cause risks for consumers of dietary supplements. It is our hope that extracting evidence for SDIs/SSIs from a large corpus of scientific literature and making the evidence available through an easily accessible search interface can offset some of these risks.
This work demonstrates how NLP techniques can be extraordinarily useful for extracting information and relationships specific to an application domain in healthcare. Re-purposing existing labeled data from related domains (that would be expensive to generate in a new domain) can be a way to derive maximum utility from curation efforts. Continuing, we look to investigate fine-grained interaction types, and provide better classification of the level of evidence provided by each sentence or document towards a particular SDI or SSI. We also aim to leverage similar techniques for identifying evidence of indications, contraindications, and side effects of dietary supplements from the biomedical and clinical literature, and make these discoverable on SUPP.AI.

Related Work
Consumer-facing websites such as the NIH Office of Dietary Supplements 5 or WebMD 6 provide facts about common supplements, but this information can be incomplete and may not support researcher or clinician needs. TRC Natural Medicines 7 and UpToDate 8 , two dedicated clinical resources, contain high-quality, curated evidence, but may not be broadly accessible due to their subscription format. Drug databases like DrugBank (Wishart et al., 2018), RxNorm (Nelson et al., 2011), and the National Drug File Reference Terminology (NDFRT) (Simonaitis and Schadow, 2010) contain only partial coverage of supplement terminology (Manohar et al., 2015b), and primarily focus on aggregating drug information.
Several prior studies have experimented with extracting safety information of supplements and supplement interactions from various forms of text. Zhang et al. (2015) employ machine learning techniques to filter supplement interaction relationships in SemMedDB, a database of relationships extracted from Medline articles. Jiang et al. (2017) develop a model for identifying adverse effects related to dietary supplements as reported by consumers on Twitter, and discover 191 adverse effects pertaining to 4 dietary supplements. Fan et al. (2016) and Fan and Zhang (2018) analyze unstructured clinical notes to predict whether a patient started, continued or discontinued a dietary supplement, which can be useful as a building block for identifying adverse effects in clinical notes (as attempted by the same authors in Fan et al. (2017) for the drug warfarin).  proposes using topic models to analyze the adverse effects of dietary supplements as mentioned in the Dietary Supplement Label Database, and finds that Latent Dirichlet Allocation models (Blei et al., 2003) can be used to group dietary supplements with similar adverse effects based on their labels. As far as we know, there are no other studies investigating the task of sentence-level identification of SDI/SSI evidence from the scientific literature. No previous work has investigated the utility of using labeled DDI data for transfer learning to SDI/SSI identification.

Limitations
There are several limitations of this work. First, we distinguish between supplements and drugs. Both supplements and drugs are pharmacologic entities, with their separate classification more attributable to marketing and social pressures rather than functional differences. However, due to this somewhat arbitrary distinction, supplement entities are not well represented in databases of pharmaceutical entities, and less information is publicly available on their interactions. We also use UMLS CUIs as a way of identifying supplement and drug entities. The lack of a standardized terminology to describe dietary supplements is discussed in Manohar et al. (2015a) and , which estimate UMLS coverage of these terms to be between 14-54%. This limitation prevents us from identifying many supplement entities. Lastly, our dependence on NLP-pipeline tools sets a performance ceiling due to unsolved problems in NER and linking. Although scispaCy is performant and detects a large number of relevant entities, our evaluations show that many supplement and drug entities are missed. A system such as MetaMapLite (Demner-Fushman et al., 2017) has higher recall, but performance is slow and there are practical challenges to using it to process large numbers of documents.

Conclusion
Insufficient regulation in the supplement space introduces dangers for the many users of these supplements. Claims of interactions are difficult to validate without links to source evidence. We create an NLP pipeline to detect SDI/SSI evidence from scientific literature, leveraging UMLS identifiers, scispaCy for NER and entity linking, BERT-based language models for classification, and labeled data from a related domain for training. We use this pipeline to extract evidence from 22M biomedical and clinical articles with high precision. The extracted SDI/SSI evidence are made search-able through a public web interface, SUPP.AI, where we integrate additional metadata about source papers to help users make decisions about the reliability of evidence. Our dataset and web interface can be leveraged by researchers, clinicians, and curious individuals to increase understanding about supplement interactions. We hope to encourage additional research to improve the safety and benefits of dietary supplements for their consumers.  such as "Dietary Supplement" (NCIT: C1505, CUI: C0242295), "Vascular Plant" (NCIT: C14336, CUI: C0682475), and "Antioxidant" (NCIT: C275, CUI: C0003402) as likely parents of supplement terms. We recursively extract child entities of these parent classes from UMLS, deriving an initial list of supplements. To improve recall, we extract supplement names from the TRC Natural Medicines database, 9 perform fuzzy string matching to entities in UMLS, and add any identified CUIs to our list of supplements. The list is manually reviewed to remove non-supplement entities, those for which we could not identify any marketed supplement or medicinal uses. Following curation, we retain 2139 unique supplement entities. Similarly, we generate a corresponding list of drug CUIs from parent entity "Pharmacologic Substance" (NCIT: C1909, CUI: C1254351) and any UMLS entity with a DrugBank identifier. Fuzzy name matching between drugs on drugs.com 10 and UMLS entities is used to identify drugs and experimental chemicals missed through UMLS search alone. Due to the significantly larger number of drugs compared to supplements, manual curation of this list is impractical at this time. This process generates a list of 15252 unique drug CUIs. Any entity that is identified as both a supplement and a drug is categorized exclusively as a supplement for the purposes of this work.
Similar supplement and drug entities are merged, such as those with overlapping names, e.g., entities corresponding to UMLS C0006675, C0006726, C0596235, and C3540037 all describe variants of Calcium and are merged under the supplement entity C3540037 ("Calcium Supplement"). The

B DDI model performance
We train RoBERTa-DDI on a combination of DDI-2013 and NLM-DailyMed training data. In Table  3, we report the F1-scores of model variants on the test data. We show the performance of the final variant of RoBERTa-DDI (trained on both DDI-2013 and NLM-DailyMed) as well as a variant trained only on DDI-2013 training data (last column), which performs best on the DDI-2013 test set, but suffers when tested on NLM-DailyMed. We also further break down performance on the DrugBank and Medline sub-corpora within DDI-2013.
The DDI-2013 dataset is used as a benchmark dataset for DDI detection and classification, and is part of the BLUE benchmark suite (Peng et al., 2019). RoBERTa-DDI outperforms recentlyreported SOTA performance on DDI detection in the DDI-2013 dataset using BioBERT (Lee et al., 2019) (F1 = 0.87) (Chauhan et al., 2019). Peng et al. (2019) also report SOTA performance on the DDI-2013 classification task, achieving 0.79 micro-F1 using a tuned BERT-large model. For comparison, we show the results of RoBERTa-DDI trained on DDI-2013 multi-class classification, which achieves 0.82 micro-F1 on DDI-2013 classification. We provide previously reported SOTA performance metrics on DDI-2013 in Table 4. We note that because the interaction classes are unbalanced in the DDI-2013 dataset, reported classification micro-and macro-F1-scores in previous work are not directly comparable.
The inclusion of the NLM-DailyMed corpus increases training data diversity and should improve generalization for the task of detecting SDI/SSI evidence. Thus, although RoBERTa-DDI trained on DDI-2013 has the highest performance on the DDI-2013 test set, RoBERTa-DDI trained over all training data performs the best overall, and we use this model variant to classify evidence for SUPP.AI.