Querying Across Genres for Medical Claims in News

We present a query-based biomedical information retrieval task across two vastly different genres – newswire and research literature – where the goal is to ﬁnd the research publication that supports the primary claim made in a health-related news article. For this task, we present a new dataset of 5,034 claims from news paired with research abstracts. Our approach consists of two steps: (i) selecting the most relevant candidates from a collection of 222k research abstracts, and (ii) re-ranking this list. We compare the classical IR approach using BM25 with more recent transformer-based models. Our results show that cross-genre medical IR is a viable task, but incorporating domain-speciﬁc knowledge is crucial.


Introduction
In recent years, the general population has increasingly sought out online sources for medical information (Fox, 2011;Fox and Duggan, 2013). Among the various types of sources, they mostly rely on online news articles, which often serve to disseminate medical findings from research studies (Medlock et al., 2015). It is, however, important to identify the source of a medical claim, especially during times of pervasive misinformation and during a pandemic, when people may not be able to visit a healthcare professional. When reporting a medical study, many news articles cite the original study either by embedding hyperlinks or explicitly showing a citation, thus providing the reader with critical markers of credibility (Fogg et al., 2009). Not all articles do this, however. Here, we present our work on finding scientific research publications that support the primary claims being made in a health-related news article. We design it as crossgenre query-based (or ad hoc) information retrieval (IR): given a medical claim made in a news article, retrieve the research publication supporting it.
(1a) Tea drinkers live longer. † (1b) Tea drinkers live longer, with the biggest boost linked to green variants. ‡ (2) Tea consumption was associated with reduced risks of atherosclerotic cardiovascular disease and all-cause mortality, especially among habitual tea drinkers. Table 1: Cross-genre medical IR where the claims (1a and 1b) are presented in lay terms in the news and serve as queries. The support (2) is provided in a research publication, expressed in specialist language.
When scientific research makes its way out of conferences and journals into news meant for general consumption, the information is presented in a drastically different language. The general audience is often poorly equipped for specialist language comprehension, to the extent that changing domain-specific language to one meant for a general audience has been treated as a discipline by itself (Swales, 2000). So this change is necessary on one hand, but on the other hand, it also increases the difficulty of IR, especially so in token-based methods such as BM25 (Robertson et al., 2009).
In this work, we present a dataset (Sec. 2) of claims made in medical news articles, where each claim is associated with at least one peer-reviewed research publication supporting it. For each claim, we present an IR task in Sec. 3 -search for the corresponding publication from a large corpus of medical research literature. The task itself is divided into two stages: (i) retrieve a candidate list of 500 abstracts from a large corpus, and (ii) rerank them to obtain the correct publication. After discussing our findings, we present an overview of related research in Sec. 4 before concluding.

Dataset
Over a period of 18 months (Oct 2018 -March 2020), we collect 72,028 news articles from the RSS feeds of several medical news websites and also from the health category of popular general news websites. To ensure that only articles citing peer-reviewed scientific publications are retained, we check every document for hyperlinks to domains listed by Wikipedia as medical journals 1 and the list of top scientific publications on Alexa, 2 leaving 17,712 articles (24.6%) in our collection. Further, many articles were aggregations of disparate medical studies. We discard these using a combination of heuristics and manual verification, and retain only those articles that report on a single study or on a series of research studies that closely relate to each other. For articles retained after this step, the headline reflects the focal claim or finding of the cited research. This was observed by three independent readers who were given a random sample of 371 articles (7.4% of the dataset). All three agreed that for each one of these 371 articles, the headline did, indeed, present the main research finding. Since some articles cite using embedded hyperlinks, while others offer a reference section at the end of the article, we are able to collect the abstracts of the cited research.
Our final dataset 3 consists of tuples of the form (h, {a i }), where h is the headline from a news article, and a i are the abstracts of the research publications cited by that article. The publication titles are retained as well. There are 5,034 headlines and 4,566 abstracts (since some research publications are cited by multiple news articles). Fig. 1 shows the distribution of the news headlines over the top ten news domains in our collection.
Since not all research is open-access, we restrict ourselves to collecting the abstracts instead of the entire publication. We believe this does not prove to be a hindrance to the task, since it is reasonable to assume that the primary findings of a research study are mentioned in the abstract. We collect these abstracts through PubMed. 4 Further, to mimic the realistic scenario where a human reader or fact-checker needs to retrieve the correct publication (i.e., the research actually upholding the claim being made in a news article) from a vast collection, we also add 217,665 spurious abstracts from the biomedical research literature. We collect these abstracts from the non-commercial use openaccess subset of PubMed Central, 5 to serve as the negative samples in our IR task.

Experiments
Our task is formulated in two stages, similar to other recent ad hoc IR (MacAvaney et al., 2019a; Yilmaz et al., 2019; Dai and Callan, 2019) -a token-based first step to obtain a candidate list, and then the final ranking by a transformer . In spite of recent advances, the transformerbased models are large, and using them to compare each query with each document is computationally expensive even for a small corpus. Thus, the two-stage approach remains a prudent choice.

Candidate Selection
Given the size of the corpus of biomedical abstracts (> 222k), our goal in this first stage is to reduce the search space for the final ranking task. For this, we consider the classical IR approach of tokenbased bag-of-words models (e.g., BM25) as well as embedding-based models that encode the claim (i.e., the news headline) and the research abstract in the same space. For the latter, we use the inner product of the embedded representations to measure the similarity between a headline and an abstract (Chang et al., 2020). Since most news articles cite only one research publication, and no article in our dataset cites more than three, precision is not an important measure for this task. Instead, we measure recall@k (k = 1, 5, 20, 100, 500). As argued in other recent two-stage approaches (Nie et al., 2019;Soleimani et al., 2020), a high recall is  crucial here, as the correct abstract will otherwise be left out from the final ranking.
As part of the token-based approaches, we use Okapi BM25 (Robertson et al., 2009) and a variant, BM25+ (Lv and Zhai, 2011a). We employ the Rank-BM25 tool, 6 based on Trotman et al. (2014). We evaluate these with and without preprocessing, where the preprocessing comprises converting the words into lowercase, removing function words, and stemming. 7 We also notice that several abbreviations are used in medical news that are not commonly found in the research literature (BP for "blood pressure", Tx for "treatment", etc.). If such an abbreviation appears more than twice in our dataset, we map it to its expansion, based on a dictionary of medical abbreviations. 8 For the embedding-based approaches, we use two pre-trained models to encode the claim h and the abstracts a i -BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), from Sentence-BERT (Reimers and Gurevych, 2019). We obtain the ranked list of abstracts pertinent to the claim based on the inner products h, a i . The pre-trained models are fine-tuned on the Natural Language Inference (NLI) and the Semantic Textual Similarity (STS) benchmark datasets (Cer et al., 2017). Considering our dataset comprises medical news and biomedical literature while BERT and RoBERTa are trained on general texts, we also use the Bio+Clinical BERT (Alsentzer et al., 2019) model and tune it on the NLI and STS benchmark datasets. Additionally, we also tune the Bio+Clinical model on the medical STS dataset (Wang et al., 2018). It is worth noting that many medical research abstracts are further divided into labeled sections (e.g., 'Background', 'Results', 'Conclusion'). In our dataset, 36% of the abstracts featured such labels. We conduct three experiments where (a) the whole abstract is encoded regardless of la- ). Table 2 shows that token-based models significantly outperform all embedding models in the candidate selection stage, with BM25+ achieving the best recall for all k when the preprocessing steps are included. Among the embeddings, finetuning on the medical STS data provides a significant improvement, which indicates the importance of domain-specific training. The BC-BERT MED B experiment was conducted based on our observation that even in abstracts without labeled sections, the primary claims are seldom made in the middle region. The results appear to support this as well. Its improvement over the other variants of BC-BERT, however, is not significant.

Transformer-based Ranking
We keep 3, 000 headlines for training, 1, 000 for development, and 1, 034 for testing. We first use the best candidate selection model (BM25+ † ) to generate a list of 500 abstracts for each headline, and then concatenate a headline with an abstract. These concatenated strings serve as training data for our task. The ground-truth label is 1 for an input h + a where a is, indeed, the abstract cited by the article with headline h. For other inputs, the label is 0. We use this labeled data to tune pretrained transformer models. During prediction, we use the softmax probabilities of the classification scores to re-rank the abstracts for each headline, and calculate recall@k for k = 1, 3, 5, 20, as well as the mean reciprocal rank (MRR).
It is possible that the correct abstract was not retrieved during candidate selection. In that case, we add it back during training (but not testing). Since this data is highly imbalanced (roughly a 1 : 500 ra-   subscript (m, n). The best performance is achieved by Bio+Clinical BERT with 1 epoch, batch size of 24 and maximum sequence length of 512 tokens. tio for the classes labeled 1 and 0, respectively), we use natural language data augmentation (Ma, 2019) to oversample the positive class. These augmentations work by either inserting or substituting words that are highly likely based on distributional similarity. For training, we choose the augmentation parameters such that at most 10 but not exceeding 30% of the tokens in a sentence are augmented. We generate 4 augmented samples (2 insertions, 2 substitutions) and 20 augmented samples (10 insertions, 10 substitutions) when we use the top 10 and 50 negative samples, respectively, in the list of 500 abstracts for each headline.
As part of our experiments, we train different models -BERT, Bio+Clinical BERT, XLNet , and DistilBERT )with transformer. We train them on different versions of the datasets controlling for the number of negative samples per claim and the number of augmented positive samples. All models are trained for 1 and 2 epoch, batch size of 16 and 24, maximum sequence length of 256 and 512 tokens, and a learning rate of 5 × 10 −5 . The final hyperparameters are manually chosen based on MRR achieved on the development set. All experiments are conducted on NVIDIA Tesla V110 GPUs.

Discussion
First, there is the existential question about candidate selection: why not simply train the final ranking algorithm with random negative samples instead of the token-based first step? With random negative sampling, we found that it was rather obvious for both human readers and learning algorithms that the negative samples did not support the claim, simply because random sampling often draws publications not related to the claim at all. This would defeat the objective of our work, which is to aid readers in attempting to fact-check a health-related claim based on the citation provided in a news article. It is unlikely that readers will compare a publication on a topic vastly different from the one being reported (e.g., the news article makes a claim about COVID-19 while the research is about 'haemophilia'). Thus, even though random negative sampling is commonly used to train fact-checking systems (e.g., Hanselowski et al. (2018); Nie et al. (2019)), it is ill suited for the task presented here.
It is also worth pointing out that our evaluation relies on relevance labels obtained from citations from news articles. It is possible that some documents ranked higher are relevant and provide support to the medical claim, but were judged as irrelevant because they were not cited by the news article. Despite this, recall@k and MRR are meaningful. For instance, if the cited publication is ranked third, while two other relevant publications are ranked above it, recall@k will effectively find success at k = 3. With exhaustively verified non-relevance labels, this hypothetical scenario would yield k = 1. Obtaining these labels is a daunting task, however. Indeed, many IR benchmark datasets -e.g., MS-MARCO (Nguyen et al., 2016) -do not provide strong non-relevance labels. In this general evaluation setup, the results may instead be viewed as a lower bound (i.e., with exhaustive ground-truth labels of non-relevance, they are better, not worse).
BM25 is hard to beat as a baseline for candidate selection, but token-based methods err when the words in the news headline do not appear in the abstract, which is common when synonymous or similar meanings are expressed using different terms across two different genres. The best embeddingbased model, BC-BERT MED B , was able to include 33% of the abstracts that BM25+ † failed to retrieve in the top 500 candidates. This also indicates why contextual embeddings improve the ranking results (Table 3). 9 From candidate selection on the test set, the best recall@500 is 0.834, which serves as the upper bound for the ranking task. After training, the transformer-based models can nearly attain this bound for k = 20. This is true even for the general BERT embeddings tuned on just 20 positive and 50 negative samples. The Bio+Clinical variant outperforms the other models. The relative improvement over BERT, however, is not significant.
Overall, our results show that these embeddings do not need much task-specific tuning on the final ranking. However, both token-and embeddingbased approaches fail when the claim is fairly generic (e.g., "Research could help design better flu vaccines"), and these errors happen during candidate selection as well as the final ranking.

Related Work
Modern ad hoc IR systems are largely built upon bag-of-words representations, using termweighting techniques like BM25 (Robertson et al., 2009) or its variants (Lv and Zhai, 2011a,b). Catena et al. (2019) used such a variation for querybased news retrieval, which focuses on specific regions in an article. They use the headlines as queries and formulate the task as retrieving the corresponding article. Such headline-content pairs from newswire have similarly been used in neural IR models as well (MacAvaney et al., 2019b).
Neural models have also recently been used in biomedical IR tasks, due to the availability of large datasets. Mohan et al. (2018) introduce a deep learning model to retrieve biomedical research literature. Further, deep neural architectures have been coupled with external knowledge bases (Zhao et al., 2019), where research documents are retrieved as part of a precision medicine task. In this body of work, the query is either an in-domain keyword, or structured information. As such, they cannot be readily used where the query may be expressed using complex linguistic structures found in the newswire. Example 1b in Table 1, for instance, stresses on a specific aspect of the claim using an adjectival clause as a modifier.
Given the success of BERT and its successors in natural language inference tasks, ad hoc IR systems have used them for claim verification (Hanselowski et al., 2018;Nie et al., 2019;Liu et al., 2019;. Applications of such models to binary classification for query-based passage re-ranking suggest that contextual information can be valuable when re-ranking an initial list of possibly relevant documents retrieved by BM25 model (Nogueira and Cho, 2019). These approaches are not readily suitable for cross-genre IR, but they motivated some of our technical choices. For instance, our use of pointwise (instead of pairwise) loss was based on the discussion in Soleimani et al. (2020) regarding IR tasks with BERT-style models.
Fact-checking is a critical component in fight-ing misinformation, but medical misinformation is known to be nuanced. For example, instead of outright false claims, statements are known to undergo exaggeration. In this general context of thwarting medical misinformation, there is some notable work that, while being distinct from the IR task discussed here, complements our research. For instance, Sumner et al. (2014) studied the exaggeration of medical claims in the news vis-à-vis the original findings in research publications.

Conclusion
In contrast to recent research in ad hoc neural IR, which require large amounts of training data (Mitra and Craswell, 2018), we present a system that combines term-weighting techniques and neural models across two distinct linguistic genres. We also provide a novel dataset of medical newswire queries linked to research literature. Our results show that while neural models excel at re-ranking a small number of documents when pre-trained contextual embeddings are tuned on domain-specific data, classical token-based approaches remain difficult to beat in a cross-genre retrieval scenario when the search space is larger. Our data collection process also reveals that even in a domain as critically important as medical news, only a small fraction of news articles (24.6%) include a complete citation and a link to the original research. Thus, the presented task has utility in medical factchecking, identifying health-related misinformation, and assessing some empirically verifiable aspects of health news reporting.