An Empirical Assessment of the Qualitative Aspects of Misinformation in Health News

The explosion of online health news articles runs the risk of the proliferation of low-quality information. Within the existing work on fact-checking, however, relatively little attention has been paid to medical news. We present a health news classification task to determine whether medical news articles satisfy a set of review criteria deemed important by medical experts and health care journalists. We present a dataset of 1,119 health news paired with systematic reviews. The review criteria consist of six elements that are essential to the accuracy of medical news. We then present experiments comparing the classical token-based approach with the more recent transformer-based models. Our results show that detecting qualitative lapses is a challenging task with direct ramifications in misinformation, but is an important direction to pursue beyond assigning True or False labels to short claims.


Introduction
In recent years, health information-seeking behavior (HISB) -which refers to the ways in which individuals seek information about their health, risks, illnesses, and health-protective behaviors (Lambert and Loiselle, 2007;Mills and Todorova, 2016) has become increasingly reliant on Online news articles (Fox and Duggan, 2013;Medlock et al., 2015;Basch et al., 2018). Some studies also posit that with increasing involvement of the news media in health-related discussions, and direct-to-consumer campaigns by pharmaceutical companies, people are turning to the Internet as their first source of health information, instead of healthcare practitioners (Jacobs et al., 2017). This behavior is primarily driven by the users' need to gain knowledge (Griffin et al., 1999) about some form of intervention (e.g., drugs, nutrition, diagnostic and screening tests, dietary recommendations, psychotherapy). Furthermore, and perhaps counter-intuitively, information seekers seldom spend a lot of time on health News headline: Experts warn coronavirus is 'as dangerous as Ebola' in shocking new study. Source: www.express.co.uk/life-style/health/1275700/ebolaelderly-patients-coronavirus-experts-study-research-deathfigures Published: Apr 30, 2020 Accessed: March 21, 2021 Cause of misinformation Comparing numbers from two different contexts: (1) the hospital fatality rate of COVID-19, and (2) the overall case fatality rate of Ebola. websites. Instead, they repeatedly jump between search engine results and reading health-related articles (Pang et al., 2014(Pang et al., , 2015. In stark contrast to HISB, there is also growing lack of trust in the accuracy of health information provided on the Internet (Massey, 2016). This is perhaps to be expected, given how widespread health-related misinformation has become. For instance, in surveys where expert panels have judged the accuracy of health news articles, nearly half were found to be inaccurate (Moynihan et al., 2000;Yavchitz et al., 2012). Health-related misinformation, however, is rarely a binary distinction between true and fake news. In medical news, multiple aspects of an intervention are typically presented, and a loss of nuance or incomplete understanding of the process of medical research can lead to various types of qualitative failures, exacerbating misinformation in this domain.
Recently, news articles citing leading medical journals have suffered because of this. Table 1 shows an example that was disseminated widely in the United Kingdom, where technically correct facts were juxtaposed with misleading contextsthe case fatality rate of Ebola was incorrectly compared with the hospital fatality rate of COVID-19 (Winters et al., 2020). Indeed, medical misinformation is often a correct fact presented in an incorrect context (Southwell et al., 2019). Moreover, health-related articles are also known to present (1) Does the story/news release adequately discuss the costs of the intervention? (2) Does the story/news release adequately quantify the benefits of the intervention? (3) Does the story/news release adequately explain/quantify the harms of the intervention? (4) Does the story/news release seem to grasp the quality of the evidence? (5) Does the story/news release commit diseasemongering? (6a) Does the story use independent sources and identify conflicts of interest? (6b) Does the news release identify funding sources & disclose conflicts of interest? (7) Does the story/news release compare the new approach with existing alternatives? (8) Does the story/news release establish the availability of the treatment/test/product/procedure? (9) Does the story/news release establish the true novelty of the approach? (10a) Does the story appear to rely solely or largely on a news release? (10b) Does the news release include unjustifiable, sensational language, including in the quotes of researchers? Table 2: Review criteria. The ten criteria for public relations news releases are almost identical to the ones for news stories (except for 6 and 10).
"disease-mongering", where a normal state is exaggerated and presented as a condition or a disease (Wolinsky, 2005). Given how these issues are specific to medical misinformation, and how intricately the accuracy of medical facts is intertwined with the quality of health care journalism, the imperative to move beyond a binary classification of true and fake becomes clear. To this end, a set of specific principles and criteria have been proposed by scientists and journalists, based largely on the acclaimed work by Moynihan et al. (2000) and the Statement of Principles by the Association of Health Care Journalists (Association of Health Care Journalists, 2007).
We present a dataset (Sec. 2) specifically tailored for health news, and labeled according to a set of domain-specific criteria by a multi-disciplinary team of journalists and health care professionals. The detailed data annotation was carried out from 2006 to 2018 (Schwitzer, 2006). For each criterion, we present a classification task to determine whether or not a given news article satisfies it (Sec. 3), and discuss the results. Finally, we present relevant prior work (Sec. 4) before concluding.

Dataset
Our data is collected from Health News Review (Schwitzer, 2006) 1 , which contains systematic re-1 www.healthnewsreview.org/ News headline: Virtual reality to help detect early risk of Alzheimer's Source: www.theguardian.com/society/2018/dec/16/alzheim ers-dementia-cure-virtual-reality-navigation-skills Published: Dec 16, 2018 Accessed: April 26, 2021 Criterion labeled "not applicable": (2) Does the story adequately quantify the benefits of the treatment/test/product/procedure? Table 3: Review criteria not applicable. In this example, the study being reported has not yet taken place, so criterion (2) in Table 2 is not germane. views of 2,616 news stories and 606 public relations (PR) news releases from a period of 13 years, from 2006 to 2018. Ten specific and standardized criteria were used for the reviews. These were chosen to align with the needs of readers seeking health information, and are shown in Table 2. The dataset consists only of articles that discuss a specific medical intervention, since the review criteria were deemed by journalists as being generally not applicable to discussions of multiple interventions or conditions. Each article is reviewed by two or three experts from journalism or medicine, and the results for each criterion include Satisfactory, Not Satisfactory and Not Applicable. The last label is reserved for cases where it is impossible or unreasonable for an article to address that criterion. Table 3 illustrates the utility of this label with one example from the dataset.
Going beyond the reviews themselves, we then collect the news articles being reviewed from the original news sites. However, nearly 30% of those pages have ceased to exist. Further, some articles could not be retrieved due to paywalls. Multiple prominent news organizations are featured in this data, with Fig. 1 showing the distribution over these organizations (for brevity, we show the top ten entities, with the tenth being "others").
Our final dataset comprises 1,119 articles (740 news stories and 379 PR news releases) along with their criteria-driven reviews. These are maintained as (n, {c i }) tuples, where n is the news article, and c i are the review results for each criteria. Since criteria 6 and 10 are slightly different for news stories and PR releases, we remove these from our empirical experiments. Further, we also remove criteria 5 and 9, since these require highly topic-specific medical knowledge. We do this in order to have our approach reflect the extent of medical knowledge available to the lay reader, who is unlikely to fully comprehend the specialized language of medical research publications (McCray, 2005).

Experiments
We approach the problem as a series of supervised classification tasks, where the performance is evaluated separately for each review criterion. Moreover, since the reviewers use the Not Applicable label based on additional topic-specific medical knowledge, we discard the (n, {c i }) tuples where c i carries this label. This eliminates approximately 2.35% of the total number of tuples in our dataset, and paves the way for a binary classification task where each article is deemed satisfactory or not for the criterion c i . The numbers of remaining news for each criterion are as shown in below Table 4.
In all experiments, we use 70% of the data for training. The rest is used as the test set. As a simple baseline, we use the Zero Rule (also called ZeroR or 0-R), which uses the base rate and classifies according to the prior, always predicting the majority class. We then experiment with the classical representation using TF-IDF feature encoding, as well as the state-of-the-art transformer-based models. In both approaches, we use 5-fold cross-validation during training to select the best hyperparameters for each model. These are described next.

1XPEHURIZRUGVLQDQDUWLFOH
1XPEHURIDUWLFOHV Figure 2: The distribution of the size of news articles.

Models
For the feature-based models, we perform some preprocessing, which consists of removing punctuation, converting the tokens into lowercase, removing function words, and lemmatization. We use two supervised learning algorithms: support vector machines (SVM) and gradient boosting (GB). As noted in Table 4, our dataset suffers from class imbalance for every criteria except for one. Thus, for the remaining five criteria, we use adaptive synthetic sampling, viz., ADASYN (He et al., 2008). Further, to reduce the high dimensions of the feature space, we apply the recursive feature elimination algorithm from Scikit-learn (Buitinck et al., 2013) with SVM. In this process, the estimator is trained on the initial set of features, and the importance of each feature is determined by the weight coefficient. The least important features are then pruned. We recursively apply this process by selecting progressively smaller feature sets, until the 300 best features remain. Next, we use several transformer-based models. Namely, BERT (Devlin et al., 2019), XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2020), DistilBERT (Sanh et al., 2019) and Longformer (Beltagy et al., 2020). The maximum sequence length is set to 512 for every model, except for Longformer, for which the value is 4,096. We use random undersampling to mitigate the class imbalance, since the model's performance would otherwise be similar to the Zero Rule baseline.

Results and discussion
The results of our experiments are shown in Table 5. As the dataset is imbalanced for all but one criterion, our simple baseline is the Zero Rule instead of a random baseline. We measure the classifier  performances using macro-average of precision, recall, and F 1 -score. Gradient boosting achieves better performance on criteria 1, 3, 4, and 8. Also, the introduction of oversampling and feature selection increases the model performance for some criteria but not uniformly across the board.
The feature-based models outperform the transformer-based models in the first four criteria. We suspect this is mainly due to the size of the dataset after undersampling. We also check the number of words for the news collected (Fig 2), and more than half of which have more than 512 words. However, the Longformer model with maximum sequence length 4,096 does not achieve significantly better performance than other transformer-based models. The reason might be the "inverted pyramid" structure of news articles, which places essential information in the lead paragraph (Pöttker, 2003). We also notice that the first four criteria are more specific than the rest. For example, the first criterion is about the cost of the intervention, which could be answered by token-level searching. It is still a challenging task, however, given that even human readers find it difficult to follow the review criteria without expert training.

Related Work
For many years now, concerns have been raised about medical misinformation in the coverage by news media (Moynihan et al., 2000;Ioannidis, 2005). Moynihan et al. studied 207 news stories about the benefits and risks of three medications to prevent major diseases, and found that 40% of the news did not report benefits quantitatively while only 47% mentioned potential harms.
Various tasks and approaches have been formulated (Thorne and Vlachos, 2018) for factchecking information. Multiple datasets have also been put forth. Ferreira and Vlachos (Ferreira and Vlachos, 2016) released a collection of 300 claims with corresponding news. This dataset was later significantly enlarged in the fake news challenge (Pomerleau and Rao, 2017). At a similarly large scale, Wang introduced a dataset comprising 12.8K manually labeled statements from POLITIFACT.COM and treated it as a text classification task. A large body of work, however, has dealt with fact-checking of short claims, both for fact-checking (Hassan et al., 2017) as well as for identifying what to check (Nakov et al., 2018).
Furthermore, a vast majority of prior work was on political news, while medical misinformation remained relatively neglected until its impact was underscored by the COVID-19 pandemic (e.g., Hossain et al. (2020); Serrano et al. (2020), among others). This body of work, however, continues to assign true/false labels or binary stance labels to short claims. In contrast, our work analyzes long articles and identifies whether or not they satisfy various qualitative criteria specifically important to medical news, as determined by journalists and health care professionals.

Conclusion
We present a first empirical analysis of qualitative reviews of medical news, since the traditional true/fake dichotomy does not adequately capture the nuanced world of medical misinformation. To this end, we collect a dataset of medical news along with their detailed reviews based on multiple criteria. The novelty of this work lies in highlighting the importance of a deeper review and analysis of medical news to understand misinformation in this domain. For example, misinformation may easily be caused by the use of sensational language, or disease-mongering, or not disclosing a conflict of interest (all of which are criteria used in this work).
Our results show that this is a challenging task. The data reveals that for most of the criteria, less than half of the news articles are satisfactory. The commonly perceived notion of reputation notwithstanding, several articles from well-known sources (such as the ones shown in Fig. 1) also fall short of these qualitative benchmarks set by domain experts. This presents a clear data-driven picture of how the qualitative aspects of misinformation defy our expectations. We have presented a first step in this direction, and our hope is that this work leads to collaborative creation of similar datasets at larger scale by computer scientists and journalists, and in multiple domains even outside of health care.