Claim Detection in Biomedical Twitter Posts

Social media contains unfiltered and unique information, which is potentially of great value, but, in the case of misinformation, can also do great harm. With regards to biomedical topics, false information can be particularly dangerous. Methods of automatic fact-checking and fake news detection address this problem, but have not been applied to the biomedical domain in social media yet. We aim to fill this research gap and annotate a corpus of 1200 tweets for implicit and explicit biomedical claims (the latter also with span annotations for the claim phrase). With this corpus, which we sample to be related to COVID-19, measles, cystic fibrosis, and depression, we develop baseline models which detect tweets that contain a claim automatically. Our analyses reveal that biomedical tweets are densely populated with claims (45 % in a corpus sampled to contain 1200 tweets focused on the domains mentioned above). Baseline classification experiments with embedding-based classifiers and BERT-based transfer learning demonstrate that the detection is challenging, however, shows acceptable performance for the identification of explicit expressions of claims. Implicit claim tweets are more challenging to detect.


Introduction
Social media platforms like Twitter contain vast amounts of valuable and novel information, and biomedical aspects are no exception (Correia et al., 2020). Doctors share insights from their everyday life, patients report on their experiences with particular medical conditions and drugs, or they discuss and hypothesize about the potential value of a treatment for a particular disease. This information can be of great value -governmental administrations or pharmaceutical companies can for instance learn about unknown side effects or potentially beneficial off-label use of medications. At the same time, unproven claims or even intentionally spread misinformation might also do great harm. Therefore, contextualizing a social media message and investigating if a statement is debated or can actually be proven with a reference to a reliable resource is important. The task of detecting such claims is essential in argument mining and a prerequisite in further analysis for tasks like factchecking or hypotheses generation. We show an example of a tweet with a claim in Figure 1.
Claims are widely considered the conclusive and therefore central part of an argument (Lippi and Torroni, 2015;Stab and Gurevych, 2017), consequently making it the most valuable information to extract. Argument mining and claim detection has been explored for texts like legal documents, Wikipedia articles, essays (Moens et al., 2007;Levy et al., 2014;Stab and Gurevych, 2017, i.a.), social media and web content (Goudas et al., 2014;Habernal and Gurevych, 2017;Bosc et al., 2016a;Dusmanu et al., 2017, i.a.). It has also been applied to scientific biomedical publications (Achakulvisut et al., 2019;Mayer et al., 2020, i.a.), but biomedical arguments as they occur on social media, and particularly Twitter, have not been analyzed yet.
With this paper, we fill this gap and explore claim detection for tweets discussing biomedical topics, particularly tweets about COVID-19, the measles, cystic fibrosis, and depression, to allow for drawing conclusions across different fields.
Our contributions to a better understanding of biomedical claims made on Twitter are, (1), to publish the first biomedical Twitter corpus manually labeled with claims (distinguished in explicit and implicit, and with span annotations for explicit claim phrases), and (2), baseline experiments to detect (implicit and explicit) claim tweets in a classification setting. Further, (3), we find in a cross-corpus study that a generalization across domains is challenging and that biomedical tweets pose a particularly difficult environment for claim detection.

Related Work
Detecting biomedical claims on Twitter is a task rooted in both the argument mining field as well as the area of biomedical text mining.

Argumentation Mining
Argumentation mining covers a variety of different domains, text, and discourse types. This includes online content, for instance Wikipedia (Levy et al., 2014;Roitman et al., 2016;Lippi and Torroni, 2015), but also more interaction-driven platforms, like fora. As an example, Habernal and Gurevych (2017) extract argument structures from blogs and forum posts, including comments. Apart from that, Twitter is generally a popular text source (Bosc et al., 2016a;Dusmanu et al., 2017). Argument mining is also applied to professionally generated content, for instance news (Goudas et al., 2014;Sardianos et al., 2015) and legal or political documents (Moens et al., 2007;Palau and Moens, 2009;Mochales and Moens, 2011;Florou et al., 2013). Another domain of interest are persuasive essays, which we also use in a cross-domain study in this paper (Lippi and Torroni, 2015;Stab and Gurevych, 2017;Eger et al., 2017).
While most approaches cater to a specific domain or text genre, Stab et al. (2018) argue that domain-focused, specialized systems do not generalize to broader applications such as argument search in texts. In line with that, Daxenberger et al. (2017) present a comparative study on crossdomain claim detection. They observe that diverse training data leads to a more robust model performance in unknown domains.

Claim Detection
Claim detection is a central task in argumentation mining. It can be framed as a classification (Does a document/sentence contain a claim?) or as sequence labeling (Which tokens make up the claim?). The setting as classification has been explored, inter alia, as a retrieval task of online comments made by public stakeholders on pending governmental regulations (Kwon et al., 2007), for sentence detection in essays, (Lippi and Torroni, 2015), and for Wikipedia (Roitman et al., 2016;Levy et al., 2017). The setting as a sequence labeling task has been tackled on Wikipedia (Levy et al., 2014), on Twitter, and on news articles (Goudas et al., 2014;Sardianos et al., 2015).
One common characteristic in most work on automatic claim detection is the focus on relatively formal text. Social media, like tweets, can be considered a more challenging text type, which despite this aspect, received considerable attention, also beyond classification or token sequence labeling. Bosc et al. (2016a) detect relations between arguments, Dusmanu et al. (2017) identify factual or opinionated tweets, and Addawood and Bashir (2016) further classify the type of premise which accompanies the claim. Ouertatani et al. (2020) combine aspects of sentiment detection, opinion, and argument mining in a pipeline to analyze argumentative tweets more comprehensively. Ma et al. (2018) specifically focus on the claim detection task in tweets, and present an approach to retrieve Twitter posts that contain argumentative claims about debatable political topics.
To the best of our knowledge, detecting biomedical claims in tweets has not been approached yet. Biomedical argument mining, also for other text types, is generally still limited. The work by Shi and Bei (2019) is one of the few exceptions that target this challenge and propose a pipeline to extract health-related claims from headlines of healththemed news articles. The majority of other argument mining approaches for the biomedical domain focus on research literature (Blake, 2010;Alamri and Stevenson, 2015;Alamri and Stevensony, 2015;Achakulvisut et al., 2019;Mayer et al., 2020).

Biomedical Text Mining
Biomedical natural language processing (BioNLP) is a field in computational linguistics which also receives substantial attention from the bioinformat- ics community. One focus is on the automatic extraction of information from life science articles, including entity recognition, e.g., of diseases, drug names, protein and gene names (Habibi et al., 2017;Giorgi and Bader, 2018;Lee et al., 2019, i.a.) or relations between those (Lamurias et al., 2019;Sousa et al., 2021;Lin et al., 2019, i.a.). Biomedical text mining methods have also been applied to social media texts and web content (Wegrzyn-Wolska et al., 2011;Yang et al., 2016;Sullivan et al., 2016, i.a.). One focus is on the analysis of Twitter with regards to pharmacovigilance. Other topics include the extraction of adverse drug reactions (Nikfarjam et al., 2015;Cocos et al., 2017), monitoring public health (Paul and Dredze, 2012;Choudhury et al., 2013;, and detecting personal health mentions (Yin et al., 2015;Karisani and Agichtein, 2018).
A small number of studies looked into the comparison of biomedical information in social media and scientific text: Thorne and Klinger (2018) analyze quantitatively how disease names are referred to across these domains. Seiffe et al. (2020) analyze laypersons' medical vocabulary.

Corpus Creation and Analysis
As the basis for our study, we collect a novel Twitter corpus in which we annotate which tweets contain biomedical claims, and (for all explicit claims) which tokens correspond to that claim.

Data Selection & Acquisition
The data for the corpus was collected in June/July 2020 using Twitter's API 1 which offers a keywordbased retrieval for tweets. Table 1 provides a sample of the search terms we used. 2 For each of the medical topics, we sample English tweets from keywords and phrases from four different query categories. This includes (1) the name of the disease as well as the respective hashtag for each topic, e.g., depression and #depression, (2) topical hashtags like #vaccineswork, (3) combinations of the disease name with words like cure, treatment or therapy as well as their respective verb forms, and (4) a list of medications, products, and product brand names from the pharmaceutical database DrugBank 3 .
When querying the tweets, we exclude retweets by using the API's '-filter:retweets' option. From overall 902,524 collected tweets, we filter out those with URLs since those are likely to be advertisements (Cocos et al., 2017;Ma et al., 2018), and further remove duplicates based on the tweet IDs. From the resulting collection of 127,540 messages we draw a sample of 75 randomly selected tweets per topic (four biomedical topics) and search term category (four categories per topic). The final corpus to be annotated consists of 1200 tweets about four medical issues and their treatments: measles, depression, cystic fibrosis, and COVID-19.

Conceptual Definition
While there are different schemes and models of argumentative structure varying in complexity as well as in their conceptualization of claims, the claim element is widely considered the core component of an argument (Daxenberger et al., 2017).
Aharoni et al. (2014) suggest a framework in which an argument consists of two main components: a claim and premises. We follow Stab and Gurevych (2017) and define the claim as the argumentative component in which the speaker or writer expresses the central, controversial conclusion of their argument. This claim is presented as if it were true even though objectively it can be true or false (Mochales and Ieven, 2009). The premise which is considered the second part of an argument includes all elements that are used either to substantiate or disprove the claim. Arguments can contain multiple premises to justify the claim. (Refer to Section 3.4 for examples and a detailed analysis of argumentative tweets in the dataset.) For our corpus, we focus on the claim element and assign all tweets a binary label that indicates whether the document contains a claim. Claims can be either explicitly voiced or the claim property can be inferred from the text in cases in which they are expressed implicitly (Habernal and Gurevych, 2017). We therefore annotate explicitness or implicitness if a tweet is labeled as containing a claim. For explicit cases the claim sequence is additionally marked on the token level. For implicit cases, the claim which can be inferred from the implicit utterance is stated alongside the implicitness annotation.

Guideline Development
We define a preliminary set of annotation guidelines based on previous work (Mochales and Ieven, 2009;Aharoni et al., 2014;Bosc et al., 2016a;Daxenberger et al., 2017;Stab and Gurevych, 2017). To adapt those to our domain and topic, we go through four iterations of refinements. In each iteration, 20 tweets receive annotations by two annotators. Both annotators are female and aged 25-30. Annotator A1 has a background in linguistics and computational linguistics. A2 has a background in mathematics, computer science, and computational linguistics. The results are discussed based on the calculation of Cohen's κ (Cohen, 1960).
After Iteration 1, we did not make any substantial changes, but reinforced a common understanding of the existing guidelines in a joint discussion. After Iteration 2, we clarified the guidelines by adding the notion of an argumentative intention as a prerequisite for a claim: a claim is only to be annotated if the author actually appears to be intentionally argumentative as opposed to just sharing an opinion (Šnajder, 2016; Habernal and Gurevych,  Table 2: Inter-annotator agreement during development of the annotation guidelines and for the final corpus. C/N: Claim/non-claim, E/I/N: Explicit/Implicit/Nonclaim, Span: Token-level annotation of the explicit claim expression. 2017). This is illustrated in the following example, which is not to be annotated as a claim, given this additional constraint: This popped up on my memories from two years ago, on Instagram, and honestly I'm so much healthier now it's quite unbelievable. A stone heavier, on week 11 of no IVs (back then it was every 9 weeks), and it's all thanks to #Trikafta and determination. I am stronger than I think.
We further clarified the guidelines with regards to the claim being the conclusive element in a Twitter document. This change encouraged the annotators to reflect specifically if the conclusive, main claim is conveyed explicitly or implicitly.
After Iteration 3, we did not introduce any changes, but went through an additional iteration to further establish the understanding of the annotation tasks. Table 2 shows the results of the agreement of the annotators in each iteration as well as the final κ-score for the corpus. We observe that the agreement substantially increased from Iteration 1 to 4. However, we also observe that obtaining a substantial agreement for the span annotation remains the most challenging task.

Annotation Procedure
The corpus annotation was carried out by the same annotators that conducted the preliminary annotations. A1 labeled 1000 tweets while A2 annotated 300 instances. From these both sets, 100 tweets were provided to both annotators, to track agreement (which remained stable, see Table 2). Annotating 100 tweets took approx. 3.3 hours. Overall, we observe that the agreement is generally moderate. Separating claim-tweets from non-claim tweets shows an acceptable κ=.56. Including the decision of explicitness/implicitness leads to κ=.48.   The span-based annotation has limited agreement, with κ=.38 (which is why we do not consider this task further in this paper). These numbers are roughly in line with previous work. Achakulvisut et al. (2019) report an average κ=0.63 for labeling claims in biomedical research papers. According to Habernal and Gurevych (2017), explicit, intentional argumentation is easier to annotate than texts which are less explicit. Our corpus is available with detailed annotation guidelines at http://www.ims.uni-stuttgart.de/data/ bioclaim. The longest tweet in the corpus consists of 110 tokens 4 , while the two shortest consist only of two 4 The tweet includes 50 @-mentions followed by a measlesrelated claim: "Oh yay! I can do this too, since you're going to ignore the thousands of children who died in outbreaks last year from measles... Show me a proven death of a child from vaccines in the last decade. That's the time reference, now? So let's see a death certificate that says it, thx" id Instance 1

Corpus Statistics
The French have had great success #hydroxycloroquine. 2 Death is around 1/1000 in measles normally, same for encephalopathy, hospitalisation around 1/5. With all the attendant costs, the vaccine saves money, not makes it. 3 Latest: Kimberly isn't worried at all. She takes #Hydroxychloroquine and feels awesome the next day.
Just think, it's more dangerous to drive a car than to catch corona 4 Lol exactly. It's not toxic to your body idk where he pulled this information out of. Acid literally cured my depression/anxiety I had for 5 years in just 5 months (3 trips). It literally reconnects parts of your brain that haven't had that connection in a long time. 5 Hopefully! The MMR toxin loaded vaccine I received many years ago seemed to work very well. More please! 6 Wow! Someone tell people with Cystic fibrosis and Huntington's that they can cure their genetics through Mormonism! We generally see that there is a connection between the length of a tweet and its class membership. Out of all tweets with up to 40 tokens, 453 instances are non-claims, while 243 contain a claim. For the instances that consist of 41 and more tokens, only 210 are non-claim tweets, whereas 294 are labeled as claims. The majority of the shorter tweets (≤ 40 tokens) tend to be non-claim instances, while mid-range to longer tweets (≥ 40 tokens) tend to be members of the claim class.

Qualitative Analysis
To obtain a better understanding of the corpus, we perform a qualitative analysis on a subsample of 50 claim-instances/topic. We manually analyze four claim properties: the tweet exhibits an incomplete argument structure, different argument components blend into each other, the text shows anecdotal evidence, and it describes the claim implicitly. Refer to Table 4 for an overview of the results.
In line with Šnajder (2016), we find that argument structures are often incomplete, e.g., in-stances only contain a stand-alone claim without any premise. This characteristic is most prevalent in the COVID-19-related tweets In Table 5, Ex. 1 is missing a premising element, Ex. 2 presents premise and claim.
Argument components (claim, premise) are not very clear cut and often blend together. Consequently they can be difficult to differentiate, for instance when authors use claim-like elements as a premise. This characteristic is again, most prevalent for COVID-19. In Ex. 3 in Table 5, the last sentence reads like a claim, especially when looked at in isolation, yet it is in fact used by the author to explain their claim.
Premise elements which substantiate and give reason for the claim (Bosc et al., 2016b) traditionally include references to studies or mentions of expert testimony, but occasionally also anecdotal evidence or concrete examples (Aharoni et al., 2014). We find the latter to be very common for our data set. This property is most frequent for cystic fibrosis and depression. Ex. 4 showcases how a personal experience is used to build an argument.
Implicitness in the form of irony, sarcasm or rhetoric questions are common features for these types of claims on Twitter. We observe claims related to cystic fibrosis are most often (in our sample) implicit. Ex. 5 and 6 show instances that use sarcasm or irony. The fact that implicitness is such a common feature in our dataset is in line with the observation that implicitness is a characteristic device not only in spoken language and everyday, informal argumentation (Lumer, 1990), but also in user-generated web content in general (Habernal and Gurevych, 2017).

Methods
In the following sections we describe the conceptual design of our experiments and introduce the models that we use to accomplish the claim detection task.

Classification Tasks
We model the task in a set of different model configurations.

Multiclass.
A trained classifier distinguishes between exlicit claim, implicit claim, and non-claim.
Multiclass Pipeline. A first classifier learns to discriminate between claims and non-claims (as in Binary). Each tweet that is classified as claim is further separated into implicit or explicit with another binary classifier. The secondary classifier is trained on gold data (not on predictions of the first model in the pipeline).

Model Architecture
For each of the classification tasks (binary/multiclass, steps in the pipeline), we use a set of standard text classification methods which we compare. The first three models (NB, LG, BiLSTM) use 50-dimensional FastText (Bojanowski et al., 2017) embeddings trained on the Common Crawl corpus (600 billion tokens) as input 6 .
NB. We use a (Gaussian) naive Bayes with an average vector of the token embeddings as input.
LG. We use a logistic regression classifier with the same features as in NB.
BiLSTM. As a classifier which can consider contextual information and makes use of pretrained embeddings, we use a bidirectional long short-term memory network (Hochreiter and Schmidhuber, 1997) with 75 LSTM units followed by the output layer (sigmoid for binary classification, softmax for multiclass).
BERT. We use the pretrained BERT (Devlin et al., 2019) base model 7 and fine-tune it using the claim tweet corpus.

Claim Detection
With the first experiment we explore how reliably we can detect claim tweets in our corpus and how well the two different claim types (explicit vs. implicit claim tweets) can be distinguished. We use each model mentioned in Section 4.2 in each setting described in Section 4.1. We evaluate each classifier in a binary or (where applicable) in a multi-class setting, to understand if splitting the claim category into its subcomponents improves the claim prediction overall. NB LG  Table 6: Results for the claim detection experiments, separated into binary and multi-class evaluation. The best F 1 scores for each evaluation setting and class are printed in bold face.

Experimental Setting
From our corpus of 1200 tweets we use 800 instances for training, 200 as validation data to optimize hyperparameters and 200 as test data. We tokenize the documents and substitute all @-mentions by "@username". For the LG models, we use an l2 regularization. For the LSTM models, the hyper-parameters learning rate, dropout, number of epochs, and batch size were determined by a randomized search over a parameter grid and also use l2 regularization. For training, we use Adam (Kingma and Ba, 2015). For the BERT models, we experiment with combinations of the recommended fine-tuning hyper-parameters from Devlin et al. (2019) (batch size, learning rate, epochs), and use those with the best performance on the validation data. An overview of all hyper-parameters is provided in Table 9 in the Appendix. For the Bi-LSTM, we use the Keras API (Chollet et al., 2015) for TensorFlow (Abadi et al., 2015). For the BERT model, we use Simple Transformers (Rajapakse, 2019) and its wrapper for the Hugging Face transformers library (Wolf et al., 2020). Further, we oversample the minority class of implicit claims to achieve a balanced training set (the test set remains with the original distribution). To ensure comparability, we oversample in both the binary and the multi-class setting. For parameters that we do not explicitly mention, we use default values. Table 6 reports the results for the conducted experiments. The top half lists the results for the binary claim detection setting. The bottom half of the table presents the results for the multi-class claim classification.

Results
For the binary evaluation setting, we observe that casting the problem as a ternary prediction task is not beneficial -the best F 1 score is obtained with the binary LG classifier (.70 F 1 for the class claim in contrast to .61 F 1 for the ternary LG). The BERT and NB approaches are slightly worse (1 pp and 4pp less for binary, respectively), while the LSTM shows substantially lower performance (13pp less).
In the ternary/multi-class evaluation, the scores are overall lower. The LSTM shows the lowest performance. The best result is obtained in the pipeline setting, in which separate classifiers can focus on distinguishing claim/non-claim and explicit/implicit -we see .59 F 1 for the explicit claim class. Implicit claim detection is substantially more challenging across all classification approaches.
We attribute the fact that the more complex models (LSTM, BERT) do not outperform the linear models across the board to the comparably small size of the dataset. This appears especially true for implicit claims in the multi-class setting. Here, those models struggle the most to predict implicit claims, indicating that they were not able to learn sufficiently from the training instances.

Error Analysis
From a manual introspection of the best performing model in the binary setting, we conclude that it is difficult to detect general patterns. We show two cases of false positives and two cases of false negatives in Table 7. The false positive instances show that the model struggles with cases that rely on judging the argumentative intention. Both Ex. 1 and 2 contain potential claims about depression and therapy, but they have not been annotated as such, because the authors' intention is motivational rather than argumentative. In addition, it appears id G P Text 1 n c #DepressionIsReal #MentalHealthAwareness #mentalhealth ruins lives. #depression destroys people. Be there when someone needs you. It could change a life. It may even save one.
2 n c The reason I stepped away from twitch and gaming with friends is because iv been slowly healing from a super abusive relationship. Going to therapy and hearing you have ptsd isnt easy. But look how far iv come, lost some depression weight and found some confidence:)plz stay safe that the model struggles to detect implicit claims that are expressed using irony (Ex. 3) or a rhetorical question (Ex. 4).

Cross-domain Experiment
We see that the models show acceptable performance in a binary classification setting. In the following, we analyze if this observation holds across domains or if information from another outof-domain corpus can help.
As the binary LG model achieved the best results during the previous experiment, we use this classifier for the cross-domain experiments. We work with paragraphs of persuasive essays (Stab and Gurevych, 2017) as a comparative corpus. The motivation to use this resource is that while they are a distinctly different text type and usually linguistically much more formal than tweets, they are also opinionated documents. 8 We use the resulting essay model for making an in-domain as well as a cross-domain prediction and vice versa for the Twitter model. We further experiment with combining the training portions of both datasets and evaluate its performance for both target domains.

Experimental Setting
The comparative corpus contains persuasive essays with annotated argument structure (Stab and Gurevych, 2017). Eger et al. (2017) used this cor-8 An essay is defined as "a short piece of writing on a particular subject, often expressing personal views" (https: //dictionary.cambridge.org/dictionary/english/essay).  pus subsequently and provide the data in CONLLformat, split into paragraphs, and predivided into train, development and test set. 9 We use their version of the corpus. The annotations for the essay corpus distinguish between major claims and claims. However, since there is no such hierarchical differentiation in the Twitter annotations, we consider both types as equivalent. We choose to use paragraphs instead of whole essays as the individual input documents for the classification and assign a claim label to every paragraph that contains a claim. This leaves us with 1587 essay paragraphs as training data, and 199 and 449 paragraphs respectively for validation and testing. We follow the same setup as for the binary setting in the first experiment.

Results
In Table 8, we summarize the results of the crossdomain experiments with the persuasive essay corpus. We see that the essay model is successful for classifying claim documents (.98 F 1 ) in the indomain experiment. Compared to the in-domain setting for tweets all evaluation scores measure substantially higher.
When we compare the two cross-domain experiments, we observe that the performance measures decrease in both settings when we use the out-ofdomain model to make predictions (11pp in F 1 for tweets, 15pp for essays). Combining the training portions of both data sets does not lead to an improvement over in-domain experiments. This shows the challenge of building domain-generic models that perform well across different data sets.

Discussion and Future Work
In this paper, we presented the first data set for biomedical claim detection in social media. In our first experiment, we showed that we can achieve an acceptable performance to detect claims when the distinction between explicit or implicit claims is not considered. In the cross-domain experiment, we see that text formality, which is one of the main distinguishing feature between the two corpora, might be an important factor that influences the level of difficulty in accomplishing the claim detection task.
Our hypothesis in this work was that biomedical information on Twitter exhibits a challenging setting for claim detection. Both our experiments indicate that this is true. Future work needs to investigate what might be reasons for that. We hypothesize that our Twitter dataset contains particular aspects that are specific to the medical domain, but it might also be that other latent variables lead to confounders (e.g., the time span that has been used for crawling). It is important to better understand these properties.
We suggest future work on claim detection models optimize those to work well across domains. To enable such research, this paper contributed a novel resource. This resource could further be improved. One way of addressing the moderate agreement between the annotators could be to include annotators with medical expertise to see if this ultimately facilitates claim annotation. Additionally, a detailed introspection of the topics covered in the tweets for each disease would be interesting for future work since this might shed some light on which topical categories of claims are particularly difficult to label.
The COVID-19 pandemic has sparked recent research with regards to detecting misinformation and fact-checking claims (e.g., Hossain et al. (2020) or Wadden et al. (2020)). Exploring how a claimdetection-based fact-checking approach rooted in argument mining compares to other approaches is up to future research.