Stance-Taking in Topics Extracted from Vaccine-Related Tweets and Discussion Forum Posts

The occurrence of stance-taking towards vaccination was measured in documents extracted by topic modelling from two different corpora, one discussion forum corpus and one tweet corpus. For some of the topics extracted, their most closely associated documents contained a proportion of vaccine stance-taking texts that exceeded the corpus average by a large margin. These extracted document sets would, therefore, form a useful resource in a process for computer-assisted analysis of argumentation on the subject of vaccination.


Introduction
Opinions towards vaccination that are expressed in discussion forums and in social media, as well as frequently occurring arguments given in support of these opinions, might help us to better understand reasons behind vaccine hesitancy.
There are previous studies in which such texts have been manually analysed (Grant et al., 2015;Faasse et al., 2016), as well as studies in which topic modelling has been applied for analysing texts about vaccination (Tangherlini et al., 2016;Surian et al., 2016;Skeppstedt et al., 2018).
Through topic modelling, it is possible to automatically extract topics that occur frequently in a text collection. For topic modelling to be a useful strategy for mining text collections for frequently occurring arguments, however, at least some of the topics extracted must correspond to stance positions or arguments given for these positions.
The aim of this study is to investigate if topic modelling is suitable for extracting arguments from two types of document collections that consist of laymen-produced texts about vaccination. We, therefore, measured the occurrence of stancetaking towards vaccination in the documents that were most closely associated with automatically extracted topics from two different corpora.

Background
There are previous studies that use topic modelling in computer-assisted processes to find frequently occurring arguments in a document collection (Sobhani et al., 2015;Skeppstedt et al., 2018). Documents that had been manually annotated as not containing argumentation/stance-taking were, however, removed in those two previous studies, i.e., no evaluation of the effects of including neutral documents when performing topic modelling was carried out. For most types of document collections, it is not known beforehand in which documents a stance towards the target of interest is taken or not. Therefore, the setting used here is more widely applicable, i.e., to use topic modelling on an entire text collection, without removing documents in which no stance is taken. In both of these two previous studies, the topic modelling algorithm NMF (Lee and Seung, 2001), i.e., Nonnegative Matrix Factorisation, was shown appropriate for extracting arguments from short argumentative texts. We, therefore, used this algorithm in our experiments.

Method
We used topic modelling to automatically extract important topics from two different vaccination corpora, both consisting of English text that predominantly had been written by people without a medical background. We, thereafter, measured the proportion of stance-taking texts among the texts that were most related to these topics, and compared it to the proportion of stance-taking texts in the entire corpus.

Document collections
As a proxy for texts containing arguments, we used texts in which stance is expressed, since such texts are likely to also contain a motivation for the position taken. The documents, from each of the two corpora, were divided into two groups based on whether they had been annotated as taking a stance towards vaccination or not, i.e., into the two groups stance-taking and non-stance-taking.
The first collection consists of posts from discussion threads on the topic of vaccination (Skeppstedt et al., 2017) that contain at least one of the following character combinations: "vacc", "vax", "jab", "immunis", and "immuniz". Posts annotated as taking a stance for or against vaccination were combined into the group stancetaking texts, and posts annotated as undecided were assigned the category non-stance-taking.
The second collection consists of tweets containing the HPV vaccine-related keywords "HPV", "human papillomavirus", "Gardasil", and "Cervarix" (Du et al., 2017). We combined tweets annotated according to the categories Positive and Negative to form the category stancetaking tweets, and tweets annotated as Neutral and Unrelated as non-stance-taking tweets.
Before applying topic modelling, the following were removed from the texts: standard English stop words, the terms that had been used for gathering the documents, hash tags, user names, URLs and links. Duplicated and near-duplicated documents were also removed from the collections. Documents with identical spans of texts that consisted of more then eight consecutive tokens were counted as near-duplicates. For documents consisting of ten or fewer tokens, a shorter (proportional to the length) cut-off was instead applied for classifying two documents as near-duplicates.

Applying topic modelling
Separate topic models were constructed for the two document collections, using the NMF class from scikit-learn (Pedregosa et al., 2011). For each topic extracted by the NMF model, the corresponding terms and documents associated with the topic are given as output, as well as their level of association with the topics.
The output of the NMF algorithm is nondeterministic, typically generating slightly different topics when run several times. Therefore, to achieve more reliable results, we followed an approach, for instance used by Baumer et al. (2017), in which the algorithm is re-run several times and only topics that occur in the output from all reruns are retained. Before checking which topics occurred in all re-runs, potential outliers were removed from the set of outputs from the re-runs.
We ran the algorithm 100 times with the setting to, for each re-run, return a term set consisting of the 50 terms most closely associated with each of the topics extracted by the algorithm. A topic was counted as stable when there was at least a 70% overlap between the pairs of term sets returned for a topic, for all 90 retained re-runs of the algorithm.
Potential outliers among the outputs were determined by measuring the average term overlap between the re-run outputs. That is, for each rerun, one combined set consisting of all terms associated with all topics from this re-run was constructed. Thereafter, the average overlap between this combined term set and the corresponding sets from the other re-runs was measured, i.e., the combined term sets constructed in the same fashion for each one of the other re-runs. The outputs from the 10% of the re-runs that had the lowest overlap were discarded as potential outliers, and were thus not included when calculating the stability of the extracted topics.
To avoid having to decide on a fixed number of topics in advance, which is normally required from an NMF user, we started by requesting the algorithm to extract 20 topics, and thereafter gradually decreased the number of topics requested until a maximum of 25% of the extracted topics were discarded as non-stable.

Results
After the near-duplicate filtering were 1,108 and 2,250 documents retained, for the discussion threads and the tweets, respectively. The proportions of stance-taking documents among the documents that were ranked by the algorithm as the top-n documents most typical to the extracted topics are shown in Table 1. These were compared to the 95% confidence interval for the proportion of stance-taking documents among n documents randomly sampled from the corpus. 1 Measurements were carried out for n=35 and n=100. The method used had yielded 90 re-run outputs, which each one of them contained a slightly different document ranking for the topics extracted. For each of the topics, we therefore extracted the 100 most top-ranked documents for every re-run, and ranked  these documents according to the sum of the documents topic-association value over the 90 re-runs.
For the figures in Table 1, the stance proportion that lies below the 95% confidence interval for the stance proportions of n randomly selected documents is marked with italics and those that lie above are marked in boldface. That is, the document rankings (top 35 or top 100) that contain a smaller or larger density of stance-taking texts, than had the same number of documents been randomly selected, are marked in italics or boldface.
For discussion forum texts, for which the collection-level proportion of stance-taking was already high, the proportions among the documents extracted for the topics were similar to the document-level proportion. The general trend was a slight increase in stance-taking documents, with one topic that had a stance-taking proportion above the 95% confidence interval for the top-35 documents and two topics that fulfil this criterion for the top-100 topics.
Also for the tweets, a majority of the topics had associated documents with a proportion of stance-taking that did not differ significantly from a random sampling from the document collection. However, some of the topics contained a very high proportion of stance-taking, in comparison to the proportion in the entire document collection. This resulted in that, for the tweets, there was a statistically significant difference for three topics also when extracting only the top 35 most typical documents. These top 35 documents were made up of document sets consisting of semantically coherent tweets. The topic girls/boys/10... mainly consisted of posts advocating HPV vaccine for both boys and girls, often also providing the argument that it prevents cancer. The documents belonging to vax/anti/age/... typically took the opposite stance, and often contained a questioning of whether there is a proof that HPV vaccination prevents cancer, or warnings against perceived adverse effects of HPV vaccination. The topic vaccination/rates/low..., which consisted of expressions of worries about HPV vaccination rates being low, forms an example of that stance-taking does not always imply that arguments are given. That is, although most of the tweets associated with this topic clearly take a stance in favour of vaccination, no direct arguments are given here.
There was also one tweet topic with a very low proportion of stance-taking among its associated documents, that is, the topic love/epidemic/documentary... which consisted of many tweets that, in different ways, but in a neutral manner, announced a documentary about HPV.

Discussion and conclusion
A typical practical application of the method studied here would be the case in which an analyst aims at finding frequently occurring vaccinerelated arguments in a document collection that is too large for a fully manual analysis. The analyst would then instead perform an analysis at which only a subset of the texts would be read, i.e., those automatically extracted through topic modelling.
We have here shown, for the two document collections investigated, that there are topics extracted which have associated documents that contain a larger proportion of vaccine stance-taking texts than the average document collection. That is, these document sets would form a useful resource for such an analyst who searches for vaccine-related argumentation. The fact that there might also be topics extracted which do not contain argumentation, i.e., topics similar to the love/epidemic... topic, should not pose a large obstacle to the analysis, as long as there are other topics that have associated documents in which stance is taken. That is, at least for the documents extracted for the topic love/epidemic..., it is evident after reading only a few documents, that this topic is uninteresting for the task of finding argumentation. Documents closely associated with such topics can, therefore, be excluded from the analysis after a quick inspection. This would enable the analyst to focus on the other topics, which have associated documents that do contain argumentation.