Construction of a Personal Experience Tweet Corpus for Health Surveillance

Studies have shown that Twitter can be used for health surveillance, and personal experience tweets (PETs) are an important source of information for health surveillance. To mine Twitter data requires a relatively balanced corpus and it is challenging to construct such a corpus due to the labor-intensive annotation tasks of large data sets. We developed a bootstrap method of ﬁnding PETs with the use of the machine learning-based ﬁlter. Through a few iterations, our approach can efﬁ-ciently improve the balance of two class dataset with a reduced amount of annotation work. To demonstrate the usefulness of our method, a PET corpus related to effects caused by 4 dietary supplements was constructed. In 3 iterations, a corpus of 8,770 tweets was obtained from 108,528 tweets collected, and the imbalance of two classes was signiﬁcantly reduced from 1:31 to 1:3. In addition, two out of three classiﬁers used showed improved performance over iterations. It is conceivable that our approach can be applied to various other health surveillance studies that use machine learning-based classiﬁcations of imbalanced Twitter data.


Introduction
As defined by the Merriam-Webster Dictionary, surveillance is the act of carefully watching someone or something. In the health field, the WHO defines that public health surveillance is the continuous, systematic collection, analysis and interpretation of health-related data needed for the planning, implementation, and evaluation of public health practice. Information directly reported by patients is of significant importance, and having an efficient way of obtaining and analyzing this data is very important. Because of mobile phones and other technologies, patients are inclined to post information on the web. This represents a great opportunity for those concerned with health surveillance if they can only mine the data. As such, the critical issue is where and how to obtain and analyze this health surveillance data.
A common challenge identified in these types of studies is the difficulty in separating the useful or ''on-topic'' tweets from the majority of the irrelevant tweets. This poses the challenge of finding the tweets that can help to perform the heath surveillance tasks while ignoring the rest.
Twitter is a micro blogging platform on which messages of up to 140 characters can be posted. Despite the shortness of the messages, the size of Twitter user pool may still mean that a lot of information can be posted. As such, for any given topic, there may be a good number of on-topic tweets and a much larger set of off-topic tweets. As a result, one of the key questions to address is how to obtain the relevant data.
In this study, the term personal experience tweet (PET) is used to describe the tweets that are relevant to the analysis. PETs, therefore, are tweets that describe a person's encounters, observations, and important events related to his or her life. In the case of health surveillance, such experience can be related to changes of a person's health, an illness, a disease, or a treatment. In other words, if any of the above affects an individual it signifies a personal experience. For example, if a medicine causes a person to vomit or improves the person's sleeping behavior, then the person is said to have some experience with the medicine. Personal experience tweets (PETs) are an important source of information for health surveillance using Twitter data.
Given the sheer volume of daily posts, Twitter data are known to contain a significant amount of irrelevant off-topic posts (e.g. news, sales promotions, spam, etc.). This can easily result in collections of Twitter data with a significant bias toward the irrelevant posts. For example, in a study of 2 billion tweets collected from May 2009 to October 2010, Bian and colleagues (Bian et al., 2012) found only 489 on-topic tweets for the 5 medicines being studied in clinical trials. As can be seen, from this study discovering on-topic tweets can be a challenge in research problem. Given all the previously stated issues, obtaining relevant data and constructing a relatively balanced corpus can be challenging and a good collection process must be implemented. This paper will discuss the data collection process, the automatic filtering approach, the annotation, and results of the analysis of the corpus. Issues related to class imbalance are also discussed.
Specifically, this study addresses the following research questions: (1) can an automated filter-ing algorithm help to speed up manual annotation of a PET corpus and (2) can the automated filtering approach help to address the class imbalance issues inherent in Twitter data?

Related Work
There have been many studies that validate the use of general purpose social media such as Twitter for surveillance of health related issues. Many of these surveillance activities involve using the information reported by the patients who share their personal health experience on social media. Efforts have been made to construct health-related Twitter corpora (Paul and Dredze, 2012;Collier et al., 2011;Ginn et al., 2014).
Using Mechanical Turk, Dredze's group (Paul and Dredze, 2012) created a corpus of 5,128 tweets classified as related to health or unrelated to health. The results showed only 36.1% of the labeled tweets were health related. It is unclear how the tweets were selected into the corpus.
Collier and colleagues (Collier et al., 2011) created a 5,283 tweets corpus related to influenza from 225,000 tweets collected from March 2010 to April 30th, 2010. These tweets in 5 classes were selected using hand built patterns which were unexplained by the authors, and annotated by a single annotator. For each of the 5 classes, the ratio of negative tweets to positive tweets was 2.52, 1.16, 1.95, 7.19 and 2.53 respectively, indicating that there were more negative tweets than positive ones in each class.
In studying adverse drug reactions from Twitter data, Ginn et al. (Ginn et al., 2014) collected 187,450 tweets over 6 months with 74 carefully selected drug names. 71,571 tweets were retained after removing those containing URLs, which were considered as advertisements. Out of 71,571 tweets, 10,822 were randomly chosen with a cap of 300-500 per drug. The 10,822 tweets were manually annotated by three annotators. Among 10,822 tweets, only 1,200 (11%) tweets contain adverse drug reactions (ADRs), showing the imbalance ratio of 1:8. The authors also reported a Kappa inter-annotator agreement metric with a value of 0.69.

Methodology
The purpose of health surveillance is to monitor the status of health conditions. To track health information using Twitter data, a data set of Twit-ter texts is needed. With this dataset, a methodology can be devised to identify the effects in the text. The challenge is in discovering the relevant tweets. Our initial inspection of tweets collected using 4 dietary supplement names showed that many of the tweets were not personal experience tweets relevant to the work. Manual annotation is an expensive process, especially when using large datasets which contain very few ontopic samples. Therefore, an automated filtering tool was needed to address these issues. One of the purposes of this study is to speed up the process of annotation. Many studies have used manual or rule-based approaches for annotation. However, these approaches are time consuming. In this paper, a machine learning-based approach is proposed to try to filter out off-topic tweets.
Inspired by the bootstrap method, we developed an iterative approach of creating Twitter corpus. It starts with a small set of annotated tweets (seed). In each iteration, the annotated tweets (in the training set which is the corpus) are used to retrain classifiers, and the predicted tweets of PET class from the trained classifiers are annotated and added to the training set, in an attempt to obtain a less imbalanced corpus.
In this section, we present our method of finding personal experience tweets and its application in constructing a PET corpus related to the effects caused by 4 dietary supplements. An automated filter was used to try to remove irrelevant samples before the data set was given to annotators. A description of the creation of the PET corpus using this filter is also presented and discussed. The next few sections of the paper describe in more detail the various considerations and methodology used to create the corpus.

Corpus Construction Procedure
This study was done with the help of two annotators, who were graduate students majoring in biology and computer information technology. They independently labeled the same tweets with personal experience tags if they contained the name of any supplements and stated the experiencing of using the supplements. Below are examples of PET tweets. Example 1: 1. melatonin gives me some messed up dreams.. or i just have awful dreams and melatonin makes me remember them. either way i dont like it.
Example 2: 2. look into St. John's Wort. Actually helps calm me down at night to sleep. always had the same issues.
First, a small number of tweets were randomly selected as a training set and were annotated manually by annotators. This was a single non-repetitive step to create a seed set. Next, three classifiers were trained using this training set and then used to classify a test set with more tweets, yielding a PET set and a non-PET set. The PET set was then labeled by annotators, and annotated tweets were added to the training set (corpus). Classifiers were retrained with the updated corpus and then used on a new batch of test data. These steps repeated until a relatively balanced corpus was achieved.
Although investigating and annotating only the predicted PET class significantly reduce the effort needed for annotation, it could potentially introduce bias undermining the representation of non-PET (majority) tweets. To compensate this potential bias, we intentionally added a small number of non-PET tweets to the training set in each iteration (Step 06 below). The above steps are summarized in the following algorithm.
Algorithm ConstructTweetCorpus() Input: A set of tweets T, balance ratio β, accuracy δ Output: A tweet corpus T 01: Randomly choose a small collection of n tweets from T as a training set denoted by T 02: Annotate T 03: Train classifiers with T 04: Do while balance ratio of T < β and/or accuracy of classifiers < δ 05: Select a collection of l new tweets from T as test set denoted by T l 06: Classify T l using trained classifiers, yielding a predicted PET set T y and non-PET set T n . 07: Annotate T y , yielding T y 08: Select m tweets randomly from predicted non-PET set T n and annotate them, yielding T n 09: Add T y and T n to the training set T, yielding a new training set: T ← T + T y + T n 10: Train classifier(s) with T 11: Loop 12: Return T where l is greater than m. β is the balance ratio, the ratio between the number of PET and non-PET tweets, δ is the expected accuracy. The value of m is only a fraction of the number of tweets in the newly predicted PET class (Step 06). Both l and m can be constants. The accuracy of a classifier is measured by the ROC Area and /or F-measure.

Dataset
Using the above algorithm, we constructed a PET corpus related to 4 dietary supplements: Echinacea, Melatonin, St. John's Wort, and Valerian. A total of 108,528 tweets were collected from May 30, 2014 to December 8, 2014, through the use of Twitter REST API. The supplement names were used as keywords to perform Twitter searches. The breakdowns of the collected Twitter data are: 9,210 tweets for Echinacea, 81,915 for Melatonin, 3,176 for St. John's Wort, and 14,227 for Valerian. The collected Twitter data were preprocessed to remove retweets and non-English tweets.

Features
Two types of features were used by the machine learning-based filter: metadata and textual. Metadata features are features about the tweet itself but not the text. They include user id and Twitter client application. Textual features are the ones extracted directly from the 140 character Twitter text. Most of the tweets collected were unrelated to personal experience. They were usually marketing or promotion tweets or just facts of what a supplement does. According to a study of 106 million tweets with 4262 trending topics, Kwak et al. (2010) found that the majority of the messages were news specific. In another study, Krieck and colleagues found that news information normally repeats official information and has no contribution to the early detection of disease outbreaks (Krieck et al., 2011).
It has been observed that personal pronouns appear frequently in social media posts related to personal experiences (Elgersma and de Rijke, 2008;Jiang and Zheng, 2013). Personal pronouns were considered as a feature to classify personal and impersonal sentences (Li et al., 2010).
Our observation revealed that words or phrases commonly used in one class but not in the opposite class may contribute to the accurate prediction of PET and non-PET tweets. These words or phrases were found in both tweet texts and Twitter user names -unlike the Twitter screen name, a Twitter user name can be a phrase. For example, online stores may use in their names terms such as shop, store, and market. Presence of any of such words can provide classifiers a hint to identify promotional tweets.
A client application is the software application a Twitter author uses to post Twitter messages. Westman and colleagues observed that personal tweets were more often posted from the Twitter website (Westman and Freund, 2010). The followings are the features used in this study.

Classifiers
For filtering the off-topic tweets, three classifiers were used: decision tree (J48), KNN (IB1) and, neural network (Multilayer Perceptron, MLP). Neural networks are known for deriving meaning from complex and imprecise data. Decision trees are simple to understand, interpret and, easily handle feature interaction. KNN is simple and robust for noisy data. For evaluation purposes, both ROC metrics and F-measure were used for the reason that F-measure is not an appropriate measure of performance when the data are imbal-anced (Chawla, 2009). Weka (Hall et al., 2009) which contains the implementation of all three algorithms was used in our study. It is well understood that not all classifiers perform the same way. The majority rule was used to determine the outcome of classification. That is, if outputs of two or more classifiers were PET, then the tweet was considered a PET tweet.

Results
Using a seed of 3,176 tweets (Run 0), our algorithm had gone three iterations with the test sets shown below. In each iteration (Run 1 through Run 3), the size of training set (corpus) increased as more annotated tweets were added.

Iteration
Training Set  The final annotated data set consisted of 8,770 number of tweets which are available at https://github.com/medeffects/ supplement-corpus/ . Of these, 2,067 were PET tweets and 6,703 non-PET tweets.

Inter-Annotator Agreement
Inter-annotator agreement metrics are helpful to establish the subjectivity of an annotation scheme. The annotation task was performed by 2 annotators. Two labels were used for the annotation: PET and non-PET. As shown in the table below, the average agreement was 85.4 %. Correcting for expected chance agreement, kappa and the other metrics still provide a reasonable score to assess the annotation consistency. The result indicates that the task of finding personal experience tweets does have a level of subjectivity. These values can later be useful to define an expected upper boundary on the PET classification task.

Corpus Class Balance
As stated earlier, the corpus was built in iterations (or runs). Each iteration used a larger training set that consisted of more examples of PET tweets.  Table 2: Inter-annotator agreement for metrics As such, it can be noticed that with each iteration more PET tweets were added to the corpus as shown in Table 3, leading to a more balanced distribution of PET and non-PET tweets. This result is beneficial for this study since the goal of it is to find as many personal experience tweets which can later be used to associate effects with dietary supplements for health surveillance.

Classifier Performance
In addition to studying the overall performance of classifiers collectively, we also collected performance data of each individual classifier on predicting PET tweets, and they are shown in the figure below.

Feature Ranking
One important aspect in this study is to determine what features helped to automatically detect personal experience tweets. As indicated previously, most of these 19 features by classifiers were extracted from the tweet text using natural language processing techniques. To perform the feature analysis, the Chi-Square ranking method was used. The top ranked features are occurrences of

Prediction Precision
The overall performance of the PET classifiers was measured with the training sets. The PET classifiers are the filter used to identify relevant tweets for human annotation. The actual performance of the filter should be measured against the prediction using the test data. Given that only predicted PET tweets were annotated -that is, only true positive and false positive figures were available, prediction precision was measured. Precision is a ratio between actual PET tweets and predicted PET tweets in the same predicted PET set, a performance measurement of classifiers when performing predictions. Table 4 shows that the precision falls within the range of 0.28 -0.49. This indicates that for every 100 predicted samples (tweets), between 28 and 49 may be actual PETs.

# PET Tweets Iteration Predicted Actual Precision
Run 1

Discussions
The amount of work on annotation can be significant when constructing a corpus that requires examination of large sets of data. In this study, if we were to annotate 108,528 tweets, it would take annotators a significant amount of their time to do so. However, using our proposed method, two annotators only needed to annotate 8,770 tweets (= initial seed tweets plus predicted PET tweets and added non-PET tweets in each iteration. Refer to Table 1). If it takes an average of one minute to annotate a single tweet and each annotator spends 8 hours a day on annotation, it will take an annotator 226 days to complete annotation of 108,528 tweets, but 18 days for 8,770 tweets. This represents a significant reduction of annotation time. By some estimates, the obtained kappa score shown in Table 2 may be considered low which implies that the text is highly subjective and difficult to annotate. This suggests that finding personal experience tweets is highly subjective. Personal experience, in the context of this paper, is text expressed by a person and that is of a very personal nature. The difficulty may lie in the fact that there is not set lexicon to define personal experience. In contrast, emotion text detection, which is also considered subjective, does have its own lexicon (i.e happy words vs. sad words).
As can be seen in Table 3, our approach is also efficient in improving the class balance of the corpus. With only 3 iterations, the ratio of the number of PET tweets to that of non-PET tweets had come down from 1:31 to 1:3, a 10-fold improvement.
The performance of individual classifiers on predicting PET tweets with the training data either remained the same level or improved over iterations. For ROC Area (Figure 1), both IB1 and J48 improved, and MLP remained the same. For F-Measure ( Figure 2) which is not an appropriate indicator of performance when data are imbalanced, all three classifiers had improved. In addition, it is noted that the multilayer perceptron (MLP) classifier has the best accuracy in predicting PETs.
Although values of ROC Area and F-Measure are quite promising, when it came to predict the unlabeled data (test set), 3 classifiers could only predict PET tweets with 28% to 49% precision. This implies that if the classifiers are to be used to predict PETs on new sets of unlabeled tweets, only 28% to 49% of tweets in the predicted PET set may be actual PET tweets.
Our result of feature ranking suggests that between metadata and textual features, textual features contribute the most to overall classification accuracy. And the best performing features are the ones related to the frequency of terms used in either tweet text or the user name -that is, the most frequent terms in a class that are infrequent in the opposite class. This approach is sometimes commonly referred to as the Gramulator type approach.

Conclusion
We proposed a bootstrap method to construct tweet corpus from noisy Twitter data. Through a few iterations, our approach can help construct quickly a tweet corpus with closely balanced classes, without a significant amount effort on annotation. It is conceivable that our approach can be applied to other health surveillance studies that use machine learning-based classifications of imbalanced social media data.