Detecting Diabetes Risk from Social Media Activity

This work explores the detection of individuals’ risk of type 2 diabetes mellitus (T2DM) directly from their social media (Twitter) activity. Our approach extends a deep learning architecture with several contributions: following previous observations that language use differs by gender, it captures and uses gender information through domain adaptation; it captures recency of posts under the hypothesis that more recent posts are more representative of an individual’s current risk status; and, lastly, it demonstrates that in this scenario where activity factors are sparsely represented in the data, a bag-of-word neural network model using custom dictionaries of food and activity words performs better than other neural sequence models. Our best model, which incorporates all these contributions, achieves a risk-detection F1 of 41.9, considerably higher than the baseline rate (36.9).


Introduction
The prevalence of diabetes is increasing in the US, mounting to 30.3 million cases in 2015, of whom 7.2 million were undiagnosed (Centers for Disease Control and Prevention, 2017). Diabetes caused over 79 thousand US deaths in 2015, in addition to $245 billion in economic costs in 2012 (American Diabetes Association, 2013). Along with genetic factors, lifestyle factors such as diet and physical activity are one of the important drivers of risk for Type 2 Diabetes Mellitus (T2DM), the most common type of diabetes. At the same time, the widespread use of social media has produced a digital record of these factors, offering potential insight into how these factors interact to contribute to health risk over time. These publicly available data present an opportunity to detect diabetes risk and similar health risks at scale. This work shows that the detection of individuals' diabetes risk solely from their public Twit-ter activity is possible, demonstrating that at-risk individuals use language differently from less atrisk individuals. Importantly, this detection is a first, crucial component in a larger battery of social media-based, public-health intervention tools that will work toward disease prevention on a large scale. Specifically, our contributions are: (1) We introduce a process that creates a novel dataset, which pairs individuals' T2DM risk with their social media activity. We measured individuals' T2DM risk using a well-established, validated questionnaire (Bang et al., 2009), and aligned the result with the corresponding Twitter accounts. To our knowledge, this is the first dataset that directly links T2DM risk with social media activity.
(2) We introduce the first machine learning (ML) approach for classifying individuals' T2DM risk based solely on their Twitter activity. Our deep learning approach has several novel contributions: (a) following previous observations that language use differs by gender, it captures and uses gender information through domain adaptation 1 (Daumé, 2007); (b) it captures recency of posts under the hypothesis that more recent posts are more representative of an individual's current risk status; and, lastly (c) it demonstrates that in this scenario where words representing real-life risk factors are sparsely represented in the data, a bag-ofword (BOW) model that uses custom dictionaries of food and physical activity words is a better solution than recurrent neural networks (RNN). Our best model, which incorporates all these contributions, achieves a risk-detection F 1 of 41.9, considerably higher than the baseline rate (36.9). In comparison, a realistic ceiling model based on the true age, gender, and Body Mass Index (BMI, kg m 2 ) of each respondent, achieves only 62.7 on this task. (3) We provide a feature analysis based on Layerwise Relevance Propagation (Bach et al., 2015;Binder et al., 2016;Arras et al., 2016Arras et al., , 2017, revealing that relevance aligns, albeit inconsistently, to expected food and activity values on average.

Data
We collected the dataset used in this work on a voluntary basis through a Qualtrics survey. 2 Participants self-selected by following an URL in an invitation tweet, and after consenting to participate, provided their Twitter handles, demographic information, and answers to an established questionnaire that estimates T2DM risk (Bang et al., 2009). The questionnaire provides an easy-to-understand measure of diabetes risk from data such as age and physical activity level, ranging from 0 to 10, with a score of 5 or higher representing elevated risk. Each participant received a risk assessment, including a summary of the sources of their risk, an explanation of how to get diagnosed (i.e., through a blood test), and a link to further information.
Of the 3,612 respondents who completed surveys, 736 (20.4%) supplied a Twitter handle. After removing respondents who provided no handle, an obviously false handle, 3 or a handle with no public tweets, 604 (16.7%) respondents with handles remained. The relatively modest dataset size is a natural consequence of the complexity of the data and the sensitivity of its collection. The distribution of risk scores among respondents is summarized in Fig. 1.
The complex relationship between height, weight, and risk score is illustrated in Fig. 2 though BMI is a major risk factor for diabetes, the existence of other factors means that there is considerable risk variation within BMI categories, and the discretization of BMI into categories necessarily obscures variation within categories. Many respondents would change BMI categories if an inch were added to or subtracted from their height, for example. We used the Twitter API to collect the tweet and profile text for each handle. The tweets and profile descriptions were tokenized and part-ofspeech tagged using ARK Tweet NLP (Owoputi et al., 2013). Each account was labeled at-risk if the owner's questionnaire risk score was 5 or greater, or less-risk otherwise. A summary of account statistics is shown in Table 1.

Approach
We predict individual-level T2DM risk from individual-level data (i.e., individual Twitter accounts), as opposed to transferring from community level statistics (e.g., county diabetes rate as dependent variable; all tweets in that region as input). Intuitively, using a community-level model should be a viable strategy: much more data is available for training; previous work has shown that exploring this data leads to good communitylevel estimations (Fried et al., 2014). However, our initial experiments showed that individual variation within communities was considerable, overshadowing the variation across communities and limiting the effectiveness of such methods. In our preliminary experiments the community-level model did not perform better than chance for estimating individual risk.
As a result of this initial analysis, in this work we focus on predicting T2DM risk from individual Twitter accounts. To this end, we propose a neural network (NN) architecture tailored to T2DM risk estimation, which relies on the following resources.

Resources
Custom dictionaries: In early experiments, we observed that no model that trained on the posts' entire content outperformed a simple baseline. We explain this result by the fact that indicators of risk factors (e.g., diet or activity words) are sparsely represented in this data, and the models cannot reliably identify them. To mitigate this problem, we created domain-specific dictionaries of words and hashtags indicating foods (pizza), exercise (#5k), chain restaurant names (#mcdonalds), and hashtags related to being overweight (#fatguyproblems). The food words were derived from a domain-specific Spanish-English glossary 4 and food vocabulary set 5 , following Fried et al. (2014). Exercise words and restaurant names were adapted 4 www.lingolex.com/spanishfood/a-b.htm 5 www.enchantedlearning.com/wordlist/food.shtml from Wikipedia lists of sports 6 and restaurants 7 . The smaller list of 13 overweight-related terms were hand-chosen based on Twitter searches.
To adapt the food dictionary to Twitter, we automatically expanded it using semantic vectors. We trained the word2vec algorithm (Mikolov et al., 2013) over an independent dataset of 12.3 M food-related tweets 8 , creating 200-dimension vectors for each word. From each existing dictionary term, we found the 5 closest candidate words, as measured by cosine distance. Each candidate could appear in multiple lists (e.g. #breakfastburrito is similar to both burrito and taco), so we calculated the softmax of the distances for each candidate. We then expanded our dictionary with the top 500 candidates, which included words such as 6 en.wikipedia.org/wiki/List_of_sports 7 en.wikipedia.org/wiki/List_of_the_largest_fast_ food_restaurant_chains 8 Collected automatically using a set of seven diet-related hashtags such as #breakfast and #lunch. halloumi, muesli, and sriracha. After these additions, there was a total of 2,871 features.
Gender: It is well established that language use differs by gender (Rao et al., 2010;Burger et al., 2011;Volkova et al., 2013;Johannsen et al., 2015). On the hypothesis that conditioning classification on these secondary variables would maximize the informativeness of other features 9 , we automatically annotated each account for gender. We predicted gender using a SVM model trained on a separate corpus of 1,000 Twitter accounts hand-annotated with gender information (man or woman). 10 This gender classifier used solely unigram features extracted from the account description and its tweets. The macro-averaged F 1 of this model is 75.79 on the T2DM dataset (c.f. human annotators, who averaged 71% accuracy on a similar task (Nguyen et al., 2014)).

Neural network architecture
We propose a feedforward neural network with one hidden layer, which captures both post recency (by weighing each input word by the recency of the corresponding post) and gender information (captured through domain adaptation). The proposed architecture is depicted and summarized in Fig. 3. This network uses pre-trained word embeddings of 200 dimensions generated using word2vec (Mikolov et al., 2013) on the above corpus of food-related tweets. The tanh layer has 128 neurons, and was trained under a 40% dropout. Importantly, this network uses only account words that matched entries in the above custom dictionaries. 11 Recency weighting: Our preliminary analysis indicated that more recent tweets are more relevant for classification. We attribute the effect of recency to transitions from high to low risk or vice versa due to lifestyle changes, in which case more recent tweets are more representative. To capture recency, we introduce a simple attention mechanism where each word is weighted by its recency, defined as normalized tweet position in the corresponding account. More formally, the recency weight (r i ) of a word w i is defined as: r i = position of tweet containing w i #tweets in account where the newest tweet in an account has the highest position. The average embedding (x) is calculated as: Domain adaptation: We capture gender information using the domain adaptation method of Daumé (2007), adapted to neural networks. As shown in Figure 3, we replicate the output of the tanh layer t to have a domain-independent version, and one version specific to each domain modeled. For example, the concatenated vector for a female account is t, t, 0 , where 0 is the zero vector corresponding to the male-account domain. This routing process is automatically implemented using the gender classifier described in the previous subsection. All in all, this allows the top sigmoid layer to detect information that generalizes across all domains, in which case the domainindependent vector (t g ) receives a larger update during backpropagation, or is specific to a domain, in which case the corresponding domain-specific vector (t d1 or t d2 ) is updated more.

Baselines
We implemented three baselines: (1) All at risk: This baseline assumes all individuals are at risk, i.e., they have a score 5 or higher.
(2) Support vector machines (SVM): This baseline model uses a linear SVM with unigram features from words and hashtags that match our custom dictionaries. 12 Similarly, following the domain adaption method of Daumé (2007), we incorporate gender information by prepending each feature name with the account's gender annotation (in addition to keeping the original feature). For example, an account annotated as a woman who used the word coffee 16 times would yield an unigram feature coffee in all models, and additionally a feature gender:woman coffee, both with a feature value of 16. (The accounts feature gender:man coffee would have a value of 0.) This allows these models to discover the best generalization for this task, e.g., if coffee is an important classification word for women only, the models will put the greatest weight on the gender:woman coffee feature; conversely, if coffee is  dictionaries is translated into a set of word embeddings w1, w2, ..., wn. The embeddings are multiplied by recency weights r1, r2, ..., rn. The resulting vectors are averaged (x) and passed to the tanh hidden layer. The output of this layer is replicated, producing copies for the general domain, tg, and for each of the domains, t d1 and t d2 , e.g., d1 = female, and d2 = male. If an account belongs to domain d1, the copy t d2 is set to the zero vector, and vice versa. The copies are then concatenated and fed to the top sigmoid (σ) layer.
always important, the generic unigram feature coffee will be assigned greater weight.
(3) Convolutional neural networks (CNN): For this baseline, we apply a CNN layer to the sequence of embeddings of dictionary words that occur in the corresponding account, followed by a rectified linear operator (ReLU). We implement domain adaptation for gender by augmenting the output of the ReLU layer, similarly to the tanh layer in Figure  3. The resulting vector feeds a top sigmoid layer that makes the prediction. 13

Ceiling models
We also developed two ceiling models against which to compare our text-based approaches. The first model (Ceiling) is an SVM trained with all the risk assessment variables collected in the survey mentioned in Section 2. This dataset is maximally informative, because these are precisely the variables that determine the risk score (Bang et al., 2009). However, it is not realistic, because most of these features are not available in social media, neither directly nor through machine learning techniques. For this reason, we also implemented an alternative and more realistic version of the ceiling system (Realistic Ceiling) that incorpo- 13 We also experimented with gated recurrent units, and with using all words instead of just dictionary words/hashtags. None outperformed this CNN configuration.  rates only those features that have previously been predicted by automatic systems through social media text or images (see Section 5). The features are summarized in Table 2.

Results
We used 10-fold cross-validation to train and evaluate each model on the binary classes at-risk and less-risk (see Section 2), using the same folds across all models. For each of the 10 runs, we reserve one fold for development, to tune hyperparameters such as classifier confidence cutoff, one fold for testing, and the rest for training. Table 3 summarizes the results of the proposed models, compared against the baselines described in Section 3. In the table, -R marks models that have recency information (models without recency used uniform r i weights), -GG marks models that used the gold gender information collected during the questionnaire, and -PG marks models that used predicted gender information. The SVM-U is an SVM model using all available words except a stoplist of closed-class words. The table underlines several observations: (1) The proposed NN models outperform all baselines, demonstrating that our NNs generalize better on this task dominated by sparse signals. Importantly, most of the strong baselines we include are below the performance of the simple "all at risk" baseline, highlighting again the difficulty of the task. The only baseline that outperformed "all at risk" is CNN-GG, which uses gold gender information, which would not be available in realworld deployments. Interestingly, our approach, which essentially relies on a (recency-weighted) bag-of-word model outperforms all the baselines that rely on sequence models. Similar observations about bag-of-word models outperforming sequence models on complex NLP tasks have been made in the past (Iyyer et al., 2015;Wang and Manning, 2012, inter alia).
(2) Both recency and gender information help. Our best model includes both, validating our original hypotheses. Surprisingly, models using predicted gender performed slightly better than models using gold gender information, but this difference was not statistically significant.
(3) This bag-of-word NN that uses only words/hashtags from relevant dictionaries outperforms considerably other complex NN sequence models that had access to the entire account texts (CNN-all). This highlights the importance of task-specific information (food and activity dictionaries in our case), which, in turn, emphasizes the need of collaboration between NLP researchers and domain (i.e., nutritional science and health care) experts.
(4) Even the Ceiling and Realistic Ceiling classifiers have considerably less than perfect performance at 68.1 and 62.7, respectively. Better performance would be likely with a larger dataset, which would likely also improve the performance of the proposed classifiers.

Feature analysis
To understand the influence of individual features (tokens) to the classification of an account by the best-performing neural net (using predicted gender and recency-weighted averaging), we adapted  the Layerwise Relevance Propagation (Bach et al., 2015;Binder et al., 2016;Arras et al., 2016Arras et al., , 2017 technique. LRP has the advantage of maintaining both positive and negative relevances, representing in this case contribution to the at-risk and less-risk class scores, respectively. In contrast, the commonly used Sensitivity Analysis (Dimopoulos et al., 1995;Gevrey et al., 2003;Simonyan et al., 2013;Li et al., 2015) measures relevance to the decision, rather than to a given class's score, and is therefore always non-negative. LRP assigns relevance to each neuron (including input values) as a function of how much they contribute to the final layer's values, as a share of its layer's contribution. To accomplish this, the neuron's activation must be divided by the sum of whole layer's activation, which can lead to unbounded values when a layer's activations sum to near zero. For this reason, we employ Bach et al. (2015)'s equation 58, which applies a small smoothing constant to the layer's summed activation to the avoid this value explosion. Examples of accounts' most recent words marked with their relevances according to the NN-PG-R model are shown in Table 4. As the table shows, the health value of words broadly aligns to relevance scores. However, because of the recency weighting of this model, making older tweets' words progressively less relevant, and because of variance in the training of different crossvalidation folds, these relevance scores are highly variable. The result is that sometimes a given token is counted as relevant to one classification (e.g., at-risk), and other times another (e.g., lessrisk). This is likely due to both the modest dataset size and to the indirectness of the connection between language use and health.

Real-world deployment using a high-precision model
The practical application of this risk detection system would involve pointing high-risk individuals toward the Bang et al. (2009) survey, and, if atrisk, to further medical diabetes screening (Rains et al., 2018). To mitigate the drawbacks of false positives (i.e., unnecessary and stressful medical testing), it is likely that in real-world deployments of this technology a high-precision variant of the learned model would be used.
In Figure 4, we show the classification performance at different thresholds for the classifier confidence. In this experiment, in order to increase stability, we have built an ensemble of models through bagging (Breiman, 1996): we generated 50 different versions of the training set by resampling it with replacement, and we trained a different model of the NN-PG-R architecture on each sampled training set. The final predictions are obtained averaging the outputs of the resulting models. As shown in Fig. 4, a threshold of 0.55, for example, yields a precision of 100% and a recall of 1.47%.
Despite the modest recall of such a high-

Related work
Analysis of social media content for health has been a topic of wide interest (Aramaki et al., 2011;Bian et al., 2012;Prier et al., 2011;Culotta, 2014;Nguyen et al., 2017). Similarly, the literature on detecting user attributes and the effects of those attributes on language use is extensive. Rao et al. (2010) predict individuals' demographic characteristics of gender, age, and political affiliation based on their tweets. Burger et al. (2011) construct a multilingual dataset of over 100K Twitter accounts, and classify gender better than human annotators, based on account text. Johannsen et al. (2015) study cross-linguistic variation in syntax (part-of-speech and dependency patterns) according to age and gender in online reviews (chosen over tweets for ease of parsing and richer metadata).
Age and gender, while much studied, are not the only available latent characteristics. Mowery et al. (2016) and Vedula and Parthasarathy (2017)  set to detect social network mental disorders with symptoms such as excessive use of social network sites, measured against gold-data questionnaires. Likewise, Schwartz et al. (2013) predict not only age and gender from the text of Facebook messages, but also the Big Five personality traits (extraversion, emotional stability, agreeableness, conscientiousness, and openness to experience) (Digman, 1990). Moreover, these sometimes-latent user characteristics can inform other classification tasks. For example, Volkova et al. (2013) demonstrate an improvement in the sentiment classification of tweets in a language-independent rule-based model when sentiment vocabulary is adapted for gender-dependent language. Our work continues this direction: here we show that gender information, even when predicted automatically, considerably improves the accuracy of T2DM risk detection.
Much of the previous work on diabetes and weight detection on social media has been at the level of communities. Fried et al. (2014) predict population characteristics such as diabetes and overweight prevalence using location-tagged, food-related tweets. Abbar et al. (2015) analyze correlations between county-level obesity prevalence and food mentions. Again the focus is on predicting dietary choices on a large scale. Relatedly, Eichstaedt et al. (2015) detect heart disease mortality at the county level from tweet text.
There is no known work on detecting individual diabetes risk from social media text. However, Farseev and Chua (2017) capitalize on multiple social media inputs (e.g., a workout tracker) to predict individuals' Body Mass Index category. Wen and Guo (2013) and Kocabey et al. (2017) predict body mass index from images similar to profile pictures, the former from booking photographs and the latter from an internet forum for sharing fitness progress. Of these, only Farseev and Chua (2017) classify solely from text, which is often the only data available from a social media account. Their classification's F 1 is low (17.8) -understandable given the difficulty of this taskwhich limits its use for realistic T2DM risk prediction. In contrast, our approach obtains a F 1 score that is over 2 times higher, on a task that is arguably more complex.

Conclusions
We introduced an approach to the detection of individuals' diabetes risk from their Twitter posts. To this end, we collected a novel dataset linking Twitter activity to a validated, survey-based measure of T2DM risk (Bang et al., 2009). Using this dataset, we proposed the first machine learning approach to predict the T2DM risk of a Twitter account holder using only her tweets. This task is challenging because the data tends to be very sparse, and there are many latent contributing variables (such as genetic predisposition). Our analysis indicates that reducing noise with relevant dictionaries, modeling gender, and modeling posts' temporal recency are valuable in predicting T2DM risk. All in all, our best model achieves an F 1 of 41.9 (vs. the 36.9 "all at risk" baseline and 39.5 of a strong sequence model).
We estimate that if a high-precision variant of this approach were to be deployed at large, e.g., on the public posts of all American Twitter users, it would identify 16,000 diabetic and 140,000 prediabetic Americans that are currently not diagnosed.
Continuing this work, we envision a larger battery of social media-based tools for public-health intervention that focus on the early identification of multiple health risks such as heart disease and various cancers at scale.

Release
The system is available as open-source software at github.com/clulab/releases/tree/ master/louhi2018-t2dmrisk.