Correlating Twitter Language with Community-Level Health Outcomes

We study how language on social media is linked to mortal diseases such as atherosclerotic heart disease (AHD), diabetes and various types of cancer. Our proposed model leverages state-of-the-art sentence embeddings, followed by a regression model and clustering, without the need of additional labelled data. It allows to predict community-level medical outcomes from language, and thereby potentially translate these to the individual level. The method is applicable to a wide range of target variables and allows us to discover known and potentially novel correlations of medical outcomes with life-style aspects and other socioeconomic risk factors.


Introduction
Surveys and empirical studies have long been a cornerstone of psychological, sociological and medical research, but each of these traditional methods pose challenges for researchers.They are time-consuming, costly, may introduce a bias or suffer from bad experiment design.
With the advent of big data and the increasing popularity of the internet and social media, larger amounts of data are now available to researchers than ever before.This offers strong promise new avenues of research using analytic procedures, obtaining a more fine-grained and at the same time broader picture of communities and populations as a whole (Salathé, 2018).Such methods allow for faster and more automated investigation of demographic variables.It has been shown that Twitter data can predict atherosclerotic heart-disease risk at the community level more accurately than traditional demographic data (Eichstaedt et al., 2015).The same method has also been used to capture and accurately predict patterns of excessive alcohol consumption (Curtis et al., 2018).
In this study, we utilize Twitter data to predict various health target variables (AHD, diabetes, various types of cancers) to see how well language patterns on social media reflect the geographic variations of those targets.Furthermore, we propose a new method to study social media content by characterizing disease-related correlations of language, by leveraging available demographic and disease information on the community level.In contrast to (Eichstaedt et al., 2015), our method is not relying on word-based topic models, but instead leverages modern state-of-theart text representation methods, in particular sentence embeddings, which have been in increasing use in the Natural Language Processing, Information Retrieval and Text Analytics fields in the past years.We demonstrate that our approach helps capturing the semantic meaning of tweets as opposed to features merely based on word frequencies, which come with robustness problems (Brown and Coyne, 2018;Schwartz et al., 2018).We examine the effectiveness of sentence embeddings in modeling language correlates of the medical target variables (disease outcome).
Section 2 gives a generalized description of our method.We apply the previously described method to the tweets and health data in Section 3 The system's performance is evaluated in Section 4 followed by the discussion in Section 5. Our code is available on github.com/epfml/correlatingtweets.

Method
We are given a large quantity of text (sentences or tweets) in the form of social media messages by individuals.Each individual-and therefore each sentence-is assigned to a predefined category, for example a geographic region or a population subset.We assume the number of sentences to be sig-nificantly larger than the number of communities.Furthermore, we assume that the target variable of interest, for example disease mortality or prevalence rate, is available for each community (but not for each individual).Our system consists of two subsystems: 1. (Prediction) The predictive subsystem makes predictions of target variables (e.g.AHD mortality rate) based on aggregated language features.The resulting linear predictions are applicable on the community level (e.g.counties) or on the individual level, and are trained using k-fold cross-validated Ridge regression.

(Interpretability)
The averaged regression weights from the prediction system allow for interpretation of the system: We use a fixed clustering (which was obtained from all sentences without any target information), and then rank each topic cluster with respect to a prediction weight vector from point 1).The top and bottom ranked topic clusters for each target variable give insights into known and potentially novel correlations of topics with the target medical outcome.
In summary, the community association is used as a proxy or weak labelling to correlate individual language with community-level target variables.The following subsections give a more detailed description of the two subsystems.

System Description
Let S be the set of sentences (e.g.tweets), with their total number denoted as |S| = S.Each sentence is associated to exactly one of the A communities A = {a 1 , . . ., a A } (e.g.geographic regions).The function δ : S → A defines this mapping.Let y ∈ R A be the target vector for an arbitrary target variable, so that each community a j has a corresponding target value y a j ∈ R.
Preprocessing and Embeddings.The complete linguistic preprocessing pipeline of a sentence is incorporated by the function ρ(s i ), ∀ i ∈ {1, . . ., S}, which represents an arbitrary sentence s i as a sequence of tokens.Each sentence s i then is represented by a D-dimensional embedding vector providing a numerical representation of the semantics for the given short text: (1)  While our method is generic for any text representation method, here Sent2Vec (Pagliardini et al., 2018) was chosen for its computational efficiency and scalability to large datasets.

Feature Aggregation
We use averaging of the sentence embedding vectors over each community to obtain the language features for each community.Formally, the complete feature matrix of all sentences is denoted as X ∈ R S×D .For our approach, the sentence embedding features are averaged over each community a j .Formally, an individual feature x a j ,d of the averaged embedding x a j ∈ R 1×D for a given community a j is defined as where N a j = |{s i : is the number of sentences belonging to community a j .Consequently, the aggregated communitylevel embedding matrix is given by (3)

Train-Test Split
Leveraging the targets available for each community, our regression method is applied to the aggregated features X and the target y.We employ Kfold cross-validation: the previously defined set A is split into K as equally sized pairwise disjoint subsets A k as possible such that: k } → TE k uniquely map the indexes to the corresponding communities a j for the k th train-test split.For each split k the train and test embedding matrices respectively are defined as . (5) Accordingly, we define the target vectors . (7)

Ridge Regression
For each train-test split k we perform linear regression from the community-level textual features X θ k to the health target variable y θ k .We employ Ridge regression (Hoerl and Kennard, 1970).
In our context, the Ridge regression is defined as the following optimization problem: where the optimal solution is Within each each fold we tune the regularization parameter λ.

Let y
] be the predicted values for the test set of the split k.The concatenated prediction vector for all splits is Accordingly, we define the concatenated true target vector as i.e., the set of individual scalars is identical to the entries in the original target vector y.The predictive performance of the system can be assessed through the following metrics: • Pearson Correlation Coefficient • Mean Average Error of prediction (MAE) • Classification Accuracy for Quantile Prediction The first two metrics are evaluated with the vectors y Λ and y Λ from all folds.In the quantilebased assessment we independently bin the true values y Λ and the predicted values y Λ into C different quantiles.Each individual true and predicted value is assigned to a quantile c j ∈ {c 1 , . . ., c C }.These assignments can be used to visually compare results on a heat-map or as regular evaluation scores in terms of accuracy.

Ridge-Weight Aggregation
For the final prediction model, the regression weights ω k from Ridge regression are averaged over the K folds, i.e. ω = 1 K K k=1 ω k .For every sentence embedding x q , the prediction is computed as y q = x q ω ∈ R.

Interpretation Subsystem: Cluster Ranking
We employ predefined textual topic clusterswhich are independent of any target values-in order to enable interpretation of the textual correlates.Each cluster is a collection of sentences and should, intuitively, be interpretable as a topic, e.g.separate topics about indoor and outdoor activities as shown in Fig. 4. For each cluster m a ranking score can be computed with respect to a linear prediction model ω such as defined above.
Let Q m = {q : ζ(q) = m ∧ q ∈ Q} be the set of sentences assigned to cluster m.The score ι m for the cluster m is the average of all predictions y q = x q ω within the cluster m: By ordering the scores ι m of all clusters, we obtain the final ranking sequence of all clusters, with respect to the target-specific model ω.
Clustering Preprocessing.For obtaining the fixed clustering, as X is a very large matrix, clustering might require subsampling to reduce computational complexity.Hence, Q out of the S embeddings in S are randomly subsampled into the set Q.The mapping is a uniformly random selection of row indexes in X out of N Q .We define the subsampled data matrix as The subset X Q is clustered with the Yinyang K-Means algorithm (Ding et al., 2015).We use M centroids and the cosine similarity as a distance function.The cluster assignment vector M ∈ [1, . . ., M ] assigns one cluster for each embedding in X Q .Accordingly, the operator ζ : {1, . . ., Q} → {1, . . ., M } indicates the assigned cluster m for a given sentence s in Q (see cluster ranking above).The cluster centers are defined in M Q ∈ R Q×D .

Data sources
We apply the method described in Section 2 to the following setting: The pool of sentences S consists of geotagged Tweets.The assigned locations are in the United States.The geotags are categorized into US-counties which represent the set of communities A. The target variables y are healthrelated variables, for example normalized mortality or prevalence rates.We focus on cancer and AHD mortality as well as on diabetes prevalence.Hence, the quantile-based predictions give a categorization of the Ridge regression predictions on a US-county level.The ranked topics assess what language might relate to higher or lower rates of the corresponding disease.Table 1 provides an overview of the size of the data sources, the year the data was collected in and the mean µ and standard deviation σ of the target variables.Not all counties are covered in the publicly available datasets, usually being limited to more populous counties.The collected Tweets are from 2014 and 2015.The target variables are the union-averaged values from 2014 and 2015: if the target variable is available for both years the two values are averaged.Conversely, if a county data point is only available for one, but not both years, we use this standalone value.

Datorium Tweets
Tweets are short messages of no more than 140 characters1 published by users of the Twitter platform.They reflect discussions, thoughts and activities of its users.We use a dataset of approximately 144 million tweets collected from first of June 2014 to first of June 2015 (Datorium, 2017).Each tweet was geotagged by the submitting user with exact GPS coordinates and all tweets are from within the US, allowing accurate countylevel mapping of individual tweets.

AHD & Cancer Mortality
Our source of the statistical county-level target variables is the CDC WONDER2 database (CDC, 2018) for AHD and cancer.Values are given as deaths per capita (100'000).

Diabetes Prevalence
We use county-wise age-adjusted diabetes prevalence data from the year 2013 (CDC, 2016), provided as percent of the population afflicted with type II diabetes.The data is available for almost all the 3144 US counties, making it a valuable target to use.

Results
The results of our method for the various target variables are listed in Table 2 along with the performance of the baseline model outlined in Section 4.1.We provide the Pearson correlation (ρ) and the mean absolute error (MAE) of our system along with the baseline model's Pearson correlation.

LDA Baseline Model
We reimplemented the approach proposed by Eichstaedt et al. (2015) as a baseline for comparison, and were able to reproduce their findings about AHD with recent data: similar results were found with the Datorium Twitter dataset (Datorium, 2017) and CDC AHD data from 2014 and 2015.Their approach averages topics generated with Latent Dirichlet Allocation (LDA) of tweets per county as features for Ridge regression.We do not use any hand-curated emotion-specific dictionaries, as these did not impact performance in our experiments.We used the predefined Facebook LDA coefficients of Eichstaedt et al. (2015), updated them with the word frequencies of our collected Twitter data (Datorium, 2017).Our results are computed with a 10-fold cross-validation and without any feature selection.2: Results of predictions on different health targets.ρ: our system (Section 2.5), ρ LDA: topic model baseline (Eichstaedt et al. (2015), Section 4.1), MAE: mean absolute error of our system (Section 2.5).

Detailed Results
In this section we discuss a selection of our results in detail, with additional information available in Appendix A.1.Diabetes has a strong demographic bias, with a higher prevalence in the south-east of the US, the so called diabetes belt.Compared to the national average, the african-american population in the diabetes belt has a higher risk of diabetes by a factor of more than 2 (Barker et al., 2011) and the southeast of the US has a large african-american population.Therefore, linguistic features (Green, 2002) common in african-american are a strong predictor of diabetes rates.The model learns these linguistic features, as seen in Figure 3, and its predictions closely match the actual geographic distribution, as seen in Figure 2. A moderate alcohol consumption is linked to a low risk of type II diabetes compared to no or excessive consumption (Koppes et al., 2005).The strongest negatively correlated word clouds in Figure 3 support this finding.
The most positively related word clouds for melanoma in Figure 4 are related to outdoor activities (Elwood et al., 1985).Conversely, the strongest negatively correlated word clouds suggest indoor activity related language.

Discussion
In this paper, we introduced a novel approach for language-based predictions and correlation of community-level health variables.For various health-related demographic variables, our approach outperforms in most cases (Table 2) similar models based on traditional demographic data by using only geolocated tweets.Our approach provides a method for discovering novel correlations between open-vocabulary topics and health variables, allowing researchers to discover yet unknown contributing factors based on large collections of data with minimal effort.
Our findings, when applying our method to AHD risk, diabetes prevalence and the risk of various types of cancers, using geolocated tweets from the US only, show that a large variety of healthrelated variables can be predicted with surprisingly high precision based solely on social media data.Furthermore, we show that our model identifies known and novel risk or protective factors in the form of topics.Both aspects are of interest to researchers and policy makers.Our model proved to be robust for the majority of targets it was applied to.
For AHD risk, we show that our approach significantly outperforms previous models based on topic models such as LDA or traditional statistical models (Eichstaedt et al., 2015), achieving a ρ-value of 0.46, an increase of 0.09 over previous approaches.For diabetes prevalence our model correctly predicts its geographic distribution by identifying linguistic features common in high-prevalence areas among other features, with a ρ-value of 0.73.For melanoma risk, it finds a high-correlation with the popularity of outdoor activities, corresponding to exposure to sunlight being one of the main risk factors in skin cancer, with an overall ρ-value of 0.72.
One of the main limitations of our approach is the need for a large collection of sentences for each community as well as a large number of communities with target variables, leading to potentially unreliable results when this is not the case, such as for social media posts by individuals or when modeling target values which are only available in e.g.few counties.Further research is needed to ascertain whether significant results can also be achieved in such scenarios, and if robustness of our approach is improved compared to bag-of-words-based baselines (Eichstaedt et al., 2015;Brown and Coyne, 2018;Schwartz et al., 2018).Furthermore, all mentioned approaches rely on correlation, and thus do not provide a way to determine any causation, or ruling out of potential underlying factors not captured by the model.Even though using social media data introduces a non-negligible bias towards users of social media, our approach was able to predict target variables tied to very different age-groups, which is encouraging and supports the robustness of our approach.
Our method captures language features on a community scale.This raises the question of how these findings can be translated to the individual person.Theoretically, a community-based model as described above could be used to rank social media posts or messages of an individual user, with respect to specific health risks.However, as we currently do not have ground truth values on the individual level, and since user's social media history has very high variance, this is left for future investigation.
Future research should also address the applicability of our model to textual data other than Twitter and potentially from non-social media sources, to communities that are not geography based, to the time evolution of topics and health/lifestyle statistics, as well as to targets that are not health related.The general methodology offers promise for new avenues for data-driven discovery in fields such as medicine, sociology and psychology.

A.2 Implementation Details
Tweets were collected according to the provided datorium IDs using the Tweepy3 library.The tweets were then imported into Google BigQuery4 and processed using Apache Beam5 .The sentence embeddings were computed using the official Sent2Vec source code and the provided 700dimensional pre-trained model for tweets (using bigrams)6 .Clustering was performed by libKM-CUDA7 .Scikit-learn8 was used for 10-fold cross validation, Ridge regression, calculating the correlation and hyperparameter search.

Figure 5 :
Figure 5: Word clouds of topics correlating with colorectal cancer: (a) (b)strongest positively correlated topics (c) (d) strongest negatively correlated topics among M = 2000 clusters.

Figure 6 :
Figure 6: Confusion matrix for decile-based prediction of diabetes prevalence.

Table 1 :
Overview of data sources.