Changes in Psycholinguistic Attributes of Social Media Users Before, During, and After Self-Reported Influenza Symptoms

Previous research has linked psychological and social variables to physical health. At the same time, psychological and social variables have been successfully predicted from the language used by individuals in social media. In this paper, we conduct an initial exploratory study linking these two areas. Using the social media platform of Twitter, we identify users self-reporting symptoms that are descriptive of influenza-like illness (ILI). We analyze the tweets of those users in the periods before, during, and after the reported symptoms, exploring emotional, cognitive, and structural components of language. We observe a post-ILI increase in social activity and cognitive processes, possibly supporting previous offline findings linking more active social activities and stronger cognitive coping skills to a better immune status.


Introduction
Stylistic variation in spoken and written communication of different users can provide rich information about them, such as their sociodemographic background (Rao et al., 2010;Argamon et al., 2009;Lampos et al., 2014;Flekova et al., 2016), personality (Schwartz et al., 2013), mental health (De Choudhury et al., 2013), mood, beliefs, fears or cognitive patterns (Snowdon et al., 1996). At the same time, researchers have been observing relations between factors such as mental health, mental states, personality, happiness, and physical health, including direct relation between individual stress level and resistance to infectious diseases (Cohen and Williamson, 1991;Martin et al., 1995;Friedman, 2000;Smith and Gallo, 2001; Kiecolt-Glaser ⇤ Project carried out during the research fellowship at the University College London, prior to joining Amazon † Also with Department of Computer Science, University of Copenhagen, Denmark et al., 1998;Uchino, 2006). In this paper, we conduct an initial exploratory study linking these two research areas. Using the social media platform of Twitter, we identify users self-reporting symptoms that are descriptive of influenza-like illness (ILI). We analyze the tweets of those users in the periods before, during, and after the reported ILI symptoms, and extract linguistic variables linked to affective, cognitive, perceptual and social processes, as well as personal concerns. We observe a post-ILI increase in social activity and cognitive processes, possibly supporting previous findings that individuals, who spend less time in social activities or are less capable of coping with stress, are associated with a poorer immune status (Friedman, 2000;Pressman et al., 2005;Jaremka et al., 2013;Pennebaker et al., 1997).

Related work
Socially stable individuals are at significantly lower risk for disease (Cohen and Williamson, 1991;Martin et al., 1995;Kiecolt-Glaser et al., 1998;Friedman, 2000). Associations were found between personality and likelihood of physical limitations. Chronic negative emotions are associated with suppressed immune functioning, and optimism with lower ambulatory blood pressure and better immune functioning (Smith and Gallo, 2001). Smolderen et al. (2007) examined stress, negative mood, negative affectivity and social inhibition related to increased vulnerability to influenza on participants. They concluded that negative affectivity and perceived stress were associated with higher self-reporting of ILI.
There is considerable evidence that social isolation is associated with poorer health. Those with more types of relationships and those who spend more time in social activities are at lower risk for disease and mortality than their more isolated counterparts (Friedman, 2000). Subjectively perceived loneliness and small social networks have also been associated with poorer immune status, greater psychological stress and poorer sleep quality (Pressman et al., 2005;Jaremka et al., 2013). Loneliness was also associated with greater psychological stress and negative affect, less positive affect, poorer sleep efficiency and quality, and elevations in circulating levels of cortisol (Pressman et al., 2005).
Some of these psychological and social variables have been previously successfully identified through an automated stylistic analysis of written text. For example, a series of natural language processing (NLP) workshops has been focusing on predicting depression on Twitter (Coppersmith et al., 2015b,a;, finding that the frequencies of functional words, auxiliary verbs, conjunctions, words indicating cognitive mechanisms, hedging expressions and exclusion words are a strongly predictive feature combination to separate depressed and healthy users. Earlier work on this topic found that authors with depressive tendency are more self-focused, use more frequently the "I" pronoun (Rude et al., 2004), and discuss in social media topics around feelings and sadness (Schwartz et al., 2014).

Dataset collection
We randomly sampled 14 million UK tweets, collected in the years 2014-2016, and searched for a small set of word patterns potentially indicative of having the flu based on previous work (Lampos andCristianini, 2010, 2012;, such as any combination of {I have, I feel, I've got} and {flu, sore throat, high fever, stupid fever, hate fever, ill} excluding {http, rt, jab, shot, you, he, she}. We obtained 2,600 tweets matching the pattern, which we then manually examined, finally obtaining 1,235 referring to the users themselves being sick with a flu, cold, sore throat, or fever. The false positive tweets were often discussing news about flu, flu vaccination, or social media trends such as (Justin) Bieber fever or cabin fever.
The 1,235 tweets come from 285 users. These users have been rather verbose on Twitter, producing 7.2 million tweets, responses and retweets over the three years. We decided to monitor the period from 7 days before the user first mentions being sick, to 14 days after this mention, as our first as-sumption was that the ILI symptoms last about a week since the first tweet. However, after the manual empirical exploration of user tweets over time, we reassessed this hypothesis, and for the rest of this study we are assuming the peak ILI period (i.e., the time period when the flu has the most extreme symptoms) is occurring slightly sooner, i.e. between one day before and two days after the time when a user is self-reporting that has the disease (TSR, time of self-report). We obtained 144,837 tweets, and after filtering out retweets this averaged to 231 tweets per user over these three weeks.

Statistical analysis method
We extract textual features using the Linguistic Inquiry and Word Count (LIWC) (Pennebaker et al., 2001), which consists of dozens of lexicons related to psychological processes (e.g., Affective, Cognitive, Biological), personal concerns (e.g., Work, Leisure, Money) and other categories such as Fillers, Disfluencies or Swear words. For each word category, we count a relative occurrence of the words of that category as a proportion to all words for a given user in a given time period.
Per set of days d 3, we calculated the mean hoi d of the occurrences o for a single feature as hoi d = P N i=0 o i /N with N being the number of users tweeting in that relative period d (e.g. "7 days before TSR" to "5 days before TSR") and o i being a feature value for one user in that period (e.g. relative proportion of words from category Family to all words tweeted by that user in that period). An example is demonstrated on Figure 2. For each data point hoi d , the period d is illustrated with the horizontal bar and the standard error the mean SE hoi d = s p N as a vertical bar. We then calculate the significance that the mean of the feature two and more days before the ILI symptoms TSR differs from the mean of the feature in the assumed ILI symptom peak interval (one day before to two days after TSR), as well as the significance that the mean of the feature three and more days after the ILI symptoms TSR differs from it. The significance is calculated as: before = hoibefore hoiduring q  Figure 1: Summary of significance of differences in feature value distributions before and after the self-reported sickness.
Generally, we are looking for values larger than 1.96, corresponding approximately to two-tailed significance test of p = 0.05 for the feature values during and before/after the ILI peak. If in reality the feature distribution in both period was the same, we would observe these or larger differences in <5% of the cases.
Both the and the corresponding p-values are listed in the individual feature plots. Generally, we are looking for values larger than 1.96, corresponding to two-tailed significance test of p = 0.05, indicating that if in reality the feature distribution in both periods was the same, we would observe these or larger differences in < 5% of the cases. Both the sigmas and the corresponding pvalues are listed in the individual feature plots.

Results
Figure 1 shows all differences in feature value distributions before and after the ILI TSR. While the values before typically tend to resemble the values during sickness more closely, the values from day +3 onwards show several significant (> 2 , p < 0.05) differences. The relative frequency of words from Friends and Family LIWC categories increases after the ILI peak ( Fig. 2b and  2a). This indicates that users probably get more socially involved after recovery, however, the relatively low values in the period before the ILI may support the hypothesis that individuals spending less time in social activities are associated with a poorer immune status (Friedman, 2000;Pressman et al., 2005;Jaremka et al., 2013). We also observe a post-ILI increase in the usage of causal words. Causal words connote terms to explain cause and effect (e.g. reason, why, because). Increased use of words in this category has been previously found to be related to improved physical health due to stronger coping skills (Pennebaker et al., 1997). There is a 2 decrease in impersonal pronouns after the ILI peak (Fig. 2e). Higher levels of impersonal pronoun usage have been previously associated to increased anxiety levels (Coppersmith et al., 2015a), hence this change could indicate a post-illness drop in anxiety.
Additional effects observed are: (a) a 2 post-ILI decrease in assent words such as agree, OK, yes (Fig. 2c), which surprisingly signals decreased levels of agreement with the social group and is linked to lower-quality social relationships (Tausczik and Pennebaker, 2010), and (b) a 3 increase in second person pronouns (Fig. 2f) during the illness, suggesting focus on others.
We found no significant difference in emotion levels or the levels of usage of the pronoun "I".

Conclusions and future work
We conduct an initial exploratory study of psycholinguistic attributes of Twitter users before, during, and after self-reported influenza symptoms. We observe a post-ILI change in expressions that correlate with elevated levels of social activity and cognitive processes. Interestingly, instead of an expected increase in using first-person pronoun "I" during the ILI peak (focus on self), we observe a significant increase in second person pronouns ("you"). We plan to extend this study by including additional 11,000 ILI events.   day intervals relative to their ILI symptoms. For each local mean value (blue point), the period of the mean is illustrated with the horizontal bar and the standard error of the mean as a vertical bar. The horizontal blue stripe visually aids to compare to the ILI peak standard error interval, and the vertical grey stripe to the ILI peak period. In addition, an average feature value during the ILI peak is illustrated by a dashed line, compared to the overall average of the feature (yellow line).