Temporal Mental Health Dynamics on Social Media

We describe a set of experiments for building a temporal mental health dynamics system. We utilise a pre-existing methodology for distant-supervision of mental health data mining from social media platforms and deploy the system during the global COVID-19 pandemic as a case study. Despite the challenging nature of the task, we produce encouraging results, both explicit to the global pandemic and implicit to a global phenomenon, Christmas Depression, supported by the literature. We propose a methodology for providing insight into temporal mental health dynamics to be utilised for strategic decision-making.


I. INTRODUCTION
Mental health issues pose a significant threat to the general population. Quantifiable data sources pertaining to mental health are scarce in comparison to physical health data (Coppersmith et al. 2014). This scarcity contributes to the complexity of development of reliable diagnoses and effective treatment of mental health issues as is the norm in physical health (Righetti-Veltema et al. 1998). The scarcity is partially due to complexity and variation in underlying causes of mental illness. Furthermore, the traditional method for gathering population-level mental health data, behavioral surveys, is costly and often delayed (De Choudhury, Counts & Horvitz 2013b).
Whilst widespread adoption and engagement in social media platforms has provided researchers with a plentiful data source for a variety of tasks, including mental health diagnosis; it has not, yet, yielded a concrete solution to mental health diagnosis (Ayers et al. 2014). Conducting mental health diagnosis tasks on social media data presents its own set of challenges: The users' option of conveying a particular public persona posts that may not be genuine; sampling from a sub-population that is either technologically savvy, which may lend to a generational bias, or those that can afford the financial cost of the technology, which may lead to a demographic bias. However, the richness and diversity of the available data's content make it an attractive data source. Quantifiable data from social media platforms is by nature social and crucially (in the context of our cases study) virtual.
Quantifiable social media data enables researchers to develop methodologies for distant mental health diagnosis and analyse different mental illnesses (De Choudhury, Counts & Horvitz 2013a). Distant detection and analysis enables researchers to monitor relationships of temporal mental health dynamics to adverse conditions such as war, economic crisis or a pandemic such as the Coronavirus (COVID-19) pandemic.
COVID-19, a novel virus, proved to be fatal in many cases during the global pandemic that started in 2019. Governments reacted to the pandemic by placing measures restricting the movement of people on and within their borders in an attempt to slow the spread of the virus. The restrictions came in the form of many consecutive temporary policies that varied across countries in their execution. We focus on arguably the most disruptive measure: The National Lockdown. This required individuals, other than essential workers (e.g. healthcare professionals) to remain in their own homes. The lockdown enforcement varied across countries but the premise was that individuals were only permitted to leave their homes briefly for essential shopping (food and medicine). This policy had far reaching social and economic impacts: growing concern towards individuals' own and their families' health, economic well-being and financial uncertainty as certain industries (such as hospitality, retail and travel) suspended operations. As a result, many individuals became redundant and unemployed which constrained their financial resources as well as being confined to their homes, resulted in excess leisure time. These experiences along with the uncertainty of the measures' duration reflected a unique period where the general public would be experiencing a similar stressful and anxious period, which are both feelings associated with clinical depression (Hecht et al. 1989, Rickels & Schweizer 1993. In this paper, we investigate the task of detecting whether a user is diagnosis-worthy over a given period of time and explore what might this appropriate time period be. We investigate the role of balance of classes in datsets by experimenting with a variety of training regimes. Finally, we examine the temporal mental health dynamics in relations to the respective national lockdowns and investigate how these temporal mental health dynamics varied across countries highly-disrupted by the pandemic. Our main contributions in this paper are: 1) We demonstrate an improvement in mental health detection performance with increasingly enriched sample representations. 2) We highlight the importance of the balance in classes of the training dataset whilst remaining aware of an approximated expected balance of classes in the unsupervised (test) dataset. 3) We analyse empirically proven relationships between populations' temporal mental health dynamics and respective national lockdowns that can be used for strategic decision-making purposes.

II. RELATED RESEARCH A. Natural Language Processing for Mental Health Detection
Unlike physical health conditions that often show physical symptoms, mental health is often reflected by more subtle symptoms (Chung & Pennebaker 2007, De Choudhury, Counts & Horvitz 2013a. This yielded a body of work that focused on linguistic analysis of lexical and semantic uses in speech, such as diagnosing a patient with depression and paranoia (Oxman et al. 1982). Furthermore, an examination of college students essays, found an increased use of negative emotional lexical content in the group of students that had high scores on depression scales (Rude et al. 2004). Such findings confirmed that language can be an indicator of an individuals psychological state (Bucci & Freedman 1981) which lead to the development of Linguistic Enquiry and Word Count (LIWC) software (Pennebaker et al. 2003, Tausczik & Pennebaker 2010 which allows users to evaluate texts based on word counts in a variety of categories. More recent and larger scale computational linguistics have been applied in conversational counselling by utilising data from an SMS service where vulnerable users can engage in therapeutic discussion with counsellors (Althoff et al. 2016). For a more in-depth review of uses of natural language processing (NLP) techniques applied in mental health the reader is referred to Trotzek et al. (2018).

B. Social Media as a Platform for Mental Health Monitoring
The widespread engagement in social media platforms by users coupled with the availability of platforms data enables researchers to extract population-level health information that make it possible to track diseases, medications and symptoms (Paul & Dredze 2011). The use of social media data is attractive to researchers not only due to its vast domain coverage but also due to the cheap methodologies by which data can be collected in comparison to previously available methodologies (Coppersmith et al. 2014). A plethora of mental health monitoring literature have utilised this cheap and efficient data mining methodologies from a variety of social media platforms such as: Reddit (Losada & Crestani 2016), Facebook (Guntuku et al. 2017) and Twitter (De Choudhury, Gamon, Counts, & Horvitz 2013).
Twitter user's engagement in the popular social media platform give way for the creation of social patterns that can be analysed by researchers, making this platform a widely used data source for data mining. Additionally, the customisable parameters querying available in the Application Programmable Interface (API) allows researchers to monitor specific populations and/or domains (De Choudhury, Counts & Horvitz 2013b).

C. Mental Health Monitoring During COVID-19 Pandemic
In the context of the COVID-19 pandemic, we found a handful of projects with similar intentions as our own, to mon-itor depression during the pandemic. Li et al. (2020) gather large scale, pandemic-related twitter data and infers depression based on emotional characteristics and sentiment analysis of tweets. Zhou et al. (2020) focus on detecting community level depression in Australia during the pandemic. They use the distant-supervision methodologies of Shen et al. (2017) to gather a balanced dataset, they utilise the methodology of Coppersmith et al. (2014) to model the rates of depression and observing the relationship with the number of COVID-19 infections in the community. Our work differs from this in three main areas: 1) We investigate the implication of different sample representations to provide more context to our classifier. 2) We retain an imbalance in our development dataset. 3) We investigate European countries (France, Germany, Italy, Spain and the United Kingdom) that experienced a relatively high number of COVID-19 infections.

III. DIAGNOSIS CLASSIFIER EXPERIMENTS
In this section we describe the data mining methodology used to build a distantly supervised dataset and the classifier experiments conducted on this dataset.

A. Data
To conduct the proposed experiments, we firstly construct a distantly supervised development dataset for each country, to be used in training and validation of the classifier. The data mining methods follow the novel distant-supervision methodology proposed in Coppersmith et al. (2014) as it is relatively cheap but also well-structured for clinical experiments.
We follow the widely-accepted methodology proposed by Watson (1768) where diagnosed (Diagnosed) and nondiagnosed (Control), groups are created. In this paper we will only be exploring depression as a mental health condition, accordingly we will have a single Diagnosed group for each country's development dataset. However, if multiple mental issues were to be explored, then the same number of different Diagnosed groups would be required for each country's dataset.
1) Diagnosed Group: We gather 200 public tweets with a geolocation inside the country of interest, posted during a two-week period during 2019. As we are searching for a depression Diagnosed tweets, this two-week period needs to be chosen strategically, as we want to capture users that have been diagnosed with depression rather than seasonal affect disorder (SAD), a separate albeit a condition with similar symptoms. Tweets collected via Twitters API 1 , were retrieved based on lexical content indicating that the user has history/is currently dealing with a clinical case, e.g. I was diagnosed with depression, rather than expressing depression in a colloquial context. Human annotators were then instructed to remove tweets that are perceived to have made a nongenuine statement regarding the users' own diagnosis, most of these were referring to a third party. Examples of genuine and non-genuine tweets encountered can be seen in Table I. We then collect all (up to 5,000 most recent) tweets made public by the remaining users between the start of 2015 and October 2019. Further filtering includes removal of all users with less than 20 tweets during this period or those whose tweets do not meet our major language of instruction benchmark. This benchmark requires 70% of the tweets collected to be written in the major language of instruction of the country of interest (i.e. United Kingdom is English, Italy is Italian etc.). Following this filtering process and some preprocessing on the tweet level, which includes medial capital splitting, mention white-space removal (i.e. if another user was mentioned this will be shown as a unique mention token), the same has been done with URLs, all uppercase and non-emoticon related punctuation were removed.
2) Control Group: We gather 10,000 public tweets with a geolocation in the country of interest, posted during the same two-week period as the Diagnosed in 2019 and remove any tweets made by Diagnosed group users. We then follow a similar process to that of the Diagnosed collection methodology by collecting up to 5,000 most recent tweets for each user from the period mentioned above. As can be seen in Table II, we construct imbalanced datasets. World Health Organisation (WHO) claim 264 million people suffer from depression worldwide 2 . Whilst, at the time of writing, the global population is approximately 7.8 billion 3 . This would suggest that 1 in 30 individuals suffer from depression. However, these figures are approximations. Therefore, the extent to which our datasets are imbalanced is not an attempt to create datasets that are representative of the expected balance of classes, as these are unverifiable. Nevertheless, our datasets present ratios of Control:Diagnosed samples between 23.78:1 and 11.46:1, which came about from the data mining methods previously described. We accept these ratios to retain imbalanced datasets in a similar order of magnitude as the expected balance whilst achieving reasonable classifier performance. We inherit the caveats to the distantsupervision approach of Coppersmith et al. (2014): (a) When sampling a population we always run the risk of only capturing a subpopulation of the Control or Diagnosed that is not fully representative of the population, especially considering that Diagnosed samples are identified based on the fact that they publicly speak out about what is a deeply personal subject this attribute may not generalise well to the entire population. (b) We do not implement a verification of the method used to identify users in Diagnosed but rather rely on the social stigma around mental illness whereas it could be regarded as unusual for a user to tweet about a diagnosis of a mental health illness that is fictitious. (c) Control is likely contaminated with users that are diagnosed with a variety of conditions, perhaps mental health related, whether they explicitly mention this or not.
We have made no attempt to remove such users. (d) Depression is often comorbid with other mental health issues (Aina & Susman 2006). As such, it is plausible that the users forming Diagnosed are suffering from other mental health conditions. This could suggest that the classifier will be trained to pick up these hidden meaning representations of other mental health issues and classify them as depression. We have made no attempt to further investigate nor remove such users from Diagnosed as having a complex Diagnosed group is a realistic representation of the task.

B. Methodology
In this section we describe the experiments conducted in training our classifier to diagnose depression. The trained classifier is deployed in Section IV for classifying samples from an unsupervised experiment dataset which is then used in analysing temporal mental health dynamics.

1) Sample Representation:
We investigate the most appropriate sample representation of our distantly supervised dataset. We are posed with these considerations: (a) Symptoms' temporal dependencies: as the tweets gathered come from a variety of days, weeks, months and even years, symptoms may only be present in specific timedependant samples. However, when represented by overwhelming tweet-enriched samples the classifier perfor-mance is traded-off with retaining the symptoms' temporal dependencies. (b) As our final task will be to monitor and analyse the temporal mental health dynamics, we are interested in modelling the rate of depression as fine-grained as possible. Therefore, the ability to accurately identify Diagnosed samples and correctly discriminate between Control and Diagnosed with the least tweet-enriched samples will be vital in modelling a fine-grained rate of depression in the deployment stage of the final task where conclusions could be drawn in the context of the national lockdowns. The sample representations we examined: • Individual each sample constitutes of a single tweet. • U ser day -each sample constitutes of all tweets by a unique user made public during a given day. • U ser week each sample constitutes of all tweets by a unique user made public during a given week. • All user -each sample constitutes of all tweets made public by a unique user. We examine the performance of a benchmark, Support Vector Machine (SVM) with a linear kernel function (Peng et al. 2019), on the different sample representations datasets where the benchmark classifier inputs are sparse many-hot encoding representations of the samples' lexical content.
As we are working with imbalanced datasets we need to think about the metrics we use to assess the classifiers' performance. The accuracy metric is insufficient for imbalanced datasets and is best illustrated with an example. If we have a dataset with 24:1 ratio split between the samples of each class, the classifier could achieve 96% accuracy by classifying every sample as the majority class. The classifier is clearly not discriminating between the distributions of the two classes but yet achieving high performance. As such, we will be assessing the performance of the classifier on the individual classes' Precision (P), Recall (R) and F1 score measures as well as the Macro F1 score for this and the remainder of the experiments in this paper. The Precision measure will tell us: of all the samples the classifier labelled as a particular class, what fraction are correct. The Recall measure will tell us: of all the samples that actually belong to that particular class, how many did the classifier correctly identify. Whilst the F1 score is a harmonic mean between the two and the Macro F1 score takes the F1 scores of all classes and calculated a non-weighted mean between them. By having a more classspecific breakdown of the classifiers' performance we can better understand the strengths and limitations of our classifiers and hence make a more informed decision when choosing the highest performing classifier.
The results in Table III suggest that our benchmark classifier improved in identifying Diagnosed, with increasingly tweet-enriched, samples. Interestingly however, when presented with the U ser day sample representations a sharp decreased in performance in Control samples causing a decrease in Macro F1 score when compared with the F1 scores of both Individual and U ser week sample representations. Barring this decrease in Macro F1 score, we can say that we are able to achieve improved performance when using increasingly tweet-enriched samples. However, the end task would benefit from fine-grained modelling of the rate of depression, providing us with more detailed relationships between the temporal mental health dynamics and noteworthy dates. As such, our task is bias towards the two fine-grained sample representations, Individual and U ser day. As our benchmark classifier achieves superior performance on the Individual sample representation we will adopt this representation, as denoted by the asterisk in Table III.
2) Classifier Experiments on U.K. Development Dataset: Once we have chosen the sample representation that balances out or fine-grained sample requirements with the benchmark classifier performance, we must now build a classifier that best discriminates between our two classes. The higher the performance of the classifier, the more accurate the temporal mental health dynamics will in Section IV. We outline the classifier architectures included in our experimentation: • SV M : Linear kernel SVM as used in Section III-B1. This classifier will serve as our benchmark. • AV EP L EF C 4 : Average pooling layer. We set hyper-parameters where an Adam optimiser (Kingma & Ba 2015) is used with a learning rate of 0.01, batch size of 1, 000. All classifiers were trained for a single epoch with a dataset training:validation split of 4:1 and weighting the samples of Diagnosed as 5 times more valuable than those of Control. Training was done on a single Tesla P100-PCIE with 16GB of RAM available through Google's Colaboratory 6 .  Table IV shows that all classifiers achieve significantly higher performance on Control than Diagnosed. As we are trying to correctly detect Diagnosed samples and discriminate between the two classes, we prioritise the Diagnosed Precision and Macro F1 score metrics. Based on these 2 chosen metrics to guide our classifier selection process 3 candidates emerge: AV EP L, BILST M and BILST M -SELF A achieving {Diagnosed Precision, Macro F1} scores of: {0.33, 0.62}; {0.3, 0.63} and {0.32, 0.63} respectively. Whilst the performance of these classifiers is similar the performance of BILST M -SELF A is the highest performance combination of the desired metrics (indicated by the asterisk) and as such we will be adopting this classifier in further experiments.

3) Dataset Balance Experiment:
In this section we investigate the distribution of our datasets in training and validation of our classifier. By conducting this experiment we intend to gather an in-depth understanding of our task from a linguistic standpoint. We train and validate the classifier on datasets with varying balances to investigate the role of our imbalanced dataset in the depression diagnosis task. This experiment analyses the performance of the BILST M -SELF A classifier on a number of different training regimes: • Balanced: a dataset containing all Diagnosed samples and downsampling from Control. • Imbalanced: a dataset of the development dataset's distribution (See Table II).
Furthermore, we explore the effects of sample weighting of the classes by weighting Diagnosed samples as 5 times more valuable than Control samples as mentioned in the previous experiment. The performance of the BILST M -SELF A classifier on the different training regimes can be seen in Table V  The Balanced-Balanced (T raining-V alidation) training regime achieves encouraging results in terms of its Precision-Recall trade-off, for both classes, as well as the Macro F1 score. This shows that the problem is reasonably linguistically achievable, when the imbalance challenge is removed from the equation. The Imbalanced-Imbalanced training regime shows that adjusting the sample weighting is a successful measure we can implement to adjust the Precision-Recall trade-off in our class of interest (Diagnosed). Our classifier performs significantly worse in the Balanced-Imbalanced regime when compared to the performance on the Imbalanced-Imbalanced regime, this performance is reduced by the introduction of sample weighting. This means that when training on a Balanced dataset our classifier is less robust to an Imbalanced dataset at validation. Finally, whilst our classifier experiences a significant improvement in performance on the Imbalanced-Balanced training regime when sample weighting is introduced due to our final depression diagnosis task in which we expect an Imbalanced unsupervised dataset (discussed in Section III-A2) the training regimes implementing Balanced validation datasets are not suitable approximations of our classifier's depression diagnosis performance. Therefore, we conclude that Imbalanced training, with suitable sample weighting, yields more desirable and robust depression diagnosis performance as it's able to see a broader range of data examples in training (i.e. no sub-sampling).

C. Results
We train separate BILST M -SELF A classifiers for each of the countries' imbalanced development datasets following the Individual sample representation. The test performance of these classifiers can be seen in Table VI. We observe that the BILST M -SELF A classifier architecture achieved similar performance on the remaining countries' datasets as was acheived on U.K. dataset. This shows that the BILST M -SELF A architecture is able to generalise well to different languages and cultural differences after training. Hence, producing an encouraging set of results and increase our confidence in its classification ability. Whilst the BILST M -SELF A classifier architecture achieved the highest performance of all our classifier architectures, a combination of 0.32 Diagnosed Precision and 0.63 Macro F1 score leaves much to be desired. As such, we perform an error analysis and examine the significance of its results. Table VII shows the input samples, Text, the Prediction type as well as the Sigmoid Output which is the output layer of the classifier and is responsible for the final classification of the samples. The Sigmoid Output is normalised in the range of [0, 1] ∈ R, where an output of 0.5 represent the decision boundary, as such it can be interpreted as complete uncertainty by the classifier as to how the sample should be classified. A Sigmoid Output of 1 is complete certainty that the sample should be classified as positive (Diagnosed) and an output of 0 is complete certainty the sample should be classified as negative (Control). We observe that the True positive example mentions having "overcoming depression" which implies that the user has recovered from depression, as one overcomes other health issues. The Sigmoid Output for this sample is 0.999 which is extremely high certainty by the classifier that this sample follows the distribution of Diagnosed. Whilst on the other end of the scale, the True negative sample discusses a topic that is completely unrelated to nor implies that the individual suffers from depression, as such it is classified as part of Control with a Sigmoid Output of 0.001.

1) Error Analysis:
However, we find the Texts of the two samples misclassified by the classifier are rather similar. They both use words rooted from the word 'depress' in rather colloquial contexts, with no indication of a past diagnosis or clinical appropriation of depression -which is rather a desirable ability for our classifier to be able to discriminate between. It is also noteworthy that the Sigmoid Outputs of these two samples are much less polarised than the correctly classified samples, with the Sigmoid Output of the False positive sample just about falls within the Diagnosed classification region. However, these two incorrectly classified samples reflect the complexity of depression diagnosis from distantly supervised tweets.
2) Significance of Results: We investigate the significance of our classifiers' results by performing a χ 2 significance test. Our null hypothesis, H 0 , states that both sets of data, our classifiers' predictions (D P ) and the distribution it is being tested against (D T ), have been drawn from the same distribution (D).
We compare the distribution of the classifiers' predictions against a random uniformly distributed set (denoted by Uniform) and against a random distributed set following the distribution of the development datasets (denoted by Weighted). All classifier results in Table VI are statistically significant from the random baselines, according to the χ 2 significance testsee Table VIII in Appendix A. Therefore, we can reject H 0 and conclude that the predictions of the classifier and those of the respective randomly distributed benchmarks have not been drawn from the same distribution.

IV. MONITORING AND ANALYSIS
We prepare the unsupervised dataset and deploy the previously trained BILST M -SELF A classifier to annotate this dataset. We analyse the relationships between the temporal mental health dynamics to respective national lockdown dates.

A. Data
In this section we discuss the procedure for constructing the unsupervised experiment dataset, to be used for monitoring the temporal mental health dynamics during the respective pandemic-inflicted national lockdowns.

1) Experiment Dataset:
The experiment dataset is used for the deployment of the classifier, which is trained and validated on the development set, to analyse the temporal mental health dynamics of a country. We start by gathering tweets made public by users during the first two weeks of 2020 with a geolocation within the country of interest. We then follow the same methodology for data mining as Control outlined in Section III-A, for the periods starting from 1 December 2019 until 15 May 2020, where we apply similar user-level and tweet-level preprocessing and filtering on these dataset. The composition of these experiment datasets can be seen ( Table IX in Appendix B) along with key dates.
The key dates specified observe the official date announcing the commencement of and the announcement of first step towards easing of the national lockdowns, rather than the first official data implementing these measures as we anticipate that the announcement would provoke users to express their opinion more than the implementation of the measures. We acknowledge some caveats to the methodology with relations to the temporal mental health dynamics during the respective national lockdowns: (a) The activity-level of users whose lifestyles have been highly disrupted by the national lockdowns may be overstated during this period, due to increased leisure time. (b) The language-based filtering component may exclude certain users of the population such as stranded expatriates that use a non-majority language to communicate their thoughts. Such samples may contain a bias towards a higher rate of depression.

B. Methodology
To monitor and analyse temporal mental health dynamics we must firstly deploy our trained BILST M -SELF A on the respective countries' unsupervised experiment datasets. Once we have the classifier's predictions, we must model the rate of depression by calculating the rate of depression at any given day, R t , using the following equation: Where Φ represents our trained classifier, x i is the input, N t is the total number of samples on day t. The output of the classifier, Φ(x i ) takes the form [0, 1] ∈ N. R t is a normalised continuous value between 0 and 1, interpreted as the proportion of tweets at t that classify as Diagnosed: 0 meaning all samples belong to Control and 1 meaning all samples belong to Diagnosed.

C. Results
Figures 1 and 2 (see Appendix C) display the temporal mental health dynamics for the countries under investigation. It is noteworthy that R t across the different countries is a function of the ratio of Control:Diagnosed samples in the country specific datasets on which the classifier was trained. As such, the rates across countries are not directly comparable but are rather analysed by the momentum of how R t in a country changed over time and how it differed from R t of other countries.

D. Discussion
Foremost, we note that we categorically cannot, nor do we, state that the rates of depression discussed in this section are caused by the imposition of respective national lockdowns or any other measures of any type, taken by governments to combat the spread of the virus. In this section, we merely offer interpretations of the rates of depression in line with explicit relationships that we discover between the rates of depression and key events that occurred during the time-period included in this case study.
Upon examination of the U.K. rate of depression (R U K ), the first distinct observation we make is not related to any pandemic-related measures but rather the sharp, non-sustained, increase of R U K of over 50% on Christmas day, before decreasing back to the status quo the next day. Upon further investigation we find that this phenomenon is well-documented (Hillard & Buckman 1982) and seeing that our classifier was able to identify this phenomenon, without explicitly being aware of its existence, is encouraging. We continue observing R U K chronologically, on March 9 th , Italy National Lockdown begins. From this point on we observe a sharp, sustained increase in R U K approximately until March 23 rd , U.K. National Lockdown begins (the last country to impose such a restriction in this study), where R U K somewhat plateaus. We interpret this as an increase in anxiety, a symptom of depression, amongst the U.K. population as neighbouring countries take decisive measures to slow the spread. A key theme in the build up to the U.K. national lockdown implementation was the intentional delay so that to ensure maximum utility from the policy 7 . However, a report published on the 16 th of March by the Imperial College COVID-19 response team 8 estimated that the the current combative approach taken up by the U.K. government would result in 250,000 deaths. The report was well-publicized by the British media and was arguably a factor in the change of combative approach by the U.K. government. This is somewhat supported by the change in R U K during the U.K. National Lockdown where we see a sustained decrease for the majority of the period. Finally, we observe a slight increase in R U K towards the end of and in the aftermath of the National Lockdown that could perhaps be interpreted as anxiety and concern from the population at the uncertainty with which they are faced both from a social and an economic perspective.
The rates of depression of France (R F R ), Germany (R DE ), Italy (R IT ) and Spain (R ES ) behave rather differently from R U K . Firstly, R IT increases sharply by over 100% over the initial days of the Italian National Lockdown. This can be interpreted as anxiety and concern in response to the quickly imposed stringent measures by the Italian government in response to the outbreak of the virus in Italy, which at this point was largely believed to be the epicentre of the pandemic. This was coupled with economic turmoil and great concern over the capacity of hospitals and their ability to handle the high requirement for intensive care units that would ensue 9 .
A similar pattern emerges in R F R , a sharp increase over the initial days of the French National Lockdown period, afterwhich R F R continues to rise throughout the lockdown period at a lower and inconsistent rate. A similar story could be tailored to R ES . Whereas, the major increase in R DE occurs Fig. 1. U.K. rate of depression before and during the National Lockdown. Noise in the rate of depression has been smoothed with a 7-day moving average. in the build-up month, whilst during the German National Lockdown, R DE increases in the initial days, albeit at a lower rate. R DE then plateaus and decreases -creating a turning point in R DE during the German National Lockdown.
Furthermore, we can observe the R of the respective countries following the easing of respective lockdowns and interpret them as the countries' outlook on the easing of restrictions. From the time period that we have available it seems that the French and Spanish general populations experienced a reduction in symptoms of depression, such as anxiety, as is evidenced by the clear reduction in R F R and R ES respectively. We can therefore conclude by, tentatively, stating that the easing of restrictions were received by an improvement in the mental state of the general populations of France and Spain, the mental state of the Italian and German general populations deteriorated, whilst the general mental state of the U.K. was rather agnostic to the easing of restriction.
We are hesitant to state the changes in the rates of depression had been caused by the imposition/easing of national lockdowns. To make such a claim we would be required to undertake a more fine-grained causality study which is beyond the scope of this paper, however we note this for future work. We can however claim to have discovered clear relationships between the drastic changes in the behaviour of rates of depression during the periods of the build-up to, during and in the aftermath of national lockdowns V. CONCLUSION Our set of experiments have been conducted with the aim of providing organisations with a methodology for monitoring and analysing temporal mental health dynamics using social media data. We examine sample representations and their ability to impact classifier performance. We investigate the role of including an imbalanced dataset in the classifier training regime. Our classifier provides encouraging performance on two fronts: the first being that it is able to discriminate with high certainty between clear Diagnosed and Control samples and secondly, it was able to identify the Christmas Depression phenomenon supported by the literature. Finally, we present an analysis and discussion of the rates of depression and their relationships with key events during the COVID-19 pandemic. We reiterate that through the analysis conducted in this paper, we cannot state that the measures imposed caused the changes in rates of depression during the pandemic and leave this causality analysis for future work. Mental health monitoring methodologies such as the one proposed in this paper can be adopted by governments, to identify relationships between the general population's mental health state to imposed measures, mental health authorities, to assist in planning and targeting individual locations in which to dynamically concentrate their resources, as well corporations involved in producing or disseminating drugs, such as pharmaceuticals, to combat mental health issues for a more commercial use case.