ESTeR: Combining Word Co-occurrences and Word Associations for Unsupervised Emotion Detection

Accurate detection of emotions in user- generated text was shown to have several applications for e-commerce, public well-being, and disaster management. Currently, the state-of-the-art performance for emotion detection in text is obtained using complex, deep learning models trained on domain-specific, labeled data. In this paper, we propose ESTeR , an unsupervised model for identifying emotions using a novel similarity function based on random walks on graphs. Our model combines large-scale word co-occurrence information with word-associations from lexicons avoiding not only the dependence on labeled datasets, but also an explicit mapping of words to latent spaces used in emotion-enriched word embeddings. Our similarity function can also be computed efficiently. We study a range of datasets including recent tweets related to COVID-19 to illustrate the superior performance of our model and report insights on public emotions during the on-going pandemic.


Introduction
Human beings are known to perceive and feel various, highly-nuanced emotions, expressed both in spoken and written texts. Modeling emotions in user-generated content has been shown to benefit domains such as commerce, public health, and disaster management (Bollen et al., 2011b;Neppalli et al., 2017;Hu et al., 2018;Pamungkas, 2019). E.g., emotion cues from social media posts were used to identify depression and PTSD (Deshpande and Rao, 2017;Aragón et al., 2019) and for personalizing chatbots to improve user satisfaction (Wei et al., 2019).
Recent studies list as many as 154 human emotions (Smith, 2015). However, most researchers in * Equal contribution from both authors.
Psychology have largely agreed on a set of basic emotions such as anger, fear, disgust, sadness, surprise, and happiness (Ekman, 2016) and showed that complex emotions can be expressed using this basic set (Ekman, 1992;Plutchik, 2001). For example, Plutchik uses combinations, intensity, and opposites of basic emotions for capturing the higherorder emotions. That is, annoyance and rage can be viewed as the less or more intense forms of anger, and anticipation is the opposite of surprise. Thus, most recent studies on automatic emotion detection use Ekman's or Plutchik's sets of 6 or 8 emotions, respectively (Mohammad et al., 2018;Liu et al., 2019).
Existing models for automatic emotion identification in user-generated texts typically use supervised learning techniques. The state-of-the-art emotion detection performance on tweets, news articles, blogs, reviews, and TV-show transcripts is obtained using complex, deep learning architectures that combine a range of features including terms, embeddings, and domain-specific aspects such as emojis, as well as human-generated lexicons of emotion-word associations (Chatterjee et al., 2019;Zahiri and Choi, 2018;Mundra et al., 2017;Abdul-Mageed and Ungar, 2017;Köper et al., 2017). Much manual effort is involved in collecting annotated data for a given domain and fine-tuning domain-specific models.
Other auxiliary works enabling emotion detection can be placed under two complementary directions. The first one is lexicon development for emotions via manual vocabulary labeling or automatic generation, for example, based on similarity to a set of seed words (Mohammad and Turney, 2013; Araque et al., 2019). The second direction uses a latent space of embeddings to compare sentences with emotion lexicons (Xu et al., 2015;Savigny and Purwarianti, 2017). Compiling a lexicon of a high quality and coverage is a labor-intensive task, and even when automation and crowdsourcing is involved, a close manual control is required. As for latent space representations, the embedding model must include sufficient information about the underlying emotions, obtained, e.g., from the lexicons or labeled datasets (Agrawal et al., 2018;Xu et al., 2018;Tang et al., 2014).
Both embeddings and lexicons enable basic techniques for unsupervised emotion predictionfor example, by using word embeddings similarities (Kim et al., 2010) or overlap between lexicon words and input text (Araque et al., 2019). Considering the abundance of user-generated texts on the current-day Web with its ever-changing topics (for example, "COVID-19 lockdown"), we argue that it is desirable to develop advanced unsupervised models that detect emotions accurately across domains, offer a probabilistic explanation for the predicted emotions, while not depending on large quantities of labeled data. These desirables comprise our precise objectives in this paper.
We present Emotion-Sensitive TextRank (ES-TeR ) and its variants as our similarity functions that use word graphs for scoring input texts with reference to a given set of emotions. ESTeR is designed based on the following two observations: (1) For a given language, words expressing emotions are fairly stable across domains (Agrawal et al., 2018). For example, the same words ("This is absurd...") may be used to express anger (emotion) regarding a product on an e-commerce website as well as in a tweet related to a goverment policy. (2) Word-occurrence graphs are known to capture contextual and latent language information and were successfully used in various NLP tasks (Mihalcea and Tarau, 2004;Yan et al., 2013;Chen and Kao, 2015;Kong et al., 2016).
We make the following contributions: • For identifying emotions in textual content, we propose similarity functions that incorporate word co-occurrence information from large-scale, publicly-available text corpora and word associations from lexicons. Our novel similarity functions are based on random walks on word graphs and score an input text with respect to a given emotion. • Next, we formally show the relation between the proposed similarity functions and Personalized PageRank . In addition, we provide a computational method based on solving a linear system of equations to compute our similarity functions efficiently at the dataset level, rather than per instance. • We present experiments illustrating the superior performance of our models on five recent, publiclyavailable datasets for emotion detection (Klinger et al., 2018;Liu et al., 2019). • Finally, we showcase our proposed model on a newly-collected dataset of COVID-19 tweets by highlighting various interesting aspects of public emotions during the current pandemic.
In the next section (Section 2), we present our scoring framework for emotion detection along with derivations on how to compute our solution efficiently. In Section 3, we summarize datasets and experiments illustrating the performance of our proposed model. In Section 4, we demonstrate anecdotally, the effectiveness of model on COVID-19 tweets. Finally, we present closely-related work in Section 5 and conclusions in Section 6.

Preliminaries
Given an input text (alternately referred to as a "sentence" in this paper for ease), d, and a set of emotions, E, the objective of the emotion detection task is to identify a subset of emotions from E to be assigned as labels for d. This objective translates into constructing a score function s : D × E → R ≥0 , where D is a dataset of n sentences.
Similar to previous unsupervised models (Kim et al., 2010), we would like to leverage the information from the emotion lexicons: a set of words L(e) which have known binary or continuous association with the emotions e ∈ E. Vocabulary V of size m is the union of all of words (in lexicon, dataset, and the corpus used to generate our graph-based model, to be explained shortly).
Let x d and x e be, respectively, the vector representations of a sentence d and a lexicon of emotion e. We use binary bag-of-words column vectors of length |V|. The matrix D ∈ R m×n represents the dataset with each column corresponding to a sentence vector x d for some d ∈ D, whereas each column of the emotions matrix E ∈ R m×|E| is a vector representation x e of some emotion e ∈ E.
The score function s(d, e) is typically defined as a similarity function between the vector representations of a sentence d and emotion e (Seyeditabari et al., 2018). For example, the commonly-used cosine similarity function is given by: To mitigate problems due to sparsity of lexicons that may result in insufficent overlap between a sentence and the emotion vectors, previous works have employed latent spaces for representing sentences and lexicons. These spaces can be obtained through matrix factorization approaches (Kim et al., 2010) or more recently through neural embeddings (Polignano et al., 2019). For these models the corresponding scoring function s lat (d, e) can be written as M ∈ R h×m is the embeddings matrix, h m.

ESTeR : Our Proposed Scoring Function
Latent-space based similarity functions show relatively improved performance (Polignano et al., 2019). However, previous works have highlighted the shortcomings of using general latent space representations for specific tasks and often labeled data is used to fine-tune latent representations within supervised models (Seyeditabari and Zadrozny, 2017; Yeh et al., 2017). Therefore, we would like to avoid an explicit mapping into a latent space by turning to the classical notion of random walk-based graph similarity and look for functions of the shape: where norm is some appropriate normalization for x d and/or x e . In deriving random-walk based similarity functions on word graphs, we need a transition matrix whose entries represent probabilities of moving from one word to another. Similar to previous works (Mihalcea and Tarau, 2004), we use a co-occurrence matrix A ∈ R m×m ≥0 derived from some general corpus, e.g., Wikipedia, where A(i, j) is the number of times words i and j appear in the same text window. The entries of A are rownormalized to convert A to a stochastic matrix.
In the random walk with restarts model, we assume that at each step of the random walk, the walker proceeds with moving to another word according to the transition matrix with probability α ∈ [0, 1] and stops with the probability 1 − α Yazdani and Popescu-Belis, 2010;Duan et al., 2018). The resulting matrix P can be expressed as: where k is the walk length.
Since our goal is to measure similarity of a sentence to a lexicon, we need to allow the walk to restart only inside the sentence vocabulary. That is, a random walker restarts at any, chosen at random, word w ∈ V in d. With a uniform distribution, each word can be chosen with a probability 1/ x d 1 . This translates into x d 1 normalization in Equation 3. On the other hand, we would like to reach any word in the lexicon, thus the probabilities to reach each particular word in the lexicon are aggregated as a sum without any normalization. Therefore, norm(x d , x e ) = x d 1 and the final formula for s(d, e) (we denote it as ESTeR(d, e), Emotion Sensitive TextRank) is: ESTeR(d, e) has a clear probabilistic interpretation as the probability that a random walk with restarts in a sentence d ends in the lexicon e.
Note that, the probability that a random walker stops at a word w ∈ V restarting from the words of a sentence d is given by: where e w is a one-hot vector with 1 at the position corresponding to the word w.
Such P P R(x d , w) is a classic Personalized  or Topic-Sensitive  PageRank score of a word w for a personalization (topic) vector where P P R(x d ) ∈ R m×1 is a Personalized PageRank vector.

Computation
Our objective is to compute ESTeR for a dataset efficiently. Using matrix representations for a dataset (D) and emotions (E), the Formula 4 can be written as: where D n is a l 1 column-normalized matrix D. ESTeR(D, E) is a matrix of size n×|E|, where each element ESTeR(d, e) is the score of a document d in emotion e. Using Neumann series (see e.g., (Naylor and Sell, 2000)), it can be further written as: If we calculate ESTeR(d, e) naïvely using available methods, we would have to run the PageRank algorithm for each sentence. To avoid this we first solve a linear system: for Z ∈ R m×|E| . We can then calculate the final dot-product with D as: ESTeR(D, E) = D T n Z. The total time complexity of this method is O(|E|LA(m) + mult(D n , Z)). LA(m) is the cost of solving a linear system of size m and generally takes O(m 3 ), but for the cases of sparse matrices, such as ours, can run in quadratic to almost linear time in practice (Zlatev, 1991). mult(D n , Z) is the time complexity of matrix multiplication, which is O(m|E|n) in general, but for the multiplication of a sparse (D n ) and dense (Z) matrices, the complexity can be reduced to O(nnz(D n )|E|), where nnz(D n ) is the number of non-zero entries in matrix D n (Zlatev, 1991). nnz(D n ) can be estimated as a · n where a is average sentence length. Discarding a and |E| as constants, the computational complexity is O(LA(m) + n).
Note that, once the system is solved and matrix Z is obtained, we can estimate scores of any new sentences on the fly in linear time of the sentence length, similar to supervised predictive models.
If computed naïvely, even using the popular power method (Arasu et al., 2002) for PageRank computation requires O(m 2 ) per iteration, thus computing Equation 5 for the whole dataset D takes O(nm 2 I + n · nnz(E)m), I is the iterations number. Estimating the lexicon size as m/b for b > 1 and discarding constants results in complexity of O(nm 2 ).

Variants
We consider a couple of variations of our ESTeR scoring function to enable other probabilistic interpretations of scoring texts with respect to emotions.
ESTeR:LexNorm. The first variant incorporates the normalization on lexicon vectors as: Here, E n is column-normalized E by 1 norm. Since lexicons (particularly auto-generated lexicons) can be large and some emotions have richer word-associations, normalization has the effect of balancing the sizes of the lexicons and contribu-tions of each word. This variant, therefore, captures the probability that a random walk with restarts starts in the sentence and ends in the lexicon, if the starting and ending words u ∈ V(d) and v ∈ L are chosen uniformly at random.
ESTeR:Lex2Sent. The second variant reverses the intuition for ESTeR by capturing the probability that the random walk starts in the lexicon and ends in the sentence and is given by: This variant therefore score sentences based on how well they reflect the lexicon.

Baselines for Comparisons
Since techniques for unsupervised emotion detection are lacking, we formulate our baselines based on the two resource types created for this task.
For the first set of baselines, we directly use the recent emotion-enriched word embeddings from ewe-uni300 (Agrawal et al., 2018), emo2vec100 1 (Xu et al., 2018), and sswe-u50 2 (Tang et al., 2014) to represent sentence and emotion vectors. The similarity is computed using Equation 2. Unlike general word embeddings (Pennington et al., 2014), emotion-enriched embeddings use supervision of some form to capture the "emotion similarity/dissimilarity" between words in a latent space.
The second set of baselines incorporates coverage in emotion lexicons by using Equation 1 to compute the similarity between the sentence and emotion vectors. We use EmoLex (Mohammad and Turney, 2013) and DepecheMood (Staiano and Guerini, 2014), two recent lexicons that are also in many supervised emotion detection models (Mohammad et al., 2018;Liu et al., 2019). In the next section, we refer to the baseline techniques using the resource names.
3 https://data.world/crowdflower/ sentiment-analysis-in-text Table 1 summarizes the main characteristics of the datasets. Note that all datasets are gathered from Twitter platform, except for DENS. Overall, our choice of datasets comprises the most recent datasets available for the emotion detection task (Klinger et al., 2018). Tweet datasets are representative of the abundant, diverse, and everchanging content on Twitter whereas the recentlycollected DENS has narrative texts from literature and fan-fiction websites. Together they comprise a diverse collection of datasets to evaluate our proposed unsupervised methods. All datasets except TEC permit multi-labeling. The median number of labels for all datasets is 1 except for SemEval2018 where it is 2.

Resources and Measures
ESTeR computation depends on two resources: the lexicons providing emotion-word association information and the graph containing word cooccurrence information. The lexicons EmoLex and DepecheMood described in Section 2.5 are used for ESTeR variants as well.
For co-occurrence matrices, we experimented with the following corpora: Wiki is based on the Wikipedia dump of text articles collected in Feb 2020 comprising of 39K words and 1.7M nonzero entries, Twitter is based on the dataset of tweets (Go et al., 2009) contains 17.7K tokens and 1.3M non-zero entries, and Combined is the cooccurrence matrix for the combined corpora with 47.7K tokens and 1.9M non-zero entries. 4 For computing ESTeR , standard BLAS 5 implementations for Linear Algebra subproblems was 4 Further details on these resources and datasets are included in the Appendix A due to space limitations. The Python 3 implementations of the methods, and experimentation scripts are available at https://github.com/ nusids/ester. used. For an indicative runtime, we can calculate ESTeR scores for SemEval2018 dataset with Combined matrix and EmoLex lexicon in 11 minutes in total. 6 Following previous works on multi-label emotion detection, we present our results using Jaccard Accuracy and F 1 measure evaluated for top-k predictions, referred to as Jaccard@k and F 1@k, with k set to 1, 2. That is, if L(d) indicates the set of correct labels for document d, and L (d) the predicted set, the measures are given as: where P (d) and R(d) refer to the precision and recall for d respectively (Manning et al., 2008):

Experimental results
Effect of co-occurrence matrices on ESTeR : In the leftmost plot of Figure 1, we show the effect of using the different co-occurrence matrices from Wiki , Twitter and Combined on our similarity function. Since our contention is that word cooccurrence graphs incorporate the latent information required for emotion detection, the richer and more representative the corpus is, the better ESTeR performs for the emotion detection task. The left plot of histograms in Figure 1 shows the F 1@2 values on a run of ESTeR with the EmoLex lexicon on the different datasets. The coverage of words in Twitter vocabulary can be expected to be different from that of Wikipedia. We notice that combining information from both these resources yields bet-6 All experiments were conducted on Xeon E5 2680 v2 2.80GHz with 64GB memory.
ter performance in 3 out of 5 datasets. A previous study by Klinger (2018), pointed out the domain similarity between TEC and CrowdFlower and the noise in CrowdFlower after a manual examination. Within ESTeR , both TEC and CrowdFlower benefit from using a focused corpus (Twitter) that is more reflective of their dataset domain.
Lexicon effect on ESTeR : In the middle plot of Figure 1, we show the effect of using the different lexicons on ESTeR . The F 1@2 values achieved by ESTeR with Wiki matrix and the two lexicons EmoLex and DepecheMood are shown in this plot. While EmoLex is based on a general dictionary, the DepecheMood lexicon, uses vocabulary from news articles. Interestingly, the substantially smaller lexicon fares significantly better on all but one dataset (TEC ). We attribute this effect to the quality of the lexicons. The EmoLex dictionary was created by asking annotators questions related to specific terms in the lexicon and them compiling them to reflect a binary association with an emotion (Mohammad and Turney, 2013). In contrast, the manual annotations obtained for news headlines were later converted to (word, emotion) association scores in DepecheMood (Araque et al., 2019). While this automatic process yields a large-scale dictionary, we note that within the ESTeR framework, having a smaller high-quality word associations seems to be more beneficial on average.
Performance of ESTeR variants: The rightmost plot in Figure 1 shows the performance of the three proposed variants ESTeR , ESTeR:LexNorm, ES-TeR:Lex2Sent with Combined matrix and EmoLex lexicon on the five datasets. As described previously, the three variants have different interpretations: ESTeR is the probability that a random walker, starting at a randomly chosen word in a sentence, stops at any word in a lexicon whereas ESTeR:LexNorm penalizes large lexicons, so that every lexicon word contributes equally. ES-TeR:Lex2Sent is similar to ESTeR:LexNorm, but the walker moves from the lexicon to the sentence. According to Figure 1, ESTeR outperforms the variants on 4 out of 5 datasets and is very close to the best performing variant for CrowdFlower. ES-TeR and ESTeR:LexNorm result in a similar classification quality and outperform ESTeR:Lex2Sent. This is explainable; the walk from a relatively larger set of lexicon words quantifies the emotion association less precisely than the walk starting from a small set of sentence words. The lexicon normalization does not offer much benefit: a lexicon covers a range of words for a given emotion and it is restrictive to require the sentence to reflect all of them.

Comparison with baselines
Based on the experiments above, we choose ESTeR in combination with EmoLex lexicon and Combined matrix to compare against the baselines in Table 3. Additionally, we include results with the best-performing combination among (ESTeR variants, matrices and lexicons based on F1@2 scores) as ESTeR * 7 entries. The best and second best performance for the two measures are highlighted in this table.
The first set of entries in this table uses stateof-the-art emotion-aware embeddings whereas the second set is based on word overlaps with the lexicon. Lexicon-based baselines are highly dependent on coverage of the words in the dataset and a given lexicon. Not surprisingly, this is reflected in the variation in performance with these baselines across the datasets. In comparison, the emotionenriched embeddings are generated for capturing similarities and dissimilarities between words in a latent space. Hence, although emotions and sentences can be represented in embedding spaces, ESTeR is able to effectively harnesses word cooccurrence space to obtain a better performance on the classification task.
From Table 3, we observe that despite using a generic EmoLex lexicon and Combined graph, we still feature among the top-2 performing models for most datasets and outperform the baselines in most cases. Furthermore, by incorporating representative lexicons and matrices (the ESTeR * entries),  Table 3: Comparison of classification quality with baselines in Section 2.5. ESTeR is run with Combined co-occurrence matrix and EmoLex lexicon. ESTeR * denotes the bestperforming combination of (lexicon, matrix, and ESTeR variant) choices. The best and second-best performances are highlighted.
we obtain the best performance on all measures for three out of five datasets and the best F1@2 for all datasets. To summarize, ESTeR is able to effectively combine information from a general corpus and a focused word-association lexicon to provide a robust and competitive method for unsupervised emotion identification.
light its usefulness in uncovering macro-level emotion trends despite zero labeled data. We study a random sample of 17K English tweets from the last week of March collected by Panacea labs 8 . This subset of tweets was analyzed using ESTeR method with Combined matrix and EmoLex lexicon. To obtain emotion labeling, we assign each tweet to 2 categories with the highest score. In the interest of space, we provide a summary of our findings in this section and provide further details in Appendix B for the interested reader.
In Figure 2 we show top frequent hashtags. 9 Each number in the matrix is the emotion frequency for the hashtag, i.e., the number of times the emotion is assigned to a tweet with the hashtag, divided by the total number of tweets with this hashtag in the dataset. We observe that, uniformly, sadness and trust are dominant emotions. General COVID-19 hashtags are mostly related to sadness, then to trust (due to the large volume of mentions of authorities and facts) and fear. Social tweets, which call to stay at home and embrace social distancing, also often have reassuring, comforting, and uplifting content, and thus are relatively often marked as joy. As expected, the majority of tweets with #pandemic tag get assigned to sadness and fear. Notably, the tag #trump gets most often assigned to surprise and infrequently to trust (probably reflective of public opinion due to his changing stance on COVID-19). Tags #quarantine and #yemen once again expectedly show a high assignment rate to fear. #NHS stands for United Kingdom National Health Service and is dominantly assigned to trust, sadness, and joy highlighting public emotion towards the struggles of the healthcare workers, as well as gratitude from the society. 8 https://github.com/thepanacealab/covid19_twitter 9 Further analysis is included in Appendix B.

Related Work
As part of affective computing, various research communities are studying emotion identification models via gestures and facial expressions (Barros et al., 2015), voice (Mitsuyoshi et al., 2017) as well as user-generated text (Canales and Martínez-Barco, 2014; Aguilar et al., 2019). In particular, text-based emotion detection and mood analysis has attracted significant research focus through task challenges (Mohammad et al., 2018;Hsu and Ku, 2018) due to the abundance of user-generated content on social media and microblogging platforms that captures public mood on various events in social, political, and economic spheres (Bollen et al., 2011a;Nguyen et al., 2014;Khanpour and Caragea, 2018).

Concluding Remarks
We proposed a random walk based model for unsupervised emotion detection in text using word associations from emotion lexicons and word cooccurrences from a general corpus. Our solution efficiently computes emotion scores at a dataset level as well as provides a probabilistic interpretation of scores. We showed superior performance of our model over existing unsupervised baselines on several recent, real-world datasets. In future, we would like to study other graph-based scoring functions to further improve performance (Boudin, 2013). In particular, we are interested in minimallysupervised representations that can apply to a range of related tasks that involve emotions such as sarcasm, stress and insult detection, abusive language classification, and personality recognition (Xu et al., 2018

A.1 Datasets
We provide more details on our datasets: The SemEval2018 (Mohammad et al., 2018) dataset is manually annotated with 8 Plutchek's prime emotions and three more: love, optimism, pessimism. We keep only 8 prime emotions in the annotation. The total size is 10.5K, multi-labeling is allowed, the maximum number of labels per a tweet is 6 and median is 2.
For the SSEC (Schuff et al., 2017) dataset, the annotation is done using 8 Plutchek's prime emotions but the dataset is available is several versions, we used "0.5" version, where each label is voted for by more than a half of annotators. The total size is 3.3K. The maximum number of labels per a tweet is 5 and median is 1.
In DENS (Liu et al., 2019), the passages are manually annotated using 8 Plutchek's emotions plus neutral, with trust is substituted by love since the labelers could recognize trust better in romantic context. According to Plutchek, love is a combinations of trust and joy). We substitute love back with trust. Only the majority-voted annotations are preserved. The number of passages, annotated with 8 Plutchek's emotions is 8K. The maximum number of labels per a passage is 2, median is 1.
TEC (Mohammad, 2012) is manually annotated using six Ekman's emotions. The total size is 20.5K. No multi-labeling is allowed.
For CrowdFlower 10 , we used the mapping proposed (Klinger et al., 2018) to obtain labels from 8 Plutchek's emotions from their 13 non-standard emotional categories, followed by a majority-based annotation aggregation. The total size is 31.2K. The maximum number of labels per a tweet is 2 and median is 1.
We processed all datasets using NLTK Tweet-Tokenizer 11 . Emoticons are preserved and only non-stopwords with lengths greater than one character are retained.

A.2 Co-occurrence matrices
Wiki co-occurrence matrix was obtained from Wikipedia collected in Feb 2020 and has approximately 5.2M documents. We apply term frequency, document frequency thresholds of 100 and 5 for 10 https://data.world/crowdflower/ sentiment-analysis-in-text 11 https://www.nltk.org/api/nltk.tokenize.html collecting the term dictionary and only keep edges between words that occur within a window of 5 and with edge frequency threshold of 200. The final cooccurrence matrix contains 39K words and 1.7M non-zero entries. Twitter co-occurrence matrix was obtained from Twitter dataset sentiment140 (Go et al., 2009) with 1.6M general tweets. Two words are considered co-occurring if they appear in the same tweet. The co-occurrence threshold is set to 10. The co-occurrence matrix contains 17.7K tokens and 1.3M non-zero entries. Combined is the (unweighted) combination of the two matrices Wiki and Twitter co-occurrence matrices with 47.7K tokens and 1.9M non-zero entries.

B Detailed Case Study Findings
We continue the case study of COVID19 dataset by ESTeR with Combined matrix and EmoLex lexicon. Recall that to obtain emotion labeling of the tweets, we assign each tweet to (at least, in case of ties) 2 categories with the highest score.
In Table 4 we group hashtags by top-2 emotion labels, which are most frequently assigned to the tweets with the corresponding hashtags. Note that the order of the emotion labels matters. For example, group 1 has sadness as the most frequent label and trust as the second most frequent label; (trust, sadness) produces a different cluster. To generate Table 4 we go through the most frequent (most popular) hashtags in the descending order. Each hashtag is added to a cluster based on top-2 emotions most frequently assigned for tweets with this hashtag. The clusters are ordered based on the maximum popularity of the hashtags they contain. We report at most 5 the most popular hashtags of each cluster. Inside each cluster the hashtags are sorted by the popularity. Table 4 shows top-6 clusters. Since none of them have anger, disgust, or anticipation as the most frequent emotion, we report also clusters number 9, 12, and 22 -the clusters with the highest rank having one of these missing emotions as the most frequently associated.
Interestingly, (anticipation, sadness) cluster covers #stocks hashtag. (Disgust, surpirse) covers American political hashtags as well as mentions of pop artists. (Anger, surprise) covers businessrelated news. Unlike health-related topics, people tend to express less empathy and more discontent with COVID-19 impact on the economy. Cluster of (joy, sadness) includes tags #love as well as country names, as these tags are often used in tweets  Table 4: Popular hashtags, grouped by the emotions, which are the most frequently assigned to the corresponding tweets. Table 5: Tweets with the highest association ESTeR score to an emotion from COVID19 . of sympathy. (T rust, sadness) cluster consists of tweets supporting social measures, expressing sympathy for health workers, and generally uniting tweets. (Surprise, sadness) is related to US politics and market news. (Sadness, f ear) covers not only COVID-19-related tags, but also Yemen armed conflict, fakenews warning. The top cluster (sadness, trust): is general coronavirus tags, trust has a high presence due to a lot of comments on official information.
In Table 5, for each emotion category e ∈ E we report top 3 tweets with the highest association ES-TeR score to e. To present a constructive examples, we consider tweets with at least 15 tokens. Due to the nature of the dataset, most of the tweets express emotions such as fear and sadness. However, there are still tweets labeled as joy, which contain jokes or express hope and optimism. Interestingly, disgust label brings up comments on political news in the time of pandemic. Surprise is represented with tweets with a question, trust labels tweets with official mentions and economics-related news.