Systematic Evaluation of a Framework for Unsupervised Emotion Recognition for Narrative Text

Identifying emotions as expressed in text (a.k.a. text emotion recognition) has received a lot of attention over the past decade. Narratives often involve a great deal of emotional expression, and so emotion recognition on narrative text is of great interest to computational approaches to narrative understanding. Prior work by Kim et al. 2010 was the work with the highest reported emotion detection performance, on a corpus of fairy tales texts. Close inspection of that work, however, revealed significant reproducibility problems, and we were unable to reimplement Kim’s approach as described. As a consequence, we implemented a framework inspired by Kim’s approach, where we carefully evaluated the major design choices. We identify the highest-performing combination, which outperforms Kim’s reported performance by 7.6 F_1 points on average. Close inspection of the annotated data revealed numerous missing and incorrect emotion terms in the relevant lexicon, WordNetAffect (WNA; Strapparava and Valitutti, 2004), which allowed us to augment it in a useful way. More generally, this showed that numerous clearly emotive words and phrases are missing from WNA, which suggests that effort invested in augmenting or refining emotion ontologies could be useful for improving the performance of emotion recognition systems. We release our code and data to definitely enable future reproducibility of this work.


Introduction
Emotion is a primary aspect of communication, and can be transmitted across many modalities including gesture, facial expressions, speech, and text. Because of this importance, automatic emotion recognition is useful for many applications, including for automated narrative understanding. A narrative is "a representation of connected events and characters that has an identifiable structure, is bounded in space and time, and contains implicit or explicit messages about the topic being addressed" (Kreuter et al., 2007, p. 222), and narratives are often used to express the emotions of authors and characters, as well as induce emotions in audiences. For many narratives-one need only consider romances such as Romeo and Juliet or the movie Titanic-it is no exaggeration to say that lacking an understanding of emotion leads to a seriously impoverished view of the meaning of the narrative.
Emotion recognition is a challenging problem on account of the complex relationship between felt emotion and linguistic expression. This includes not only standard natural language processing challenges, such as polysemous words and the difficulty of coreference resolution (Uzuner et al., 2012;Peng et al., 2019), but also emotion-specific challenges such as how context can subtly change emotional interpretations (Cowie et al., 2005). These technical challenges are exacerbated by a shortage of quality labeled data addressing this task.
There has been much prior work on emotion recognition. With regard to narrative specifically, Kim et al. (2010) reported a high-performing approach to emotion recognition on a corpus of fairy tales texts (Alm, 2008). This approach involved an unsupervised learning framework for emotion recognition in textual data, using a modified form of Ekman's psychological theory of emotion (joy, anger, fear, sadness;Ekman, 1992b). In that work, they used the WordNetAffect (WNA) and ANEW (Affective Norm for English Words) emotion lexicons to construct a semantic space. Each sentence is placed in the space using tf-idf weights for emotion words found in the lexicons. They then tested three methods-Non-negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA), probablistic Latent Semantic Analysis (pLSA)for compressing the space to extract features of the constructed vector space model, reduce noise, and eliminate outliers. Finally, the framework used cosine-similarity to label sentences by evaluating how similar they are compared to standard vectors generated based on WNA entries strongly associated with emotion lexicon (more specifically an extension of WNA). The best performing method was NMF, which they reported achieved an average emotion recognition F 1 of 0.733.
Close inspection of the work, however, revealed significant reproducibility problems. Despite our best efforts we were unable to reproduce results anywhere near Kim's reported performance; indeed, our best attempt yielded only roughly 0.25 F 1 . This was due to several reasons. First, the paper lacked information on model hyper-parameters. Second, the paper omitted descriptions of key NMF steps, including how to identify representative features and what features should be removed before semantic space compression. Third, the paper did not explain how to adapt NMF to deal with the sparse matrices that occur in textual NMF models. Fourth, certain resources associated with WNA either were not correctly identified, or are no longer available. These omissions prevented us from reproducing their models to any degree of accuracy.
Therefore, we undertook to do a systematic exploration of the design space described in Kim et al. (2010). We examined the highest performing vector space compression techniques reported by Kim et al. (NMF), as well as Principle Component Analysis (PCA) and Latent Dirchelet Allocation (LDA) which were reported as high-performing techniques in other work. We show that NMF indeed performs the best, and we clearly explain our experimental setup including methods for identifying relevant features and handling sparse text matrices. The PCA and NMF methods implemented in this paper are based on the works of Mairal et al. (2009) and Boutsidis and Gallopoulos (2008) respectively which have implemented mechanisms that works for a large sparse matrix (in our case, 1, 090 × 2, 405). This work resulted in an improvement of performance of roughly 7.6 points of F 1 over Kim's reported results. We release our code and data to facilitate future work 1 .
The rest of this paper is structured as follows. We briefly review psychological models of emotions, describe several key emotion language resources, and outline a number of well-known emotion recog-1 Code and data may be downloaded from https://doi. org/10.34703/gzx1-9v95/03RERQ nition models ( §2). We then describe our adapted unsupervised emotion recognition method, giving detailed descriptions of all steps, parameters, and resources needed ( §3). We next describe the performance of our method on Alm's corpus of fairy tales (Alm, 2008), which was annotated for emotion on a per-sentence level ( §4). Finally, we identify some unsolved challenges that point toward future work ( §5), and summarize our contributions ( §6).
2 Related Work 2.1 Psychological emotion theories Theories of emotion go back to the ancient Greeks and Romans, and have been a recurring theme of inquiries into the nature of the human experience throughout history, including famous proposals by Charles Darwin and William James in the 19th century (Darwin and Prodger, 1998;James, 1890). Modern psychological theories of emotion may be grouped into two types: categorical and dimensional (Calvo and Mac Kim, 2013). Categorical psychological models propose discrete basic emotions, e.g., Oatley and Johnson-Laird's (1987) with five basic emotions, several models with six basic emotions (Ekman, 1992b;Shaver et al., 1987), Parrott's model of six basic emotions arranged in a three-level tree (2001), Panksepp's model with seven emotions (1998), and Izard's with ten (2007).
Finally, there are also models which combine both categorical and dimensional aspects, called hybrid models, the most prominent of which is Plutchik's wheel and cone model with eight basic emotions (Plutchik, 1980(Plutchik, , 1984(Plutchik, , 2001. Of all the many emotion models that have been proposed, Ekman's 6 category model (anger, disgust, fear, happiness, sadness, surprise) is by far the most popular in computational approaches, partly because of its simplicity, and partly because it has been successfully applied to automatic facial emo-tion recognition (Zhang et al., 2018;Suttles and Ide, 2013;Ekman, 1992bEkman, ,a, 1993. This is despite that some researchers have doubts that Ekman's model is complete, as it seems to embed a Western cultural bias (Langroudi et al., 2018). In our own review of emotion recognition systems, as discussed below, the highest performing system reported for narrative text was described by Kim et al. (2010). In that work, they used a four-label subset of Ekman's model (happiness, anger, fear, and sadness), and this is the model we adopt in this paper.

Emotion Lexicons
One of the key language resources for emotion recognition in text is an emotion lexicon, which is simply a list of words associated with emotion categories. Emotion lexicons can be used both in rule-based and machine-learning-based recognition methods. There are two types of emotion lexicons. One is general purpose emotion lexicons (GPELs) which specify the generic sense of emotional words. GPELs sometimes express emotions as a score, and can be applied to any domains. Prominent  Strapparava and Valitutti, 2004), which builds upon the general WordNet database (Fellbaum, 1998). WNA classifies 280 WordNet Noun synsets into an emotion hierarchy rooted in an augmented version of Ekman's basic emotions, and partially depicted in Figure 1. Word-Net links an additional 1,191 Verb, Adverb, and Adjective synsets to this core Noun-focused hierarchy. These synsets represent approximately 3,500 English lemma-POS pairs.

Emotion Recognition Approaches
There have been at least one hundred papers describing approaches to emotion recognition in text  (Calefato et al., 2017;Teng et al., 2007;Shaheen et al., 2014). Here we review a selection of approaches that have been applied to narrative-like or narrative-related discourse types. It is important to remember that all of these approaches use different data and different theories, often involving different numbers of labels. All things being equal, classification results usually degrade as the number labels increases; therefore the performance of each system can only be loosely compared. Strapparava and Mihalcea (2008) described a system for recognizing emotions in news headlines. They extracted 1,250 news headlines from a variety of news websites (such as Google news, CNN, and online newspapers) and annotated them using Ekman's model-anger, disgust, fear, joy, sadness and surprise-splitting the data into a training set of 250 and a test set of 1,000 (this is called the SemEval-2007 dataset). They tested five approaches: WNA-PRESENCE, LSA-SINGLE-WORD, LSA-EMOTION-SYNSET, LSA-ALL-EMOTION-WORDS, and NAIVEBAYES-TRAINED-ON-BLOGS. WNA-PRESENCE, which looked for headline words listed in WNA, provided the best precision at 0.38. The LSA-ALL-EMOTION-WORDS, which calculated the vector similarity between the six affect words and the LSA representation of the headline, led to the highest recall and F 1 , at 0.90 and 0.176, respectively.
Aman and Szpakowicz (2008) used a Support Vector Machine (SVM) trained and tested on blog data for recognition Ekman's emotion classes, plus two additional classes: mixed emotion, and no emotion. Four human judges manually annotated 1,890 sentences from automatically retrieved blogs to cre-  Tokuhisa et al. (2008) described a lexicon-based emotion recognition system for Japanese. They handcrafted emotion lexicon by identifying 349 emotion words from the Japanese Expression Evaluation (JEE) Dictionary classified into 10 different emotions: 3 positive (happiness, pleasantness, relief) and 7 negative (fear, sadness, disappointment, unpleasantness, loneliness, anxiety, and anger). They then used this lexicon to automatically assemble a labeled corpus of 1.3M emotion-provoking (EP) "events" (defined as a subordinate clauses which modifies an emotional statement). They then demonstrated a two-step method for emotion recognition, starting with SVM-based coarse sentiment polarity classification (positive, negative, or neutral) followed by kNN-based classification of nonneutral instances into the appropriate fine-grained emotion classes (3 for positive, 7 for negative). Their reported accuracies of between 0.5 and 0.8 for their best performing model. Cherry et al. (2012) presented two supervised machine learning models for emotion recognition in suicide note sentences. They used the 2011 i2b2 NLP Challenge Task 2, which comprised 4,241 sentences in the training set, and 1,883 sentences in the test set, which were manually annotated with 13 emotion labels. A one-classifier-per-emotion approach yielded an F 1 of 0.55, while a latent sequence model that applied multiple emotion labels per sentence achieved an F 1 of 0.53. They noted that more than 73% of their training data lacked labels which limited the effectiveness of the training. Bandhakavi et al. (2017) experimented with unigram mixture models (UMMs) for recognizing emotions in tweets, incident reports, news head-lines, and blogs. Each corpus was manually annotated with different emotion theories: 280,000 tweets with Parrott's six primary emotions (Parrott, 2001), 1,250 news headlines and 5,500 blogs with Ekman's six emotion set, 7000 incident reports from the ISEAR dataset 2 labeled with a seven emotion set. One goal of the study was to compare the utility of domain-specific emotion lexicons with general purpose emotion lexicons (DSELs vs GPELs). They found that combining DSEL lexicon words with n-grams, part of speech tags, and additional words from sentiment lexicons yielded the highest performance of 0.60 F 1 on the blog data. Kim et al. (2010) reported the highest performing emotion recognition system on narrative text. Among their data was a set of 176 fairy tales whose 15,087 sentences were labeled by Alm (2008) with a four-emotion subset of Ekman's theory (anger, fear, joy, and sadness). They demonstrated an unsupervised approach, where each sentence is transformed into a vector in a space of emotion words (drawn from WNA and ANEW), and then compressed using a dimension reduction technique (NMF, LSA, or pLSA). These vectors were then compared to reference vectors in the same space that were computed for each of the four emotions. They reported a performance of F 1 of 0.733 for NMF, which was their highest performing model. One advantage of this approach was that it is unsupervised, which means both that significant amounts of training data are not required and that all the annotated data can be used for testing. This is important because of the small size of the corpus on which the technique was tested.

Emotion Recognition Framework
We now describe an unsupervised system for emotion recognition modeled on that reported by Kim   (2010). While we follow the general pattern of that work, we experiment with a different set of dimension reduction methods (NMF from Lee and Seung, as well as PCA and LDA). The system takes as input the following items: • A corpus containing n sentences S : s 1 , s 2 , . . . , s n ; • A set of emotions E = {e 1 , e 2 , . . . , e l−1 , neu-tral} for classifying emotions into l different classes, including neutral; and, • An emotion lexicon L : Ω → E which maps each word in the corpus ω ∈ Ω (where Ω has m terms) to an emotion e ∈ E. The word ω is in its lemmatized form and has a specific POS. A flowchart of the system is shown in Figure 2. The system comprises four consecutive steps. In the first step, pre-processing, the system processes the input corpus using the CoreNLP library (Manning et al., 2014) to separate the text into sentences and lemmatized tokens. The second step, vector space modeling, uses the lemmatized tokens to generate a vector for each sentence in a vector space whose dimensions correspond to the items in Ω. In the third step, noise cancellation or dimension reduction, we explored three different models (Non-negative Matrix Factorization, Latent Dirichlet Allocation, and Principal Component Analysis) to either reduce dimensions or extract features of the vector space. One of our main contributions here is to analyze and explain the effect of this step on the performance of the final emotion recognition system. Finally, the fourth step, labeling, compares the vector for each sentence with vectors for each emotion, choosing the closest emotion as the label for the sentence.
Augmenting WNA As mentioned before, WNA 1.1 assigns an emotion label to 1,471 synonym sets (synsets) of WordNet. This corresponds to a lexicon of nearly 3,495 affective lemma-POS pairs. Careful inspection of WNA revealed both incor-

Separate Sentences
Tokenize/Lemmatize Make BoW for each sentence

Emotion Label Extraction
Cosine Similarity Labeling

Calculate F1 Measurement
Is F1 acceptable? rectly included as well as missing pairs. For incorrectly included pairs, a substantial number were included because all their multiple senses were labeled by emotions related to a secondary affective sense, not their main non-affective sense. We manually reviewed and removed these incorrect labels. Additionally, we identified missing lemma-POS pairs with the help of closely related pairs already labeled by WNA. For example the pair glorious-JJ was missing from WNA, but is related (via the derived-from relation) to already labeled pair glorify-VB. We manually searched for these missing relationships, adding the missing terms, as well as recursively adding their synonyms (e.g., glorious-JJ resulted in splendid, magnificent, brilliant, and superb being added as well). In total, we removed 613 and added 814 labels of different lemma-POS pairs, resulting a final count of 4048 lemma-POS pairs. In general, the technique of using a fixed lexicon of emotion terms to capture highly contextdependent emotional expressions is problematic at best. Although we show here that work on improving the lexicon does improve emotion recognition results, ultimately, any technique will have to move away from a rigid lexicon-based approach to something more flexible. We plan to explore such directions in future work.

Sentences S = { s i } n
Step 1: Pre-Processing For each sentence s ∈ S in the given corpus, we construct a bag of words by tokenizing the sentence and lemmatizing each word. We generate a count vector for BoW s by mapping each lemma to the count in the sentence (Ω → Z ≥0 ). We do not remove stop words as their effects are minimized by the tf-idf computation in the next step.
Step 2: Vector Space Modeling Using the count vectors constructed in the first step, we compute a tf-idf vector for each sentence as well as a standard vector for each emotion class e ∈ E. For each sentence s j ∈ S, we construct an m dimensional vector where each entry in the vector is the tf-idf of term ω i in sentence s j ; i.e.
where TF i,j = BoW s j (ω i ), . (2) n is the number of sentences, and Ω = {ω i } m i=1 . The constructed vector space model is represented by the following m × n matrix V : We compute a standard vector for each emotion class Y e = (y e,ω 1 , y e,ω 2 , . . . , y e,ωm ) where y e,ω i is 1 if the term ω i is mapped to e by the lexicon, otherwise 0.
Step 3: Noise Cancellation or Dimension Reduction The vectors V s and Y e from the previous step are all m-dimensional vectors where m is the total number of terms in the corpus. There are many terms that have little or no effect on the emotion labeling of their sentences. Therefore, dimensional reduction or noise cancellation techniques may improve the performance of the emotion labeling step which comes later. Principle Component Analysis (PCA) has been known for quite some time for noise cancellation (Abdi and Williams, 2010), while Latent Dirichlet Allocation (LDA) was specifically developed for dimension reduction in natural language processing (Blei et al., 2003). Non-Negative Matrix Factorization (NMF) was first introduced for noise cancellation by Lee and Seung (1999).
Step 3.1: Vector Space Decomposition We can decompose the obtained matrix V in one of the following three ways:

Non-negative Matrix Factorization (NMF):
we extract d features from the m-dimensional vectors of sentences using NMF.

Principal Component Analysis (PCA):
We reduce the number of dimensions of V s vectors from m to ∆ < m.

Latent Dirichlet Allocation (LDA):
We reduce the number of dimensions of V s vectors from m to δ < m.
When using PCA or LDA we can move directly to fourth step of the system; however, in the case of NMF, we must select important terms (Step 3.2), remove irrelevant features (Step 3.3), and reconstruct the vector space (Step 3.4).
When using NMF for decomposing the vector space model, V is factorized into two matrices W m×d = [w ij ] and H d×n = [h ij ], both with all non-negative entries: Note that d is considered a hyper-parameter in this step and its numerical value can be fine-tuned by maximizing the output of the system on a development set.
The NMF factorization process produces a matrix W whose d columns each represents an mdimensional feature for each of the original n sentences in the corpus:  Figure 3: Non-negative matrix factorization (Step 3.1) to extract features of sentence vector model V . The results of this process is given by matrices W and H. Columns of W are corresponding to the extracted features F 1 , F 2 , . . . , F d of the model and rows of H are called the weights of these features.
Each of the d rows of H matrix represents weights of the d features in F . This decomposition is shown in Figure 3.
Step 3.2: Term Selection For every feature F j , we identify a fraction r of terms with the highest weights as its representatives, where r is a hyper-parameter that can be fine-tuned during system optimization (r is usually less than 1%).
Step 3.3: Feature Removal In this phase we remove the ρ features that have little or no emotional relevance, where ρ is a nonnegative integer hyper-parameter that can be tuned. We will call a feature "emotionally irrelevant" if all of its representative terms (as selected in the previous step) are labeled as neutral by the lexicon. These features will always be removed first. If ρ is less than the number of emotionally irrelevant features, we choose at random. On the other hand, if the number of emotionally irrelevant features is less than ρ, we eliminate features F j in order of their overall emotional relevance, which is computed by estimating the standard deviation of cosine similarity ratios between emotion vectors Y e 's obtained in Step 2 and F j •R j (element-wise product of F j and R j ) where R j is the binary identifier of whether a term is a representative for F j and is constructed based on the outcome of Step 3.2. Symbolically, to quantify how emotionally relevant feature F j is, we calculate the following standard-deviation: Step 3.4: Vector Space Reconstruction In this step, the vector space model is reconstructed (V ) after eliminating the irrelevant features. Let I denote the set of indices whose corresponding features are identified as least relevant in previous step. Then the reconstructed vector space is: Figure 4 illustrates the vector space reconstruction.
Step 4: Labeling Finally the emotion recognition process takes place by measuring the similarity between sentence vectors V s and standard emotion vectors Y e which are taken from the previous step with the help of NMF, PCA, or LDA. Label of each sentence s is calculated by the following formula: where similarity function can be measured by the cosine of angle made by the two given vectors:

Performance on Fairy Tale Data
We tuned and tested our system using the manually annotated dataset of fairy tales constructed by Alm (2008), which comprises 176 children's fairy tales (80 from Brothers Grimm, 77 from Hans Andersen, and 19 from Beatrix Potter) with 15,087 unique sentences (15,302 sentences), 7,522 unique words and 320,521 total words. These fairy tales were annotated by two annotators labeling the emotion and mood of each sentence as one of joy, anger, fear, sadness, or neutral which resulted in four labels per sentence. Across the sentences, only 1,090 of them agreed on all four non-neutral labels. Kim et al. (2010) used only these sentence to train and test their system 3 , and we followed the same procedure. There were 2,405 unique term-POS pairs. Also, the distribution of labels in the dataset is specified in the pie-chart depicted in Figure 6. We measured the performance of our system on Alm's data. Without augmenting WNA, using the original 1,471 synsets of WNA, the F 1 score is 0.625. The performance metrics presented in Table 4 were obtained by the model using the augmented WNA. The plots depicted in Figure 5 show the F 1 scores of various setups of the proposed model using NMF technique for noise cancellation. Also, Table 4 summarizes the precision, recall and F 1 score of our system for each of the four emotion classes as well as its overall F 1 score when using NMF, PCA, or LDA with different setups (values of hyper-parameters). As observed in this table, the highest overall F 1 score is obtained when using NMF with (d, r, ρ) = (975, 10, 18). In this model, 209 sentences were labeled incorrectly. Among them, some challenging examples are in Table 3.

Unsolved Challenges and Future Work
As already discussed, one challenge regarding automatic emotion recognition is the context dependency of emotional semantics. For instance, I'm over the moon! is an expression of extreme happiness but does not use any explicitly happy or joyful words (or, indeed, any emotion word at all). Another obstacle is polysemous words, when words have both an emotional and non-emotional senses; recognizing which sense of the word is being used is challenging and remains an open problem. Aside from these fundamental issues, there is a serious lack of high-quality annotated data, not just for narrative text but for all discourse types. Annotated corpora use a wide variety of sometimes incompatible emotion theories and are often poorly annotated, with low inter-annotator agreements and many errors.
Given these considerations, there are many possible directions for future work, for example: • Reconciling emotion lexicons and context de-

Sentence Predicted Gold Label
They told him that their father was very ill, and that they were afraid nothing could save him. Fear Sadness And in sight of the bridge! Said poor pigling, nearly crying.
Sadness Fear She smiled once more, and then people said she was dead.
Sadness Joy Then he aimed a great blow, and struck the wolf on the head, and killed him on the spot! . . . and when he was dead they cut open his body, and set Tommy free.
Anger Joy  pendency of emotion detection models using learning techniques; • Evaluating the performance of a bag-of-words multi-layer perceptron applied to the dataset to extract emotions; • Applying multi-label prediction to the dataset and comparing the results with this work, • Evaluating the effect of text unit size (sentence, paragraph, story) on the accuracy of sentiment labels; i.e., would there be an advantage in grouping sentences into longer units (e.g. paragraphs) and assigning a single label to this longer unit? It seems that a sentence by itself might not always carry sufficient cues to disambiguate its emotion, but its surrounding sentences might give this context.

Contributions
We identified a high performing approach to emotion recognition in narrative text (Kim et al., 2010) and carefully reimplemented and characterized the technique, exploring a design space of three different noise cancellation or dimension reduction techniques (NMF, PCA, or LDA), exploring various hyper-parameter settings. Our experiments indicated that NMF performed best, with an overall F 1 of 0.809. In the course of our investigation we clarified numerous implementational issues of the work reported by Kim et al. (2010), as well as made some improvements to WordNet Affect (WNA), one of the language resources used in the system, by adding new terms manually and using Wordnet similarity relations. This work suggests several promising future directions for improving the work, including careful annotation of a larger corpus, and augmenting WNA or similar lexicons to provide improved coverage of emotion terms. We release our code and data to enable future work 4 .