Human Needs Categorization of Affective Events Using Labeled and Unlabeled Data

We often talk about events that impact us positively or negatively. For example “I got a job” is good news, but “I lost my job” is bad news. When we discuss an event, we not only understand its affective polarity but also the reason why the event is beneficial or detrimental. For example, getting or losing a job has affective polarity primarily because it impacts us financially. Our work aims to categorize affective events based upon human need categories that often explain people’s motivations and desires: PHYSIOLOGICAL, HEALTH, LEISURE, SOCIAL, FINANCIAL, COGNITION, and FREEDOM. We create classification models based on event expressions as well as models that use contexts surrounding event mentions. We also design a co-training model that learns from unlabeled data by simultaneously training event expression and event context classifiers in an iterative learning process. Our results show that co-training performs well, producing substantially better results than the individual classifiers.


Introduction
Recent research has focused on identifying affective events in text, which are activities or states that positively or negatively affect the people who experience them. Recognizing affective events in text is challenging because they appear as factual expressions and their affective polarity is often implicit. For example, "I broke my arm" and "I got fired" are usually negative experiences, while "I broke a record" and "I went to a concert" are typically positive experiences. Several NLP techniques have been developed to recognize affective events, including patient polarity verb bootstrapping (Goyal et al., 2010(Goyal et al., , 2013, implicature rules (Deng and Wiebe, 2014), label propagation (Ding and Riloff, 2016), pattern-based learning (Vu et al., 2014;Reed et al., 2017), and semantic consistency optimization .
Our research aims to probe deeper and understand not just the polarity of affective events, but the reason for the polarity. Events can impact people in many ways, and understanding why an event is beneficial or detrimental is a fundamental aspect of language understanding and narrative text comprehension. Additionally, many applications could benefit from understanding the nature of affective events, including text summarization, conversational dialogue processing, and mental health therapy or counseling systems. As an illustration, a mental health therapy system can benefit from understanding why someone is in a negative state. If the triggering event for depression is "I broke my leg" then the reason is about the person's Health, but if the triggering event is "I broke up with my girlfriend" then the reason is based on Social relationships.
We hypothesize that the polarity of affective events can often be attributed to a relatively small set of human need categories. Our work is motivated by theories in psychology that explain people's motivations, desires, and overall well-being in terms of categories associated with basic human needs, such as Maslow's Hierarchy of Needs (Maslow et al., 1970) and Fundamental Human Needs (Max-Neef et al., 1991). Drawing upon these works, we propose that the polarity of affective events often arises from 7 types of human needs: PHYSIOLOGICAL, HEALTH, LEISURE, SOCIAL, FINANCIAL, COGNITION, and FREE-DOM. For example, "I broke my arm" has negative polarity because it negatively impacts one's Health, "I got fired" is negative because it negatively impacts one's Finances, and "I am confused" is negative because it reflects a problem related to Cognition.
We explore this hypothesis and tackle the chal-lenge of categorizing affective events in text with respect to these 7 human need categories. As our evaluation data, we use events extracted from personal blog posts and manually labeled with affective polarity in previous work . These affective events were then subsequently annotated for the human need categories.
In this paper, we design several types of classification models that learn from both labeled and unlabeled data. First, we present supervised learning models that use lexical and embedding features for the words in event expressions, as well as models that learn from the sentence contexts surrounding mentions of event expressions. Next, we explore self-training and co-training models that exploit both labeled and unlabeled data for training. The most effective system is a co-training model that uses two classifiers with two different views in an iterative learning process: one classifier only uses the words in an event expression, and the other classifier only uses the contexts surrounding instances of an event expression. Our results show that this co-training model effectively uses unlabeled data to substantially improve results compared to classifiers trained only with labeled data, yielding gains in both precision and recall.

Related Work
Recently, there has been growing interest in recognizing the affective polarity of events. For example, Goyal et al. (2013) developed a bootstrapped learning method to learn patient polarity verbs, which impart affective polarities to their patients.  designed methods to extract verb expressions that imply negative opinions from reviews. Rashkin et al. (2016) recently proposed connotation frames to incorporate the connotative polarities for a verb's arguments from the writer's and other event entities' perspectives. Li et al. (2014) proposed a bootstrapping approach to extract major life events from tweets using congratulation and condolence speech acts. Most of these major life events are affective although their work did not identify polarity. Another group of researchers have studied +/-effect events (Deng et al., 2013;Choi and Wiebe, 2014) which they previously called benefactive/malefactive events. Their work mainly focused on inferring implicit opinions through implicature rules Wiebe, 2014, 2015). Ding and Riloff (2016) designed an event con-text graph model to identify affective events using label propagation. Reed et al. (2017) demonstrated that automatically acquired patterns could benefit the recognition of first-person related affective sentences. Most recently,  developed a semantic consistency model to induce a large set of affective events using three types of semantic relations in an optimization framework. (We use their annotated affective event data set in our work.) All of this previous work only identifies affective events and their polarities. In contrast, our work aims to identify the reason for the affective polarity of an event.
The human need categories are inspired by two prior theories. The first one is Maslow's Hierarchy of Needs (Maslow et al., 1970) which was developed to study people's motivations and personalities. The second one is Fundamental Human Needs (Max-Neef et al., 1991) which was developed to help communities identify their strengths and weaknesses. The human need categories are also related to the concept of "goals", which has been proposed by (Schank and Abelson, 1977) to understand narrative stories. Goals could be very specific to a character in a particular narrative story. However, but many types of goals originate from universal needs and desires shared by most people (Max-Neef et al., 1991). In addition, our work is also related to research on wish detection (Goldberg et al., 2009), desire fulfillment (Chaturvedi et al., 2016), and modelling protagonist goals and desires (Rahimtoroghi et al., 2017).
Self-training is a semi-supervised learning method to improve performance by exploiting unlabeled data. Self-training has been successfully used in many NLP applications such as information extraction (Ding and Riloff, 2015) and syntactic parsing (McClosky et al., 2006). Co-training (Blum and Mitchell, 1998) uses both labeled and unlabeled data to train models that have two different views of the data. Co-training has been previously used for many NLP tasks including spectral clustering (Kumar and Daumé, 2011), word sense disambiguation (Mihalcea, 2004), coreference resolution (Phillips and Riloff, 2002), and sentiment analysis (Wan, 2009;Xia et al., 2015).  created for prior research  which aims to identify affective events. We will refer to this data as the AffectEvent dataset. We will briefly describe this data and the human need category annotations that we added on top of it. The AffectEvent dataset contains events extracted from a personal story corpus that was created by applying a personal story classifier (Gordon and Swanson, 2009) to 177 million blog posts. The personal story corpus contains 1,383,425 personal story blogs. StanfordCoreNLP  was used for POS and NER tagging and SyntaxNet (Andor et al., 2016) for parsing. Each event is represented using a frame-like structure to capture the meanings of different types of events. Each event representation contains four components: Agent, Predicate, Theme, PP . The Predicate is a simple verb phrase corresponding to an action or state. The Agent is a named entity, nominal, or pronoun, and is extracted using syntactic heuristics rather than semantic role labeling. We use "Theme" loosely to allow a NP or adjective to fill this role. The PP component is composed of a preposition and a NP. All words in the event are lemmatized, and active and passive voices are normalized to have the same representation. See  for more details of the event representation.

Human Need Category Annotations
Affective events impact people in a positive or negative way for a variety of reasons. We hypothesized that the polarity of most affective events arises from the satisfaction or violation of basic human needs. Psychologists have developed theories that explain people's motivations, desires, and overall well-being in terms of categories associated with basic human needs, such as Maslow's Hierarchy of Needs (Maslow et al., 1970) and Fundamental Human Needs (Max-Neef et al., 1991). Based upon this work, we defined 7 human need categories, which are briefly described below. Physiological Needs maintain our body's basic functions (e.g., air, food, water, sleep). Health Needs are to be physically healthy and safe. Leisure Needs are to have fun, to be relaxed, to have leisure time, to appreciate and enjoy beauty. Social Needs are to have good social relations (e.g., family, friendship), to have good self-worth and self-esteem, and to be respected by others. Financial Needs are to obtain and protect financial income, to acquire and maintain valuable possessions, to have a job and satisfying work. Cognition Needs are to obtain skills, information, and knowledge, to receive education, to improve one's intelligence, and to mentally process information correctly. Freedom Needs are the ability to move or change positions freely, and to access things or services in a timely manner. We also defined two categories for event expressions that represent explicit emotions and opinions (Emotions/Sentiments/Opinions) and events that do not fall into any other categories (None of the Above).
We added manual annotations for human need categories on top of the manually annotated positive and negative affective events in the Af-fectEvent dataset. Three people were asked to assign a human need category label to each of the 559 affective events in the AffectEvent test set. Annotators achieved good pairwise interannotator agreement (κ ≥ .65) on this task. The Cohen's kappa scores were κ=.69, κ=.66 and κ=.65. We assigned a single category to each event because most of the affective events fell into just one category in our preliminary study, even though some cases could legitimately be argued for multiple categories. We discuss this issue further in Section 5.4 The distribution of human need categories is shown in Table 1. Since very few affective events were found to belong to the Freedom category, this category was merged into None. Additionally, 17 events received three different labels from the annotators, so they were discarded. The majority label was then assigned to the remaining events, yielding a gold standard data set of 542 affective events with human need category labels. Some of the annotated examples are shown in Table 2. A more detailed description of the human need category definitions, data set, and manual annotation effort is described in . This data set is freely available for other researchers to use.
In the next section, we present classification models designed to tackle this human needs categorization task.

Categorizing Human Needs with Labeled and Unlabeled Data
Automatically categorizing affective events in text based on human needs is a new task, so we investigated several types of approaches. First, we designed supervised classifiers to categorize affective events based upon the words in the event expressions, which we will refer to as Event Expression Classifiers. We explored lexical features, word embedding features, and semantic category features, along with several types of machine learning algorithms. Our task is to determine the human need category of an affective event based on the meaning of the event itself, independent of any specific context. 1 But we hypothesized that collecting the contexts around instances of the events could also provide valuable information to infer human need categories. So we also designed Event Context Classifiers to use the sentence contexts around event mentions as features.
Our gold standard data set is relatively small, so supervised learning that relies entirely on manually labeled data may not have sufficient coverage to perform well across the human need categories. However, the AffectEvent dataset contains a very large set of events that were extracted from the same blog corpus, but not manually labeled with affective polarity. Consequently, we explored two weakly supervised learning methods to exploit this large set of unlabeled events. First, we tried selftraining to iteratively improve the event expression classifier. Second, we designed a co-training model that takes advantage of both an event expression classifier and an event context classifier to learn from the unlabeled events. These two types of classifiers provide complementary views of an event, so new instances labeled by one classifier can be used as valuable new data to benefit the other classifier, in an iterative learning cycle.

Event Expression Classifiers
The most obvious approach is to use the words in event expressions as features for recognizing human need categories (e.g., {ear, be, better} for the event <ear, be, better>). We experimented with both lexical (string) features and pre-trained word embedding features. For the latter, we used GloVe (Pennington et al., 2014) vectors (200d) pretrained on 27B tweets. For each event expression, we compute its embedding as the average of its words' embeddings.
We also designed semantic features using the lexical categories in the LIWC lexicon (Pennebaker et al., 2007) to capture a more general meaning for each word. LIWC is a dictionary of words associated with "psychologically meaningful" lexical categories, some of which are directly relevant to our task, such as AFFECTIVE, SO-CIAL, COGNITIVE, and BIOLOGICAL PROCESS. We identify the LIWC category of the head word of each phrase in the event representation and use them as Semantic Category features.
We experimented with three types of supervised classification models: logistic regression (LR), support vector machines (SVM), and recurrent neural network classifiers (RNN). One advantage of the RNN is that it considers the word order in the event expression, which can be important. In our experiments, we used the Scikit-learn implementation (Pedregosa et al., 2011) for the LR classifier, and LIBSVM (Chang and Lin, 2011) with a linear kernel for the SVM classifier. For the RNN, we used the example LSTM implementation from Keras (Chollet et al., 2015) github, which was developed to build a sentiment classifier. We used the default parameters in our experiments 2 .

Event Context Classifiers
The event dataset was originally extracted from a large collection of blog posts, which contain many instances of the events in different sentences. We hypothesized that the contexts surrounding instances of an event can also provide strong clues about the human need category associated with the event. Therefore, we also created Event Context Classifiers to exploit the sentence contexts around event mentions. We explored several designs for event context classifiers, which are explained below.
Context SentBOW : For each event in the training set, we first collect all sentences mentioning this event and assign the event's human need category as the label for each sentence. Each sentence is then used as a training instance for the event context classifier. We use a bag-of-words representation for each sentence.
Context SentEmbed : This variation labels sentences exactly the same way as the previous model. But each sentence is represented as a dense embedding vector, which is computed as the average of the embeddings for each word in the sentence. We used GloVe (Pennington et al., 2014) vectors (200d) pretrained on 27B tweets.
Context AllBOW : Instead of treating each sentence as a training instance, for this model we aggregate all of the sentences that mention the same event to create one giant context for the event.
Each event corresponds to one training instance in this model, which is represented using bag-ofword features.
Context AllEmbed : This variation aggregates the sentences that mention an event exactly like the previous model. But each sentence is represented as a dense embedding vector. First, we compute an embedding vector for each sentence as the average of the embeddings of its words. Then we compute a single context embedding by averaging all of the sentence embeddings.
In the data, some events appear in many sentences, while others appear in just a few sentences. To maintain balance, we randomly sample 10 sentences for each event to use as its contexts.
To predict the human need category of an event, we first apply the event context classifier to contexts that mention the event, which produces a probability distribution over the human need categories. For each category, we compute its mean probability. Finally, we assign the event with the human need category that has the highest mean probability (i.e. argmax).

Self-Training the Event Expression Classifier
Our labeled data set is relatively small, but as mentioned previously, the AffectEvent dataset contains a large set of unlabeled events as well. So we designed a self-training model to try to iteratively improve the event expression classifier by exploiting the unlabeled event data. The self-training process works as follows. Initially, the event expression classifier is trained using the manually labeled events. Then the classifier is applied to the unlabeled events and assigns a human need category to each event with a confidence value. For each human need category, we select the unlabeled event that has been assigned to that category with the highest confidence. Therefore, each category will have one additional labeled event at each iteration. The newly labeled events are added to the labeled data set, and the classifier is re-trained for the next iteration.

Co-Training with Event Expression and Event Context Classifiers
The sentence contexts in which an event appears contain complementary information to the event expression itself. So we designed co-training models to exploit these complementary types of classifiers to iteratively learn from unlabeled data.  Figure 1 shows the architecture of our cotraining model. Initially, an event expression classifier and an event context classifier are independently trained on the manually labeled training data. Each classifier is then applied to the large collection of unlabeled events E U . For each hu-man need category, we then select the event that has been assigned to the category with the highest confidence value as a new instance to label. Consequently, each category will receive two additional labeled events at each iteration, one from the event expression classifier and another one from the event context classifier. 3 Both sets of newly labeled events are then added to the labeled set E L , and each of the classifiers is re-trained on the expanded set of labeled data. Because the classifiers have different views of the events, the new instances labeled by one classifier serve as fresh training instances for the other, unlike self-training with a single classifier where it is learning entirely from its own predictions. The following section describes the co-training algorithm in more detail.

The Co-Training Algorithm
Our co-training algorithm is shown in Algorithm 1. The input to the algorithm are the sets of labeled events E L and unlabeled events E U . Each event is associated with both an event expression and the set of sentences in which it occurs in the blogs corpus.
For each iteration, the event expression classifier is first trained using the labeled events E L with the event expression view. Then, we construct an event context view X con for each event in the labeled set E L . The context sentences are used differently depending on the type of context model (described in Section 4.2). An event context classifier is then trained using the context view X con . Both classifiers are then independently applied to the unlabeled events E U . For each human need category, each classifier selects one event to label based on its most confident prediction. All of the newly labeled events are then added to the labeled training set E L , and the process repeats.

Prediction with Co-Trained Classifiers
The co-training process simultaneously trains two classifiers, so here we explain how we use the resulting classifiers after the co-training process has finished. For each event e in the test set, we apply both the event expression classifier and the event context classifier, which each produce a probability distribution over the human need categories. Then we explore two different methods to combine the two probability distributions for each test Construct context view (X con ) of E L

5:
Train the event context classifier on X con 6: Apply the event expression classifier to E U and select new labeled events (E exp ) 7: Apply the event context classifier to E U and select new labeled events (E con ) 8: Update labeled events: E L = E L ∪ E exp ∪ E con 9: end while event: (1) sum, we compute the final probability vector p(e) by applying the element-wise summarization operation to the two predicted probability vectors; (2) product, we compute the final p(e) as the element-wise product of the two vectors. Then, the final probability vector is normalized to make sure the sum of probabilities over all classes is 1. Finally, we predict an event's human need category as the one with the highest probability.

Evaluation
We conducted experiments to evaluate the methods described in Section 4. For all of our experiments, the results are reported based on 3-fold cross-validation on the 542 affective events manually labeled with human need categories. We show the average results over 3-folds in the following sections. For development, we used a distinct set of events labeled during preliminary studies. We did not tune any of the models, using only their default parameter settings. We present experimental results in terms of precision, recall, and F1 score, macro-averaged over the human need categories. Table 4 shows the results 4 for the event expression classifiers. We also evaluated the ability of the LIWC lexicon (Pennebaker et al., 2007) to label the event expressions. We manually aligned the relevant LIWC categories with our human need categories, as shown in Table 3. Then we labeled each event by identifying the human need category of each word in the event phrase and assign-ing the most frequent category to the event. 5 If no words were assigned to our categories, we labeled the event as None. The top row of Table 4 shows that LIWC achieved 39% recall but only 47.7% precision. The reason is that some categories in LIWC are more generalized compared with the definitions of our corresponding categories. For example, the words "abandon" and "damage" belong to the Affect category (corresponding to our Emotion category) in LIWC. However, based on our definition the event "my house was damaged" actually belongs to the Finance category. In this way, the Emotion category is overly generalized which leads to low precision for this class.  The LR and SVM rows in Table 4 show the performance of the logistic regression (LR) and support vector machine (SVM) classifiers, respectively. We evaluated classifiers with bag-of-words features (BOW) and classifiers with event embedding features (Embed), computed as the average of the embeddings for all words in the event expression. We also tried adding semantic category features from LIWC to each feature set, denoted as +SemCat. The results show that the Embed features performed best for both the LR and SVM classifiers. Adding the SemCat features improved upon the bag-of-word representations, but not the embeddings.

Performance of Event Expression Classifiers
The last two rows of Table 4 show the performance of two RNN classifiers, one using lexical words as input (RNN Words ) and one using pretrained word embeddings as input (RNN EmbedSeq ). The RNN EmbedSeq system takes the sequence of word embeddings as input rather than the average embeddings. As with the other classifiers, the word embedding feature representations performed best, achieving an F1 score 54.4%, which is comparable to the F1 score of the LR Embed system. However, the RNN's precision was only 58%, compared to 64.2% for the logistic regres-  Overall, we concluded that the logistic regression classifier with event embedding features (LR Embed ) achieved the best performance because of its F1 score (54.8%) and higher precision (64.2%). Table 5 shows the performance 4 of the event context classifiers described in Section 4.2. Since logistic regression worked best in the previous experiments, we only evaluated logistic regression classifiers in our remaining experiments. The results show that using each context sentence as an individual training instance (Context SentBOW and Context SentEmbed ) substantially outperformed the classifiers that merged all the context sentences as a single training instance (Context AllBOW and Context AllEmbed ). Overall, the best performing system Context SentEmbed achieved an F1 score of 44.3% with 59.1% Precision.  It is worth noting that the precision of the best contextual classifier was only 5% below that of the best event expression classifier, while there was a 10% difference in their recall. Since they achieved (roughly) similar levels of precision and represent complementary views of events, a co-training framework seemed like a logical way to use them together to gain additional benefits from unlabeled event data.

Performance of Event Context Classifiers
We also created a classifier that combined event expression features and event context features together. But combining them did not improve performance.

Performance of Self-Training and
Co-Training Models In this section, we evaluate the weakly supervised self-training and co-training methods that additionally use unlabeled data. To keep the number of unlabeled events manageable, we only used events in the AffectEvent dataset that had frequency ≥ 100, which produced an unlabeled data set of 23,866 events. We used the best performing event expression classifier (LR Embed ) in these models, and the cotraining framework includes the best performing event context classifier (Context SentEmbed ) as well. We also experimented with the sum and product variants for co-training (described in Section 4.4.2), which are denoted as CoTrain sum and CoTrain prod . We ran both the self-training and cotraining methods for 20 iterations.  Figure 2 tracks the performance of the selftraining and co-training models after each iteration, in terms of F1 score. The flat line shows the F1 score for the best classifier that uses only labeled data (LR Embed ). Both types of models yield performance gains from iteratively learning with the unlabeled data, but the co-training models perform substantially better than the self-training model. Even after just 5 iterations, co-training achieves an F1 score over 58%, and by 20 iterations performance improves to > 60%. Table 6 shows the results for these models after 20 iterations, which was an arbitrary stopping criterion, and after 17 iterations, which happened to produce the best results for all three systems. The first two rows show the results of the best performing event context classifier (Context SentEmbed ) and best performing event expression classifier (LR Embed ) from the previous experiments, for the sake of comparison. Table 6 shows that after 20 iterations, the CoTrain prod model performed best, yielding an F1 score of 61% compared to 54.8% for the LR Embed model. Furthermore, we see gains in both recall and precision.
All three systems performed best after 17 iterations, so we show those results as well to give an idea of additional gains that would be possible if we could find an optimal stopping criterion. Our data set was small so we did not feel that we had enough data to fine-tune parameters, but we see the potential to further improve performance given additional tuning data.   Table 7 shows a breakdown of the performance across the individual human need categories for two models: the best event expression classifier and the best co-training model (CoTrain prod after 17 iterations). We see that the co-training model outperformed the LR Embed model on every category. Co-training improved performance the most for the Finance and Cognition categories, yielding F1 score gains of +12% and +16%, respectively, and notably improving both recall and precision.

Analysis
We manually examined our system's predictions to better understand its behavior. We found that most of the correctly classified Physiological  events were related to food, while the correctly classified Cognition events were primarily about learning and understanding. Our method missed many events for the Health, Finance, and Cognition classes. For Health, many medical symptoms were not recognized, such as "my face looks pale" and "I puked". For Finance, the system missed events related to possessions (e.g., "engine stopped running" and "my clock is wrong") and jobs (e.g., "I went to resign").
We also took a closer took at which categories were confused with other categories. Figure 3 shows the confusion matrix between CoTrain Prod and the gold annotations. Each cell shows the total number of confusions across the 3-folds of crossvalidation. The category names are abbreviated as Physiological (Phy), Health (Hlth), Leisure (Leis), Social (Socl), Finance (Fnc), Cognition (Cog), and Emotion (Emo). #Tot denotes the total number of events in each row or column.

Pred. \ Gold Phy
Hlth Leis Socl Fnc Cog Emo None #Tot  Phy  13  1  0  0  1  0  0  2  17  Hlth  1  26  1  0  1  1  4  8  42  Leis  1  1  48  4  0  1  4  10  69  Socl  0  6  4  84  2  3  10  11  120  Fnc  1  0  2  0  13  0  1  5  22  Cog  0  0  0  0  0  12  1  2  15  Emo  1  5  12  12  3  1  91  16  141  None  2  13  8  8  9  8  17  51  116  #Tot  19  52  75  108  29  26  The co-training model had difficulty distinguishing the None category from other classes, presumably because None does not have its own semantics but is used for affective events that do not belong to any of the other categories. We also see that the system often confuses Emotion with Leisure and Social. This happens because many event expressions contain words that refer to emotions. Our guidelines instructed annotators to focus on the event and assign the Emotion label only when no event is described beyond an emotion (e.g., "I was thrilled"). Consequently, the gold label of "I love journey" is Leisure and "I'm worried about my mom" is Social, but both were classified by the system as Emotion. In future work, it may be advantageous to allow event expressions to be labeled as both an explicit Emotion and a Human Need category based on the target of the emotion.

Conclusions
In this work, we introduced a new challenge to recognize the reason for the affective polarity of events in terms of basic human needs. We designed four types of classification methods to categorize affective events according to human need categories, exploiting both labeled and unlabeled data. We first evaluated event expression and event context classifiers, trained using only labeled data. Then we designed self-training and co-training methods to additionally exploit unlabeled data. A co-training model that simultaneously trains event expression and event context classifiers produced substantial performance gains over the individual models. However, performance on the human need categories still has substantial room for improvement. In future work, obtaining more human annotations will be useful to build a better human needs categorization system. In addition, applying and analyzing the human needs of affective events in narrative stories and conversations is a fruitful and interesting direction for future research.