Minimally Supervised Learning of Affective Events Using Discourse Relations

Recognizing affective events that trigger positive or negative sentiment has a wide range of natural language processing applications but remains a challenging problem mainly because the polarity of an event is not necessarily predictable from its constituent words. In this paper, we propose to propagate affective polarity using discourse relations. Our method is simple and only requires a very small seed lexicon and a large raw corpus. Our experiments using Japanese data show that our method learns affective events effectively without manually labeled data. It also improves supervised learning results when labeled data are small.


Introduction
Affective events (Ding and Riloff, 2018) are events that typically affect people in positive or negative ways. For example, getting money and playing sports are usually positive to the experiencers; catching cold and losing one's wallet are negative. Understanding affective events is important to various natural language processing (NLP) applications such as dialogue systems (Shi and Yu, 2018), question-answering systems (Oh et al., 2012), and humor recognition (Liu et al., 2018). In this paper, we work on recognizing the polarity of an affective event that is represented by a score ranging from −1 (negative) to 1 (positive).
Learning affective events is challenging because, as the examples above suggest, the polarity of an event is not necessarily predictable from its constituent words. Combined with the unbounded combinatorial nature of language, the non-compositionality of affective polarity entails the need for large amounts of world knowledge, which can hardly be learned from small annotated data.
In this paper, we propose a simple and effective method for learning affective events that only requires a very small seed lexicon and a large raw corpus. As illustrated in Figure 1, our key idea is that we can exploit discourse relations (Prasad et al., 2008) to efficiently propagate polarity from seed predicates that directly report one's emotions (e.g., "to be glad" is positive). Suppose that events x 1 are x 2 are in the discourse relation of CAUSE (i.e., x 1 causes x 2 ). If the seed lexicon suggests x 2 is positive, x 1 is also likely to be positive because it triggers the positive emotion. The fact that x 2 is known to be negative indicates the negative polarity of x 1 . Similarly, if x 1 and x 2 are in the discourse relation of CONCESSION (i.e., x 2 in spite of x 1 ), the reverse of x 2 's polarity can be propagated to x 1 . Even if x 2 's polarity is not known in advance, we can exploit the tendency of x 1 and x 2 to be of the same polarity (for CAUSE) or of the reverse polarity (for CONCESSION) although the heuristic is not exempt from counterexamples. We transform this idea into objective functions and train neural network models that predict the polarity of a given event.
We trained the models using a Japanese web corpus. Given the minimum amount of supervision, they performed well. In addition, the combination of annotated and unannotated data yielded a gain over a purely supervised baseline when labeled data were small.

Related Work
Learning affective events is closely related to sentiment analysis. Whereas sentiment analysis usually focuses on the polarity of what are described (e.g., movies), we work on how people are typically affected by events. In sentiment analysis, much attention has been paid to compositionality. Word-level polarity (Takamura et al., 2005; Figure 1: An overview of our method. We focus on pairs of events, the former events and the latter events, which are connected with a discourse relation, CAUSE or CONCESSION. Dropped pronouns are indicated by brackets in English translations. We divide the event pairs into three types: AL, CA, and CO. In AL, the polarity of a latter event is automatically identified as either positive or negative, according to the seed lexicon (the positive word is colored red and the negative word blue). We propagate the latter event's polarity to the former event. The same polarity as the latter event is used for the discourse relation CAUSE, and the reversed polarity for CONCESSION. In CA and CO, the latter event's polarity is not known. Depending on the discourse relation, we encourage the two events' polarities to be the same (CA) or reversed (CO). Details are given in Section 3.2. Wilson et al., 2005;Baccianella et al., 2010) and the roles of negation and intensification (Reitan et al., 2015;Wilson et al., 2005;Zhu et al., 2014) are among the most important topics. In contrast, we are more interested in recognizing the sentiment polarity of an event that pertains to commonsense knowledge (e.g., getting money and catching cold). Label propagation from seed instances is a common approach to inducing sentiment polarities. While Takamura et al. (2005) and Turney (2002) worked on word-and phrase-level polarities, Ding and Riloff (2018) dealt with event-level polarities. Takamura et al. (2005) and Turney (2002) linked instances using co-occurrence information and/or phrase-level coordinations (e.g., "A and B" and "A but B"). We shift our scope to event pairs that are more complex than phrase pairs, and consequently exploit discourse connectives as eventlevel counterparts of phrase-level conjunctions.
Ding and Riloff (2018) constructed a network of events using word embedding-derived similarities. Compared with this method, our discourse relation-based linking of events is much simpler and more intuitive.
Some previous studies made use of document structure to understand the sentiment. Shimizu et al. (2018) proposed a sentiment-specific pretraining strategy using unlabeled dialog data (tweet-reply pairs). Kaji and Kitsuregawa (2006) proposed a method of building a polarity-tagged corpus (ACP Corpus). They automatically gath-ered sentences that had positive or negative opinions utilizing HTML layout structures in addition to linguistic patterns. Our method depends only on raw texts and thus has wider applicability.

Polarity Function
Our goal is to learn the polarity function p(x), which predicts the sentiment polarity score of an event x. We approximate p(x) by a neural network with the following form: Encoder outputs a vector representation of the event x. Linear is a fully-connected layer and transforms the representation into a scalar. tanh is the hyperbolic tangent and transforms the scalar into a score ranging from −1 to 1. In Section 4.2, we consider two specific implementations of Encoder.

Discourse Relation-Based Event Pairs
Our method requires a very small seed lexicon and a large raw corpus. We assume that we can automatically extract discourse-tagged event pairs, (x i1 , x i2 ) (i = 1, · · · ) from the raw corpus. We refer to x i1 and x i2 as former and latter events, respectively. As shown in Figure 1, we limit our scope to two discourse relations: CAUSE and CONCESSION.
The seed lexicon consists of positive and negative predicates. If the predicate of an extracted event is in the seed lexicon and does not involve complex phenomena like negation, we assign the corresponding polarity score (+1 for positive events and −1 for negative events) to the event. We expect the model to automatically learn complex phenomena through label propagation. Based on the availability of scores and the types of discourse relations, we classify the extracted event pairs into the following three types.
AL (Automatically Labeled Pairs) The seed lexicon matches (1) the latter event but (2) not the former event, and (3) their discourse relation type is CAUSE or CONCESSION. If the discourse relation type is CAUSE, the former event is given the same score as the latter. Likewise, if the discourse relation type is CONCESSION, the former event is given the opposite of the latter's score. They are used as reference scores during training.

CA (CAUSE Pairs)
The seed lexicon matches neither the former nor the latter event, and their discourse relation type is CAUSE. We assume the two events have the same polarities.

CO (CONCESSION Pairs)
The seed lexicon matches neither the former nor the latter event, and their discourse relation type is CONCESSION. We assume the two events have the reversed polarities.

Loss Functions
Using AL, CA, and CO data, we optimize the parameters of the polarity function p(x). We define a loss function for each of the three types of event pairs and sum up the multiple loss functions.
We use mean squared error to construct loss functions. For the AL data, the loss function is defined as: where x i1 and x i2 are the i-th pair of the AL data. r i1 and r i2 are the automatically-assigned scores of x i1 and x i2 , respectively. N AL is the total number of AL pairs, and λ AL is a hyperparameter.
For the CA data, the loss function is defined as: y i1 and y i2 are the i-th pair of the CA pairs. N CA is the total number of CA pairs. λ CA and µ are hyperparameters. The first term makes the scores of the two events closer while the second term prevents the scores from shrinking to zero. The loss function for the CO data is defined analogously: The difference is that the first term makes the scores of the two events distant from each other.

Experiments
4.1 Dataset 4.1.1 AL, CA, and CO As a raw corpus, we used a Japanese web corpus that was compiled through the procedures proposed by Kawahara and Kurohashi (2006). To extract event pairs tagged with discourse relations, we used the Japanese dependency parser KNP 1 and in-house postprocessing scripts (Saito et al., 2018). KNP used hand-written rules to segment each sentence into what we conventionally called clauses (mostly consecutive text chunks), each of which contained one main predicate. KNP also identified the discourse relations of event pairs if explicit discourse connectives (Prasad et al., 2008) such as "の で" (because) and "の に" (in spite of ) were present. We treated Cause/Reason (原 因・理由) and Condition (条件) in the original tagset (Kawahara et al., 2014) as CAUSE and Concession (逆 接) 2 as CONCESSION, respectively.
Here is an example of event pair extraction.
Type of pairs # of pairs AL (Automatically Labeled Pairs) 1,000,000 CA (CAUSE Pairs) 5,000,000 CO (CONCESSION Pairs) 5,000,000 From this sentence, we extracted the event pair of "重大な失敗を犯す" ([I] make a serious mistake) and "仕事をクビになる" ([I] get fired), and tagged it with CAUSE. We constructed our seed lexicon consisting of 15 positive words and 15 negative words, as shown in Section A.1. From the corpus of about 100 million sentences, we obtained 1.4 millions event pairs for AL, 41 millions for CA, and 6 millions for CO. We randomly selected subsets of AL event pairs such that positive and negative latter events were equal in size. We also sampled event pairs for each of CA and CO such that it was five times larger than AL. The results are shown in Table 1.

ACP (ACP Corpus)
We used the latest version 3 of the ACP Corpus (Kaji and Kitsuregawa, 2006) for evaluation. It was used for (semi-)supervised training as well. Extracted from Japanese websites using HTML layouts and linguistic patterns, the dataset covered various genres. For example, the following two sentences were labeled positive and negative, respectively: (2) 作業が楽だ。 The work is easy.
Although the ACP corpus was originally constructed in the context of sentiment analysis, we found that it could roughly be regarded as a collection of affective events. We parsed each sentence and extracted the last clause in it. The train/dev/test split of the data is shown in Table 2. The objective function for supervised training is:  where v i is the i-th event, R i is the reference score of v i , and N ACP is the number of the events of the ACP Corpus.
To optimize the hyperparameters, we used the dev set of the ACP Corpus. For the evaluation, we used the test set of the ACP Corpus. The model output was classified as positive if p(x) > 0 and negative if p(x) ≤ 0.

Model Configurations
As for Encoder, we compared two types of neural networks: BiGRU and BERT. GRU (Cho et al., 2014) is a recurrent neural network sequence encoder. BiGRU reads an input sequence forward and backward and the output is the concatenation of the final forward and backward hidden states.
BERT (Devlin et al., 2019) is a pre-trained multi-layer bidirectional Transformer (Vaswani et al., 2017) encoder. Its output is the final hidden state corresponding to the special classification tag ([CLS]). For the details of Encoder, see Sections A.2.
We trained the model with the following four combinations of the datasets: AL, AL+CA+CO (two proposed models), ACP (supervised), and ACP+AL+CA+CO (semi-supervised). The corresponding objective functions were: L AL , L AL + L CA + L CO , L ACP , and L ACP + L AL + L CA + L CO . Table 3 shows accuracy. As the Random baseline suggests, positive and negative labels were distributed evenly. The Random+Seed baseline made use of the seed lexicon and output the corresponding label (or the reverse of it for negation) if the event's predicate is in the seed lexicon. We can see that the seed lexicon itself had practically no impact on prediction.

Results and Discussion
The models in the top block performed considerably better than the random baselines. The performance gaps with their (semi-)supervised counterparts, shown in the middle block, were less than   7%. This demonstrates the effectiveness of discourse relation-based label propagation.
Comparing the model variants, we obtained the highest score with the BiGRU encoder trained with the AL+CA+CO dataset. BERT was competitive but its performance went down if CA and CO were used in addition to AL. We conjecture that BERT was more sensitive to noises found more frequently in CA and CO.
Contrary to our expectations, supervised models (ACP) outperformed semi-supervised models (ACP+AL+CA+CO). This suggests that the training set of 0.6 million events is sufficiently large for training the models. For comparison, we trained the models with a subset (6,000 events) of the ACP dataset. As the results shown in Table 4 demonstrate, our method is effective when labeled data are small.
As the CA and CO pairs were equal in size (Table  1), λ CA and λ CO were comparable values. λ CA was about one-third of λ CO , and this indicated that the CA pairs were noisier than the CO pairs. A major type of CA pairs that violates our assumption was in the form of "problem negative causes

Conclusion
In this paper, we proposed to use discourse relations to effectively propagate polarities of affective events from seeds. Experiments show that, even with a minimal amount of supervision, the proposed method performed well.
Although event pairs linked by discourse analysis are shown to be useful, they nevertheless contain noises. Adding linguistically-motivated filtering rules would help improve the performance.

A.2 Settings of Encoder
BiGRU The dimension of the embedding layer was 256. The embedding layer was initialized with the word embeddings pretrained using the Web corpus. The input sentences were segmented into words by the morphological analyzer Ju-man++. 4 The vocabulary size was 100,000. The number of hidden layers was 2. The dimension of hidden units was 256. The optimizer was Momentum SGD (Sutskever et al., 2013). The mini-batch size was 1024. We ran 100 epochs and selected the snapshot that achieved the highest score for the dev set.
BERT We used a Japanese BERT model 5 pretrained with Japanese Wikipedia. The input sentences were segmented into words by Juman++, and words were broken into subwords by applying BPE (Sennrich et al., 2016). The vocabulary size was 32,000. The maximum length of an input sequence was 128. The number of hidden layers was 12. The dimension of hidden units was 768. The number of self-attention heads was 12. The optimizer was Adam (Kingma and Ba, 2014). The mini-batch size was 32. We ran 1 epoch.