Word Emotion Induction for Multiple Languages as a Deep Multi-Task Learning Problem

Predicting the emotional value of lexical items is a well-known problem in sentiment analysis. While research has focused on polarity for quite a long time, meanwhile this early focus has been shifted to more expressive emotion representation models (such as Basic Emotions or Valence-Arousal-Dominance). This change resulted in a proliferation of heterogeneous formats and, in parallel, often small-sized, non-interoperable resources (lexicons and corpus annotations). In particular, the limitations in size hampered the application of deep learning methods in this area because they typically require large amounts of input data. We here present a solution to get around this language data bottleneck by rephrasing word emotion induction as a multi-task learning problem. In this approach, the prediction of each independent emotion dimension is considered as an individual task and hidden layers are shared between these dimensions. We investigate whether multi-task learning is more advantageous than single-task learning for emotion prediction by comparing our model against a wide range of alternative emotion and polarity induction methods featuring 9 typologically diverse languages and a total of 15 conditions. Our model turns out to outperform each one of them. Against all odds, the proposed deep learning approach yields the largest gain on the smallest data sets, merely composed of one thousand samples.


Introduction
Deep Learning (DL) has radically changed the rules of the game in NLP by dramatically boosting performance figures in almost all applications areas. Yet, one of the major premises of highperformance DL engines is their dependence on huge amounts of training data. As such, DL seems ill-suited for areas where training data are scarce, such as in the field of word emotion induction.
We will use the terms polarity and emotion here to distinguish between research focusing on "semantic orientation" (Hatzivassiloglou and McKeown, 1997) (the positiveness or negativeness) of affective states, on the one hand, and approaches which provide predictions based on some of the many more elaborated representational systems for affective states, on the other hand.
Originally, research activities focused on polarity alone. In the meantime, a shift towards more expressive representation models for emotion can be observed that heavily draws inspirations from psychological theory, e.g., Basic Emotions (Ekman, 1992) or the Valence-Arousal-Dominance model (Bradley and Lang, 1994).
Though this change turned out to be really beneficial for sentiment analysis in NLP, a large variety of mutually incompatible encodings schemes for emotion and, consequently, annotation formats for emotion metadata in corpora have emerged that hinder the interoperability of these resources and their subsequent reuse, e.g., on the basis of alignments or mergers (Buechel and Hahn, 2017).
As an alternative way of dealing with thus unwarranted heterogeneity, we here examine the potential of multi-task learning (MTL; Caruana (1997)) for word-level emotion prediction. In MTL for neural networks, a single model is fitted to solve multiple, independent tasks (in our case, to predict different emotional dimensions) which typically results in learning more robust and meaningful intermediate representations. MTL has been shown to greatly decrease the risk of overfitting (Baxter, 1997), work well for various NLP tasks (Setiawan et al., 2015;Liu et al., 2015;Søgaard and Goldberg, 2016;Cummins et al., 2016;Liu et al., 2017;Peng et al., 2017), and practically increases sample size, thus making it a natural choice for small-sized data sets typically found in the area of word emotion induction.
After a discussion of related work in Section 2, we will introduce several reference methods and describe our proposed deep MTL model in Section 3. In our experiments (Section 4), we will first validate our claim that MTL is superior to single-task learning for word emotion induction. After that, we will provide a large-scale evaluation of our model featuring 9 typologically diverse languages and multiple publicly available embedding models for a total of 15 conditions. Our MTL model surpasses the current state-of-the-art for each of them, and even performs competitive relative to human reliability. Most notably however, our approach yields the largest benefit on the smallest data sets, comprising merely one thousand samples. This finding, counterintuitive as it may be, strongly suggests that MTL is particularly beneficial for solving the word emotion induction problem. Our code base as well as the resulting experimental data is freely available. 1

Related Work
This section introduces the emotion representation format underlying our study and describes external resources we will use for evaluation before we discuss previous methodological work.
Emotion Representation and Data Sets. Psychological models of emotion can typically be subdivided into discrete (or categorical) and dimensional ones (Stevenson et al., 2007;Calvo and Mac Kim, 2013). Discrete models are centered around particular sets of emotional categories considered to be fundamental. Ekman (1992), for instance, identifies six Basic Emotions (Joy, Anger, Sadness, Fear, Disgust and Surprise).
In contrast, dimensional models consider emotions to be composed of several influencing factors (mainly two or three). These are often referred to as Valence (a positive-negative scale), Arousal (a calm-excited scale), and Dominance (perceived degree of control over a (social) situation)-the VAD model (Bradley and Lang (1994); see Figure  1 for an illustration). Many contributions though omit Dominance (the VA model) (Russell, 1980). For convenience, we will still use the term "VAD" to jointly refer to both variants (with and without Dominance).
VAD is the most common framework to acquire empirical emotion values for words in psychology. Over the years, a considerable number of such resources (also called "emotion lexicons") have emerged from psychological research labs (as well as some NLP labs) for diverse languages. The emotion lexicons we use in our experiments are listed in Table 1. An even more extensive list of such data sets is presented by Buechel and Hahn (2018). For illustration, we also provide three sample entries from one of those lexicons in Table 2. As can be seen, the three affective dimensions behave complementary to each other, e.g., "terrorism" and "orgasm" display similar Arousal but opposing Valence.
The task we address in this paper is to predict the values for Valence, Arousal and Dominance, given a lexical item. As is obvious from these examples, we consider emotion prediction as a regression, not as a classification problem (see arguments discussed in Buechel and Hahn (2016)).
In this paper, we focus on the VAD format for the following reasons: First, note that the Valence dimension exactly corresponds to polarity (Turney and Littman, 2003). Hence, with the VAD model, emotion prediction can be seen as a generalization over classical polarity prediction. Second, to the best of our knowledge, the amount and diversity of available emotion lexicons with VAD encodings is larger than for any other format (see Table 1).
Word Embeddings. Word embeddings are dense, low-dimensional vector representations of words trained on large volumes of raw text in an unsupervised manner. The following are among today's most popular embedding algorithms:    (Mikolov et al., 2013). FASTTEXT is a derivative of WORD2VEC, also incorporating sub-word character n-grams (Bojanowski et al., 2017). Unlike the former two algorithms which fit word embeddings in a streaming fashion, GLOVE trains word vectors directly on a word co-occurrence matrix under the assumption to make more efficient use of word statistics (Pennington et al., 2014). Somewhat similar, SVD PPMI performs singular value decomposition on top of a point-wise mutual information co-occurrence matrix (Levy et al., 2015). In order to increase the reproducibility of our experiments, we rely on the following widely used, publicly available embedding models trained on very large corpora (summarized in Table 3): the SGNS model trained on the Google News corpus 2 (GOOGLE), the FASTTEXT model trained on Common Crawl 3 (COMMON), as well as the FASTTEXT models for a wide range of languages trained on the respective Wikipedias 4 (WIKI).
Note that WIKI denotes multiple embedding models with different training and vocabulary sizes (see Grave et al. (2018) for further details). Additionally, we were given the opportunity to reuse the English embedding model from Sedoc et al. (2017) (GIGA), a strongly related contribution (see below). Their embeddings were trained on the English Gigaword corpus (Parker et al., 2011).
Word-Level Prediction. One of the early approaches to word polarity induction which is still popular today (Köper and Schulte im Walde, 2016) was introduced by Turney and Littman (2003). They compute the polarity of an unseen word based on its point-wise mutual information (PMI) to a set of positive and negative seed words, respectively.
SemEval-2015 Task 10E featured polarity induction on Twitter (Rosenthal et al., 2015). The best system relied on support vector regression (SVR) using a radial base function kernel (Amir et al., 2015). They employ the embedding vector of the target word as features. The results of their SVR-based system were beaten by the DEN-SIFIER algorithm (Rothe et al., 2016). DENSIFIER learns an orthogonal transformation of an embedding space into a subspace of strongly reduced dimensionality. Hamilton et al. (2016) developed SENTPROP, a graph-based, semi-supervised learning algorithm which builds up a word graph, where vertices correspond to words (of known as well as unknown polarity) and edge weights correspond to the similarity between them. The polarity information is then propagated through the graph, thus computing scores for unlabeled nodes. According to their evaluation, DENSIFIER seems to be superior overall, yet SENTPROP produces competitive results  only when the seed lexicon or the corpus the word embeddings are trained on is very small. 5 For word emotion induction, a very similar approach to SENTPROP has been proposed by . They also propagate affective information (Valence and Arousal, in this case) through a word graph with similarity weighted edges. Sedoc et al. (2017) recently proposed an approach based on signed spectral clustering where a word graph is constructed not only based on word similarity but also on the considered affective information (again, Valence and Arousal). The emotion value of a target word is then computed based on the seed words in its cluster. They report to outperform the results from .
Contrary to the trend to graph-based methods, the best system of the IALP 2016 Shared Task on Chinese word emotion induction ) employed a simple feed-forward neural network (FFNN) with one hidden layer in combination with boosting (Du and Zhang, 2016).
Another very recent contribution which advocates a supervised set-up was published by Li et al. (2017). They propose ridge regression, again using word embeddings as features. Even with this simple approach, they report to outperform many of the above methods in the VAD prediction task. 6 Sentence-Level and Text-Level Prediction. Different from the word-level prediction task (the one we focus on in this contribution), the determination of emotion values for higher-level linguistic units (especially sentences and texts) is also heavily investigated. For this problem, DL approaches are meanwhile fully established as the method of choice (Wang et al., 2016b;Abdul-Mageed and Ungar, 2017;Felbo et al., 2017;Mohammad and Bravo-Marquez, 2017).
It is important to note, however, that the methods discussed for these higher-level units cannot easily be transferred to solve the word emotion induction problem. Sentence-level and text-level architectures are either adapted to sequential input data (typical for RNN, LSTM, GRNN and related architectures) or spatially arranged input data (as with CNN architectures). However, for word embeddings (the default input for word emotion induction) there does not seem to be any meaningful order of their components. Therefore, these more sophisticated DL methods are, for the time being, not applicable for the study at hand.

Methods
In this section, we will first introduce various reference methods (two originally polarity-based for which we offer adaptations for VAD prediction) before defining our own neural MTL model and discussing its difference from previous work.
Let V := {w 1 , w 2 , ..., w m } be our word vocabulary and let E := {e 1 , e 2 , ..., e m } be a set of embedding vectors such that e i ∈ R n denotes the ndimensional vector representation of word w i . Let D := {d 1 , d 2 , ..., d l } be a set of emotional dimensions. Our task is to predict the empirically determined emotion vector emo(w) ∈ R l given a word w and the embedding space E.

Reference Methods
Linear Regression Baseline (LinReg). We propose (multi-variate) linear regression as an obvious baseline for the problem: where W is a matrix, W i * contains the regression coefficients for the i-th affective dimension and b is the vector of bias terms. The model parameters are fitted using ordinary least squares. Technically, we use the scikit-learn.org implementation with default parameters.
Ridge Regression (RidgReg). Li et al. (2017) propose ridge regression for word emotion induction. Ridge regression works identically to linear regression during prediction, but introduces L 2 regularization during training. Following the authors, for our implementation, we again use the scikit-learn implementation with default parameters.
Turney-Littman Algorithm (TL). As one of the earliest contributions in the field, Turney and Littman (2003) defined a simple PMI-based approach to determine the semantic polarity SP T L of a word w: (2) where seeds + and seeds − are sets of positive and negative seed words, respectively. Since this algorithm is still popular today (Köper and Schulte im Walde, 2016), we here provide a novel modification for adapting this originally polarity-based approach to word emotion induction with vectorial seed and output values.
First, we replace PMI-based association of seed and target word w and s by their similarity sim based on their word embeddings e w and e s : (4) Although this step is technically not required for the adaptation, it renders the TL algorithm more comparable to the other approaches evaluated in Section 4 besides from most likely increasing performance. Equation (4)  where seeds := seeds + ∪ seeds − and emo(s) maps to 1, if s ∈ seeds + , and −1, if s ∈ seeds − . Equation (5) can be trivially adapted to an ndimensional emotion format by redefining emo(s) such that it maps to a vector from R n instead of {−1, 1}. Our last step is to introduce a normalization term such that emo(w) T L lies within the range of the seed lexicon. emo T L (w) := s∈seeds sim(w, s) × emo(s) s∈seeds sim(w, s) (6) As can be seen from Equation (6), for the more general case of n-dimensional emotion prediction, the Turney-Littman algorithm naturally translates into a weighted average where the seed emotion values are weighted according to the similarity to the target item.
Densifier. Rothe et al. (2016) train an orthogonal matrix Q ∈ R n×n (n being the dimensionality of the word embeddings) such that applying Q to an embedding vector e i concentrates all the polarity information in its first dimension such that the polarity of a word w i can be computed as where p = (1, 0, 0, ..., 0) T ∈ R 1×n . For fitting Q, the seeds are arranged into pairs of equal polarity (the set pairs = ) and those of opposing polarity (pairs = ). A good fit for Q will minimize the distance within the former and maximize the distance within the latter which can be expressed by the following two training objectives: The objectives described in the expressions (8) and (9) are combined into a single loss function (using a weighting factor α ∈ [0, 1]) which is then minimized using stochastic gradient descent (SGD).
To adapt this algorithm to dimensional emotion formats, we construct a positive seed set, seeds + v , and a negative seed set, seeds − v , for each emotion dimension v ∈ D. Let M v be the mean value of all the entries of the training lexicon for the affective dimension v. Let SD v be the respective standard deviation and β ∈ R, β ≥ 0. Then all entries greater than M v + βSD v are assigned to seeds + v and those less than M v − βSD v are assigned to seeds − v . Q is fitted individually for each emotion dimension v.
Training was performed according to the original paper with the exception that (following Hamilton et al. (2016)) we did not apply the proposed re-orthogonalization after each training step, since we did not find any evidence that this procedure actually results in improved performance. The hyperparameters α and β were set to .7 and .5 (respectively) for all experiments based on a pilot study. Since the original implementation is not accessible, we devised our own using tensorflow.org.
Boosted Neural Networks (ensembleNN). Du and Zhang (2016) propose simple FFNNs in combination with a boosting algorithm. An FFNN consists of an input or embedding layer with activation a (0) ∈ R n which is equal to the embedding vector e k when predicting the emotion of a word w k . The input layer is followed by multiple hidden layers with activation where W (l+1) and b (l+1) are the weights and biases for layer l + 1 and σ is a nonlinear activation function. Since we treat emotion prediction as a regression problem, the activation on the output layer a out (where out is the number of non-input layers in the network) is computed as the affine transformation Boosting is a general machine learning technique where several weak estimators are combined to form a strong estimator. The authors used FFNNs with a single hidden layer of 100 units and rectified linear unit (ReLU) activation. The boosting algorithm AdaBoost.R2 (Drucker, 1997) was used to train the ensemble (one per affective dimension). Our re-implementation copies their technical set-up 7 exactly using scikit-learn.

Multi-Task Learning Neural Network
The approaches introduced in Section 3.1 and Section 2 vary largely in their methodological foundations, i.e., they comprise semi-supervised and supervised machine learning techniques-both statistical and neural ones. Yet, they all have in common that they treat the prediction of the different emotional dimensions as separate tasks. That is, they fit one individual model per VAD dimension without sharing parameters between them.
In contradistinction, the key feature of our approach is that we fit a single FFNN model to 7 Original settings available at https://github. com/StevenLOL/ialp2016_Shared_Task predict all VAD dimensions jointly, thus applying multi-task learning to word emotion induction. Hence, we treat the prediction of Valence, Arousal and Dominance as three independent tasks. Our multi-task learning neural network (MTLNN) (depicted in Figure 2) has an output layer of three units such that each output unit represents one of the VAD dimensions. However, the activation in our two hidden layers (of 256 and 128 units, respectively) is shared across all VAD dimensions, and so are the associated weights and biases. Thus, while we train our MTLNN model it is forced to learn intermediate representations of the input which are generally informative for all VAD dimensions. This serves as a form of regularization, since it becomes less likely for our model to fit the noise in the training set as noise patterns may vary across emotional dimensions. Simultaneously, this has an effect similar to an increase of the training size, since each sample now leads to additional error signals during backpropagation. Intuitively, both properties seem extremely useful for relatively small-sized emotion lexicons (see Section 4 for empirical evidence).
The remaining specifications of our model are as follows. We use leaky ReLU activation (LReLU) as nonlinearity (Maas et al., 2013).
with γ := .01 for our experiments. For regularization, dropout (Srivastava et al., 2014) is applied during training with a probability of .2 on the embedding layer and .5 on the hidden layers. We train for 15, 000 iterations (well beyond convergence on each data set we use) with the ADAM optimizer (Kingma and Ba, 2015) of .001 base learning rate, batch size of 128 and Mean-Squared-Error loss. The weights are randomly initialized (drawn from a normal distribution with a standard deviation .001) and biases are uniformly initialized as .01. Tensorflow is used for implementation.

Results
In this section, we first validate our assumption that MTL is superior to single-task learning for word emotion induction. Next, we compare our proposed MTLNN model in a large-scale evaluation experiment. Performance figures will be measured as Pearson correlation (r) between our automatically predicted values and human gold ratings. The Pearson correlation between two data series X = x 1 , x 2 , ..., x n and Y = y 1 , y 2 , ..., y n takes values between +1 (perfect positive correlation) and −1 (perfect negative correlation) and is computed as (13) wherex andȳ denote the mean values for X and Y , respectively.

Single-Task vs. Multi-Task Learning
The main hypothesis of this contribution is that an MTL set-up is superior to single-task learning for word emotion induction. Before proceeding to the large-scale evaluation of our proposed model, we will first examine this aspect of our work.
For this, we use the following experimental setup: We will compare the MTLNN model against its single-task learning counterpart (SepNN). SepNN simultaneously trains three separate neural networks where only the input layer, yet no parameters of the intermediate layers are shared across the models. Each of the separate networks is identical to MTLNN (same layers, dropout, initialization, etc.), yet has only one output neuron, thus modeling only one of the three affective VAD dimensions. SepNN is equivalent to fitting our proposed model (but with only one output unit) to the different VAD dimensions individually, one after the other. Yet, training these separate networks simultaneously (not jointly!) makes both approaches, MTLNN and SepNN, easier to compare.
We will run MTLNN against SepNN on the EN and the EN+ data set (the former is very small, the latter relatively large; see Table 1) using the following set-up: for each gold lexicon and model, we randomly split the data 9/1 and train for 15, 000 iterations on the larger split (the same number of steps is used for the main experiment). After each one-thousand iterations step, model performance is tested on the held-out data. This process will be repeated 20 times and the performance figures at each one-thousand iterations step will be averaged. In a final step, we will average the results for each of the three emotional dimensions and only plot this average value. The results of this experiment are depicted in Figure 3.
First of all, each combination of model and data set displays a satisfactory performance of at least r ≈ .75 after 15,000 steps compared to previous work (see below). Overall, performance is higher for the smaller EN lexicon. Although counterintuitive (since smaller lexicons lead to fewer training samples), this finding is consistent with prior work (Sedoc et al., 2017;Li et al., 2017) and is probably related to the fact that smaller lexicons usually comprise a larger portion of strongly emotionbearing words. In contrast, larger lexicons add more neutral words which tend to be harder to predict in terms of correlation.
As hypothesized, the MTLNN model does indeed outperform the single task model on both data sets. Our data also suggest that the gain from the MTL approach is larger on smaller data sets (again in concordance with our expectations). Figure Table 4: Results of our main experiment in averaged Pearson correlation; best result per condition (in rows) in bold, second best result underlined; significant difference (paired two-tailed t-test) over the second best system marked with "*", "**", or "***" for p < .05, .01, or .001, respectively. when the separate model does not overfit (as on the EN+ lexicon), MTLNN reveals better results. Although SepNN needs fewer training steps before convergence, the MTLNN model trains much faster, thus still converging faster in terms of runtime (about a minute on a middle-class GPU). This is because MTLNN has only about a third as many parameters as the separate model SepNN.

Comparison against Reference Methods
We combined each of the selected lexicon data sets ( Table 1) with each of the applicable publicly available embedding models (Section 2; the embedding model provided by Sedoc et al. (2017) will be used separately) for a total of 15 conditions, i.e, the rows in Table 4.
For each of these conditions, we performed a 10-fold cross-validation (CV) for each of the 6 methods presented in Section 3 such that each method is presented with the identical data splits. 8 For each condition, algorithm, and VA(D) dimension, we compute the Pearson correlation r between gold ratings and predictions. For conciseness, we present only the average correlation over the respective affective dimensions in Table 4 (Valence and Arousal for ES+ and ZH, VAD for the others). Note that the methods we compare ourselves against comprise the current state-of-the art in both polarity and emotion induction (as described in Section 2).
As can be seen, our proposed MTLNN model outperforms all other approaches in each of the 15 conditions. Regarding the average over all affective dimensions and conditions, it outperforms the second best system, ensembleNN, by more than 5%-points. In line with our results from Section 4.1, those improvements are especially pronounced on smaller data sets containing one up to two thousand entries (EN, ES, IT, PT, ID) with close to 10%-points improvement over the respective second-best system.
Concerning the relative ordering of the affective dimensions, in line with former studies (Sedoc et al., 2017;Li et al., 2017), the performance figures for the Valence dimension are usually much higher than for Arousal and Dominance. Using MTLNN, for many conditions, we see the pattern that Valence is about 10%-points above the VAD average, Arousal being 10%-points below and Dominance being roughly equal to the average over VAD (this applies, e.g., to EN, EN+ and IT). On other data sets (e.g., PL, NL and ID), the ordering between Arousal and Dominance is less clear though Valence still stands out with the best results. We observe the same general pattern for the reference methods, as well.
Concerning the comparison to Sedoc et al. (2017), arguably one of most related contributions, they report a performance of r = .768 for Valence and .582 for Arousal on the EN+ data set in a 10fold CV using their own embeddings. In contrast, MTLNN using the COMMON model achieves r = .870 and .674 in the same set-up-about 10%-  points better on both dimensions. However, the COMMON model was trained on much more data than the embeddings Sedoc et al. (2017) use. For the most direct comparison, we also repeated this experiment using their embedding model (GIGA). We find that MTLNN still clearly outperforms their results with r = .814 for Valence and .607 for Arousal. 9 MTLNN achieves also very strong results in direct comparison to human performance (see Table  5). Warriner et al. (2013) (who created EN+) report an inter-study reliability (ISR; i.e., the correlation of the aggregated ratings from two different studies) between the EN and the EN+ lexicon of r = .953, .759 and .795 for VAD, respectively. Since EN is a subset of EN+, we can compare these performance figures against our own results on the EN data set where we achieved r = .918, .730 and .825, respectively. Thus, our proposed method did actually outperform human reliability for Dominance and is competitive for Valence and Arousal, as well.
This general observation is also backed up by split-half reliability data (SHR; i.e., when randomly splitting all individual ratings in two groups and averaging the ratings within each group, how strong is the correlation between these averaged ratings?). For the EN+ data set, Warriner et al. (2013) report an SHR of r = .914, .689 and .770 for VAD, respectively. Again, our MTLNN model performs very competitive with r = .870, .674 and .758, respectively using the COMMON embeddings.
in word emotion induction-the task to predict a complex emotion score for an individual word. We validated our claim that MTL is superior to single-task learning by achieving better results with our proposed method in performance as well as training time compared to its single-task counterpart. We performed an extensive evaluation of our model on 9 typologically diverse languages, using different kinds of word embedding models for a total 15 conditions. Comparing our approach to state-of-the-art methods from word polarity and word emotion induction, our model turns out to be superior in each condition, thus setting a novel state-of-the-art performance for both polarity and emotion induction. Moreover, our results are even competitive to human annotation reliability in terms of inter-study as well as split-half reliability. Since this contribution was restricted to the VAD format of emotion representation, in future work we will examine whether MTL yields similar gains for other representational schemes, as well.