Learning and Evaluating Emotion Lexicons for 91 Languages

Emotion lexicons describe the affective meaning of words and thus constitute a centerpiece for advanced sentiment and emotion analysis. Yet, manually curated lexicons are only available for a handful of languages, leaving most languages of the world without such a precious resource for downstream applications. Even worse, their coverage is often limited both in terms of the lexical units they contain and the emotional variables they feature. In order to break this bottleneck, we here introduce a methodology for creating almost arbitrarily large emotion lexicons for any target language. Our approach requires nothing but a source language emotion lexicon, a bilingual word translation model, and a target language embedding model. Fulfilling these requirements for 91 languages, we are able to generate representationally rich high-coverage lexicons comprising eight emotional variables with more than 100k lexical entries each. We evaluated the automatically generated lexicons against human judgment from 26 datasets, spanning 12 typologically diverse languages, and found that our approach produces results in line with state-of-the-art monolingual approaches to lexicon creation and even surpasses human reliability for some languages and variables. Code and data are available at https://github.com/JULIELab/MEmoLon archived under DOI 10.5281/zenodo.3779901.


Introduction
An emotion lexicon is a lexical repository which encodes the affective meaning of individual words (lexical entries). Most simply, affective meaning can be encoded in terms of polarity, i.e., the distinction whether an item is considered as positive, negative, or neutral. This is the case for many well-known resources such as WORDNET-AFFECT (Strapparava and Valitutti, 2004), SENTIWORD-NET (Baccianella et al., 2010), or VADER (Hutto andGilbert, 2014). Yet, an increasing number of researchers focus on more expressive encodings for affective states inspired by distinct lines of work in psychology Buechel and Hahn, 2017;Sedoc et al., 2017;Abdul-Mageed and Ungar, 2017;Bostan and Klinger, 2018;Mohammad, 2018;Troiano et al., 2019).
Psychologists, on the one hand, value such lexicons as a controlled set of stimuli for designing experiments, e.g., to investigate patterns of lexical access or the structure of memory (Hofmann et al., 2009;Monnier and Syssau, 2008). NLP researchers, on the other hand, use them to augment the emotional loading of word embeddings (Yu et al., 2017;Khosla et al., 2018), as additional input to sentence-level emotion models so that the performance of even the most sophisticated neural network gets boosted (Mohammad and Bravo-Marquez, 2017;Mohammad et al., 2018;De Bruyne et al., 2019), or rely on them in a keyword-spotting approach when no training data is available, e.g., for studies dealing with historical language stages (Buechel et al., 2016).
As with any kind of manually curated resource, the availability of emotion lexicons is heavily restricted to only a few languages whose exact number varies depending on the variables under scrutiny. For example, we are aware of lexicons for 15 languages that encode the emotional variables of Valence, Arousal, and Dominance (see Section 2). This number leaves the majority of the world's (less-resourced) languages without such a dataset. In case such a lexicon exists for a particular language, it is often severely limited in size, sometimes only comprising some hundreds of entries (Davidson and Innes-Ker, 2014). Yet, even the largest lexicons typically cover only some ten thousands of words, still leaving out major portions of the emotion-carrying vocabulary. This is especially true for languages with complex morphology or productive compounding, such as Finnish, Turkish, Czech, or German. Finally, the diversity of emotion representation schemes adds another layer of complexity. While psychologists and NLP researchers alike find that different sets of emotional variables are complementary to each other (Stevenson et al., 2007;Pinheiro et al., 2017;Barnes et al., 2019;De Bruyne et al., 2019), manually creating emotion lexicons for every language and every emotion representation scheme is virtually impossible.
We here propose an approach based on crosslingual distant supervision to generate almost arbitrarily large emotion lexicons for any target language and emotional variable, provided the following requirements are met: a source language emotion lexicon covering the desired variables, a bilingual word translation model, and a target language embedding model. By fulfilling these preconditions, we can automatically generate emotion lexicons for 91 languages covering ratings for eight emotional variables and hundreds of thousands of lexical entries each. Our experiments reveal that our method is on a par with state-of-the-art monolingual approaches and compares favorably with (sometimes even outperforms) human reliability.

Related Work
Representing Emotion. Whereas research in NLP has focused for a very long time almost exclusively on polarity, more recently, there has been a growing interest in more informative representation structures for affective states by including different groups of emotional variables (Bostan and Klinger, 2018). Borrowing from distinct schools of thought in psychology, these variables can typically be subdivided into dimensional vs. discrete approaches to emotion representation (Calvo and Mac Kim, 2013). The dimensional approach assumes that emotional states can be composed out of several foundational factors, most noticeably Valence (corresponding to polarity), Arousal (measuring calmness vs. excitement), and Dominance (the perceived degree of control in a social situation); VAD, for short (Bradley and Lang, 1994). Conversely, the discrete approach assumes that emotional states can be reduced to a small, evolutionary motivated set of basic emotions (Ekman, 1992). Although the exact division of the set has been subject of hot debates, recently constructed datasets (see Section 4) most often cover the categories of Joy, Anger, Sadness, Fear, and Disgust; BE5, for short. Plutchik's Wheel of Emotion takes a middle ground between those two positions by postulating emotional categories which are yet grouped into opposite pairs along different levels of intensity (Plutchik, 1980). Another dividing line between representational approaches is whether target variables are encoded in terms of (strict) class-membership or scores for numerical strength. In the first case, emotion analysis translates into a (multi-class) classification problem, whereas the latter turns it into a regression problem (Buechel and Hahn, 2016). While our proposed methodology is agnostic towards the chosen emotion format, we will focus on the VAD and BE5 formats here, using numerical ratings (see the examples in Table 1) due to the widespread availability of such data. Accordingly, this paper treats word emotion prediction as a regression problem. . VAD uses 1-to-9 scales ("5" encodes the neutral value) and BE5 1-to-5 scales ("1" encodes the neutral value).
Building Emotion Lexicons. Usually, the ground truth for affective word ratings (i.e., the assignment of emotional values to a lexical item) is acquired in a questionnaire study design where subjects (annotators) receive lists of words which they rate according to different emotion variables or categories. Aggregating individual ratings of multiple annotators then results in the final emotion lexicon (Bradley and Lang, 1999). Recently, this workflow has often been enhanced by crowdsourcing (Mohammad and Turney, 2013) and best-worst scaling (Kiritchenko and Mohammad, 2016).
As a viable alternative to manual acquisition, such lexicons can also be created by automatic means (Bestgen, 2008;Köper and Schulte im Walde, 2016;Shaikh et al., 2016), i.e., by learning to predict emotion labels for unseen words. Researchers have worked on this prediction problem for quite a long time. Early work tended to focus on word statistics, often in combination with linguistic rules (Hatzivassiloglou and McKeown, 1997;Turney and Littman, 2003). More recent approaches focus heavily on word embeddings, either using semi-supervised graph-based approaches Hamilton et al., 2016;Sedoc et al., 2017) or fully supervised methods (Rosenthal et al., 2015;Li et al., 2017;Rothe et al., 2016;Du and Zhang, 2016). Most important for this work, Buechel and Hahn (2018b) report on near-human performance using a combination of FASTTEXT vectors and a multi-task feed-forward network (see Section 4). While this line of work can add new words, it does not extend lexicons to other emotional variables or languages.
A relatively new way of generating novel labels is emotion representation mapping (ERM), an annotation projection that translates ratings from one emotion format into another, e.g., mapping VAD labels into BE5, or vice versa (Hoffmann et al., 2012;Buechel andHahn, 2016, 2018a;Alarcão and Fonseca, 2017;Landowska, 2018;Zhou et al., 2020;Park et al., 2019). While our work uses ERM to add additional emotion variables to the source lexicon, ERM alone can neither increase the coverage of a lexicon, nor adapt it to another language.
Translating Emotions. The approach we propose is strongly tied to the observation by Leveau et al. (2012) and Warriner et al. (2013) who found-comparing a large number of existing emotion lexicons of different languages-that translational equivalents of words show strong stability and adherence to their emotional value. Yet, their work is purely descriptive. They do not exploit their observation to create new ratings, and only consider manual rather than automatic translation.
Making indirect use of this observation, Mohammad and Turney (2013) offer machine-translated versions of their NRC Emotion Lexicon. Also, many approaches in cross-lingual sentiment analysis (on the sentence-level) rely on translating polarity lexicons (Abdalla and Hirst, 2017;Barnes et al., 2018). Perhaps most similar to our work, Chen and Skiena (2014) create (polarity-only) lexicons for 136 languages by building a multilingual word graph and propagating sentiment labels through that graph. Yet, their method is restricted to high frequency words-their lexicons cover between 12 and 4,653 entries, whereas our approach exceeds this limit by more than two orders of magnitude.
Our methodology also resembles previous work which models word emotion for historical language stages (Cook and Stevenson, 2010;Hamilton et al., 2016;Hellrich et al., 2018;Li et al., 2019). Work in this direction typically comes up with a set of seed words with assumingly temporally stable affective meaning (our work assumes stability against translation) and then uses distributional methods to derive emotion ratings in the target language stage. However, gold data for the target language (stage) is usually inaccessible, often preventing evaluation against human judgment. In contrast, we here propose several alternative evaluation set-ups as an integral part of our methodology.

A Novel Approach to Lexicon Creation
Our methodology integrates (1) cross-lingual generation and expansion of emotion lexicons and (2) their evaluation against gold and silver standard data. Consequently, a key aspect of our workflow design is how data is split into train, dev, and test sets at different points of the generation process. Figure 1 gives an overview of our framework including a toy example for illustration.
Lexicon Generation. We start with a lexicon (Source) of arbitrary size, emotion format 1 and source language which is partitioned into train, dev, and test splits denoted by Source-train, Source-dev, and Source-test, respectively. Next, we leverage a bilingual word translation model between source and desired target language to build the first target-side emotion lexicon denoted as TargetMT. Source words are translated according to the model, whereas target-side emotion labels are simply copied from the source to the target (see Section 2). Entries are assigned to train, dev, or test set according to their source-side assignment (cf. Figure 1). The choice of our translation service (see below) ensures that each source word receives exactly one translation.
TargetMT is then used as the distant supervisor to train a model that predicts word emotions based on target-side word embeddings. TargetMT-train and TargetMT-dev are used to fit model parameters and optimize hyperparameters, respectively, whereas TargetMT-test is held out for later evaluation. Once finalized, the model is used to predict new labels for the words in TargetMT, resulting in a second target-side emotion lexicon denoted TargetPred. Our rationale for doing so is that a reasonably trained model should generalize well  Figure 1: Schematic view on the methodology for generating and evaluating an emotion lexicon for a given target language based on source language supervision. Included is a toy example starting with an English VA lexicon (sunshine, nuclear, terrorism and the associated numerical scores for Valence and Arousal) and resulting in an extended German lexicon which incorporates translated entries with altered VA scores and additional entries originating from the embedding model with newly learned scores. over the entire TargetMT lexicon because it has access to the target-side embedding vectors. Hence, it may mitigate some of the errors which were introduced in previous steps, either by machine translation or by assuming that sourceand target-side emotion are always identical. We validate this assumption in Section 6. We also predict ratings for all the words in the embedding model, leading to a large number of new entries.
The splits are defined as follows: let M T train , M T dev , and M T test denote the set of words in train, dev, and test split of TargetMT, respectively. Likewise, let P train , P dev , and P test denote the splits of TargetPred and let E denote the set of words in the embedding model. Then The above definitions help clarify the way we address polysemy. 2 Ambiguity on the target-side 2 In short, our work evades this problem by dealing with lexical entries exclusively on the type-rather than the senselevel. From a lexicological perspective, this may seem like a strong assumption. From a modeling perspective, however, it appears almost obvious as it aligns well with the major components of our methodology, i.e., lexicons, embeddings, and translation. The lexicons we work with follow the design of behavioral experiments: a stimulus (word type) is given to may result in multiple source entries translating to the same target-side word. 3 This circumstance leads to "partial duplicates" in TargetMT, i.e., groups of entries with the same word type but different emotion values (because they were derived from distinct Source entries). Such overlap could do harm to the integrity of our evaluation since knowledge may "leak" from training to validation phase, i.e., by testing the model on words it has already seen during training, although with distinct emotion labels. The proposed data partitioning eliminates such distortion effects. Since partial duplicates receive the same embedding vector, the prediction model assigns the same emotion value to both, thus merging them in TargetPred.
Evaluation Methodology. The main advantage of the above generation method is that it allows us to create large-scale emotion lexicons for languages a subject and the response (rating) is recorded. The absence of sense-level annotation simplifies the mapping between lexicon and embedding entries. While sense embeddings form an active area of research (Camacho-Collados and Pilehvar, 2018;Chi and Chen, 2018), to the best of our knowledge, type-level embeddings yield state-of-the-art performance in downstream applications.
3 Source-side polysemy, in contrast to its target-side counterpart, is less of a problem, because we receive only a single candidate during translation. This may result in cases where the translation misaligns with the copied emotion value in TargetMT. Yet, the prediction step partly mitigates such inconsistencies (see Section 6).
for which gold data is lacking. But if that is the case, how can we assess the quality of the generated lexicons? Our solution is to propose two different evaluation scenarios-a gold evaluation which is a strict comparison against human judgment, meaning that it is limited to languages where such data (denoted TargetGold) is available, and a silver evaluation which substitutes human judgments by automatically derived ones (silver standard) which is feasible for any language in our study. The rationale is that if both, gold and silver evaluation, strongly agree with each other, we can use one as proxy for the other when no target-side gold data exists (examined in Section 6).
Note that our lexicon generation approach consists of two major steps, translation and prediction. However, these two steps are not equally important for each generated entry in TargetPred. Words, such as German Sonnenschein for which a translational equivalent already exists in the Source ("sunshine"; see Figure 1), mainly rely on translation, while the prediction step acts as an optional refinement procedure. In contrast, the prediction step is crucial for words, such as Erdbeben, whose translational equivalents ("earthquake") are missing in the Source. Yet, these words also depend on the translation step for producing training data.
These considerations are important for deciding which words to evaluate on. We may choose to base our evaluation on the full TargetPred lexicon, including words from the training set-after all, the word emotion model does not have access to any target-side gold data. The problem with this approach is that it merges words that mainly rely on translation, because their equivalents are in the Source, and those which largely depend on prediction, because they are taken from the embedding model. In this case, generalizability of evaluation results becomes questionable.
Thus, our evaluation methodology needs to fulfill the following two requirements: (1) evaluation must not be performed on translational equivalents of the Source entries to which the model already had access during training (e.g., Sonnenschein and nuklear in our example from Figure 1); but, on the other hand, (2) a reasonable number of instances must be available for evaluation (ideally, as many as possible to increase reliability). The intricate cross-lingual train-dev-test set assignment of our generation methodology is in place so that we meet these two requirements.  In particular, for our silver evaluation, we intersect TargetMT-test with TargetPred-test and compute the correlation of these two sets individually for each emotion variable. Pearson's r will be used as correlation measure throughout this paper. Establishing a test set at the very start of our workflow, Source-test, assures that there is a relatively large overlap between the two sets and, by extension, that our requirements for the evaluation are met.
The gold evaluation is a somewhat more challenging case, because we can, in general, not guarantee that the overlap of a TargetGold lexicon with TargetPred-test will be of any particular size. For this reason, the words of the embedding model are added to TargetPred-test (see above), maximizing the expected overlap with TargetGold. In practical terms, we intersect TargetGold with TargetPred-test and compute the variable-wise correlation between these sets, in parallel to the silver evaluation. A complementary strategy for maximizing overlap, by exploiting dependencies between published lexicons, is described below.

Experimental Setup
Gold Lexicons and Data Splits. We use the English emotion lexicon from Warriner et al. (2013) as first part of our Source dataset. This popular resource comprises about 14k entries in VAD format collected via crowdsourcing. Since manually gathered BE5 ratings are available only for a subset of this lexicon (Stevenson et al., 2007), we add BE5 ratings from Buechel and Hahn (2018a) who used emotion representation mapping (see Section 2) to convert the existing VAD ratings, showing that this is about as reliable as human annotation.
As apparent from the previous section, a crucial aspect for applying our methodology is the design of the train-dev-test split of the Source because it directly impacts the amount of words we can test our lexicons on during gold evaluation. In line with these considerations, we choose the lexical items which are already present in ANEW (Bradley and Lang, 1999) as Source-test set. ANEW is the precursor to the version later distributed by Warriner et al. (2013); it is widely used and has been adapted to a wide range of languages. With this choice, it is likely that a resulting TargetPred-test set has a large overlap with the respective TargetGold lexicon. As for the TargetGold lexicons, we included every VA(D) and BE5 lexicon we could get hold of with more than 500 entries. This resulted in 26 datasets covering 12 quite diverse languages (see Table 2). Note that we also include English lexicons in the gold evaluation. In these cases, no translation will be carried out (Source is identical to TargetMT) so that only the expansion step is validated. Appendix A.1 gives further details on data preparation.
Translation. We used the GOOGLE CLOUD TRANSLATION API 4 to produce word-to-word translation tables. This is a commercial service, total translation costs amount to 160 EUR. API calls were performed in November 2019.
Embeddings. We use the fastText embedding models from Grave et al. (2018) trained for 157 languages on the respective WIKIPEDIA and the respective part of COMMONCRAWL. These resources not only greatly facilitate our work but also increase comparability across languages. The restriction to "only" 91 languages comes from intersecting the ones covered by the vectors with the languages covered by the translation service.
Models. Since our proposed methodology is agnostic towards the chosen word emotion model, we will re-use models from the literature. In particular, we will rely on the multi-task learning feed-forward network (MTLFFN) worked out by Buechel and Hahn (2018b). This network constitutes the current state of the art for monolingual emotion lexicon creation (expanding an existing lexicon for a given language) for many of the datasets in Table 2.
The MTLFFN has two hidden layers of 256 and 128 units, respectively, and takes pre-trained embedding vectors as input. Its distinguishing feature is that hidden layer parameters are shared between the different emotion target variables, thus constituting a mild form of multi-task learning (MTL). We apply MTL to VAD and BE5 variables individually (but not between both groups), thus training two distinct emotion models per language, following the outcome of a development experiment. Details are given in Appendix A.2 together with the remainder of the model specifications.
Being aware of the infamous instability of neural approaches (Reimers and Gurevych, 2017), we also employ a ridge regression model, an L 2 regularized version of linear regression, as a more robust, yet also powerful baseline (Li et al., 2017).

Results
The size of the resulting lexicons (a complete list is provided in Table 8 in the Appendix) ranges from roughly 100k to more than 2M entries mainly depending on the vocabulary of the respective embeddings. We want to point out that not every single entry should be considered meaningful because of noise in the embedding vocabulary caused by typos and tokenization errors. However, choosing the "best" size for an emotion lexicon necessarily translates into a quality-coverage trade-off for which there is no general solution. Instead, we release the full-size lexicons and leave it to prospective users to apply any sort of filtering they deem appropriate.
Silver Evaluation. Figure 2 displays the results of our silver evaluation. Languages (x-axis) are sorted by their average performance over all variables (not shown in the plot; tabular data given in the Appendix). As can be seen, the evaluation results for English are markedly better than for any other language. This is not surprising since no (potentially error-prone) machine translation was performed. Apart from that, performance remains relatively stable across most of the languages and starts degrading more quickly only for the last third of them. In particular, for Valence-typically the easiest variable to predict-we achieve a strong performance of r > .7 for 56 languages. On the other hand, for Arousal-typically, the most difficult one to predict-we achieve a solid performance of r > .5 for 55 languages. Dominance and the discrete emotion variables show performance trajectories swinging between these two extremes. We assume that the main factors for explaining performance differences between languages are the quality of the translation and embedding models which, in turn, both depend on the amount of available text data (parallel or monolingual, respectively).
Comparing MTLFFN and ridge baseline, we find that the neural network reliably outperforms the linear model. On average over all languages and variables, the MTL models achieve 6.7%-points higher Pearson correlation. Conversely, ridge regression outperforms MTLFFN in only 15 of the total 728 cases (91 languages × 8 variables).
Gold Evaluation. Results for VAD variables on gold data are given in Table 3. As can be seen, our lexicons show a good correlation with human judgment and do so robustly, even for less-resourced languages, such as Indonesian (id), Turkish (tr), or Croatian (hr), and across affective variables. Perhaps the strongest negative outliers are the Arousal results for the two Chinese datasets (zh), which are likely to result from the low reliability of the gold ratings (see below).

ID
Shared (  We compare these results against those from Buechel and Hahn (2018b) which were acquired on the respective TargetGold dataset in a monolingual fashion using 10-fold cross-validation (10-  CV). We admit that those results are not fully comparable to those presented here because we use fixed splits rather than 10-CV. Nevertheless, we find that the results of our cross-lingual set-up are more than competitive, outperforming the monolingual results from Buechel and Hahn (2018b) in 17 out of 30 cases (mainly for Valence and Dominance, less often for Arousal). This is surprising since we use an otherwise identical model and training procedure. We conjecture that the large size of the English Source lexicon, compared to most TargetGold lexicons, more than compensates for error-prone machine translation. Table 4 shows the results for BE5 datasets which are in line with the VAD results. Regarding the ordering of the emotional variables, again, we find Valence to be the easiest one to predict, Arousal the hardest, whereas basic emotions and Dominance take a middle ground.
Comparison against Human Reliability. We base this analysis on inter-study reliability (ISR), a rather strong criterion for human performance. ISR is computed, per variable, as the correlation between the ratings from two distinct annotation studies (Warriner et al., 2013). Hence, this analysis is restricted to languages where more than one gold lexicon exists per emotion format. We intersect the entries from both gold standards as well as the respective TargetPred-test set and compute the correlation between all three pairs of lexicons. If our lexicon agrees more with one of the gold standards than the two gold standards agree with each other, we consider this as an indicator for superhuman reliability (Buechel and Hahn, 2018b).
As shown in  human reliability in 4 out of 6 cases for Arousal, and in the single test case for Dominance. There are no cases of overlapping gold standards for BE5.

Methodological Assumptions Revisited
This section investigates patterns in prediction quality across languages, validating design decisions of our methodology.
Translation vs. Prediction. Is it beneficial to predict new ratings for the words in TargetMT rather than using them as final lexicon entries straight away? For each TargetGold lexicon (cf. Table 2), we intersect its word material with that in TargetMT and TargetPred. Then, we compute the correlation between TargetPred and TargetMT with the gold standard. This analysis was done on the respective train sets because using TargetMT rather than TargetPred is only an option for entries known at training time. Table 6 depicts the results of this comparison averaged over all gold lexicons. As hypothesized, the TargetPred lexicons agree, on average, more with human judgment than the TargetMT lexicons, suggesting that the word emotion model acts as a value-adding post-processor, partly mitigating rating inconsistencies introduced by mere translation of the lexicons. The observation holds for each individual emotion variable with particularly large benefits for Arousal, where the postprocessed TargetPred lexicons are on average  14%-points better compared to the translation-only TargetMT lexicons. This seems to indicate that lexical Arousal is less consistent between translational equivalents compared to other emotional meaning components like Valence and Sadness, which appear to be more robust against translation.
Gold vs. Silver Evaluation. How meaningful is silver evaluation without gold data? We compute the Pearson correlation between gold and silver evaluation results across languages per emotion variable. For languages where we consider multiple datasets during gold evaluation, we first average the gold evaluation results for each emotion variable. As can be seen from Table 7, the correlation values range between r = .91 for Joy and r = .27 for Disgust. This relatively large dispersion is not surprising when we take into account that we correlate very small data series (for Valence and Arousal there are just 12 languages for which both gold and silver evaluation results are available; for BE5 there are only 5 such languages). However, the mean over all correlation values in Table  7 is .64, indicating that there is a relatively strong correlation between both types of evaluation. This suggests that the silver evaluation may be used as a rather reliable proxy of lexicon quality even in the absence of language-specific gold data.  Table 7: Agreement between gold and silver evaluation across languages in Pearson's r relative to the number of applicable languages ("#Lg").

Conclusion
Emotion lexicons are at the core of sentiment analysis, a rapidly flourishing field of NLP. Yet, despite large community efforts, the coverage of existing lexicons is still limited in terms of languages, size, and types of emotion variables. While there are techniques to tackle these three forms of sparsity in isolation, we introduced a methodology which allows us to cope with them simultaneously by jointly combining emotion representation mapping, machine translation, and embedding-based lexicon expansion.
Our study is "large-scale" in many respects. We created representationally complex lexiconscomprising 8 distinct emotion variables-for 91 languages with up to 2 million entries each. The evaluation of the generated lexicons featured 26 manually annotated datasets spanning 12 diverse languages. The predicted ratings showed consistently high correlation with human judgment, compared favorably with state-of-the-art monolingual approaches to lexicon expansion and even surpassed human inter-study reliability in some cases.
The sheer number of test sets we used allowed us to validate fundamental methodological assumptions underlying our approach. Firstly, the evaluation procedure, which is integrated into the generation methodology, allows us to reliably estimate the quality of resulting lexicons, even without target language gold standard. Secondly, our data suggests that embedding-based word emotion models can be used as a repair mechanism, mitigating poor target-language emotion estimates acquired by simple word-to-word translation.
Future work will have to deepen the way we deal with word sense ambiguity by way of exchanging the simplifying type-level approach our current work is based on with a semantically more informed sense-level approach. A promising direction would be to combine a multilingual sense inventory such as BABELNET (Navigli and Ponzetto, 2012) with sense embeddings (Camacho-Collados and Pilehvar, 2018

A.1 Data Preparation
The exact design of the Source train-dev-test split is as follows: All entries (words plus ratings) from all splits are taken from Warriner et al. (2013). The data was then partitioned based on the overlap with the two precursory versions by Bradley and Lang (1999) (the original ANEW) and Bradley and Lang (2010) (an early extended version of ANEW roughly twice as large). Source-test was built by intersecting the lexicon from Warriner et al. (2013) with the original ANEW. A similar process was applied for Source-dev: we intersected the words from Warriner et al. (2013) and Bradley and Lang (2010) and removed the ones present in Source-test. Lastly, Source-train is made up by all words from Warriner et al. (2013) which are neither in Source-test nor in Source-dev. The reason why the ratings in Source are taken exclusively from Warriner et al. (2013) is that these are distributed under a more permissive license compared to their precursors. We removed multi-token entries (e.g., boa constrictor) and entries with upper case characters (e.g., Budweiser) from all data splits of Source, thus restricting the lexicon to single-token, nonproper noun entries to make it more suitable for word embedding-based research. All splits combined have 13,791 entries (train: 11,463, dev: 1,296, test: 1,032), thus removing less than 1% from the original lexicon. 5 Regarding the remaining gold standards, the only cases which needed additional preparation or cleansing steps were zh1  and zh2 (Yao et al., 2017). zh1 was created and is distributed using traditional Chinese characters, whereas the embedding model by Grave et al. (2018) employs simplified ones. Therefore, we converted zh1 into simplified characters using GOOGLE TRANSLATE 6 prior to evaluation.
While manually examining the zh2 lexicon, we noticed several cases where the ratings seemed rather counter-intuitive (e.g., seemingly positive words which received very negative ratings). We contacted the authors who confirmed the problem and sent us a corrected version. We did not find any such problems in the second version. We consulted 5 The data split is available at: https://github.com/ JULIELab/XANEW 6 In this case the regular Web application, not the API, was used: https://translate.google.com/ with a Chinese native speaker for both of these procedures regarding the zh1 and zh2 lexicons.

A.2 Model Training and Implementation
Training of the MTLFFN model closely followed the procedure specified by Buechel and Hahn (2018b): For each language, the model was trained for roughly 15k iterations (exactly 168 epochs) with a batch size of 128 using the Adam optimizer (Kingma and Ba, 2015) with learning rate 10 −3 , and .5 dropout on the hidden layers and .2 on the input layer. As nonlinear activation function we used leaky ReLU with "leakage" of 0.01.
Embedding vectors are the only model input. They have 300 dimensions for every language, independent of their respective training data size (Grave et al., 2018). Since the automatic translation of Source is not guaranteed to result in single-word translations, we use the following workaround to derive embedding vectors for multi-token translations: If the translation as a whole cannot be found in the embedding model, the multi-token term gets split up into its constituent parts, using spaces, apostrophes or hyphens as separators. Each substring is looked up in the embedding model, the averaged vector is taken as input. If no substring is recognized, we use the zero vector instead. We also use the zero vector for single-token entries in TargetMT that are missing in the embeddings.
Since Buechel and Hahn (2018b) considered only VAD but not BE5 datasets, we conducted a development experiment on the TargetMT-dev sets for all 91 languages where we assessed whether MTL is advantageous for BE5 variables as well, or for a combination of VAD and BE5 variables. We found that MTL improved performance when applied separately among all VAD and BE5 variables. Yet, when jointly learning all eight emotion variables, the results were somewhat inconclusive. Performance increased for BE5, but decreased for VAD. Hence, for lexicon creation, we took a cautious approach and trained two separate models per language, one for VAD, the other for BE5. An analysis of MTL across VAD and BE5 is left for future work.
The MTLFFN model is implemented in PY-TORCH, adapting part of the TENSORFLOW code from Buechel and Hahn (2018b). The ridge regression baseline model is implemented with SCIKIT-LEARN (Pedregosa et al., 2011) using default parameters.