CARER: Contextualized Affect Representations for Emotion Recognition

Emotions are expressed in nuanced ways, which varies by collective or individual experiences, knowledge, and beliefs. Therefore, to understand emotion, as conveyed through text, a robust mechanism capable of capturing and modeling different linguistic nuances and phenomena is needed. We propose a semi-supervised, graph-based algorithm to produce rich structural descriptors which serve as the building blocks for constructing contextualized affect representations from text. The pattern-based representations are further enriched with word embeddings and evaluated through several emotion recognition tasks. Our experimental results demonstrate that the proposed method outperforms state-of-the-art techniques on emotion recognition tasks.


Introduction
Emotions reflect different users' perspectives towards actions and events, therefore they are innately expressed in dynamic linguistic forms. Capturing these linguistic variations is challenging because it involves knowledge of linguistic phenomena such as slang and coded words. Previous methods model these linguistic behaviours through rule-based (Volkova and Bachrach, 2016) and statistics-based approaches (Becker et al., 2017). These methods construct features that rely on hand-crafted resources; thus, they cannot properly capture the evolving linguistic variability found in large-scale opinionated content.
Consider the social posts "Thanks God for everything" and "Tnx mom for waaaaking me two hours early. Cant get asleep now", a lexicon-based model may not properly represent the emotion-relevant phrases: "waaaaking me", "Thanks God", and "Tnx mom". First, the word * * Corresponding author "waaaaking" doesn't exist in the English vocabulary, hence its referent may vary from its standard form, "waking". Secondly, knowledge of the semantic similarity between the words "Thanks" and "Tnx" is needed to establish any relationship between the last two phrases. Even if such relationship can be established through knowledgebased techniques, it's difficult to reliably determine the association of these phrases to a group of emotions. This is because traditional methods analyze data at the sentence level, which may be less effective as compared to methods that model the corpus as a complex network .
We represent an emotion corpus as a graph, which may suffer less from the problems mentioned above. This method efficiently captures the global mutual use of linguistic variations found in textual information. This is particularly important for linguistic behaviour that is socially and culturally influenced, as is common in opinionated content. Other advantages of the graph approach are that minimum domain knowledge and manual effort are required to capture important contextual and latent information, which are useful to disambiguate meaning in emotional expressions.
As an overview, we first collect an emotion dataset through noisy labels, annotated via distant supervision as in (Go et al., 2009). The graphbased mechanism helps to construct contextualized, pattern-based emotion features, which are further enriched with word embeddings in order to preserve semantic relationship between patterns.
To evaluate the quality of patterns, emotion detection models are trained using various online classifiers and deep learning models. Our main contributions are as follows: 1) A graph-based algorithm for automatic emotion-relevant feature extraction, 2) a set of emotion-rich feature representations enhanced through word embeddings, 3) and a comprehensive performance analysis of various con-ventional learning models and deep learning models used for text-based emotion recognition.
The rest of the paper is organized as follows: Section 2 discusses the relevant literature and different aspects of emotion recognition research addressed in this work; then, Section 3 provides details of the proposed methodology for extracting contextualized emotion-relevant representations; next, Section 4 lists the constructed emotion recognition models and comparison models; later, Section 5 discusses the data collection and experimental results; and finally, Section 6 further explains key insights observed from the results.

Related Work
Emotion Lexica: Several works use hand-crafted features and statistics-based approaches to train emotion recognition models (Blitzer et al., 2007;Wang et al., 2012;Roberts et al., 2012;Qadir and Riloff, 2013;Volkova et al., 2013;Becker et al., 2017;Saravia et al., 2016a). Some of these studies rely on affect lexicons, such as LIWC  and WordNet Affect (Strapparava et al., 2004), to extract emotion features from a text-based corpus. A recent study trained emotion detection systems built on emoticons and hashtag features (Volkova and Bachrach, 2016). Handcrafted features are useful for emotion recognition but are usually constrained by manually created resources. Our graph-based features are obtained in an semi-supervised manner, requiring minimum domain expertise and no dependency of linguistic resources that quickly become outdated. Emotion Corpora: There are several affective datasets such as SemEval-2017 Task 4 (Rosenthal et al., 2017) and Olympic games dataset (Sintsova et al., 2013). However, these datasets are limited by quantity. We bootstrap a set of noisy labels to obtain large-scale emotion tweets, and then perform annotation via distant supervision as in (Go et al., 2009;González-Ibánez et al., 2011;Wang et al., 2012;Mohammad and Kiritchenko, 2015;Abdul-Mageed and Ungar, 2017). In emotion recognition studies, Plutchik's wheel of emotions (Plutchik, 2001) or Ekman's six basic emotions (Ekman, 1992), are commonly adopted to define emotion categories (Mohammad, 2012;Suttles and Ide, 2013). Similar to previous works, we rely on hashtags to define our emotion categories. Feature Representations: Recent emotion recognition systems employ representation learning for automatic feature extraction (Poria et al., 2016;Savigny and Purwarianti, 2017;Abdul-Mageed and Ungar, 2017). In general, a combination of word embeddings (Mikolov et al., 2013) and a convolutional neural network (CNN) performs well for sentence classification tasks (Kim, 2014;Zhang et al., 2015). These models learn features which tend to have high coverage, high adaptability, require minimum supervision, and capture contextual information to some extent. We aim to leverage them and combine them with the proposed affect representations. Our graph-based feature extraction algorithm focuses on the underlying interactions between important linguistic components. Graph analysis measurements then help to output the building blocks for constructing pattern-based features. Hence, the patterns can be constructed to capture important contextual and latent emotion-relevant information.

Contextualized Affect Representations
In this section, we introduce a graph-based algorithm which helps to output the building blocks used to bootstrap a set of emotion-rich representations. The structural descriptions offered by the graph are particularly efficient at automatically surfacing important information (i.e., contextual and latent information) from a large-scale emotion corpus. Two different measurements are used to surface two families of words, which are in turn used to construct contextualized, pattern-based affect representations. The patterns are further enriched using word embeddings so as to preserve semantic relationship between patterns. After the patterns are constructed, the goal is to assign a weight to each pattern, referred to as a pattern score, which denotes how important a pattern p is to an emotion e. In the context of emotion classification, patterns and their weights play the role of features. The graph-based feature extraction algorithm is summarized in the following steps: Step 1 (Normalization): First, we collected two separate datasets using the Twitter API: subjective tweets S (obtained through hashtags as weak labels) and objective tweets O (obtained from Twitter feeds of news accounts). 1 Both datasets are tokenized by white-spaces and then preprocessed by applying lower case and replacing user mentions and URLs with a <usermention> and <url> 1 Each dataset contains 2+ million tweets. S was collected using 339 hashtags, similar to the process in Section 5.1.

Subject words Connector words
Pattern Extraction
placeholder, respectively. Hashtags are used as ground-truth in this work, so to avoid any bias we replace them with a <hashtag> placeholder.
Step 2 (Graph Construction): Given the normalized objective tweets O and subjective tweets S, two graphs are constructed: objective graph G o (V o ; A o ) and subjective graph G s (V s ; A s ), respectively. Vertices V is a set of nodes which represent the tokens extracted from the corpus. Edges, denoted as A, represent the relationship of words extracted using a window approach. These steps help to preserve the syntactic structure of the data. Given a post "<usermention> last night's concert was just awesome !!!!! <hashtag>", the resulting arcs are: "<usermention> → last", "last → night", ... , "!!!!! → <hashtag>".
Step 3 (Graph Aggregation): In this step we obtain a set of arcs that represent syntactic structures more common in subjective content. By adjusting graph G s with G o , we obtain a graph G e , referred to as the emotion graph, which preserves emotionrelevant tokens and is obtained in two steps: (1). For an arc a i ∈ A, its normalized weight can be computed as shown in Equation 1.
where f req(a i ) is the frequency of arc a i .
(2). Subsequently, new weights for arcs a i ∈ G e are assigned based on a pairwise adjustment as (2) The resulting weights belonging to graph G e were adjusted so that the most frequently occurring arcs in objective set G o are weakened in G e . As a result, arcs in G e that have higher weights represent tokens that are more common in subjective content. Furthermore, arcs a i ∈ A e are pruned based on a threshold φ w 2 .
Step 4 (Token Categorization): Two different graph measurements are used to extract two family of words from G e . These will function as the building blocks to build contextualized patterns. We formalize this step as follows: Given an adjacency matrix M, an entry M i,j is computed as: (3) Then, the eigenvector centrality and clustering coefficient of all vertices in V e are computed and used to categorize tokens into two types: (1) Connector Words: To measure the influence of all nodes in graph G e , we utilize eigenvec-tor centrality, which is calculated as: where λ denotes a proportionality factor and c i is the centrality score of node i.
Given λ as the corresponding eigenvalue, Equation 4 can be reformulated in vector notation form as Mc = λc, where c is an eigenvector of M. Given a selected eigenvector c and the eigenvector centrality score of node i, denoted as c i , the final list of connector words, hereinafter referred to as CW , is obtained by retaining all tokens with c i > φ eig 3 . CW correspond to the set of nodes that are very frequent and have high connectivity to high-rank nodes (e.g., "or", "and", and "my").
(2) Subject Words: In contrast, subject words or topical words are usually clustered together, i.e., many subject words are interconnected by the same connector words. Therefore, a coefficient is assigned to all nodes in G e and is computed as: (5) where cl i denotes the average clustering coefficient of node i which captures the amount of interconnectivity among neighbours of node i. Similar to the connector words, the subject words, hereinafter referred to as SW , are obtained by retaining all the tokens with cl i > φ cl 4 . Examples of subjects words obtained are: "never" and "life".
The subject words represent psychological oriented words similar to the LIWC affect lexicon , while connector words reflect the set of most common words in the subjective tweets (e.g., pronouns, auxiliary verbs, and conjunctions). As presented by Chung and Pennebaker (2007), both connector words and subject words are important for conveying emotion. Influenced by their work, we aim to capture intricate relationships -through the graphbetween these two families of words. The graph structure helps to preserve syntax and can automatically be used to surface emotion-relevant information.
One of the advantages of using graphs to represent syntactic relationships is that rare and important words are also surfaced. As shown in Table 1, 3 φeig is an experimentally determined threshold. 4 φ cl is an experimentally defined threshold.  informal words and misspellings, such as "definetley", "happnd", were surfaced. Words containing character repetitions help to express emotion intensity (e.g., "plzzzzzzz", "aaaaaaah", and "yayyyyy"). Interestingly, emotion-related coded words are also captured (e.g., "juju", "sh*t", "4ever", and "baobei" 5 ). All these examples show the benefit of using graph methods to capture emotion-relevant linguistic information.
Step 5 (Pattern Candidates): Given SW and CW , we bootstrap candidate patterns, which are more prevalent in opinionated content, while preserving syntactic structure. We provide the templates used to define the candidate patterns in Table 2. (sw and cw represent arbitrary tokens obtained from the sets SW and CW , respectively). It is important to clarify that sequences of size two and three were used in this work since this setting experimentally produced the best results.
Step 6 (Basic Pattern Extraction): A naive pattern extraction process consists of applying the syntactic templates to a dataset S p 6 in an exhaustive manner. In addition, the sw component in each pattern is replaced with a "*" placeholder. This operation allows for unknown subject words, not present in our training corpus, to be considered when constructing features. This can enable many useful applications, such as applying the patterns to other domains. We are interested in patterns that are highly associated with subjectivity, so patterns frequently occurring above a threshold are kept and the rest are filtered out. In Table 2, we provide examples of the type of basic patterns extracted along with their corresponding templates.

Enriched Patterns
As they stand, the patterns constructed in the previous step contain limited information relevant to emotion classification. Therefore, the patterns are enriched using continuous word representations so as to preserve semantic relationship between pat-

Templates
Pattern Examples < cw, sw > stupid * , like *, am * < cw, cw, sw > love you *, shut up * < sw, cw, sw > * for * < sw, cw, cw > * on the < sw, cw > * <hashtag> terns. The motivation behind this step is to focus on patterns that may be more useful for an emotion classification task. Alternatively, the whole universe of patterns can also be used, but we show in the experiments that the former method significantly improves emotion recognition results.
Pre-trained Word Embeddings: First, we obtain Twitter-based, pre-trained word embeddings from (Deriu et al., 2017) and reweight them via a sentiment corpus through distant supervision (Read, 2005;Go et al., 2009). 7 We trained a fully connected neural network (1 hidden layer) with 10 epochs via backpropagation as in (Deriu et al., 2017). The embeddings size is d = 52. Note that term frequency-inverse document frequency (tf-idf ) was used to reduce the vocabulary of words (from 140K to 20K words). Word Clusters: We then apply agglomerative clustering to generate clusters of semantically related words through their word embedding information.
To determine the quality of the clusters, they are compared with WordNet-Affect synsets (Strapparava et al., 2004) and tested for both homogeneity and completeness. We use Ward's method (Ward Jr, 1963) as the linkage criterion and cosine distance as the distance metric. The scikit-learn package (Pedregosa et al., 2011) was used to compute a total of k = 1500 clusters.
Enriched-Pattern Construction: The purpose of the word clusters is to enrich the patterns by preserving the semantic relationship between them, which is useful for classification purposes. We achieve this by revising the universe of patterns obtained from the basic pattern extraction step, and check to see if the words represented by the sw component exist in any of the word embedding clusters. This is done in an exhaustive manner, ensuring that all possible patterns in the dataset S p are processed to meet the criteria. Furthermore, patterns that appear < 10 times in dataset S p are filtered out, producing a total of 476,174 patterns. 7 We collected approximately 10 million tweets via sentiment emoticons (5+ mil. negative and 5+ mil. positive).
The resulting enriched patterns 8 now contain both the semantic information provided by the word embeddings and the contextual information provided through the graph components, hence the term contextualized affect representations.

Emotion Pattern Weighting
Before using the patterns for classification, they need to be weighted using a weighting mechanism such as tf-idf (Leskovec et al., 2014). The weights determine the importance of patterns to each emotion category. The proposed pattern weighting scheme used in this work is a customized version of tf-idf, coined as pattern frequency-inverse emotion frequency (pf-ief ), and is defined in two steps. Firstly, we compute for pf as: where f req(p, e) represents the frequency of p in e, and pf p,e denotes the logarithmically scaled frequency of a pattern p in a collection of texts related to emotion e.
Then we compute for ief as: ief p = log f req(p, e) + 1 e j ∈E f req(p, e j ) + 1 where the inverse emotion frequency ief p is a measure of the relevance of pattern p across all emotion categories.
Finally, we obtain a pattern score calculated as: where ps p,e is the final score that reflects how important a pattern p is to an emotion class e.

Models
In this section, we present the emotion recognition models and comparison models used to evaluate the contextualized affect representations. More details are provided in Appendix A. CARER: The proposed framework combines a multi-layer CNN architecture with a matrix form of the enriched patterns. The input X ∈ R n×m denotes an embedding matrix where entry X i,j represents the pattern score of enriched pattern i in emotion j. 9 X is fed into two 1-D convolutional layers with filters of sizes 3 and 16. The output of this process is passed through a ReLU activation function (Nair and Hinton, 2010) that produces a feature map matrix. A 1-max pooling layer (Boureau et al., 2010) of size 3 is then applied to each feature map. The results of the pooling are fed into two hidden layers of dimensions 512 and 128 in that order, each applied a dropout (Hinton et al., 2012) of 0.5 for regularization. We chose a batch size of 128 and trained for 4 epochs using Adam optimizer (Kingma and Ba, 2014). A softmax function is used to generate the final classifications. We use Keras (Chollet et al., 2015) to implement the CNN architecture.
Baseline Model: As baseline, we present a firstgeneration model (CARER β ) that employs primitive enriched patterns ‡10 . We adopt the CNN architecture used for CARER, however, this model differs in that the set of patterns used is significantly smaller as compared to the original size of the enriched patterns. The reason is because a different set of primitive pattern templates was used, which captured fewer patterns (187,648). This shows that the proposed method offers flexibility in terms of what templates to use and what size of patterns to generate. This could be useful in cases where there are limited computing and data resources, and for incorporating domain expertise.
Traditional Models: We also compare with various traditional methods (bag of words (BoW), character-level (char), n-grams, and TF-IDF) which are commonly used in sentence classification. To train the models we use the default stochastic gradient descent (SGD) classifier provided by scikit-learn (Pedregosa et al., 2011). Deep Learning Models: Among the works that employ deep learning models for emotion recognition, they vary by the choice of input: pretrained word/character embeddings and end-toend learned word/character representations. Our work differs in that we utilize enriched graphbased representations as input. We compare with convolutional neural networks (CNNs), recurrent neural networks (RNNs), bidirectional gated recurrent neural networks (BiGRNNs), and word embeddings (word2vec) (Mikolov et al., 2013). 9 We use a zero-padding strategy as in (Kim, 2014). 10 ‡ hereinafter refers to the primitive enriched patterns.

Data
We construct a set of hashtags to collect a separate dataset of English tweets from the Twitter API. Specifically, we use the eight basic emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The hashtags (339 total) serve as noisy labels, which allow annotation via distant supervision as in (Go et al., 2009). To ensure data quality, we follow the pre-processing steps proposed by (Abdul-Mageed and Ungar, 2017), and considered the hashtag appearing in the last position of a tweet as the ground truth. We split the data into training (90%) and testing (10%) datasets. The final distribution of the data and a list of hashtag examples for each emotion are provided in Table 3. In the following section we evaluate the effectiveness of the enriched patterns on several emotion recognition tasks. We use F1score as the evaluation metric, which is commonly used in emotion recognition studies due to the imbalanced nature of the emotion datasets.

Experimental Results
Traditional Features: As shown in Table 4, TF-IDF models produce better results than basic count-based features for both character-level and word-level feature extractors. These findings are consistent with the work of Zhang et al., (2015), where traditional methods, such as n-gram, were found to perform comparable to deep neural networks on various sentence classification tasks.
Pattern-based Approaches: The results of CNN BASIC 11 , which employs the basic graphbased patterns proposed in Step 6, perform worse than most of the conventional approaches. Both CARER β and CARER, which use the enriched patterns, acquire better results than CNN BASIC and all the other conventional approaches. In fact, our method obtains the best F1-score on all eight emotions. We observed that there are significant gains in performance (↑27% and ↑12%) when using the enriched patterns as compared to the basic patterns and primitive patterns ‡ , respectively. This highlights the importance of the pattern enrichment procedure and the benefit of refining the pattern templates. Note that the baseline model, CARER β , also performs better than all other the comparison models including the state-of-the-art methods (DeepMoji and EmoNet). Comparison to state-of-the-art: Felbo et al., (2017) proposed a state-of-the-art emotion prediction model, DeepMoji, trained on billions of emoji-labeled tweets. We obtained their pretrained model 12 and applied it to our dataset. As shown in Table 4, their model performs as well as other traditional methods. However, our model (CARER) significantly outperforms theirs (↑20%). Moreover, we re-implemented the GRNN model proposed in (Abdul-Mageed and Ungar, 2017). We also outperform their model (EmoNet) which manually trains word embeddings, similar to DeepMoji. The CNN w2v model uses word embeddings trained on billions of tweets (Deriu et al., 2017), thus it performs better than all the other approaches, and closer to ours. Results with Deep Learning: We offer more comparison with other various deep learning models as evaluated on Ekman's six basic emotions (i.e., sadness, disgust, anger, joy, surprise, and fear). For the RNN w2v and CNN char models, different inputs are used, as shown in Table 5. We feed the enriched patterns as embeddings to a bidirectional GRNN, which along with CARER EK and CARER β outperform all the other methods. Contextualized Approaches: DeepMoji is built on a stack of Bi-LSTM layers and performs much better with six emotion classes. However, using the enriched patterns as input, CARER EK 13 performs the best (81%). Note that the number of epochs used to train our models is much lower as compared to the other methods, which provides a strong case of the benefit of contextualizing features prior to training the models. Moreover, the important distinction between connector words and subject words helps to refine and surface relevant contextual information. We also 12 Model obtained from github.com/bfelbo/deepmoji 13 The proposed model trained on six emotions dataset. show that the enriched patterns can be applied to other deep learning models besides CNN, such as BiGRNN, which leaves an opportunity to explore more complex architectures and fusion models in the future. More importantly, for problems that require deeper understanding of contextualized information, there is a need to go beyond traditional features and distributed representations.
Multilingual Capabilities: We also tested the effectiveness of the proposed feature extraction algorithm for the Chinese language. We collected Traditional Chinese datasets 14 from several of Facebook's fan pages and applied the same procedures as were done for the English datasets. User comments are considered as documents and the associated user reaction to the root post represents the emotion labels. For comparison, we obtained Chinese pre-trained word vectors computed through (Bojanowski et al., 2017), and trained a model (fastText ch ) using the proposed CNN architecture. For our approach (CARER ch ), we applied the same CNN architecture on the Chinesebased enriched patterns. As shown in Figure 2, our model performs significantly better on all four emotions (average F1 score of 70%). Overall, we show that the approach is not restricted to any specific language and that the enriched features are applicable to other languages and data sources. In the future, we seek to expand our methods to support other complex languages, such as Japanese, French, and Spanish, where there tends to exist fewer linguistic resources.
6 Analysis of Enriched Patterns

Pattern Coverage and Consistency
One of the advantages of the contextualized enriched patterns is that they possess high coverage  is the proposed CNN model and word vectors obtained from (Deriu et al., 2017); char refers to character-level features; ngram employ unigrams, bigrams, and trigrams as features; CNN BASIC uses the proposed CNN architecture with basic patterns; EmoNet (Abdul-Mageed and Ungar, 2017) and DeepMoji (Felbo et al., 2017) are state-of-the-art emotion recognition models.  due to the way they were constructed. High coverage also means that the enriched patterns should demonstrate stability, in terms of how useful they are in an emotion classification task, even when reduced to smaller sizes. There are two cases where this could be useful: limited data and limited computing resources. Therefore, to test for pattern consistency, we randomly selected several pattern sizes 15 and trained a random forest classifier using the eight emotions dataset. This model performs comparable to CARER β (average F1-score of 65%), and it has the benefit of faster training 15 We employed the primitive patterns ‡ used in CARER β . time, making it suitable for the aforementioned experiment. We compared with the results obtained from the LIWC lexicon (affect dimension) , word2vec (Mikolov et al., 2013), and tweet2vec (Deriu et al., 2017). 16 As shown in Figure 3, due to the limited coverage of the LIWC lexicon, such resources may not be feasible on evolving, large-scale datasets. In contrast, word2vec contains over 3 million unique word embeddings and has been proven effective for text classification. However, if we keep reducing the available word vectors of word2vec, which is common when there are limited computing resources, the accuracy keeps dropping at significant rates. tweet2vec has a similar effect. In the case of our patterns, the classification results remain relatively stable, even when reducing the patterns to 30% and 10% of the original size. These results show that the proposed features are feasible to address the text-based emotion recognition problem. Moreover, the patterns are highly beneficial where  Words in bold pink denote the connector word/s cw in the pattern; E.g., "it" is cw of patterns "need it" and "hoping it". GT stands for ground truth. DeepMoji, EmoNet, and CAREREK correspond to the models reported in Table 5. there is shortage of computing and linguistic resources.
In Table 6, we provide samples extracted from the testing data. The examples show different cases where the comparison models struggled to capture important contextual information that helps to determine the emotion conveyed in the text. For instance, in the short text, "damn what a night", only our model was able to interpret the statement as joy because it uses the "what a" pattern and its corresponding subject words to determine that this statement has a stronger association with joy. Our model also works well for capturing rare words and for disambiguating emotional meaning using the enriched and refined contextual information of the patterns. Rare words like "whaaaaaat" and "thee" help to implicitly convey intense emotional expressions, which are also captured and considered important by our enriched patterns. Emotionrelevant verbs, such as "want" and "going" are also considered important context that help to convey and interpret emotion. Overall, the enriched patterns efficiently capture important emotional information that other models seem to ignore.

Conclusion
We proposed a graph-based feature extraction mechanism to extract emotion-relevant representations in an unsupervised manner. The contextualized affect representations are further enriched with word embeddings and are used to train several deep learning-based emotion recognition models. The patterns capture implicit and explicit linguistic emotional information which significantly improves emotion recognition results.
We offered a detailed analysis demonstrating special cases where the patterns are helpful to further extract and understand emotional information from textual information. For instance, short text is a challenging problem in emotion recognition and various natural language tasks; the proposed contextualized patterns show promising results in addressing this issue by helping the models to capture nuanced information which is useful to determine the overall emotion expressed in a piece of text. The proposed method paves the way for building more interpretable emotion recognition systems which have various implications when investigating human behavioural data (Saravia et al., 2015(Saravia et al., , 2016b and building empathy-aware conversational agents.
In the future work, we aim to investigate the graph-based patterns more in-depth and provide a more comprehensive and advanced theoretical discussion of how they are constructed. We also hope to keep improving the pattern weighting mechanism so as to improve the overall performance on emotion recognition tasks and minimize trade-off between pattern coverage and performance. We plan to employ transfer learning methods with the proposed enriched patterns and test on other emotion-related problems such as sentiment classification and sarcasm detection. The proposed methodology is also being expanded to support Spanish and Japanese emotion recognition tasks.