NELEC at SemEval-2019 Task 3: Think Twice Before Going Deep

Existing Machine Learning techniques yield close to human performance on text-based classification tasks. However, the presence of multi-modal noise in chat data such as emoticons, slang, spelling mistakes, code-mixed data, etc. makes existing deep-learning solutions perform poorly. The inability of deep-learning systems to robustly capture these covariates puts a cap on their performance. We propose NELEC: Neural and Lexical Combiner, a system which elegantly combines textual and deep-learning based methods for sentiment classification. We evaluate our system as part of the third task of ‘Contextual Emotion Detection in Text’ as part of SemEval-2019. Our system performs significantly better than the baseline, as well as our deep-learning model benchmarks. It achieved a micro-averaged F1 score of 0.7765, ranking 3rd on the test-set leader-board. Our code is available at https://github.com/iamgroot42/nelec


Introduction
Sentiment analysis of textual data: Twitter data (Kouloumpis et al., 2011;Pak and Paroubek, 2010), movie reviews (Thet et al., 2010), and product reviews (Pang et al., 2008), is perhaps the most extensively explored problem, with a plethora of research to tackle it. Novel systems utilise deep learning architectures to achieve near-human performance on clean, well-formatted data. However, sentiment classification of chat data is significantly challenging. The presence of spelling errors, slang, emoticons, code-mixing, style of writing and abbreviations makes it significantly harder for existing deep-learning models to work on such data. * Equal contribution, order determined by coin toss Literature dealing with this problem comprises a wide range of approaches: from hand-crafted features to end-to-end deep-learning methods. Some rule-learning based methods use keywordbased analysis (Ko and Seo, 2000) and part-ofspeech tagging (Agarwal et al., 2011). These procedures require extensive human-involvement for identifying keywords and designing rules and are thus not scalable.
Non-neural machine-learning methods utilize feature extraction algorithms like n-grams and Tf-Idf vectors, coupled with classification algorithms like Naive Bayes (Pang et al., 2002), Decision Trees (Bilal et al., 2016), SVM (Moraes et al., 2013). These approaches perform significantly better than rule-based approaches but fail to capture context well, since they ignore the order of words in text sequences.

Statistic
Train Dev Test Neural, deep-learning based approaches use architectures such as variations of recurrent models: GRU (Chung et al., 2014), LSTM (Hochreiter and Schmidhuber, 1997), BiLSTM (Schuster and Paliwal, 1997) and Convolutional models (Mundra et al., 2017), performing significantly better than other machine-learning techniques. Their ability to generalise and capture context over long se-  quences makes them a popular choice for text classification tasks.
We propose NELEC, a novel system specifically designed for sentiment classification. We combine lexical and neural features for sentiment classification, followed by class-specific thresholds for better labelling. Our system yields an F 1 score of 0.7765 on the test-set of Task 3 of Sem-Eval 2019.

Deep Learning Model
We experiment with a two-layer, recurrent, deeplearning model with skip connections, bidirectional cells and attention ( Figure 1). We trained our model for 100 epochs with Cyclic Learning Rate (Smith, 2017) scheduling. This model outperforms the baseline by a significant margin. An in-depth analysis of the cases where it fails reveals its shortcomings (along with that of a deeplearning model in general): it is not robust to misspellings and cannot capture the meaning of outof-vocabulary words robustly. Even though pretrained embeddings are available for most words, the context with which they are used in chat may vary from the corpora they were trained on, thus lowering their usability.

NELEC : Neural and Lexical Combiner
Since neural features have a lot of shortcomings, we shift our focus to lexical features. Using a combination of both lexical (n-gram features, etc.) and neural features (scores from neural classifiers), we trained a standard Light-GBM (Ke et al., 2017) Model for 100 iterations, with feature subsampling of 0.7 and data sub-sampling of 0.7 using bagging with a frequency of 1.0. We use 10 −2 * weights 2 as regularization. We also experimented with a logistic regression model, but it had a significant drop in performance for the 'happy' and 'angry' classes ( Table 2). The total number of features used is 9270, out of which 9189 are sparse. The features we use in our model are described in the sections below:

Turn Wise Word n-Grams
Word level bi-grams and tri-grams (skip 1). These help capture patterns like "am happy" and automatically handles unseen data such as "am very happy" or "am so happy" because of the skip word. We take the term frequencies of these n-Grams as features. Word Grams not|good, hate, no|one had the highest feature gains.

Turn Wise Char n-Grams
Character level bi-grams and tri-grams. This feature helps capture character-level trends such as "haha" (and its variants), as well as emoticons. It helps with misspellings and makes the system robust to variants of several words like "haha". h|a|h, w|o|w had one of the highest feature gains.

Valence Arousal Dominance
We used Valence-Arousal-Dominance data (Mohammad, 2018) in the following manner:

Emotion Intensity
We use EmoLex (Mohammad and Turney, 2010), which associates words to eight emotions and two sentiments. For each turn, we obtain the number of words having specific emotions and sentiment and use it as a feature.

Neural Features
We used scores obtained by utilizing available pretrained classifiers features:

Lexical Count Features
Lastly, we used certain count features such as the number of interrogation marks, exclamation marks, uppercase letters, the total number of words and letters for each turn. These features were observed to be very helpful while detecting anger and happiness.

Data Preparation
The training, development and test sets consist of 30160, 2755 and 5509 examples respectively. The final model is trained on the combined training and development set. For each instance, one of four class labels: {happy, angry, sad, other}, is provided. Table 1 provides some statistics for the given dataset. We concatenate all three turns per conversation. For the Deep-Learning approach, a special eos token is inserted in between these turnconversations.
3.1 Pre-processing for NELEC 1. Lemmatization: Contrary to intuition, using lemmatization decreased the final performance of our model. Further analysis suggests that emotion is highly sensitive to exact words: information captured by the word "hate" and "hated" are very different, even though a lemmatization system would reduce them to the same word, and similarly for "happy" versus "happiest". Using lemmatization drops the system's F 1 score by 0.0092.

WordNet for Synonyms:
We also tried using synonyms for nouns using the Wordnet Graph (Miller, 1998). However, a similar issue plagues this approach. For instance "dog", "doggie" and "puppy" are all synonyms, but they do not express the same kind of emotion: words like "puppy" convey much more positive emotion. Using Wordnet drops the system's F 1 score by 0.0023.

Normalization:
We try word tokenization and normalization by removing diacritics, numbers, stop-words, question marks etc. However, this also drops the F 1 score by 0.0046.
Character n-gram features can handle lemmatization as well as misspellings for most of the cases without discarding any additional information. Finally, we only lower-cased the sentences.  Table 3: Micro-averaged F 1 scores when all features apart from these (per row) are dropped. F 1 gain here refers to the gain when using the feature mentioned, as opposed to dropping it.

Pre-processing for Deep-Learning based Approach
We use pre-trained GloVe (Pennington et al., 2014) embeddings. Some observations are: • Emoticons: Around 15% of all conversations includes at least one emoticon. We use embeddings from a pre-trained emoji2vec (Eisner et al., 2016) model to handle emoticons.
• Words with repeated characters: This trend is common for chat-data. For example, "heelloo", "ookayy". We design specific regular expressions to handle such variations.
• Abbreviations and slang: tokens such as "idk", "irl" are converted to their full forms.

Experiments
To ascertain the novelty of our system, we report both class-wise and micro-averaged F 1 scores on the test set. We also compare our performance with the benchmarks provided by the contest organizers (Chatterjee et al., 2019a). As mentioned in Section 3.2, data preprocessing on deep-learning models leads to significant performance gains, while leading to a drop in performance when using NELEC. NELEC outperforms both the baseline and our deep model by a considerable margin (Table 2).

Ablation Study
To analyze the usefulness of all features used by NELEC, we perform hold-one-out experiments on its features (Section 2.2). Results are reported in Table 3. There is a noticeable gain for most of the features, with character n-grams observing the maximum gain among them all.
One of the most intriguing patterns observed is the ease with which they detect sad emotion and an equal difficulty in detecting happiness.
• Words like "haha" and "okay" have several forms which all convey different magnitudes of emotion. While lemmatising such words, there is a significant loss of information.
• Most of the conversations labelled sad have easy-to-recognize signals such as negative emoticons, keywords like "lonely", which make detection easy. On the other hand, differentiating happy and others is non-trivial.
• Not using the second turn, along with its associated features, leads to a negligible drop in F 1 performance. This observation highlights the importance of the first user (in data) in analyzing sentiment. Moreover, we can utilize this information to make the feature set even smaller, making the model smaller and faster.

Conclusion
We propose a deep neural architecture to solve the problem of emotion detection in conversations from chat data. Although it outperforms the existing baseline, its performance is not satisfactory. To better capture lexical features and make the model robust to misspellings, abbreviations, emoticons, etc., we propose NELEC, a Neural and Lexical Combiner. Our model utilises lexical features, along with signals from pre-trained neural models for sentiment and adult-offensive classification to boost performance. Our system performs at par with the existing state of the art, yielding a microaveraged F 1 score of 0.7765 on the test set, ranking 3 rd on the test-set leader-board.