ECNU at SemEval-2018 Task 2: Leverage Traditional NLP Features and Neural Networks Methods to Address Twitter Emoji Prediction Task

This paper describes our submissions to Task 2 in SemEval 2018, i.e., Multilingual Emoji Prediction. We first investigate several traditional Natural Language Processing (NLP) features, and then design several deep learning models. For subtask 1: Emoji Prediction in English, we combine two different methods to represent tweet, i.e., supervised model using traditional features and deep learning model. For subtask 2: Emoji Prediction in Spanish, we only use deep learning model.


Introduction
Visual icons play a crucial role in providing information about the extra level of social media information. SemEval 2018 shared task for researchers to predict, given a tweet in English or Spanish, its most likely associated emoji (Barbieri et al., 2018(Barbieri et al., , 2017) (Task 2, Multilingual Emoji Prediction), which is organized into two optional subtask (subtask 1 and subtask 2) respectively in English and Spanish.
For subtask 1, we adopt a combination model to predict emojis, which consists of traditional Natural Language Processing (NLP) methods and deep learning methods. The results returned by the classifier with traditional NLP features, by the neural network model and by the combination model are voted to get the final result. For subtask 2, we only use deep learning model.

System Description
For subtask 1, we explore three different methods i.e., using traditional NLP features to learn a supervised machine learning-based classifier, learning a deep learning model to make prediction and combine features captured by neural networks with traditional NLP features to train a supervised machine learning-based classifier. For subtask 2, we simply implement deep learning method to make prediction.

Traditional NLP Features
In this task, we extract the following three types of features to capture effective information from the given tweets, i.e., linguistic features, sentiment lexicon features and tweet specific features.
• POS: Generally, the sentences carrying subjective emotions are inclined to contain more adjectives and adverbs while the sentences without sentiment orientation would contain more nouns. Thus, we extract POS tag from the sentence as features with the Bag-of-Words form.
• Correlation Degree: For each word appear in training data, the ratio of the number of occurrences under each class and the total occurrences is counted as the correlation degree of the word to a certain class. When the feature is created, the sum of the correlation degree of words in tweet is counted as the correlation degree of the tweet to a certain class: Where |s| is the length of tweet and N is the number of classes, w t means t th word in tweet and c i means i th class, O(w t , c i ) denotes the number of tweets of c i that contain w t . The dimension of this feature is equal to the number of classes, value is correlation degree of the tweet to each class, i.e., Cor-rDeg(s,l).

Sentiment Lexicon Features (SentiLexi)
We also extract sentiment lexicon features (Sen-tiLexi) to capture the sentiment information of the given sentence. Given a tweet, we first convert all words into lowercase. Then on each sentiment lexicon, we calculate the following six scores for one message: (1) the ratio of positive words to all words, (2) the ratio of negative words to all words, (3) the maximum sentiment score, (4) the minimum sentiment score, (5) the sum of sentiment scores, (6) the sentiment score of the last word in tweet. If the word does not exist in one sentiment lexicon, its corresponding score is set to 0. The following 8 sentiment lexicons are adopted in our systems: Bing Liu lexicon 1 , General Inquirer lexicon 2 , IMDB 3 , MPQA 4 , NRC Emotion Sentiment Lexicon 5 , AFINN 6 , NRC Hashtag Sentiment Lexicon 7 , and NRC Sentiment140 Lexicon 8 .

Tweet Specific Features
• Punctuation: Considering that users often use exclamation marks and question marks to express strongly surprised and questioned feelings, we extract 7-dimensions punctuation features by recording rules of punctuation marks in the tweets.
• All-caps: One binary feature is to check whether this tweet has words in uppercase.
• Bag-of-Hashtags: We construct a vocabulary of hashtags appearing in the training data and then adopt the bag-of-hashtags method for each tweet.

Deep Learning Modules
In addition to manually constructing features, we build deep neural models to capture the semantics of the text. Figure 1 shows the network structure of our model. The input of the network is a tweet, which is a sequence of words. The output of the network contains class elements.

Word-Level Representations
We use pre-trained word embedding concatenated with char embedding, POS embedding and N-ER embedding obtaining a final representation for each word type, which are learned together with the updates to the model.
• Word Embedding: Word embedding is a continuous-valued vector representation for each word, which can capture meaningful syntactic and semantic regularities. In this task, we use the 300-dimensional word vectors pre-trained on Twitter provided by Se-mEval task organizers, available in SWM 9 • Char Embedding: We randomly initialize the representation of the character and compute character-based continuous-space vector embeddings of the words in tweets by bidirectional LSTM. The dimension of char embedding is 50.
• POS Embedding: We randomly initialize the representation of the POS tag in tweet with a vector size of 50.
• NER Embedding: We also randomly initialize the representation of the NER tag in tweet with a vector size of 50.

Sentence-Level Representations
• Bi-Directional LSTM: We apply a recurrent structure to capture contextual information as far as possible when learning word representations, to model the tweet with both of the preceeding and following contexts, we apply a Bi-directional Long Short-term Memory Networks (BiLSTM, Graves et al. (2005)) architecture as shown in Figure 1.
• Attention Mechanism: Considering not all words contribute equally to the representation of the sentence meaning, we introduce attention mechanism (Bahdanau et al., 2014) to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector.
We first use BiLSTM and Attention Mechanism to obtain sentence-level representations and then concatenate it with several effective NLP features. At last, we use a Multi-layer Perceptron (MLP) and output the probability of emoji label based on a softmax function. The BiLSTM has a hidden size of 512. The MLP have 1 hidden layer of size 200 and relu non-linearity.
To learn model parameters, we minimize the KL-divergence between the outputs and gold labels. We adopt Adam (Kingma and Ba, 2014) as optimization method and set learning rate of 0.01.

Datasets
For training sets, the organizers provide only the list of tweet ID and a script for all participants to collect tweets. However, since not all tweets are available when downloading, participants may collect slightly different numbers of tweets for training sets. In addition, we find that the crawled training sets and the trial sets provided by the organizers have 37.26% overlap in English and 71.16% in Spanish. So we remove the duplicate data and combine train and trial sets to perform a 3-fold cross-validation. Table 1 shows the statistics of the tweets we collect in our experiments. In subtask 1, the number of class 0 is the largest, accounting for 22.28%, followed by class 1 and class 2, respectively, 10.37% and 10.20%, and the other 17 classes distribute between 2.46% and 5.51%. Subtask 2 has a similar data distribution.

Learning Algorithm
Considering the large dimension of the features designed by traditional NLP methods, we use learning algorithms of Logistic Regression(LR) to build classification models, which is supplied in Liblinear 11 .

Evaluation Metrics
The official evaluation measure is Macro F-score, which would inherently mean a better sensitivity to the use of emojis in general, rather than for instance overfitting a model to do well in the three or four most common emojis of the test data. Macro F-score can be defined as simply the average of the individual label-wise F-scores. Table 2 lists the comparison of different contributions made by different features on crossvalidation with Logistic Regression algorithm. From the results in Table 2, we observe the following findings:
(2) Correlation Degree feature makes more contributes than other features, as it reflects the degree of relevance between tweets and emoji label.
(3) Bigram feature makes contribution and is more effective than unigram feature. The reason may be that bigram feature can capture more contextual information and word orders.
(4) SentiLexi feature also makes contribution, which indicates that SentiLexi features are beneficial not only in traditional sentiment analysis tasks, but also in predicting the emoji in tweet.   Table 3 shows the results of different deep learning models described before. From Table 3, we observe the findings as follows:

Comparison of Deep Learning Modules
(1) We explore the performance of three different deep learning model: Neural Bag-of-Words(NBOW, Iyyer et al. (2015)), Convolutional Neural Network (CNN, Collobert et al. (2011)) and Bi-directional Long Short-term Memory Networks (LSTM, Hochreiter and Schmidhuber (1997)). All models used only pre-trained word embedding to compare. Clearly, BiLSTM outperformed other models in this task, and our deep learning model is based on BiLSTM.
(2) POS embedding makes more contribution than other word-level representations. Since POS embedding can learn emotional tendencies, it is beneficial for tweet emojis prediction.
(3) The last two rows results shows that combine both SentiLexi and Punctuation features with sentence representations to train the deep learning model can make contribution.

Combination and Ensemble
For subtask 1, we also use the trained neural networks described in 4.2 to capture the features of tweets and combine it with traditional NLP features to train a Logistic Regression classifier, named Combination Model. Table 4 shows the results of different methods. We find that combination model improved the performance and the ensemble of 3 methods achieve the best result. It suggests that the traditional NLP methods and the deep learning models are complementary to each other and their combination achieves the best performance.

System Configuration
Based on above experimental analysis, the two system configurations on test data sets are listed as followings: (1) subtask 1: Logistic Regression with best NLP feature sets is used as model 1. Deep learn-ing model is used as model 2. Logistic Regression with NLP features and the feature captured by deep learning model is used as model 3. Ensemble of three models is used as final submission.
(2) subtask 2: Deep learning model with word embedding and char embedding is used as submission.

Results on Test Data
Subtask 1 Subtask 2 Our system 33.35 (5) 16.41 (7) Table 5 shows the results on test datasets. From Table 5, we find that our system achieves almost the same performance as the cross-validation. The low performance of this task illustrates the difficulty of the task itself, especially the Spanish task.

Conclusion
In this paper, we extract several effective traditional NLP features, design different deep learning models and build a model in combination of traditional NLP features and deep learning method together. The extensive experimental results show that this combination improves the performance.
For the future work, we consider to focus on developing a neural networks model to handle unbalanced data and improve the performance of confusing labels.