EmoNLP at SemEval-2018 Task 2: English Emoji Prediction with Gradient Boosting Regression Tree Method and Bidirectional LSTM

This paper describes our system used in the English Emoji Prediction Task 2 at the SemEval-2018. Our system is based on two supervised machine learning algorithms: Gradient Boosting Regression Tree Method (GBM) and Bidirectional Long Short-term Memory Network (BLSTM). Besides the common features, we extract various lexicon and syntactic features from external resources. After comparing the results of two algorithms, GBM is chosen for the final evaluation.


Introduction
Short text messages from social media websites such as Twitter and Facebook have become an important communication channel in our daily life. Although the writing styles of such short text messages are extremely diverse, the usage of emojis are generally shared. Emojis are ideograms and smiles that can be electronic expressions of natural emotions. Genres of emojis vary from facial expression, places, types of weather to animals. Emojis are used every day, rapidly changing the communication way in the social network. Due to the importance of emojis, investigations about emojis have been performed in recent years. For example, the previous work (Barbieri et al., 2017) has shown that there are relations between words and emojis, in other words, emojis are predictable given its surrounding words. Task 2 of SemEval 2018 provides a platform for the further prediction of emojis on tweets.
Our system addresses the first subtask: English emoji prediction. Note that all the emojis in the training data are removed. We models this problem as a multiclass classification problem. Specif-ically, we leverage on semantic and syntactic resources to extract varieties of features. After feature engineering, two models are adopted: gradient boosting regression tree method (GBM) and bi-directional long short-term memory network (BLSTM). After comparing results of these two models, GBM is selected for the final evaluation.
The reminder of this paper is structured as follows. In section 2, we describe our system in detail, including the feature description and approaches. In section 3, results of 5-fold experiments and feature ablation are presented. Finally, section 4 summarizes our work.

System Description
In this section, we present the details of our English emoji prediction system. In the dataset, each tweet corresponds to one label indicating one type of emojis. There are 20 types of emojis, most of which are emotions.
We treat the problem as a multiclass prediction task and extract a variety of features. For the model GBM, besides the common features such as word n-gram features, we utilize extensive external resources to build diverse word clusters, lexicon and syntactic features. For the model BLSTM, we adopt the pre-trained word embedding GloVe (Pennington et al., 2014).

Preprocessing
As the first step, we perform preprocessing for tweets tokenization and normalization.
The tokenization of all tweets are performed using tweetokenize 1 . In addition, we normalize tweets by replacing all the URLs (e.g. https://t.co/bihPimeeV9) with " url " and all the mentions (e.g. @Preston Hall) with "@mention".

Features
This section briefly describes features employed in our two models. GBM takes advantage of all the features shown in this section while BLSTM only utilizes pre-trained GloVe as the word embedding. For GBM, each tweet is represented as a feature vector consisting of all the following features. Since most of the target emojis are related to emotions, we employ diversiform lexicon features as well as emotional word (e.g. ":)" ) features to exploit the sentimental information in the sentence.
Character ngram: This feature represents the presence or absence of contiguous sequence of 3, 4 and 5 characters to capture the morphological information hidden in the words.
POS: The POS tag presents the information about the lexical type of the word. We part-ofspeech tag the tweets with the Carnegie Mellon University (CMU) tool (Gimpel et al., 2011). This tool is designed specifically for tweets pos tagging with the capability to deal with the non-standard words. For example, it can tag "ikr" ("I know, right?") as an interjection.
Cluster: This feature is induced from CMU pos-tagging tool which provides the word cluster using the Brown clustering algorithm. These pretrained 1,000 clusters serve as alternative representations of each tweet. This feature illustrates the presence and absence of tokens from the 1,000 clusters.
Counting feature: This feature is inspired by Mohammad's work (Mohammad et al., 2013) and developed by combining all the number of special symbols (e.g. mentions) in each tweet.
• the number of hashtags; • the number of words with all characters in upper case; • the number of contiguous sequences of question marks, exclamation marks, and both of them; • whether the last token contains an question mark or exclamation; • the number of mentions; • the number of URLs; • the number of words which have repeated characters (e.g. "coooool"); • presence or absence of positive or negative emoticons in the tweet. The positive and negative emoticons are defined in (Mohammad et al., 2013).
SSWE feature: SSWE (sentiment-specific word embedding) are learned on 10 million tweets using customized neural network (Tang et al., 2014). SSWE features can capture the sentiment information of sentences as well as the syntactic context of words.
Lexicon feature: This feature is created following the method produced by (Mohammad et al., 2013). We investigate the number of sentiment words, the total sentiment score, the score of last sentiment words and the maximal sentiment score for each lexicon. Taking advantage of extensive external lexicons 2 , this feature can interpret the sentimental information in tweets comprehensively.

Model
Two models are used in our system: GBM and BLSTM. We compare performances of these two models and finally use the GBM result for final evaluation.
GBM: We construct GBM model by using all the features mentioned in section 2.2. For each tweet, the concatenated feature vector can be used as the model input. Outputs of the model are the multiclassification results. The gradient boosting regression tree method generates base models sequentially and at each step updates the base model by minimizing the loss function value. The base model is a single regression tree which fits a set of features by partitioning the feature space into different regions. With additional regression trees added to the model, the fitted model may achieve a small training error. In other words, the gradient boosting method sequentially fits the training data by correcting base models at each step to strategically yield the best combination of trees. Therefore, the gradient boosting method is potential to produce more accurate predictions results (Zhang and Haghani, 2015). The tool we used to build GBM model is lightGBM 3 . We tune the hyperparameters on the training set by grid search. Because of the time constraints, it is impossible to tune all the hyperparameters in the GBM model. We choose two hyperparameters to tune and we set learning rate to 0.1 and minimal number of data in one leaf to 20.

BLSTM:
We experiment BLSTM with published word embedding, namely Stanford's GloVe embedding 4 trained on 6 billion words from Google and Web text. Instead of a traditional feedforward network, we use the bi-directional longshort term memory network. LSTM (Hochreiter and Schmidhuber, 1997) is a powerful connectionist model that can capture time dynamics and it has special capability to cope with these gradient vanishing problems compared with the traditional recurrent neural network (RNN). However, LSTM only has access to process one directional information in the sequence which is contradictory with most of the practical situations where the bidirectional information is both beneficial. BLSTM is designed to deal with this problem with the basic idea to process the sequence backward and forward and feed the output into two separate hidden states to catch the past and future information (Ma and Hovy, 2016).
We exploit this BLSTM transforming word embedding into classification results. Figure 1 shows the network in detail. We also tuned the hyperpa-

Results
Our system is trained on two models. With the fine-tuned hyperparameters exhibited in table 1 and 2, we train the two separate models to determine the final classifier for evaluation. To find out the optimum settings, we explore all the training data and conduct 5-fold crossvalidation experiments.     Table 5: Feature ablation study using the GBM model. The quantity is the F1 loss and score resulting from the removal of each feature group.

Conclusion
In this paper, we present the systems used in Se-mEval 2018 task 2 for English emoji prediction. Our effort focuses on putting forward two models to improve the multi emojis classification. By leveraging on general features (i.e. word ngram feature, character ngram feature and counting features), external resources (i.e. a variety of manual constructed lexicons, CMU brown cluster), feature selection and hyperparameters fine-tuning, GBM achieves better performance than BLSTM. This observation is attributed to the extensive usage of sentimental and syntactic features. Due to most of the target emojis are related emotions, these sentimental features can reveal the relation from words with the emotional emojis. In future, we hope to improve our BLSTM model by taking advantage of more features and incorporating more effective architecture.