UWB at SemEval-2018 Task 1: Emotion Intensity Detection in Tweets

This paper describes our system created for the SemEval-2018 Task 1: Affect in Tweets (AIT-2018). We participated in both the regression and the ordinal classification subtasks for emotion intensity detection in English, Arabic, and Spanish. For the regression subtask we use the AffectiveTweets system with added features using various word embeddings, lexicons, and LDA. For the ordinal classification we additionally use our Brainy system with features using parse tree, POS tags, and morphological features. The most beneficial features apart from word and character n-grams include word embeddings, POS count and morphological features.


Introduction
The task of Detecting Emotion Intensity assigns the intensity to a tweet with given emotion. The emotions include anger, fear, joy, and sadness. The intensity is either on a scale of zero to one for the regression subtask, or one of four classes (0:no, 1: low, 2: moderate, 3: high) for the classification subtask. The task was prepared in three languages: English, Arabic, and Spanish. For each language there are four training and test sets of data -one for each emotion. The data creation is described in  and detailed description of the task is in .
We participated in the emotion intensity regression task (EI-reg) and in the emotion intensity ordinal classification task (EI-oc) in English, Arabic and Spanish.

System Description
We used two separate systems for ordinal classification -AffectiveTweets (Section 3) and Brainy (Section 4). For the regression task we just use the AffectiveTweets system. We train a separate model for each emotion. The Brainy system performed better in our pre-evaluation experiments on the development data for all emotions in Spanish and for fear and joy emotions in Arabic.

Tweets Preprocessing
Tweets often contain slang expressions, misspelled words, emoticons or abbreviations and it's needed to make some preprocessing steps before extracting features. First, every tweet was tokenized using TweetNLP 1 (Gimpel et al., 2011). Then the AffectiveTweets 2  package for Weka machine learning workbench (Hall et al., 2009) was used for feature extraction. The following steps were applied on tokens for every language in both tasks: 1. Tokens were converted to lowercase 2. URL links were replaced with http://www.url.com token 3. Twitter usernames (tokens starting with @) were replaced with @user token 4. Tokens containing sequences of letters occurring more than two times in a row were replaced with two occurrences of them (e.g. huuuungry is reduced to huungry, looooove to loove) 5. Common sequences of words and emojis were divided by space (e.g. token " nice:D:D" was divided into two tokens " nice" and " :D:D") These steps lead to reduction of feature space as shown in (Go et al., 2009). We also used some individual preprocessing for Arabic language. After the above described steps every token was also processed via Stanford Word Segmenter 3 (Monroe et al., 2014). When using word embeddings, we transformed Arabic words from regular UTF-8 Arabic to a more ambiguous form 4 . This was done only for word embedding features.

Features
Our AffectiveTweets system used combinations of features that are described in this section. The submitted combination of features is shown in Table  1.
• Word n-grams (WN n i ): word n-grams 5 from i to n (for i = 1, n = 2, unigrams and bigrams were used).
• Character n-grams (ChN n i ): character ngrams 5 from i to n (for i = 2, n = 3 character bigrams and trigrams were used).
• Word Embeddings (WE): an average of the word embeddings of all the words in a tweet.
• LDA -Latent Dirichlet Allocation (D n ): topic distribution of tweet, that is obtained from our pre-trained model, n indicates number of topics in model (for n = 5, feature vector with dimension 5 will be produced and each component of the vector refers to one topic). We used LDA features only in Affecti-veTweets system.

English Word Embeddings:
• Mentioned Arabic word embeddings were created with Global Vectors (GloVe) (Pennington et al., 2014) and Word2Vec toolkit (Mikolov et al., 2013) using skip-gram (SG) model and continuous bagof-words (CBOW) model. These Arabic word embeddings were trained on different data domains -Twitter (tw), web pages (web), Wikipedia (wiki), and their combination (var) for more details see the cited papers.

English lexicons (L-en):
-We used all affective lexicons from the Af-fectiveTweets package.

Model Training
In our AffectiveTweets system we used an L 2regularized L 2 -loss SVM regression and classification model with the regularization parameter C set to 1, implemented in LIBLINEAR Library (Fan et al., 2008) 6 .

LDA Training
To use topics created with LDA (Latent Dirichlet Allocation) (Blei et al., 2003) as features, we trained our own models for every language. Tweets used to train the Arabic and Spanish models were taken from SemEval-2018 AIT DISC corpus  and tweets for English model were taken from Sentiment140 7 training data (Go et al., 2009). We trained our LDA models with LDA implementation from MALLET 8 (McCallum, 2002). We used the same preprocessing for LDA as for regular feature extraction. Additionally we removed stopwords and following special characters [ , . ! -]. Tokens from Spanish tweets were stemmed with Snowball 9 stemming algorithm.

Brainy System
We use Maximum Entropy classifier from Brainy machine learning library (Konkol, 2014) and UD-Pipe (Straka et al., 2016) for preprocessing and doesn't use any lexicons, just word embeddings. The system is based on (Hercig et al., 2016).

Preprocessing
The same preprocessing has been done for all datasets. We use UDPipe (Straka et al., 2016) with Spanish Universal Dependencies 1.2 models and Arabic Universal Dependencies 2.0 models for POS tagging and lemmatization. Tokenization has been done by TweetNLP tokenizer (Owoputi et al., 2013). We further replace all user mentions with the token "@USER" and all links with the token "$LINK".

Features
The Brainy system used the following features. The exact combination of features for each emotion and the change in performance caused by its removal is shown in Table 9.
• Character n-grams (ChN n ): Separate binary feature for each character n-gram in the utterance text. We do it separately for different orders n ∈ {1, 2, 3, 4, 5} and remove ngrams with frequency t.
• Bag of Words (BoW): We used bag-ofwords representation of a tweet, i.e. separate binary feature representing the occurrence of a word in the tweet.
• Bag of POS (BoPOS): We used bag-ofwords representation of a tweet, i.e. separate binary feature representing the occurrence of a POS tag in the tweet.
• Bag of Parse Tree Tags (BoT): We used bag-of-words representation of a tweet, i.e. separate binary feature representing the occurrence of a parse tree tag in the tweet. We remove tags with a frequency ≤ 2.
• Emoticons (E): We used a list of positive and negative emoticons (Montejo-Ráez et al., 2012). The feature captures the presence of an emoticon within the text.
• First Words (FW): Bag of first five words with at least 2 occurrences.
• Last Words (LW): Bag of last five words with at least 2 occurrences.
• Last BoM (LBoM): Bag of last five morphological features (see BoM) with at least 2 occurrences.
• FastText (FT): An average of the FastText (Bojanowski et al., 2016) word embeddings of all the words in a tweet.
• N-gram Shape (NSh): The occurrence of word shape n-gram in the tweet. Word shape assigns words into one of 24 classes 11 similar to the function specified in (Bikel et al., 1997). We consider unigrams, bigrams, and trigrams with frequency ≤ 2.
• POS Count Bins (POS-B): We map the frequency of POS tags in a tweet into a onehot vector with length three and use this vector as binary features for the classifier.
The frequency belongs to one of three equalfrequency bins 12 . Each bin corresponds to a position in the vector. We remove POS tags with frequency t ≤ 5.
• TF-IDF: Term frequency -inverse document frequency of a word computed from the training data for words with at least 5 occurrences and at most 50 occurrences. 11 We use edu.stanford.nlp.process.WordShapeClassifier with the WORDSHAPECHRIS1 setting available in Standford CoreNLP library . 12 The frequencies from the training data are split into three equal-size bins according to 33% quantiles.

Experiments
All presented experiments are evaluated on the test data for the given task.
We performed ablation experiments to see which features are the most beneficial (see Table  9, 8, and 10). Numbers represent the performance change when the given feature is removed 13 .
Word embeddings features have a great impact on system performance, so we compared several word embeddings for every language (Table 2, 3, and 4). For English was best WE-ue word embeddings, but for submission we used WE-b word embeddings, because it worked better on dev data. In Spanish tweets the WE-us word embeddings outperformed the WE-ft word embeddings in regression and WE-us was better for classification in anger and on average of all emotions. For classification in Arabic was var-CBOW best on every emotion except anger and for regression var-SG worked best on average and on fear.
We also experimented with only LDA features to find out how the numbers of topics in LDA model affect the performance (see Figure 1). We star- 13 The lowest number denotes the most beneficial feature ted with models containing 5 topics and continued up to 1000 (step was non-equidistantly increased). Our experiments suggest that the best setting is around 200-300 topics. We selected the number of topics based on the performance on the development data.

Results
Our results in the emotion intensity regression subtask are in Table 5 and our results in the emotion intensity ordinal classification subtask are in Table 6 and Table 7. The system settings and features for each language and emotion were selected based on our pre-evaluation experiments with evaluation on the development data.

Conclusion
We competed in the emotion intensity regression and ordinal classification tasks in English, Arabic and Spanish.
Our ranks are 27 th out of 48 for English, 5 th out of 14 for Arabic, and 5 th out of 16 for Spanish for the regression task and 21 st out of 39 for English, 5 th out of 14 for Arabic, and 5 th out of 16 for Spanish for the ordinal classification task.