Tohoku at SemEval-2016 Task 6: Feature-based Model versus Convolutional Neural Network for Stance Detection

In this paper, we compare feature-based and Neural Network-based approaches on the supervised stance classiﬁcation task for tweets in SemEval-2016 Task 6 Subtask A (Moham-mad et al., 2016). In the feature-based approach, we use external resources such as lexicons and crawled texts. The Neural Network based approach employs Convolutional Neural Network (CNN). Our results show that the feature-based model outperformed the CNN model on the test data although the CNN model was better than the feature-based model in the cross validation on the training data.


Introduction
To solve supervised short text classification tasks, there are two major approaches; feature-based and Neural Network based approaches. In traditional feature-based approaches, we extract various features from a text. The features are usually constructed from n-grams (e.g., bigrams) of the texts and external resources such as lexicons and unlabeled corpora.
In Neural Network based approaches, a number of models for text classifications exist; for example, Feed-Forward Neural Network model using an average of embeddings of target word sequences as the input layer (Iyyer et al., 2015), Recursive Neural Network (Socher et al., 2011;Socher et al., 2013), and Convolutional Neural Network (CNN) (Johnson and Zhang, 2015;dos Santos and Gatti, 2014;Kim, 2014).
In this paper, we compare feature-based and Neural Network based approaches on the supervised stance classification task for tweets, SemEval-2016 Task 6 Subtask A (Mohammad et al., 2016). The feature-based approach classifies tweets using logistic regression model. The features are extracted using external knowledge such as Senti-WordNet (Esuli and Sebastiani, 2006) and a collection of crawled tweets, in addition to unigrams or bigrams in the target tweet. For the Neural Network approach, we implement CNN based on Kim (2014). As the input embeddings, we use word embeddings trained by Continuous Bag-Of-Words (CBOW) model (Mikolov et al., 2013) on Wikipedia articles.
The experimental results show that the CNN based approach performed the best in the cross validation on the training data. However the tendency was opposite on the test data probably because the CNN model overfitted to the training data. In contrast, the feature-based approach was more robust, leveraging the external knowledge.

Datasets
We use the dataset of the SemEval-2016 Task 6 Subtask A, which is a supervised tweet classification task for five topics. There are three stances to classify; NONE, FAVOR, AGAINST. Table 1 shows the topics and distributions of the training data. To classify tweets into their stances, we consider two configurations: Three-way Polarity Classifier which detects three stance labels at once, and a combination of Topic Classifier and Two-way Polarity Classifier. Topic Classifier judges the relevance of a tweet to the topic, in other words, whether a stance label is NONE or not (FAVOR/AGAINST). Two-way Polarity Classifier then labels FAVOR or AGAINST for tweets that were not judged as NONE by the Topic Classifier.
3 Feature-Based Approach

Preprocessing Tweets
We remove reply and mention expressions (@UserName) in tweets to prevent overfitting, and keep flags indicating whether tweets contain them or not. We also remove hashtags based on the following rules to prevent overfitting.
Rule 1. Hashtags embedded in the sentence with capitals or digits at non-inital letters e.g., #WeLoveJapan, #Pray4all Rule 2. Hashtags at the end of a tweet e.g., #SemST, #2014, #LylicTweet Rule 1 removes hashtags that are too long or unpopular. Rule 2 removes hashtags that do not contain a stance. Remaining hashtags such as #hillary and #god may provide important features to detect the stances. We also expand shortened forms such as "I'm" and "can't" based on simple rules. Finally, we obtain part-of-speech (POS) tags and dependency trees of tweets by using Stanford CoreNLP 1 .

Features
Reply (R): If a tweet has a flag that indicates a reply or mention expression, we gener-1 http://stanfordnlp.github.io/CoreNLP/ Topic Query # tweets Atheism "atheism" 24124 Climate Change is a Real Concern "climate", "climate change" 22703 Feminist Movement "feminist", "feaminism", "feminist movement", "gender equality" 131677 Hillary Clinton "hillary", "clinton", "hillary clinton" 980080 Legalization of Abortion "abortion" 54846 Feminist Movement "feminist", "feminism" Hillary Clinton "hillary", "clinton" Legalization of Abortion "abortion" ate R=is reply or R=is mention as a feature. This feature may be effective because a reply or mention may provides a clue for detecting a stance. BagOfWords (BoW): For detecting stances, words in a tweet are very informative. We include all unigrams of lemmas in a tweet as features. (e.g.

BoW=think, BoW=not)
BagOfDependencies (BoD): Dependency relations such as adjectival modifier and negation are important for detecting stances. We include all dependency relations in a tweet as features. (e.g.

BoD=hate=>i, BoD=like=>not)
BagOfPOSTag (BoP): We also extract features from POS tags. For example, if a tweet contains several interjections, the user probably has a negative opinion to the topic. We include all unigrams of POS tags in a tweet as features. (e.g. BoP=NOUN, BoP=UH) SentiWordNet (SWN): Content words in a tweet may express some sentiment, which indicates stances and emotions of the user. We use SentiWordNet (Esuli and Sebastiani, 2006) for introducing sentiment of a word. It assigns positive/negative/objective scores to each word. In sentiment classification task, Pang et al. (2002)    adjectives and adverbs in the tweet based on the following rules.
1. For a given word, look up the top item in Sen-tiWordNet and obtain a negative and positive score of the word.
2. If the negative score is equal to the positive score, no features are generated.
3. If the negative score is larger than the positive score, generate a negative polarity feature, otherwise generate a positive polarity feature. (e.g.

SWN=love=>p, SWN=hate=>n)
SentiWordSubject (SWS): This feature focuses on sentiment expressed by subjective pronouns such as "I" or "we", which may indicate emotions or stances of the user of a tweet. We obtain a sentiment polarity from the word modifying a subjective pronoun in a tweet, and include it as a feature. A sentiment polarity is obtained by SentiWord-Net using the same rules for SWN features. (e.g.

SWS=I=love=>p, SWS=We=hate=>n)
TargetSentiment (TS): We also consider sentiment or emotion for the topics. Jiang et al. (2011) add words modifying target words as features. Similarly, we extract words modifying target words in a tweet, and include sentiment polarity features using the same rules in SWN features.
We calculate similarities between words and seed keywords using word embeddings. If the similarity is higher than 0.7, we use it as the target word. Table  3 shows the seed keywords for each topic.
For example, given a tweet "We hate feminist", we extract "hate" that modifies the target word "feminist". Then we get a feature TS=n using the same rules in SWN features. (e.g. TS=p, TS=n) HighPMI (P): We crawled tweets containing target words, and collected words cooccuring with seed keywords (Table 3) in all crawled tweets for each topic. Table 2 shows query words and the number of crawled tweets for each topic. Then we calculate Point-wise Mutual Information (PMI) for all words. If the word in a tweet is in top 300 of the PMI, we generate a feature. This feature detects a tweet containing words related to the topic. This feature may be effective to classify whether NONE or not. (e.g. P=humanist, P=meninist)

Experimental Setups
We used L2 logistic regression as the classification algorithm, and measured the classification performance on 10-fold cross validation using the Classias package (Okazaki, 2009). We evaluated each model by a macro average of micro-F1 scores of FAVOR and AGAINST for each topic and all topics.

Comparison of Classifier Combinations
We compared Three-way Polarity Classifier with the combination of Topic Classifier and Two-way Polarity Classifier. Table 3.1 shows the performances of these two classifier configurations. We confirmed that the combination of Topic Classifier and Two-way Polarity Classifier outperformed Three-way Polarity Classifier. Therefore, we used the combination of Topic Classifier and Two-way Polarity Classifier hereafter.

Ablation Test
Through this experiment, we explore the contribution of individual features explained in Section 3.2. Table 5 shows the results of ablation tests. These results show that SWN features were the most effective to classify the stances. Sentiment of the tweet is one of the keys for stance classification. In contrast, BoD features degraded the classifier. We experimented further ablation tests with the feature set except for degraded features in the ablation test. These experiments revealed the best feature sets {BoW, BoP, R, SWN, P} (denoted 'Best' in Table  5).

Method Overview
In recent years, Convolutional Neural Network (CNN) models have achieved remarkable results in various fields of research, such as computer vision and speech recognition. In the field of natural language processing, CNN models are also used for text classification tasks (Johnson and Zhang, 2015;dos Santos and Gatti, 2014), sentiment analysis (Kim, 2014), etc.
Following Kim (2014), we constructed CNN models to detect stances, as shown in Figure 1. They consist of one convolution layer with one maxpooling layer, and a three-layered feedforward network with softmax at the end to predict a distribution over classes. The convolution layer has 200 kernel windows whose sizes are k × d, where k is the number of words in a window and d is the dimension size of the word embeddings. We denote an input tweet s as a sequence of words w 1 , w 2 , ..., w n , and their embeddings v w 1 , v w 2 , ..., v wn . We use Chainer 2 for creating neural networks. To create a fixed-size input matrix for the implementation on Chainer, we added zero-padding vectors into the end of a sentence so that each input matrix will be N × d matrix, where N is the upper bound of the length of a sentence.
As we mentioned in Section 2, we consider both Three-way Polarity Classifier and a combination of Topic Classifier and Two-way Polarity Classifier. We also try to find out the best hyper parameter k and activation functions.   Table 7: Tuning of window size per word k on 10-fold cross validation using CNN based approach. The scores were measured in a macro average of micro-F1 scores of FAVOR and AGAINST for each topic and all topics.

Experimental Setups
We trained 300 dimensional word embeddings using Word2Vec 3 with Wikipedia articles 4 (3950598 articles in total) 5 . We set N to 100, which exceeds the maximum length of all tweets. We use (300 × k) × 200 matrix as W and three fully connected layers that consist of 200-50-3 units (Threeway Polarity Classifier) or 200-50-2 units (Topic Classifier or Two-way Polarity Classifier). We measured the performance on 10-fold cross validation. We evaluated each model by a macro average of micro-F1 scores of FAVOR and AGAINST for each topic and all topics.

Comparison of Classifier Combinations
We compared Three-way Polarity Classifier with the combination of Topic Classifier and Two-way Polarity Classifier with k = 3. We tried using all possible combinations of sigmoid and relu functions in the CNN models. Table 6 shows the performances of the classifiers.
3 https://code.google.com/archive/p/word2vec/ 4 https://dumps.wikimedia.org/enwiki/20151201/enwiki-20151201-pages-articles.xml.bz2 5 We used the following options: -size 300 -window 5 -sample 1e-4 -negative 5 -hs 0 -cbow 1 -iter 3 The table indicates that the combination of Topic Classifier and Two-way Polarity Classifier outperformed Three-way Polarity Classifier. We confirmed that the combination of these classifier has been found effective for not only the feature-based approach, but also for the CNN-based approach.
We also achieved the best score when we use sigmoid function for Topic Classifier and relu for Two-way Polarity Classifier.

Tuning of Window Size k
We searched for the best value of the hyperparameter k with the highest model in Section 4.3.1. Table  7 shows that the model obtained the highest score with window size k = 2. The results show that bigram is appropriate for stance detection on the training data.

Visualization of the CNN Model
In this section, we visualize the CNN model that achieved the highest score in Section 4.3.2. To visualize the CNN model, we define a region score as the number of dimensions that are selected by the max-pooling layer per region. Figure 2 shows a heat map reflecting the region score.
The figure provides several observations.
• Topic related words such as movement received a high score in both Topic classifier and Twoway Polarity Classifier. This shows that each  CNN model automatically detects the topic words.
• Nouns, verbs and adjectives that appear in Sen-tiWordNet received a higher score in both classifiers. In addition, their scores have some associations with cooccurrence with the topic word.
• Negation words such as not and can't received high scores in Polarity Classifier, but they received less scores in Topic Classifier.

Overall Results
We compared feature-based models with CNN models and the majority baseline in the test data. The feature-based models used Topic + Two-way Polarity Classifiers and the best feature sets mentioned in Section 3.3.3. The CNN models used Topic + Twoway Polarity Classifiers and the best hyperparameters mentioned in Section 4.3. The majority baseline labeled the test data as the stance that was most prevalent in the training data. Table 8 shows a macro average of micro-F1 scores of FAVOR and AGAINST for two models in the test data and the cross validation results. As a comparison, we also show the majority baseline in Table 8.
We found that the feature-based model outperformed the CNN model in the test data, although the CNN model was better in the cross validation on the training data. We think that the feature-based model was more robust, including broad external knowledge such as SentiWordNet and crawled tweets. In contrast, the CNN model obtained a lower score on the test data than on the cross validation.

Conclusion
We compared the feature-based and the CNN based approaches on SemEval-2016 Task 6 Subtask A. The CNN based approach performed the best in the cross validation on the training data although the feature-based approach outperformed the CNN model on the test data. We also visualized the CNN model to reveal what was focused on. We found that the CNN model automatically detected the topic words and effective words to detect the stances.