Multi-Task Stance Detection with Sentiment and Stance Lexicons

Stance detection aims to detect whether the opinion holder is in support of or against a given target. Recent works show improvements in stance detection by using either the attention mechanism or sentiment information. In this paper, we propose a multi-task framework that incorporates target-specific attention mechanism and at the same time takes sentiment classification as an auxiliary task. Moreover, we used a sentiment lexicon and constructed a stance lexicon to provide guidance for the attention layer. Experimental results show that the proposed model significantly outperforms state-of-the-art deep learning methods on the SemEval-2016 dataset.


Introduction
With the rapid growth of social media, user opinions towards various targets, e.g., politicians and religion, are abundant. These opinions can help optimize management systems and can gain insight into important events, e.g., presidential elections. The stance detection task aims to determine whether people are in favor of, against, or neutral towards a specific target. This task is similar to the three-way aspect-level sentiment analysis that determines sentiment polarity towards aspect terms. However, different from aspect-level sentiment analysis, the target in stance detection might not be explicitly mentioned in a given sentence. Consider the following example tweet: @realDonaldTrump is the only honest voice of the @GOP and that should scare the shit out of everyone! #SemST. Target: Hillary Clinton; Stance: Against; Sentiment: Positive. Observe that even though target Hillary Clinton does not appear in this tweet, we can still infer that opinion holder is less likely to be in favor of Hillary Clinton. Therefore, identifying target related information is of vital importance to stance detection.
Previous studies to stance detection used feature engineering (Mohammad et al., 2016b), Convolutional Neural Networks (CNNs) (Vijayaraghavan et al., 2016;Wei et al., 2016) and Recurrent Neural Networks (RNNs) (Zarrella and Marsh, 2016). However, they failed to take target information into considerations. In order to address this issue, several target-specific attention mechanisms (Du et al., 2017;Zhou et al., 2017;Sun et al., 2018) have been proposed to embed target information into sentence representations. Sentiment information has also been proven to be beneficial to stance detection. For example, Sobhani et al. (2016) found that sentiment lexicon features are useful for stance detection when combined with other features. Instead of directly integrating sentiment features into vector representations, Sun et al. (2018) proposed a hierarchical attention network that learns the importance of sentiment information automatically. Later, Sun et al. (2019) proposed a joint model that determines stance and sentiment simultaneously. However, the attention mechanism is not considered in this model.
Motivated by recent advances in multi-task learning (Liu et al., , 2017Balikas et al., 2017;Yu et al., 2018;Cohan et al., 2019) and the effectiveness of sentiment information in stance detection (Sobhani et al., 2016;Sun et al., 2018Sun et al., , 2019, we propose an attention-based multi-task framework that takes sentiment classification as an auxiliary task for stance detection. Specifically, we first encode words of input sentences in fixed-length dense vectors (Mikolov et al., 2013;Bojanowski et al., 2017) and then feed them as input to Bidirectional Long Short-Term Memory networks (BiLSTM) (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997). After that, the attention mechanism (Bahdanau et al., 2015) is used to extract important words for sentiment representation. Besides, a target-specific attention layer is designed to identify important words related to a given target. The main task incorporates sentiment information by concatenating the outputs of the two attention layers. Moreover, in order to provide guidance to the attention mechanism, sentiment and stance lexicon features are integrated into the final loss. We show that the proposed model, called AT-JSS-Lex, achieves competitive results on the SemEval-2016 dataset (Mohammad et al., 2016a(Mohammad et al., ,b, 2017. Our contributions are summarized as follows: First, we propose a joint sentiment and stance model (AT-JSS-Lex) based on multi-task learning that improves stance detection with the help of sentiment information and integrates both sentiment attention and target-specific attention. Second, we propose a novel formulation of the loss function that uses an existing sentiment lexicon and a stance lexicon, that we specifically constructed for this task, to guide the attention mechanism. As part of our contributions, we will make the stance lexicon publicly available. Finally, we show that the proposed AT-JSS-Lex model achieves remarkable improvements in performance over strong baselines and prior works on the SemEval-2016 stance detection dataset.

Multi-Task Learning Architecture
The overall architecture of the proposed model is shown in Figure 1. For sentiment classification (the auxiliary task), the input sentence (S = s 1 , s 2 , ..., s n ) is first sent to an embedding layer and each word is represented by a dense vector (X = x d 1 1 , where n is the sentence length and d 1 is the dimension of the word embedding. Then, a BiLSTM is used for feature extraction. At time step i, the hidden vector of forward LSTM is calculated based on previous hidden vector h i−1 and current input vector

The hidden vector
← − h i of backward LSTM is defined in a similar way. After concatenating the two hidden vectors, we obtain We adopt the attention mechanism (Bahdanau et al., 2015) to improve the three-class sentiment classification. Attention weight α 1 i is computed as: where e i is calculated based on sentence summary vector s, the final hidden vector of BiLSTM, and hidden vector h i : is the dimension of hidden units of LSTM. The final vector representation is the weighted sum of hidden vectors: At last, a fully-connected layer and a softmax layer are applied to get label distribution.
Different from the attention layer of auxiliary task, the attention layer of main task integrates target embedding t, similar to (Zhou et al., 2017). The target embedding t is the word embedding of the target word. For example, for target "Abortion," t is the word vector of the word "Abortion." For multi-word targets (e.g., Hillary Clinton), we use the average of the constituent word vectors (e.g., the average of the vectors corresponding to "Hillary" and "Clinton"). Then e i can be written as: Then, α 2 and r 2 are computed in the same way as Eq.
(1) and Eq. (3), respectively. The final vector representation of main task is the concatenation of r 1 and r 2 .
In the training stage, cross-entropy loss is used to train the model and the loss function L is defined as follows: where λ is a hyper-parameter to be tuned, determining the weight of stance detection task. L main and L aux are loss functions of main task and auxiliary task, respectively.

Lexicon Loss
Previous works showed that the attention mechanism is beneficial for stance detection. However, the attention mechanism does not always work well due to the size of training data and inability to identify target information. In order to address these issues, we propose a reformulation of the loss in Eq. (5) that employs both sentiment and stance lexicons to improve stance detection. Our full model is called AT-JSS-Lex. We use the sentiment lexicon 1 by Hu and Liu (2004) and construct a stance lexicon 2 (of almost 2,000 words) with words related to the five targets in the SemEval-2016 dataset. Specifically, we construct a stance lexicon for each target from the training data available from the Se-mEval 2016 dataset and from an extra 1,000 tweets for each target that we collected from Twitter using specific hashtags. For example, for the target "Hillary Clinton," hashtags such as "#Crooked-Hillary," "#hillaryforprison," and "#Hillary2016" are used to collect more tweets. After data collection, we manually extracted the directly related words (e.g., Killary, Shillary, and Hilly) and indirectly related words (e.g., Trump, Benghazi, and abortion) from each tweet. We ended up with around 400 lexicon words for each target.
The intuition of using sentiment and stance lexicons is to provide guidance to the attention layer. Specifically, given an input sentence, we mark sentiment and stance lexicon words as 1 and mark as 0 the remaining words that are not present in any of the two lexicons in order to obtain a lexicon vector lex. For example, the lexicon vector for "Celebrity atheism is beginning to irk me #init-1 https://www.cs.uic.edu/ ∼ liub/FBS/sentiment-analysis. html#lexicon 2 https://github.com/chuchun8/EMNLP19-Stance forthemoney #SemST" is [0 1 0 0 0 1 0 0 0]. Note that the words "atheism" and "irk" are from the stance and sentiment lexicons, respectively. The final loss from Eq. (5) is then defined as: where β is a hyperparameter that determines the importance of lexicon loss and α norm is the normalization of α sum . α sum is the summation of attention weights α 1 and α 2 . We normalize the α sum by dividing with the maximum value. Through the minimization of the loss function, ideally, α norm gets closer to lex, i.e., the vector components of α norm get closer to 1 when the corresponding components in lex are 1 (and closer to 0 otherwise), which enforces the model to learn higher attention weights for important words.

Datasets and Experimental Settings
We use SemEval-2016 Task 6.A to test the performance of our proposed model. This dataset contains five different targets: "Atheism," "Climate Change is a Real Concern" ("Climate"), "Feminist Movement" ("Feminism"), "Hillary Clinton" ("Hillary") and "Legalization of Abortion" ("Abortion"). Table 1 shows the distribution of these targets in the dataset. Each tweet has a stance label ("Favor," "Against" or "None") and a sentiment label ("Positive," "Negative" or "Other"). We sampled about 15% of the training data as validation data to tune the parameters. Word vectors are initialized using fastText word embeddings 3 (Bojanowski et al., 2017) with dimension 300. The size of hidden units in LSTM is set to 200. The dropout rate is 0.5 for dense layer and 0.2 for LSTM layer. The learning rate of Adam optimizer (Kingma and Ba, 2015) is 0.001. The size of the dense layer for the auxiliary task and  main task is 50 and 100, respectively. λ is 0.7 and β is 0.025 for all targets. L2 regularization is applied to the loss function and the regularization parameter is set to 0.01.

Evaluation Metrics
F avg , macro-average of F1 score (M acF avg ) and micro-average of F1 score (M icF avg ) are adopted to evaluate the performance of proposed model. Firstly, the F1 score of label "Favor" and "Against" is calculated as follows: where P and R are precision and recall respectively. Then the F1 average is calculated as: Note that the label "None" is not discarded during training. However, the label "None" is not considered in the evaluation because we are only interested in labels "Favor" and "Against" in this task. We average the F avg on each target to get M acF avg . Moreover, we get M icF avg by calculating F f avor and F against across all targets.

Results
First, an ablation experiment is used to determine the importance of each component of our proposed model for stance detection: • AT-JSS-Lex is a lexicon integrated multi-task model with attention mechanisms.
• AT-JSS is a model that shares the same architecture with AT-JSS-Lex, but has no lexicon loss.
• JSS is a joint sentiment and stance model similar to AT-JSS, but without the attention mechanisms.
• AT-BiLSTM is a BiLSTM model with stance attention.
• BiLSTM is a single task model that only exploits BiLSTM to detect stance.
Table 2 (top) shows the results of this ablation study. As we can see from the table, AT-JSS-Lex performs best on "Climate," "Feminism," "Hillary," and "Abortion." Moreover, AT-JSS-Lex has the best M acF avg and M icF avg scores when compared with the other models. Experimental results show that M acF avg and M icF avg drop by 0.75% and 0.27% when we remove the lexicon component, indicating that lexicon information contributes to stance detection (except on "Atheism"). Note that AT-JSS-Lex also outperforms AT-BiLSTM model, which shows that the proposed multi-task framework has better performance than single task with attention mechanism. Second, we compare the proposed model with the following baseline methods (all experimental results of baseline methods are retrieved from original papers): • SVM-ngram (Mohammad et al., 2016b) is trained by using word n-grams and character n-grams features, surpassing the best model in SemEval-2016 competition.
• JOINT (Sun et al., 2019) is a joint model that exploits sentiment information to improve stance detection task without attention mechanism.
• TAN (Du et al., 2017) is an attention-based LSTM model that extracts important part of given text. • AS-biGRU-CNN (Zhou et al., 2017) is another attention-based model that adds CNN layer after attention-based LSTM model to extract target specific features.
• HAN (Sun et al., 2018) is a hierarchical attention model leveraging various linguistic features.
• TGMN-CR (Wei et al., 2018) uses attention and memory modules to extract important information for detecting stance.
Table 2 (bottom) shows the results of this comparison as well. We can observe that AT-JSS-Lex model outperforms all baseline models except on "Atheism." Specifically, the proposed model outperforms JOINT model by 5.17% and 3.11% in M acF avg and M icF avg , demonstrating the effectiveness of attention mechanism and lexicon information. In addition, our AT-JSS-Lex model also performs better than attention-based models (TAN, HAN, AS-biGRU-CNN and TGMN-CR), showing that multi-task learning can benefit the stance detection task.

Attention Visualizations
In Figure 2, we list two input sentences and visualize the attention weights learned during the training process. The color indicates the importance of a word in given sentences, the darker the more important. We can observe that the words "global" and "warming" that are closely related to the target "Climate," have high attention weights in the first sentence. Likewise, in the second sentence, the words "Mary," "Mother," "God," and "sinners," are highlighted and show that the proposed model can pay attention to target-related words.

Error Analysis
Mislabeled data and compound hashtags are two challenging factors that increase the difficulty of further improving the stance classification. For example, consider the following tweet: Watch @Dame Lillard bring the Blazers to the playoff #Beast #SemST. Target: Atheism; Stance: Against; Sentiment: Positive. Even though the stance label of this tweet is "Against," however, we can observe that the content of this sentence is nothing related to the target. For SemEval 2016 dataset, sometimes it is hard to predict the stance without considering the compound hashtags. Here is an example: @Reince This is very credible! Good work! America is desperately in need of good leadership. #Vote-GOP #NoHillary #SemST. Target: Hillary Clinton; Stance: Against; Sentiment: Positive. It would be very difficult to infer the correct stance label if we do not consider the hashtags "#NoHillary" and "#VoteGOP." Therefore, inability to separate compound hashtags results in the loss of important target information.

Conclusion and Future Work
In this paper, we propose an attention-based multitask learning framework and integrate lexicon information to achieve better performance. Experimental results show that our model outperforms state-of-the-art deep learning methods for this task. Moreover, visualization results indicate the capability of our model to capture essential information. Future work includes exploiting unsupervised learning to generate target-related lexicon and incorporating more labels (e.g., emotion classification) for multi-task learning.