Incorporating Label Dependencies in Multilabel Stance Detection

Stance detection in social media is a well-studied task in a variety of domains. Nevertheless, previous work has mostly focused on multiclass versions of the problem, where the labels are mutually exclusive, and typically positive, negative or neutral. In this paper, we address versions of the task in which an utterance can have multiple labels, thus corresponding to multilabel classification. We propose a method that explicitly incorporates label dependencies in the training objective and compare it against a variety of baselines, as well as a reduction of multilabel to multiclass learning. In experiments with three datasets, we find that our proposed method improves upon all baselines on two out of three datasets. We also show that the reduction of multilabel to multiclass classification can be very competitive, especially in cases where the output consists of a small number of labels and one can enumerate over all label combinations.


Introduction
Stance detection is an established task in the computational linguistics community, and is typically concerned with whether an utterance (e.g. a tweet) expresses an attitude (often positive, negative or neutral) against a target such as an entity e.g. a politician (Hasan and Ng, 2013;Mohammad et al., 2016), or another utterance, e.g. a previous tweet in a thread (Zubiaga et al., 2016). Thus stance detection is an important task for analyzing discourse in online forums and social media platforms and is a component in assessing the veracity of claims (Kochkina et al., 2018).
When the stances are mutually exclusive as in the aforementioned cases, multiclass classification is an appropriate formulation for the task. Often, however, a text may express multiple stances simultaneously. Such cases need to be formulated as Brexit Blog Corpus (Simaki et al., 2018) Utterance: rivalry between the us and china is inevitable but it needs to be kept within bounds that would preclude the use of military force. Stances: certainty, contrariety, necessity, prediction US Election Twitter Corpus (Sobhani et al., 2019) Utterance: voters mean more than super delegates @sensanders corrupt -> #hillaryclinton spends millions on msm to discourage #americans voting #sanders Stances: Clinton: AGAINST, Sanders: FAVOR Moral Foundations Twitter (Dehghani et al., 2019) Utterance: blatant racism in #colorado, #blacklivesmatter http://fb.me/1ibyxmswm Stances: cheating, harm multilabel classification (Sorower, 2010), where an instance can receive multiple, non-mutually exclusive labels. The most commonly used approaches to multiclass classification treat the task by learning models for each label. However, such approaches do not model dependencies between the labels explicitly, i.e. that the presence of one label results in another becoming more or less likely.
In this paper, we investigate multilabel stance detection in the context of three datasets: the Brexit Blog Corpus (BBC) (Simaki et al., 2018), the US Election Tweets Corpus (ETC) (Sobhani et al., 2019), and the Moral Foundations Twitter Corpus (MFTC) (Dehghani et al., 2019). Figure 1 shows examples from each dataset where the utterances have been annotated with multiple stances. In BBC and MFTC, each utterance is annotated with a variable number of stances, encoded as binary presence/absence of each possible stance. In ETC, each utterance has a three-way stance FA-VOR (positive), AGAINST (negative), or NONE (neutral) for each of the candidates. We show that it is possible to improve over baseline results that employ binary relevance and multitask learning, by incorporating label dependencies explicitly with the cross-label dependency loss (Yeh et al., 2017), originally introduced by Zhang and Zhou (2006). We also show that a reduction of multilabel to multiclass classification by considering all label combinations, also known as label powerset, can be very competitive, especially when the output consists of a small number of labels and one can enumerate all combinations, and verify our results with statistical significance testing. Finally, we improve on the best reported results on the ETC dataset.

Multilabel classification
The most commonly used approach to multilabel classification encodes the labels so that a single multilabel classification task is reduced to many sub-tasks learned independently. E.g. for BBC and MFTC binary models are learnt for each of the labels that predict the presence or absence of each label, hence the name Binary Relevance (BR), which has also been used in image classification (Boutell et al., 2004). It is straightforward to extend BR to handle tasks such as ETC where each position in the output can have more options than presence/absence, by using multiclass classifiers instead of binary ones. While it is possible to learn the models for the subtasks jointly using multitask learning (Ruder, 2017), this does not capture label dependencies in the output directly; instead it encourages layers of the model before the output to be learned to benefit all tasks simultaneously.
An alternative encoding method, Label Powerset (LP), captures dependencies explicitly: each label combination appearing in the training data is encoded as a new, unique label, reducing the task once again to a multiclass classification. However, LP can introduce an exponentially large number of new labels, potentially with few training instances, thus exacerbating class imbalance. Moreover, only those label combinations seen in the data will be available during training; this can be a particular issue when there is a shortage of representative training data.
BR encoding methods ignore label dependencies, and the LP method relies on encoding each label combination appearing in the training set as an explicit new label, both methods reducing the task to binary/multiclass classification. In what follows, we adopt a middle-ground between BR and LP methods by incorporating a notion of de-pendence between the labels in the targets.

Cross-label dependency loss
To capture the dependencies among labels in the output, we follow Yeh et al. (2017) and employ the cross-label dependency (XLD) loss between vectors y andŷ: where y denotes a vector of true (binary) labels of dimension n,ŷ a vector of predicted label probabilities, y 1 are the indices of the 1-labelled elements of y, y 0 are the indices of the 0-labelled elements, andŷ p denotes the pth element of vectorŷ. Minimising the cross-label dependency loss is equivalent to maximising the distance between 0-and 1-labelled targets, by penalising the model when it predicts label pairs that shouldn't co-exist for the instance. The intuition is similar to that of Bayesian Personalized Ranking in collaborative filtering (Rendle et al., 2009). We add XLD to cross-entropy loss to define an overall loss function L(Y,Ŷ): where XEnt is cross-entropy loss, Y i denotes a row vector of dimension n, and α ≥ 0 is a hyperparameter controlling the contribution of crosslabel dependency loss to the overall loss.
Extending XLD to the ETC dataset is slightly complicated by the fact that the labels have three possible values, so we cannot represent a set of target labels as a binary vector. Firstly, we encode the labels using a one-hot encoding binary representation, so for example, AGAIN ST = 100, N ON E = 010 and F OR = 001. We then apply XLD between the two encoded target labels, y and z, of each tweet, and their predicted label probabilitiesŷ andẑ respectively, as follows: The above definition is not symmetric since it compares the 0-labelled positions of the first target label with the 1-labelled positions of the second target label. We re-introduce the symmetry by defining the overall loss function as: where α L ≥ 0 and α R ≥ 0 are hyper-parameters controlling the contribution of the left and right XLD loss across the targets, respectively.

Experimental setup
In our experiments, we use the following multilabel stance detection datasets. The BBC dataset (Simaki et al., 2018) contains 1,239 1 utterances from social media blogs. Each utterance is assigned multiple stances by expert annotators from a set of ten stances. The ETC dataset (Sobhani et al., 2019) consists of 3 sets of tweets, each associated with a pair of election candidates in the US 2016 Election: Donald Trump-Hillary Clinton (DT-HC), Donald Trump-Ted Cruz (DT-TC), and Hillary Clinton-Bernie Sanders (HC-BS), containing 1,722, 1,317 and 1,366 tweets respectively. The MFTC dataset consists of 35,108 tweets curated from six 2 distinct discourse domains, e.g. natural disasters, politics, etc. Each tweet is annotated with up to 10 labels of moral sentiment.
Hyper-parameter selection is done using 5-fold cross-validation (CV) on the training set of each dataset. For the BBC dataset, we split the data 80% into a training set, and 20% holdout test set. For the ETC dataset, we combine the training and validation sets already provided to perform CV, and report on the original test set. For the MFTC data set, we split the data 80% into a training set, and 20% holdout test set.
In BBC and MFTC we use the Jaccard Similarity Score (JSS) (Jaccard, 1902) as our scoring metric: JSS gives credit for partial matches, but does not reward predicting the absence of labels, which is desirable as most labels for each instance are absent (e.g. 90% of the instances in BBC and MFTC have at most two labels). It is less harsh than accuracy (Exact Match Ratio) (Sorower, 2010), which requires the entire label combination to be predicted correctly. For the ETC dataset, where each tweet is tagged with exactly two stances (i.e. no absent labels), following Sobhani et al. (2019), we use the macro-averaged F1-score for FAVOR and AGAINST, as the scoring metric.

Results
In our experiments, we consider models that capture label dependencies explicitly as well as baselines that do not capture these. As our baselines, we consider binary relevance using FastText (FT) classifiers (Joulin et al., 2017) for each stance label in BBC/MFTC and politician in ETC, as well as a multi-task learning (MTL) approach (Ruder, 2017) where each of the classifiers becomes a task and they all operate on a shared hidden layer (hard parameter sharing). As models capturing dependencies, we considered three options: the combination of the cross label dependency loss with MTL (MTL-XLD), and the combinations of label powerset with FT and MTL (FT-LP and MTL-LP respectively). For the latter, each label combination becomes a task learned jointly with the rest. Further details on all models and parameter tuning are in the supplementary material 3 .
In Table 1 we report the test set results for all models. The results for the MFTC dataset are averaged across the six discourse domains. Overall, MTL-LP is the best performing multilabel classification method across all the datasets. MTL-LP is also better than the best performing model Seq2Seq reported in Sobhani et al. (2019) for the ETC dataset. MTL-XLD improves on the baseline models for the BBC and MFTC datasets, but performs slightly worse than MTL on the ETC dataset. We note that our results for the BBC and MFTC datasets are not directly comparable with previous work on BBC (Simaki et al., 2017) and MFTC (Dehghani et al., 2019), since we consider the full set of labels, whereas previous work removed those that were sparser. Our reimplementation of the logistic regression model of Simaki et al. (2017), as an additional baseline, resulted in poor performance in the BBC dataset (20 in JSS) and we did not consider it further.

Bootstrap training experiments
Reporting results on a held out test set is standard practice, but we also examine the results of   the MTL-XLD model. We sample increasingly larger fractions of the training sets, train the models on these fractions, and record the scoring metric on the original holdout test set. The learning curve for BBC is shown in Figure 2, from which we can see that the MTL-XLD model is superior to MTL-LP until the training dataset size is approximately 70% of the original, after which MTL-LP scores higher than MTL-XLD. The remaining learning curves for MFTC are given in the supplementary material, and show that for discourse domains ALM, BLM, Davidson, Election and MeToo, MTL-LP is superior to MTL-XLD at all training set sizes, however MTL-XLD is superior to MTL-LP for the Baltimore domain.

Conclusions
In this paper, we focused on multilabel stance detection and presented experiments on three datasets. We demonstrated that taking label dependencies into account improves the performance of classification-based and multi-task baselines. In future work, we will explore how to integrate the textual descriptions of the labels in our our approach which has been shown to be beneficial in the case of large label sets (Mullenbach et al., 2018).