BUT-FIT at SemEval-2019 Task 7: Determining the Rumour Stance with Pre-Trained Deep Bidirectional Transformers

This paper describes our system submitted to SemEval 2019 Task 7: RumourEval 2019: Determining Rumour Veracity and Support for Rumours, Subtask A (Gorrell et al., 2019). The challenge focused on classifying whether posts from Twitter and Reddit support, deny, query, or comment a hidden rumour, truthfulness of which is the topic of an underlying discussion thread. We formulate the problem as a stance classification, determining the rumour stance of a post with respect to the previous thread post and the source thread post. The recent BERT architecture was employed to build an end-to-end system which has reached the F1 score of 61.67% on the provided test data. It finished at the 2nd place in the competition, without any hand-crafted features, only 0.2% behind the winner.


Introduction
Fighting false rumours at the internet is a tedious task. Sometimes, even understanding what an actual rumour is about may prove challenging. And only then one can actually judge its veracity with an appropriate evidence. The works of Ferreira and Vlachos (2016) and Enayet and El-Beltagy (2017) focused on predictions of rumour veracity in thread discussions. These works indicated that the veracity is correlated with discussion participants' stances towards the rumour. Following this, the SubTask A SemEval-2019 Task 7 consisted in classifying whether the stance of each post in a given Twitter or Reddit thread supports, denies, queries or comments a hidden rumour.
Potential applications of such a function are wide, ranging from an analysis of popular events (political discussions, academy awards, etc.) to quickly disproving fake news during disasters.
Stance classification (SC), in its traditional form, is concerned with determining the attitude of a source text towards a target text (Mohammad et al., 2016). It has been studied thoroughly for discussion threads (Walker et al., 2012;Hasan and Ng, 2013;Chuang and Hsieh, 2015). However, the objective of SubTask A SemEval-2019 Task 7 is to determine the stance to a hidden rumour which is not explicitly given (it can be often inferred from the source post of the discussion -the root of the tree-shaped discussion thread -as demonstrated in Figure 1). The competitors were asked to classify the stance of the source post itself too.
.@AP I demand you retract the lie that people in #Ferguson were shouting "kill the police", local reporting has refuted your ugly racism The provided dataset was collected from Twitter and Reddit tree-shaped discussions. Stance labels were obtained via crowdsourcing. The discussions deal with 9 recently popular topics -Sydney siege, Germanwings crash etc.
The approach followed in our work builds on recent advances in language representation models.
We fine-tune a pre-trained end-toend BERT (Bidirectional Encoder Representations from Transformers) model (Devlin et al., 2018), while using discussion's source post, target's previous post and the target post itself as inputs to determine the rumour stance of the target post. Our implementation is available online. 1 2 Related Work Previous SemEval competitions: In recent years, there were two SemEval competitions targeting the stance classification. The first one focused on the setting in which the actual rumour was provided (Mohammad et al., 2016). Organizers of SemEval-2016 Task 6 prepared a benchmarking system based on SVM using hand-made features and word embeddings from their previous system for sentiment analysis (Mohammad et al., 2013), outperforming all the challenge participants.
The second competition was the previous Ru-mourEval won by a system based on word vectors, handcrafted features 2 and an LSTM (Hochreiter and Schmidhuber, 1997) summarizing information of the discussion's branches (Kochkina et al., 2017). Other submissions were either based on similar handcrafted features (Singh et al., 2017;Wang et al., 2017;Enayet and El-Beltagy, 2017), features based on sets of words for determining language cues such as Belief or Denial (Bahuleyan and Vechtomova, 2017), post-processing via rulebased heuristics after the feature-based classification (Srivastava et al., 2017), Convolutional Neural Networks (CNNs) with rules (Lozano et al., 2017), or CNNs that jointly learnt word embeddings (Chen et al., 2017).
End-to-end approaches: Augenstein et al. (2016) encode the target text by means of a bidirectional LSTM (BiLSTM), conditioned on the source text. The paper empirically shows that the conditioning on the source text really matters. Du et al. (2017) propose target augmented embeddings -embeddings concatenated with an average of source text embeddings -and apply them to compute an attention based on the weighted sum of target embeddings, previously transformed via a BiLSTM. Mohtarami et al. (2018) propose an architecture that encodes the source and the target text via an LSTM and a CNN separately and then uses a memory network together with a similarity matrix to capture the similarity between the source and the target text, and infers a fixed-size vector suitable for the stance prediction.

Pre-processing
We replace URLs and mentions with special tokens $U RL$ and $mention$ using tweetprocessor 3 . We use spaCy 4 to split each post into sentences and add the [EOS] token to indicate termination of each sentence. We employ the tokenizer that comes with the Hugging Face PyTorch re-implementation of BERT 5 . The tokenizer lowercases the input and applies the WordPiece encoding (Wu et al., 2016) to split input words into most frequent n-grams present in the pre-training corpus, effectively representing text at the subword level while keeping a 30,000-token vocabulary only.

Model
Following the recent trend in transfer learning from language models (LM), we employ the pretrained BERT model. The model is first trained on the concatenation of BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words) using the multi-task objective consisting of LM and machine comprehension (MC) subobjectives. The LM objective aims at predicting the identity of 15% randomly masked tokens present in the input 6 . Given two sentences from the corpus, the MC objective is to classify whether the second sentence follows the first sentence in the corpus. The sentence is replaced randomly in half of the cases. During pre-training, the input consists of two documents, each represented by a sequence of tokens divided by the special [SEP ] Figure 2: An architecture of BUT-FIT's system. The text segment containing document 1 is green, the segment containing document 2 (the target post) is blue. The input representation is obtained by summing input embedding matrices E = E t + E s + E p ∈ R L×d , L being the input length and d the input dimensionality. The input is passed N times via the transformer encoder. Finally, the [CLS] token-level output is fed through two dense layers yielding the class prediction.
Our system follows the assumption that the stance of discussion's post depends only on itself, on the source thread post and on the previous thread post. Since the original input is composed of two documents, we experimented with various ways of encoding the input (see Section 5), ending up with just a concatenation of the source and the previous post as document 1 (left empty in case of the source post being the target post) and the target post as document 2 . The discriminative finetuning of BERT is done using the [CLS] token level output and passing it through two dense layers yielding posterior probabilities as depicted in Figure 2. A weighted cross-entropy loss is used to ensure a flat prior over the classes.

Ensembling
Before submission, we trained 100 models differing just by their learning rates. We experimented with 4 different fusion mechanisms in order to increase the F1 measure and compensate for overfitting: The TOP-N fusion chooses 1 model randomly and adds it to the ensemble. Then, it randomly shuffles the rest of the models and tries to add them into the ensemble one at the time, while iteratively calculating ensemble's F1 by averaging the output probabilities, effectively approximating the Bayesian model averaging. If a model increases the total F1 score, the model is permanently added to the ensemble. The process is repeated until no further model improving the ensemble's F1 score can be found. This procedure resulted in a set of 17 best models. The EXC-N fusion chooses all models into the ensemble and then iteratively drops one model at the time, starting from that which dropping results in the largest increase of the ensemble's F1. The process stops when dropping any other model cannot increase the F1 score. Using this approach, we ended up using 94 models. The TOP-N s is analogous to the TOP-N fusion, but we average pre-softmax scores instead of output class probabilities. The OPT-F1 fusion aims at learning weights summing up to 1 for the weighted average of output probabilities from models selected via the procedure used in the TOP-N strategy. The weights are estimated using modified Powell's method from the SciPy package to maximize the F1 score on the development dataset.

Experimental Setup
We implemented our models in PyTorch, taking advantage of the Hugging Face re-implementation (see Footnote 5), with the "BERT-large-uncased" setting, pre-trained using 24 transformer layers, having the hidden unit size of d = 1024, 16 attention heads, and 335M parameters. When building the ensemble, we picked learning rates from the interval [1e−6, 2e−6]. Each epoch iterates over the dataset in an ordered manner, starting by the shortest sequence. We truncate sequences at maximum length l = 200 with a heuristic -firstly we truncate the document 1 to length l/2, if that is not enough, then we truncate the document 2 to  BERT big−nosrc and BERT big−noprev denote system instantiations with an empty source and an empty target post, respectively. Note that the accuracy is biased towards different training data priors as shown in Table 1. SemEval submissions are denoted by * .
the same size. We keep the batch size of 32 examples and keep other hyperparameters the same as in the BERT paper. We use the same Adam optimizer with the L2 weight decay of 0.01 and no warmup. We trained the model on the GeForce RTX 2080 Ti GPU.

Results and Discussion
We compare the developed system to three baselines. The first one is the branch-LSTM baseline provided by the task organizers 7 -inspired by the winning system of RumourEval 2017. The second baseline (FeaturesNN) is our reimplementation of the first baseline in PyTorch without the LSTM -posts are classified by means of a 2-layer network (ReLU/Softmax), using only the features defined in Footnote 2. In the third case (BiLSTM+SelfAtt), we use the same input representation as in our submitted model but replace the BERT by an 1-layer BiLSTM network followed by a self-attention and a softmax layer, inspired by . The results are shown in Table 2. BERT models had to cope with a high variance during the training. This might be caused by the problem difficulty, the relatively small number of training examples, or the complexity of the models. To deal with the problem, we decided to discard all models with F1 scores of less than 55 on the development dataset and we averaged the output class probabil-7 http://tinyurl.com/y4p5ygn7 ity distributions when ensembling. Our initial experiments used sequences up to the length of 512, but we found no difference when truncating them down to 200.
What features were not helpful: We tried adding a number of other features, including those indicating positive, neutral, or negative sentiment, and all the features used by the FeaturesNN baseline. We also tried adding jointly learned POS, NER, and dependency tag embeddings, as well as the third segment embeddings 8 . We also experimented with an explicit [SEP ] token to separate the source and the previous post in the BERT input. However, none of the mentioned changes led to a statistically significant improvement.

Conclusions and Future Directions
The system presented in this paper achieved the macro F1 score of 61.67, improving the baseline by 12.37%, while using only the source post of discussion, the previous post and the target post to classify the target post's stance to a rumour.
A detailed analysis of the provided data shows that the employed information sources are not sufficient to correctly classify some examples. Our future work will focus on extending the system by a relevance scoring component. To preserve the context, it will evaluate all posts in a given discussion thread and pick up the most relevant ones according to defined criteria.

A.1 Dataset Insights
The dataset contains a whole tree structure and metadata for each discussion from Twitter and Reddit. The nature of the data differs across the sources (for example, the Reddit subset includes upvotes). When analysing the data, we spotted several anomalies: • 12 data points do not contain any text. According to the task organizers, they were deleted by users at the time of data download and been left in the data not to break the conversational structure.
• The query stance of some examples taken from subreddit DebunkThis 9 is dependent on the domain knowledge. The class of some examples is ambiguous; they should be probably labelled by multiple classes.

A.1.1 Domain knowledge dependency
Examples from subreddit DebunkThis have all the same format "Debunk this: [statement]", e.g. "Debunk this: Nicotine isn't really bad for you, and it's the other substances that makes tobacco so harmful.". All these examples are labelled as queries.

A.1.2 Class ambiguity
The source/previous post "This is crazy! #CapeTown #capestorm #weatherforecast https://t.co/3bcKOKrCJB" and the target post "@RyGuySA Oh my gosh! Is that not a tornado?! Cause wow, It almost looks like one!", labelled as a comment in the dataset, might be seen as a query as well.

A.2 Additional Introspection
Figures 3, 4, 5, and 6 demonstrate attention matrices A, derived from the multi-head attention defined as: where Q, K ∈ R L×d k are the matrices containing query/value vectors and d k is the key/value dimension. The insights are selected from the heads at the first layer of the transformer encoder.