CLEARumor at SemEval-2019 Task 7: ConvoLving ELMo Against Rumors

This paper describes our submission to SemEval-2019 Task 7: RumourEval: Determining Rumor Veracity and Support for Rumors. We participated in both subtasks. The goal of subtask A is to classify the type of interaction between a rumorous social media post and a reply post as support, query, deny, or comment. The goal of subtask B is to predict the veracity of a given rumor. For subtask A, we implement a CNN-based neural architecture using ELMo embeddings of post text combined with auxiliary features and achieve a F1-score of 44.6%. For subtask B, we employ a MLP neural network leveraging our estimates for subtask A and achieve a F1-score of 30.1% (second place in the competition). We provide results and analysis of our system performance and present ablation experiments.


Introduction
Online social media has changed the way of communicating and disseminating media content and opinions, but also paved the way for spreading false or unverified rumors.
RumourEval 2019 (Gorrell et al., 2019) provides a dataset of labelled threads from Twitter and Reddit where each source post mentions a rumor. Subtask A (SDQC) consists of deciding for each post in a thread whether it is in a support, deny, query, or comment relation to the rumor. The goal of subtask B (Verification) it to classify the veracity of the rumor as true, false, or unverified. Automated rumor classification is a challenging task as there is no definite evidence (e.g., authorized confirmation). In its absence, stance analysis is a useful approach. Systems that employ neural network architectures showed promising results in RumourEval 2017 (Derczynski et al., 2017), with * The first two authors contributed equally. Figure 1: An example Twitter thread from the training dataset with SDQC labels for each post and a veracity label for the thread's source post. Any post that does not reply to another is a source post. Reply posts can be direct replies (replies to a source post) or nested replies (replies that reply to another reply post). A thread is the set containing a source post and all its reply posts. the LSTM-based sequential model of Kochkina et al. (2017) performing best.
In this paper, we describe our approach CLEARumor (ConvoLving ELMo Against Rumors) for solving both subtasks and provide empirical results and ablation experiments of our architecture. We make our PyTorch-based implementation and trained models publicly available 1 .

System Description
After preprocessing the post text (Section 2.1) and embedding it with ELMo (Section 2.2), our architecture for subtask A (Section 2.3) passes the embedded text through a convolutional neural network (CNN) block, adds auxiliary features, and uses a multilayer perceptron (MLP) block for estimating class membership. These estimates are combined with further auxiliary features and fed into an MLP block for the classification for subtask B (Section 2.4).

Preprocessing
For preprocessing, we rely mostly on Erika Varis Doggett's tokenizer for Twitter and Reddit 2 , with which we strip away all user handles (e.g., "@Fut-bolLife"), remove the number sign in front of hash tags (e.g., "#Ebola" becomes "Ebola"), remove URLs, and limit repetitions of the same character to at most three times (e.g., "heeeeey" becomes "heeey"). We further decided to lowercase all text, which resulted in improved performance over mixed case in initial experiments. Last, all posts are truncated after 32 tokens 3 .

ELMo Embeddings
The task of word embedding is to represent each word in a given sentence by a vector, which among other things allows for encoding words at the input layer in a neural network architecture. Traditional embedding methods such as word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014) work independently of context and always map the same word to the same vector.
In contrast, ELMo (Peters et al., 2018) is a recent embedding approach based on bidirectional LSTM networks that considers the context a word occurs in and is thereby able to address certain linguistic peculiarities, e.g., that the same word can have different meanings depending on its context. Further, ELMo incorporates subword units and is thereby able to represent words not seen during training successfully, an important benefit for the social media domain, where users frequently misspell existing words or introduce new ones. Formally, given a sequence of words w 1 w 2 . . . w n ELMo represents the k-th word w k as where L gives the number of internal layers that were used to train ELMo, h k,j is the contextual 2 https://github.com/erikavaris/ tokenizer 3 Only 10 out of the total 6634 Twitter posts are longer than this, while a few Reddit posts are up to 1,000 tokens long which would result in very impractical batch sizes. vector representation of layer j for word k, and γ task and the s task j are scalars that can be tuned specifically for the task at hand.
We report results for the pretrained model elmo_2x4096_512_2048cnn_2xhighway _5.5B 4 for which L = 2 and which outputs 1024-dimensional embedding vectors (but didn't notice drastic improvements over the much smaller models). ELMo allows us to fine-tune γ task , s task j , and even the h k,j by backpropagating gradients to them, but we decided against this, because the RumourEval dataset is very small (cf. Table 1) adjusting these weights can quickly lead to overfitting, and keeping the weights constants allows us to precompute and store all ELMo embeddings once before the training process which results in a major boost in performance.

Subtask A
Our architecture for subtask A is visualized in Fig. 2. First, the tokenized text of the post that is to be classified as either support, deny, query, or comment is represented with an ELMo embedding. Next, the embedded text is fed into L convmany convolutional layers. Here, a single convolutional layers consists of multiple 1D-convolution operations with a set of different kernel sizes S, each mapping onto C convolutional channels, which are then concatenated along the channel axis. Each convolution operation is batch normalized (Ioffe and Szegedy, 2015) after a ReLU activation. To maintain an equal sequence length, sequences are padded with zero vectors. The result-ing sequence representation is transformed into a single |S| · C-dimensional vector via global average pooling. This sequence vector is concatenated with a vector of auxiliary features that encodes meta information about the post under classification (detailed in the next paragraph). Following is a stack of L dense -many dense layers, for which dropout-regularization (Srivastava et al., 2014) is performed after ReLU activation. Finally, a single linear layer that is softmax-activated yields the four estimates of class membership. Parameters are optimized using Adam (Kingma and Ba, 2015) and a cross-entropy loss.
We use the following auxiliary features: (1) a two-dimensional Boolean vector encoding whether the post is from Twitter or Reddit; (2) a five-dimensional real-valued vector encoding meta-information for the post author: whether the user is verified or not, the number of followers they have, the number of accounts they follow themselves, and a ratio of the latter two numbers 5 ; (3) the cosine similarity of the averaged ELMo embeddings of the post under classification to those of the thread's source post (defined to be 1 for source posts); and (4) a three-dimensional Boolean vector encoding whether the post is a source post, a direct reply, or a nested reply.
As hyperparameters, we employ a learning rate of 10 −3 , a batch size of 512, and train for 100 epochs. In our loss function, we weigh the estimates of support, deny, and query equally but that of comment at only a fifth of the strength because of the imbalance of the dataset. We add L2regularization with a weight of 10 −2 . In our reported results we use L conv = 1 convolutional layer, with kernel sizes S = {2, 3} each mapping into C = 64 channels, after which follow L dense = 3 dense layers with 128 hidden units each and a dropout of 0.5.

Subtask B
For subtask B we build a single feature vector that we feed into a MLP classifier. We reuse all the auxiliary features from subtask A except the last two, because all posts under classification in subtask B are source posts. We further add the following features: (1) a two-dimensional Boolean vector encoding whether media (an image or a URL) is attached to the post, (2) the upvote-to-downvote ratio of the post for Reddit (manually set to 0.5 for Twitter), (3) a two-dimensional real-valued vector encoding which fraction of the thread's posts are direct replies and which fraction are nested replies, (4) the averaged support, deny, and query probability estimates from subtask A averaged over all posts in the thread. Similarly to subtask A, this feature vector is fed into a stack of L dense -many dense layers with dropoutregularization (Srivastava et al., 2014) after ReLU activation, after which a single softmax-activated linear layer yields estimates for the three classes true, false, or unverified.
Our model was trained with a learning rate of 10 −3 and a batch size of 128 for 5000 epochs. In our loss calculation, we weigh the unverified class at 0.3 of the strength of the other two, and add a L2-regularization weight of 10 −2 . We used L dense = 2 dense layers with 512 hidden units each and a dropout of 0.25.

Evaluation
The dataset of RumourEval 2019 is summarized in Table 1 Table 2: Evaluation results for subtask A. All reported scores are multiplied by 100. We provide the macroaveraged F 1 -score for the development (Dev), the test (Test) datasets and for 10-fold cross validation (CV). For the test dataset, we further provide the individual F 1 -scores per class. "Always Comment" is a baseline predicting always the most common class. "Submitted" are the results we officially submitted to RumourEval 2019. For our CLEARumor architecture we provide multiple ablation experiments. CLEAR aux CNN+MLP is our full system, CLEAR CNN+MLP the same but without the auxiliary features, CLEAR aux MLP instead uses no convolutional layers, and CLEAR aux just concatenates averages ELMo embeddings with auxiliary features uses a single linear layer.

Subtask B
Dev Test CV  Table 3: Evaluation results for subtask B. We report F 1 (multiplied by 100) and RMSE (root mean squared error) scores for the development (Dev), the test (Test) datasets and for 10-fold cross validation (CV). CLEAR Subtask-B is our subtask B architecture using the subtask A estimates from CLEAR aux CNN+MLP . CLEAR NileTMRG uses the same estimates but computes task B results using the NileTMRG system (Enayet and El-Beltagy, 2017). detailed in Table 2 and Table 3, respectively.
The reported results differ from our official submission, because we continued to tune hyperparameters afterwards. We report results as trained on the training dataset and then evaluated on the development and test datasets, as provided by the RumourEval organizers. Because neural network experiments are naturally nondeterministic (Reimers and Gurevych, 2018) and we did indeed notice huge variances when retraining models, we report the mean and standard deviation over 10 runs for each experiment. Additionally, we report scores from a 10-fold cross validation over the whole dataset. Simple cross-validation would be inappropriate in our setting, because for example a split could result in the case where the same rumors occur in both the training and the test dataset which would allow a model to just memorize which posts are rumorous. We ensure that this does not happen in our case, by keeping all posts belonging to the same rumor 6 in the same cross 6 For Twitter posts, the dataset contains rumor-topic labels validation fold. Note that scores on the organizer split and the cross validation are not directly comparable as different fractions of the whole dataset are used for training (~60-70% for the organizer split and~90% for 10-fold cross validation).

Conclusion
We have presented CLEARumor, our architecture for the RumourEval 2019 shared tasks. In future we aim to generalize our approach, e.g., we currently use domain-specific features for characterizing the post author popularity, such as number of followers for Twitter, which are not available for all social media platforms. Besides investigating how well our approach translates to other languages, we are interested in studying the results for other pretrained word representation approaches, e.g., BERT (Devlin et al., 2018).
for each thread, so we ensure that each topic only occurs in one fold. For Reddit posts, no labelling is available, so we can only ensure that all posts of a thread occur in the same fold.