Neural Architectures for Fine-Grained Propaganda Detection in News

This paper describes our system (MIC-CIS) details and results of participation in the fine grained propaganda detection shared task 2019. To address the tasks of sentence (SLC) and fragment level (FLC) propaganda detection, we explore different neural architectures (e.g., CNN, LSTM-CRF and BERT) and extract linguistic (e.g., part-of-speech, named entity, readability, sentiment, emotion, etc.), layout and topical features. Specifically, we have designed multi-granularity and multi-tasking neural architectures to jointly perform both the sentence and fragment level propaganda detection. Additionally, we investigate different ensemble schemes such as majority-voting, relax-voting, etc. to boost overall system performance. Compared to the other participating systems, our submissions are ranked 3rd and 4th in FLC and SLC tasks, respectively.


Introduction
In the age of information dissemination without quality control, it has enabled malicious users to spread misinformation via social media and aim individual users with propaganda campaigns to achieve political and financial gains as well as advance a specific agenda. Often disinformation is complied in the two major forms: fake news and propaganda, where they differ in the sense that the propaganda is possibly built upon true information (e.g., biased, loaded language, repetition, etc.).
Prior works (Rashkin et al., 2017;Habernal et al., 2017; in detecting propaganda have focused primarily at document level, typically labeling all articles from a propagandistic news outlet as propaganda and thus, often non-propagandistic articles from the outlet are mislabeled. To this end, Da San  focuses on analyzing the use of propaganda and detecting specific propagandistic techniques in news articles at sentence and fragment level, respectively and thus, promotes explainable AI. For instance, the following text is a propaganda of type 'slogan'. Sentence-level Classification (SLC), a binary classification that predicts whether a sentence contains at least one propaganda technique, and (2) Fragment-level Classification (FLC), a token-level (multi-label) classification that identifies both the spans and the type of propaganda technique(s).
Contributions: (1) To address SLC, we design an ensemble of different classifiers based on Logistic Regression, CNN and BERT, and leverage transfer learning benefits using the pre-trained embeddings/models from FastText and BERT. We also employed different features such as linguistic (sentiment, readability, emotion, part-of-speech and named entity tags, etc.), layout, topics, etc. (2) To address FLC, we design a multi-task neural sequence tagger based on LSTM-CRF and linguistic features to jointly detect propagandistic fragments and its type. Moreover, we investigate performing FLC and SLC jointly in a multi-granularity network based on LSTM-CRF and BERT. (3) Our system (MIC-CIS) is ranked 3 rd (out of 12 participants) and 4 th (out of 25 participants) in FLC and SLC tasks, respectively.   (Blei et al., 2003;Gupta et al., 2019a at sentence and document levels in order to determine irrelevant themes, if introduced to the issue being discussed (e.g., Red Herring). For word and sentence representations, we use pre-trained vectors from FastText (Bojanowski et al., 2017) and BERT (Devlin et al., 2019).

Sentence-level Propaganda Detection
Figure 1 (left) describes the three components of our system for SLC task: features, classifiers and ensemble. The arrows from features-to-classifier indicate that we investigate linguistic, layout and topical features in the two binary classifiers: Lo-gisticRegression and CNN. For CNN, we follow the architecture of Kim (2014) for sentencelevel classification, initializing the word vectors by FastText or BERT. We concatenate features in the last hidden layer before classification.
One of our strong classifiers includes BERT that has achieved state-of-the-art performance on mul-tiple NLP benchmarks. Following Devlin et al. (2019), we fine-tune BERT for binary classification, initializing with a pre-trained model (i.e., BERT-base, Cased). Additionally, we apply a decision function (Table 1) such that a sentence is tagged as propaganda if prediction probability of the classifier is greater than a threshold (τ ). We relax the binary decision boundary to boost recall, similar to Gupta et al. (2019b).
Ensemble of Logistic Regression, CNN and BERT: In the final component, we collect predictions (i.e., propaganda label) for each sentence from the three (M = 3) classifiers and thus, obtain M number of predictions for each sentence. We explore two ensemble strategies (Table 1): majority-voting and relax-voting to boost precision and recall, respectively.
(2) LSTM-CRF+Multi-grain that jointly performs FLC and SLC with FastTextWordEmb and BERTSentEmb, respectively. Here, we add binary sentence classification loss to sequence tagging weighted by a factor of α.
Ensemble of Multi-grain, Multi-task LSTM-CRF with BERT: Here, we build an ensemble by considering propagandistic fragments (and its type) from each of the sequence taggers. In doing so, we first perform majority voting at the fragment level for the fragment where their spans exactly overlap. In case of non-overlapping fragments, we consider all. However, when the spans overlap (though with the same label), we consider the fragment with the largest span.

Experiments and Evaluation
Data: While the SLC task is binary, the FLC consists of 18 propaganda techniques (Da San Martino et al., 2019). We split (80-20%) the annotated corpus into 5-folds and 3-folds for SLC and FLC tasks, respectively. The development set of each the folds is represented by dev (internal); however, the un-annotated corpus used in leaderboard comparisons by dev (external). We remove empty and single token sentences after tokenization.
Experimental Setup: We use PyTorch framework for the pre-trained BERT model (Bert-basecased 2 ), fine-tuned for SLC task. In the multigranularity loss, we set α = 0.1 for sentence classification based on dev (internal, fold1) scores. We 2 github.com/ThilinaRajapakse/ pytorch-transformers-classification   Kim (2014) with filter-sizes of [2,3,4,5,6], 128 filters and 16 batch-size. We compute binary-F1and macro-F1 3 (Tsai et al., 2006) in SLC and FLC, respectively on dev (internal). See Table 5 for hyper-parameter settings for FLC task using LSTM-CRF. Table 3 shows the scores on dev (internal and external) for SLC task. Observe that the pre-trained embeddings (FastText or BERT) outperform TF-IDF vector representation. In row r2, we apply logistic regression classifier with BERTSentEmb that leads to improved scores over FastTextSen-tEmb. Subsequently, we augment the sentence vector with additional features that improves F1 on dev (external), however not dev (internal). Next, we initialize CNN by FastTextWordEmb or BERT-WordEmb and augment the last hidden layer (before classification) with BERTSentEmb and feature vectors, leading to gains in F1 for both the dev sets. Further, we fine-tune BERT and apply differ-  ent thresholds in relaxing the decision boundary, where τ ≥ 0.35 is found optimal. We choose the three different models in the ensemble: Logistic Regression, CNN and BERT on fold1 and subsequently an ensemble+ of r3, r6 and r12 from each fold1-5 (i.e., 15 models) to obtain predictions for dev (external). We investigate different ensemble schemes (r17-r19), where we observe that the relax-voting improves recall and therefore, the higher F1 (i.e., 0.673). In postprocess step, we check for repetition propaganda technique by computing cosine similarity between the current sentence and its preceding w = 10 sentence vectors (i.e., BERTSentEmb) in the document. If the cosine-similarity is greater than λ ∈ {.99, .95}, then the current sentence is labeled as propaganda due to repetition. Comparing Dev (internal), Fold1

Results: Sentence-Level Propaganda
Dev (external)  r19 and r21, we observe a gain in recall, however an overall decrease in F1 applying postprocess. Finally, we use the configuration of r19 on the test set. The ensemble+ of (r4, r7 r12) was analyzed after test submission. Table 2 (SLC) shows that our submission is ranked at 4 th position. Table 4 shows the scores on dev (internal and external) for FLC task. Observe that the features (i.e., polarity, POS and NER in row II) when introduced in LSTM-CRF improves F1. We run multigrained LSTM-CRF without BERTSentEmb (i.e., row III) and with it (i.e., row IV), where the latter improves scores on dev (internal), however not on dev (external). Finally, we perform multitasking with another auxiliary task of PFD. Given the scores on dev (internal and external) using different configurations (rows I-V), it is difficult to infer the optimal configuration. Thus, we choose the two best configurations (II and IV) on dev (internal) set and build an ensemble+ of predictions (discussed in section 2.3), leading to a boost in recall and thus an improved F1 on dev (external).

Results: Fragment-Level Propaganda
Finally, we use the ensemble+ of (II and IV) from each of the folds 1-3, i.e., |M| = 6 models to obtain predictions on test. Table 2 (FLC) shows that our submission is ranked at 3 rd position.

Conclusion and Future Work
Our system (Team: MIC-CIS) explores different neural architectures (CNN, BERT and LSTM-CRF) with linguistic, layout and topical features to address the tasks of fine-grained propaganda detection. We have demonstrated gains in performance due to the features, ensemble schemes, multi-tasking and multi-granularity architectures. Compared to the other participating systems, our submissions are ranked 3 rd and 4 th in FLC and SLC tasks, respectively. In future, we would like to enrich BERT models with linguistic, layout and topical features during their fine-tuning. Further, we would also be interested in understanding and analyzing the neural network learning, i.e., extracting salient fragments (or key-phrases) in the sentence that generate propaganda, similar to  in order to promote explainable AI.