Findings of the NLP4IF-2019 Shared Task on Fine-Grained Propaganda Detection

We present the shared task on Fine-Grained Propaganda Detection, which was organized as part of the NLP4IF workshop at EMNLP-IJCNLP 2019. There were two subtasks. FLC is a fragment-level task that asks for the identification of propagandist text fragments in a news article and also for the prediction of the specific propaganda technique used in each such fragment (18-way classification task). SLC is a sentence-level binary classification task asking to detect the sentences that contain propaganda. A total of 12 teams submitted systems for the FLC task, 25 teams did so for the SLC task, and 14 teams eventually submitted a system description paper. For both subtasks, most systems managed to beat the baseline by a sizable margin. The leaderboard and the data from the competition are available at http://propaganda.qcri.org/nlp4if-shared-task/.


Introduction
Propaganda aims at influencing people's mindset with the purpose of advancing a specific agenda. In the Internet era, thanks to the mechanism of sharing in social networks, propaganda campaigns have the potential of reaching very large audiences (Glowacki et al., 2018;Muller, 2018;Tardáguila et al., 2018).
Propagandist news articles use specific techniques to convey their message, such as whataboutism, red Herring, and name calling, among many others (cf. Section 3). Whereas proving intent is not easy, we can analyse the language of a claim/article and look for the use of specific propaganda techniques. Going at this fine-grained level can yield more reliable systems and it also makes it possible to explain to the user why an article was judged as propagandist by an automatic system.
With this in mind, we organised the shared task on fine-grained propaganda detection at the NLP4IF@EMNLP-IJCNLP 2019 workshop. The task is based on a corpus of news articles annotated with an inventory of 18 propagandist techniques at the fragment level. We hope that the corpus would raise interest outside of the community of researchers studying propaganda. For example, the techniques related to fallacies and the ones relying on emotions might provide a novel setting for researchers interested in Argumentation and Sentiment Analysis.

Related Work
Propaganda has been tackled mostly at the article level. Rashkin et al. (2017) created a corpus of news articles labelled as propaganda, trusted, hoax, or satire.  experimented with a binarized version of that corpus: propaganda vs. the other three categories. Barrón-Cedeno et al. (2019) annotated a large binary corpus of propagandist vs. non-propagandist articles and proposed a feature-based system for discriminating between them. In all these cases, the labels were obtained using distant supervision, assuming that all articles from a given news outlet share the label of that outlet, which inevitably introduces noise (Horne et al., 2018).
A related field is that of computational argumentation which, among others, deals with some logical fallacies related to propaganda. Habernal et al. (2018b) presented a corpus of Web forum discussions with instances of ad hominem fallacy. Habernal et al. (2017Habernal et al. ( , 2018a introduced Argotario, a game to educate people to recognize and create fallacies, a by-product of which is a corpus with 1.3k arguments annotated with five fallacies such as ad hominem, red herring and irrelevant authority, which directly relate to propaganda.
Unlike (Habernal et al., 2017(Habernal et al., , 2018a, our corpus uses 18 techniques annotated on the same set of news articles. Moreover, our annotations aim at identifying the minimal fragments related to a technique instead of flagging entire arguments.
The most relevant related work is our own, which is published in parallel to this paper at EMNLP-IJCNLP 2019  and describes a corpus that is a subset of the one used for this shared task.

Propaganda Techniques
Propaganda uses psychological and rhetorical techniques to achieve its objective. Such techniques include the use of logical fallacies and appeal to emotions. For the shared task, we use 18 techniques that can be found in news articles and can be judged intrinsically, without the need to retrieve supporting information from external resources. We refer the reader to  for more details on the propaganda techniques; below we report the list of techniques: 1. Loaded language. Using words/phrases with strong emotional implications (positive or negative) to influence an audience (Weston, 2018, p. 6).
2. Name calling or labeling. Labeling the object of the propaganda as something the target audience fears, hates, finds undesirable or otherwise loves or praises (Miller, 1939).
3. Repetition. Repeating the same message over and over again, so that the audience will eventually accept it (Torok, 2015;Miller, 1939).

Exaggeration or minimization.
Either representing something in an excessive manner: making things larger, better, worse, or making something seem less important or smaller than it actually is (Jowett and O'Donnell, 2012, p. 303), e.g., saying that an insult was just a joke.

Doubt.
Questioning the credibility of someone or something.
6. Appeal to fear/prejudice. Seeking to build support for an idea by instilling anxiety and/or panic in the population towards an alternative, possibly based on preconceived judgments.
7. Flag-waving. Playing on strong national feeling (or with respect to a group, e.g., race, gender, political preference) to justify or promote an action or idea (Hobbs and Mcgee, 2008).
8. Causal oversimplification. Assuming one cause when there are multiple causes behind an issue. We include scapegoating as well: the transfer of the blame to one person or group of people without investigating the complexities of an issue. 9. Slogans. A brief and striking phrase that may include labeling and stereotyping. Slogans tend to act as emotional appeals (Dan, 2015).
10. Appeal to authority. Stating that a claim is true simply because a valid authority/expert on the issue supports it, without any other supporting evidence (Goodwin, 2011). We include the special case where the reference is not an authority/expert, although it is referred to as testimonial in the literature (Jowett and O'Donnell, 2012, p. 237).

11.
Black-and-white fallacy, dictatorship. Presenting two alternative options as the only possibilities, when in fact more possibilities exist (Torok, 2015). As an extreme case, telling the audience exactly what actions to take, eliminating any other possible choice (dictatorship).
12. Thought-terminating cliché. Words or phrases that discourage critical thought and meaningful discussion about a given topic. They are typically short and generic sentences that offer seemingly simple answers to complex questions or that distract attention away from other lines of thought (Hunter, 2015, p. 78).
14. Reductio ad Hitlerum. Persuading an audience to disapprove an action or idea by suggesting that the idea is popular with groups hated in contempt by the target audience. It can refer to any person or concept with a negative connotation (Teninbaum, 2009).
15. Red herring. Introducing irrelevant material to the issue being discussed, so that everyone's attention is diverted away from the points made (Weston, 2018, p. 78). Those subjected to a red herring argument are led away from the issue that had been the focus of the discussion and urged to follow an observation or claim that may be associated with the original claim, but is not highly relevant to the issue in dispute (Teninbaum, 2009). 16. Bandwagon. Attempting to persuade the target audience to join in and take the course of action because "everyone else is taking the same action" (Hobbs and Mcgee, 2008).
17. Obfuscation, intentional vagueness, confusion. Using deliberately unclear words, to let the audience have its own interpretation (Suprabandari, 2007;Weston, 2018, p. 8). For instance, when an unclear phrase with multiple possible meanings is used within the argument and, therefore, it does not really support the conclusion.
18. Straw man. When an opponent's proposition is substituted with a similar one which is then refuted in place of the original (Walton, 1996).

Tasks
The shared task features two subtasks:

Fragment-Level Classification task (FLC).
Given a news article, detect all spans of the text in which a propaganda technique is used. In addition, for each span the propaganda technique applied must be identified.

Sentence-Level Classification task (SLC).
A sentence is considered propagandist if it contains at least one propagandist fragment. We then define a binary classification task in which, given a sentence, the correct label, either propaganda or non-propaganda, is to be predicted.

Data
The input for both tasks consists of news articles in free-text format, collected from 36 propagandist and 12 non-propagandist news outlets 1 and then annotated by professional annotators. More details about the data collection and the annotation, as well as statistics about the corpus can be found in , where an earlier version of the corpus is described, which includes 450 news articles. We further annotated 47 additional articles for the purpose of the shared task using the same protocol and the same annotators.
The training, the development, and the test partitions of the corpus used for the shared task consist of 350, 61, and 86 articles and of 16,965, 2,235, and 3,526 sentences, respectively. Figure 1 shows an annotated example, which contains several propaganda techniques. For example, the fragment babies on line 1 is an instance of both Name Calling and Labeling. Note that the fragment not looking as though Trump killed his grandma on line 4 is an instance of Exaggeration or Minimisation and it overlaps with the fragment killed his grandma, which is an instance of Loaded Language. Table 1 reports the total number of instances per technique and the percentage with respect to the total number of annotations, for the training and for the development sets.

Setup
The shared task had two phases: In the development phase, the participants were provided labeled training and development datasets; in the testing phase, testing input was further provided.
Phase 1. The participants tried to achieve the best performance on the development set. A live leaderboard kept track of the submissions.
Phase 2. The test set was released and the participants had few days to make final predictions.
In phase 2, no immediate feedback on the submissions was provided. The winner was determined based on the performance on the test set.
7 Evaluation FLC task. FLC is a composition of two subtasks: the identification of the propagandist text fragments and the identification of the techniques used (18-way classification task). While F 1 measure is appropriate for a multi-class classification task, we modified it to account for partial matching between the spans; see (Da San Martino et al., 2019) for more details. We further computed an F 1 value for each propaganda technique (not shown below for the sake of saving space, but available on the leaderboard).
SLC task. SLC is a binary classification task with imbalanced data. Therefore, the official evaluation measure for the task is the standard F 1 measure. We further report Precision and Recall.

Baselines
The baseline system for the SLC task is a very simple logistic regression classifier with default parameters, where we represent the input instances with a single feature: the length of the sentence. The performance of this baseline on the SLC task is shown in Tables 4 and 5.
The baseline for the FLC task generates spans and selects one of the 18 techniques randomly. The inefficacy of such a simple random baseline is illustrated in Tables 6 and 7.

Participants and Approaches
A total of 90 teams registered for the shared task, and 39 of them submitted predictions for a total of 3,065 submissions. For the FLC task, 21 teams made a total of 527 submissions, and for the SLC task, 35 teams made a total of 2,538 submissions.
Below, we give an overview of the approaches as described in the participants' papers. Tables 2 and 3 offer a high-level summary.

Teams Participating in the Fragment-Level Classification Only
Team newspeak (Yoosuf and Yang, 2019) achieved the best results on the test set for the FLC task using 20-way word-level classification based on BERT (Devlin et al., 2019): a word could belong to one of the 18 propaganda techniques, to none of them, or to an auxiliary (token-derived) class. The team fed one sentence at a time in order to reduce the workload. In addition to experimenting with an out-of-the-box BERT, they also tried unsupervised fine-tuning both on the 1M news dataset and on Wikipedia. Their best model was based on the uncased base model of BERT, with 12 Transformer layers (Vaswani et al., 2017), and 110 million parameters. Moreover, oversampling of the least represented classes proved to be crucial for the final performance. Finally, careful analysis has shown that the model pays special attention to adjectives and adverbs. Team Stalin (Ek and Ghanimifard, 2019) focused on data augmentation to address the relatively small size of the data for fine-tuning contextual embedding representations based on ELMo (Peters et al., 2018), BERT, and Grover (Zellers et al., 2019). The balancing of the embedding space was carried out by means of synthetic minority class over-sampling. Then, the learned representations were fed into an LSTM.

Teams Participating in the Sentence-Level Classification Only
Team CAUnLP (Hou and Chen, 2019) used two context-aware representations based on BERT. In the first representation, the target sentence is followed by the title of the article. In the second representation, the previous sentence is also added. They performed subsampling in order to deal with class imbalance, and experimented with BERT BASE and BERT LARGE Team LIACC (Ferreira Cruz et al., 2019) used hand-crafted features and pre-trained ELMo embeddings. They also observed a boost in performance when balancing the dataset by dropping some negative examples.
Team JUSTDeep (Al-Omari et al., 2019) used a combination of models and features, including word embeddings based on GloVe (Pennington et al., 2014) concatenated with vectors representing affection and lexical features. These were combined in an ensemble of supervised models: bi-LSTM, XGBoost, and variations of BERT.
Team YMJA (Hua, 2019) also based their approach on fine-tuned BERT. Inspired by kaggle competitions on sentiment analysis, they created an ensemble of models via cross-validation.
Team jinfen (Li et al., 2019) used a logistic regression model fed with a manifold of representations, including TF.IDF and BERT vectors, as well as vocabularies and readability measures.
Team Tha3aroon (Fadel and Al-Ayyoub, 2019) implemented an ensemble of three classifiers: two based on BERT and one based on a universal sentence encoder (Cer et al., 2018).
Team NSIT (Aggarwal and Sadana, 2019) explored three of the most popular transfer learning models: various versions of ELMo, BERT, and RoBERTa (Liu et al., 2019).
Team Mindcoders (Vlad et al., 2019) combined BERT, Bi-LSTM and Capsule networks (Sabour et al., 2017) into a single deep neural network and pre-trained the resulting network on corpora used for related tasks, e.g., emotion classification.
Finally, team ltuorp (Mapes et al., 2019) used an attention transformer using BERT trained on Wikipedia and BookCorpus.

Teams Participating in Both Tasks
Team MIC-CIS (Gupta et al., 2019) participated in both tasks. For the sentence-level classification, they used a voting ensemble including logistic regression, convolutional neural networks, and BERT, in all cases using FastText embeddings (Bojanowski et al., 2017) and pre-trained BERT models. Beside these representations, multiple features of readability, sentiment and emotions were considered. For the fragment-level task, they used a multi-task neural sequence tagger, based on LSTM-CRF (Huang et al., 2015), in conjunction with linguistic features. Finally, they applied sentence-and fragment-level models jointly.  Team CUNLP (Alhindi et al., 2019) considered two approaches for the sentence-level task. The first approach was based on fine-tuning BERT. The second approach complemented the fine-tuned BERT approach by feeding its decision into a logistic regressor, together with features from the Linguistic Inquiry and Word Count (LIWC) 2 lexicon and punctuation-derived features. Similarly to Gupta et al. (2019), for the fragment-level problem they used a Bi-LSTM-CRF architecture, combining both character-and word-level embeddings.
Team ProperGander (Madabushi et al., 2019) also used BERT, but they paid special attention to the imbalance of the data, as well as to the differences between training and testing. They showed that augmenting the training data by oversampling yielded improvements when testing on data that is temporally far from the training (by increasing recall). In order to deal with the imbalance, they performed cost-sensitive classification, i.e., the errors on the smaller positive class were more costly. For the fragment-level classification, inspired by named entity recognition, they used a model based on BERT using Continuous Random Field stacked on top of an LSTM.  Table 5: Results for the SLC task on the development set at the end of phase 1 (see Section 6).

Evaluation Results
The results on the test set for the SLC task are shown in Table 4, while Table 5 presents the results on the development set at the end of phase 1 (cf. Section 6). 3 The general decrease of the F 1 values between the development and the test set could indicate that systems tend to overfit on the development set. Indeed, the winning team ltuorp chose the parameters of their system both on the development set and on a subset of the training set in order to improve the robustness of their system. Tables 6 and 7 report the results on the test and on the development sets for the FLC task. For this task, the results tend to be more stable across the two sets. Indeed, team newspeak managed to almost keep the same difference in performance with respect to team Antiganda. Note that team MIC-CIS managed to reach the third position despite never having submitted a run on the development set.

Conclusion and Further Work
We have described the NLP4IF@EMNLP-IJCNLP 2019 shared task on fine-grained propaganda identification. We received 25 and 12 submissions on the test set for the sentence-level classification and the fragment-level classification tasks, respectively. Overall, the sentence-level task was easier and most submitted systems managed to outperform the baseline. The fragment-level task proved to be much more challenging, with lower absolute scores, but most teams still managed to outperform the baseline.
We plan to make the schema and the dataset publicly available to be used beyond NLP4IF. We hope that the corpus would raise interest outside of the community of researchers studying propaganda: the techniques related to fallacies and the ones relying on emotions might provide a novel setting for researchers interested in Argumentation and Sentiment Analysis.
As a kind of advertisement, Task 11 at SemEval 2020 4 is a follow up of this shared task. It features two complimentary tasks: Task 1 Given a free-text article, identify the propagandist text spans.
Task 2 Given a text span already flagged as propagandist and its context, identify the specific propaganda technique it contains.
This setting would allow participants to focus their efforts on binary sequence labeling for Task 1 and on multi-class classification for Task 2.