Leveraging Declarative Knowledge in Text and First-Order Logic for Fine-Grained Propaganda Detection

We study the detection of propagandistic text fragments in news articles. Instead of merely learning from input-output datapoints in training data, we introduce an approach to inject declarative knowledge of fine-grained propaganda techniques. We leverage declarative knowledge expressed in both natural language and first-order logic. The former refers to the literal definition of each propaganda technique, which is utilized to get class representations for regularizing the model parameters. The latter refers to logical consistency between coarse- and fine- grained predictions, which is used to regularize the training process with propositional Boolean expressions. We conduct experiments on Propaganda Techniques Corpus, a large manually annotated dataset for fine-grained propaganda detection. Experiments show that our method achieves superior performance, demonstrating that injecting declarative knowledge expressed in both natural language and first-order logic can help the model to make more accurate predictions.


Introduction
Propaganda is the approach deliberately designed with specific purposes to influence the opinions of readers.Different from the fake news which is entirely made-up and refers to fabricated news with no verifiable facts, propaganda conveys information with strong emotion or somewhat biased, albeit is possibly built upon an element of truth.This characteristic makes propaganda more effective and unnoticed through the rise of social media platforms.There are many propaganda techniques.For instance, examples of propagandistic texts and definitions of corresponding techniques are shown in Figure 1.
We study the problem of fine-grained propaganda detection in this work, which is possible thanks to the recent release of Propaganda Techniques Corpus (Da San Martino et al., 2019).Different from earlier works (Rashkin et al., 2017;Wang, 2017) that mainly study propaganda detection at a coarse-grained level, namely predicting whether a document is propagandistic or not, the problem requires identification of tokens of particular propaganda techniques in news articles.Da San Martino et al. (2019) propose strong baselines in a multi-task learning manner, which are trained by binary detection of propaganda at sentence level and fine-grained propaganda detection over 18 techniques at token level.Such data-driven methods have the merits of convenient end-to-end learning and strong generalization, however, they cannot guarantee the consistency between sentencelevel and token-level predictions.In addition, it is appealing to integrate human knowledge into datadriven approaches.
In this paper, we introduce an approach named LatexPRO that leverages logical and textual knowledge for propaganda detection.Following (Da San Martino et al., 2019), we develop a BERTbased multi-task learning approach as the base arXiv:2004.14201v1[cs.CL] 29 Apr 2020 model, which makes predictions for 18 propaganda techniques at both sentence-level and token-level.Based on that, we inject two types of knowledge as additional objectives to regularize the learning process.Specifically, we use logic knowledge by transforming the consistency between sentencelevel and token-level predictions with propositional Boolean expressions.Moreover, we use textual definition of propaganda techniques by first representing each of them as a contextual vector and then minimizing the distances to corresponding model parameters in semantic space.
We conduct extensive experiments on Propaganda Techniques Corpus (PTC) (Da San Martino et al., 2019), a large manually annotated dataset for fine-grained propaganda detection.Experiments show that our knowledge augmented method significantly improves a strong multi-task learning approach.In particular, results show our model greatly improves precision, demonstrating injecting declarative knowledge expressed in both natural language and first-order logic can help the model to make more accurate predictions.What is more important, further analysis indicates that augmenting the learning process with declarative knowledge reduces the percentage of inconsistency in model predictions.
The contributions of this paper are summarized as follows: • We introduce an approach to leverage declarative knowledge expressed in both natural language and first-order logic for fine-grained propaganda techniques.
• We utilize both types of knowledge as regularizers in the learning process, which enables the model to make more consistent between sentence-level and token-level predictions.
• Extensive experiments on the PTC dataset (Da San Martino et al., 2019) demonstrate that our method achieves superior performance with high F 1 and precision.
by Da San Martino et al. (2019) to calculate precision, recall and F 1 , in that giving partial credit to imperfect matches at the character level.The FLC task is evaluated on two kinds of measures: (1) Full task is the overall task of detecting both propagandistic fragments and identifying the technique, while (2) Spans is a special case of the Full task, which only considers the spans of fragments except for their propaganda techniques.

Method
In this section, we present our approach LatexPRO that injects declarative knowledge of fine-grained propaganda techniques into neural networks.A high-level illustration is shown in Figure 2. We first present our base model ( §3.1), which is a multi-task learning neural architecture that slightly extends the model of (Da San Martino et al., 2019).Afterwards, we introduce ways to regularize the learning process with textual knowledge from literal definitions of propaganda techniques ( §3.3 and logical knowledge about the consistency between sentencelevel and token-level predictions ( §3.2).Finally, we describe the training and inference procedures ( §3.4).

Base Model
To better exploit the sentence-level information and further help token-level prediction, we develop a fine-grained multi-task method as our base model, which makes predictions for 18 propaganda techniques at both sentence-level and token-level.Inspired by the success of pretrained language models on various natural language processing downstream tasks, we adopt BERT (Devlin et al., 2019) as the backbone model here.To fine-tune the model, for each sentence, the input sequence is modified as "[CLS]sentence tokens[SEP ]".Specifically, we add 19 binary classifiers and one 19-way classifier on top of BERT, where all classifiers are implemented as linear layers.At sentence level, we perform multiple binary classifications and this can further support leveraging declarative knowledge.

The last representation of the special token [CLS]
which is regarded as a summary of the semantic content of the input, is adopted to perform multiple binary classifications, including one binary classification of propaganda vs. non-propaganda and 18 binary classifications of each propaganda technique.We adopt sigmoid activation for each binary classifier.At token level, the last representation of each token is fed into a linear layer to predict the propaganda technique over 19 cate-gories (i.e., 18 categories of propaganda techniques plus one category for "none of them").We adopt Softmax activation for the 19-way classifier.We apply two different losses for this multi-task learning process, including the sentence-level loss L sen and the token-level loss L tok .L sen is the binary cross-entropy loss of multiple binary classifications.
L tok is the focal loss (Lin et al., 2017) of 19-way classification for each token, which could address class imbalance problem.

Inject Logical Knowledge
There are some implicit logical constraints between predictions.However, neural networks are less interpretable and need to be trained with a large amount of data to make it possible to learn such implicit logic.Therefore, we consider to further improve performance by using logic knowledge.To this end, we propose to employ propositional Boolean expressions to explicitly regularize the model with logic-driven objective, which improves logical consistency between sentence-level and token-level predictions, and makes our method more interpretable.For instance, in this work, if a propaganda class c is predicted by the multiple binary classifiers (indicates the sentence contains this propaganda technique), then the token-level predictions belonging to the propaganda class c should also exist.We thus consider the propositional rule F = A ⇒ B, formulated as: where A and B are two variables.Specifically, substituting f c (x) and g c (x) into above formula as , then the logic rule can be written as: where x denotes the input, f c (x) is the binary classifier for propaganda class c, and g c (x) is the probability of fine-grained predictions that contains x being category of c. g c (x) can be obtained by maxpooling over all the probability of predictions for class c.Note that the probabilities of the unpredicted class are set to 0 to prevent any violation, i.e., ensuring that each class has a probability corresponding to it.Our objective here is maximizing P (F ), i.e., minimizing L logic = −log (P (F )), to improve logical consistency between coarse-and fine-grained predictions.

Inject Textual Knowledge
Declarative knowledge in natural language, i.e., the literal definitions of propaganda techniques in this work, can be regarded as somewhat textual knowledge which contains useful semantic information.

Training and Inference
Training.To train the whole model jointly, we introduce a weighted sum of losses L j which consists of the token-level loss L tok , fine-grained sentencelevel loss L sen , textual definition loss L def and logical loss L logic : ) where hyper-parameters α, β, λ and γ are employed to control the tradeoff among losses.During training, our goal is minimizing L j using stochastic gradient descent.
Inference.For the SLC task, our method with a condition to predict "propaganda" only if the probability of propagandistic binary classification for the positive class is above 0.7.This threshold is chosen according to the number of propaganda and non-propaganda samples in the training dataset.For the FLC task, to better use the coarse-grained (sentence-level) information to guide fine-grained (token-level) prediction, we design a way that can be used to explicitly make constraints on 19-way predictions when doing inference.Prediction probabilities of 18 fine-grained binary classifications

Spans
Full Task BERT (Da San Martino et al., 2019)  above 0.9 are set to 1, and vice versa to 0. Then the Softmax probability of 19-way predictions (except for the "none of them" class) of each token is multiplied by the corresponding 18 probabilities of propaganda techniques.This means that our model only considers making predictions for the propaganda techniques which are strongly confident the sentence contains.

Experimental Settings
In this paper, we conduct experiments on Propaganda Techniques Corpus (PTC)1 (Da San Martino et al., 2019) which is a large manually annotated dataset for fine-grained propaganda detection, as detailed in Section 2. We adopt F 1 score as the final metric to represent the model performance.
We select the best model on the dev dataset.We adopt BERT base cased (Devlin et al., 2019) as the pre-trained model.We implement our model using Huggingface (Wolf et al., 2019).We use AdamW as the optimizer.In our best model on the dev dataset, the hyper-parameters in loss optimization are set as α = 0.8, β = 0.2, λ = 0.001 and γ = 0.001.We set the max sequence length to 256, the batch size to 16, the learning rate to 3e-5 and warmup steps to 500.We train our model for 20 epochs and adopt an early stopping strategy on the average validation F 1 score of Spans and Full Task with patience of 5.For all experiments, we set the random seed to be 42 for reproducibility.

Models for Comparison
We compare our proposed methods with several baselines for fine-grained propaganda detection.Moreover, three variants of our method are provided to reveal the impact of each component.The notations of LatexPRO (T+L), LatexPRO (T), and LatexPRO (L) denote our model which injects of both textual and logical knowledge, only textual knowledge and only logical knowledge, respectively.Each of these models will be described as follows.
BERT (Da San Martino et al., 2019) adds a linear layer on the top of BERT, and is fine-tuned on SLC and FLC tasks, respectively.
MGN (Da San Martino et al., 2019) is a multitask learning model, which regards the SLC task as the main task and drive the FLC task on the basis of the SLC task.
LatexPRO is our baseline model without leveraging declarative knowledge.
LatexPRO (T) arguments LatexPRO with declarative textual knowledge in natural language, i.e., the literal definitions of propaganda techniques.
LatexPRO (L) injects logical knowledge by employing propositional Boolean expressions to explicitly regularize the model.
LatexPRO (T+L) is our full model in this paper.

Experiment Results and Analysis
Fragment-Level Propaganda Detection.The results for the FLC task are shown in basic model LatexPRO has achieved better results than other baseline models, which approves the effectiveness of our fine-grained multi-task learning structure.It is worth noting that, our full model LatexPRO (T+L) significantly outperforms MGN by 10.06% precision and 2.85% F 1 for Spans task, 12.54% precision and 4.92% F 1 for Full task, which is considered as significant process on this dataset.This demonstrates that leveraging declarative knowledge in text and first-order logic helps to predict the propaganda types more accurately.Moreover, our ablated models LatexPRO (T) and LatexPRO (L) both gain improvements over Lat-exPRO, while LatexPRO (L) gains more improvements than LatexPRO (T).This indicates that injecting each kind of knowledge is useful, and the effect of different kinds of knowledge can be superimposed and uncoupled.It should be noted that, compared with baseline models, our models have achieved a superior performance thanks to high precision, but the recall slightly loses.This is mainly because our models tend to make predictions for the high confident propaganda types.
To further understand the performance of models for the FLC task, we make a more detailed analysis of each propaganda technique.Table 3 shows detailed performance on the Full task.Our models achieve precision and F 1 improvements of almost all the classes over baseline model, and can also predict some low-frequency propaganda techniques,  e.g., Whataboutism and Obfus.,Int.This further demonstrates that our method can stress class imbalance problem, and make more accurate predictions.
Sentence-Level Propaganda Detection.Table 4 shows the performances of different models for the SLC task.The results indicate that our model achieves superior performances over other baseline models.Compared to MGN, LatexPRO (T+L) increases the precision by 1.63%, recall by 9.16% and F 1 score by 4.89%.This demonstrates the effectiveness of our model, and shows that our model can find more positive samples which will further benefit the token-level predictions for the FLC task.The Left maintains that Islamic jihad terror is not a problem -it's just a reaction to the evil deeds of the U.S. and Israel3.[…]Sinema is proof: the Left hates America,2 and considers "right-wing extremists," […] It used to be that this fact was dismissed as hysterical hyperbole.2

MGN:
[…] "I go over there, and I'm fighting for the Taliban […]."Sinema responded: "Fine.I don't care if you go and do that, go ahead.5"[…] The Left maintains that Islamic jihad terror is not a problem -it's just a reaction to the evil deeds of the U.S. and Israel .[…]Sinema is proof: the Left hates America,4 and considers "right-wing extremists," […] It used to be that this fact was dismissed as hysterical hyperbole.2The baseline MGN predicts spans of fragments with wrong propaganda techniques, while our method can make more accurate predictions.

Effectiveness of Improving Consistency
We further define the following metric M C to measure the consistency between sentence-level predictions Y c which is a set of predicted propaganda technique classes, and token-level predictions Y t which is a set of predicted propaganda techniques for input tokens: where |Y t | denotes a normalizing factor, 1 A (x) represents the indicator function:  knowledge from propaganda definitions, and logical knowledge from implicit logical rules between predictions, which enables the model to make more consistent predictions.The results show that although MGN could predict the spans of fragments correctly, it fails to identify their techniques to some extent.However, our method shows promising results on both spans and specific propaganda techniques, which further confirms that our method can make more accurate predictions.

Error Analysis
Although our model has achieved the best performance, it still some types of propaganda techniques are not identified, e.g., Appeal to Authority and Red Herring as shown in Table 3.To explore why our model LatexPRO (T+L) cannot predict for those propaganda techniques, we compute a confusion matrix for the Full Task of FLC task, and visualize the confusion matrix using a heatmap as shown in Figure 4. We find that most of the off-diagonal elements are in class O which represents none of them.This demonstrates most of the cases are wrongly classified into O.We think this is due to the imbalance of the propaganda and non-propaganda cate-gories in the dataset.Similarly, Straw Men, Red Herring and Whataboutism are the relatively low frequency of classes.How to deal with the class imbalance still needs further exploration.

Related work
Our work relates to fake news detection and the injection of first-order logic into neural networks.We will describe related studies in these two directions.Fake news detection draws growing attention as the spread of misinformation on social media becomes easier and leads to stronger influence.Various types of fake news detection problems are introduced.For example, there are 4-way classification of news documents (Rashkin et al., 2017), and 6way classification of short statements (Wang, 2017).There are also sentence-level fact checking problems with various genres of evidence, including natural language sentences from Wikipedia (Thorne et al., 2018), semi-structured tables (Chen et al., 2019), and images (Zlatkova et al., 2019;Nakamura et al., 2019).Our work studies propaganda detection, a fine-grained problem that requires tokenlevel prediction over 18 fine-grained propaganda techniques.The release of a large manually annotated dataset (Da San Martino et al., 2019) makes the development of large neural models possible, and also triggers our work, which improves a standard multi-task learning approach by augmenting declarative knowledge expressed in both natural language and first-order logic.
Neural networks have the merits of convenient end-to-end training and good generalization, however, they typically need a lot of training data and are not interpretable.On the other hand, logicbased expert systems are interpretable and require less or no training data.It is appealing to leverage the advantages from both worlds.In NLP community, the injection of logic to neural network can be generally divided into two groups.Methods in the first group regularize neural network with logicdriven loss functions (Xu et al., 2017;Fischer et al., 2018;Li et al., 2019).For example, Rocktäschel et al. (2015) target on the problem of knowledge base completion.After extracting and annotating propositional logical rules about relations in knowledge graph, they ground these rules to facts from knowledge graph and add a differentiable training loss function.Kruszewski et al. (2015) map text to Boolean representations, and derive loss functions based on implication at Boolean level for entail-ment detection.Demeester et al. (2016) propose lifted regularization for knowledge base completion to improve the logical loss functions to be independent of the number of grounded instances and to further extend to unseen constants, The basic idea is that hypernyms have ordering relations and such relations correspond to component-wise comparison in semantic vector space.Hu et al. (2016) introduce a teacher-student model, where the teacher model is a rule-regularized neural network, whose predictions are used to teach the student model.Wang and Poon (2018) generalize virtual evidence (Pearl, 2014) to arbitrary potential functions over inputs and outputs, and use deep probabilistic logic to integrate indirection supervision into neural networks.More recently, Asai and Hajishirzi (2020) regularize question answering systems with symmetric consistency and symmetric consistency.The former creates a symmetric question by replacing words with their antonyms in comparison question, while the latter is for causal reasoning questions through creating new examples when positive causal relationship between two cause-effect questions holds.
The second group is to incorporate logic-specific modules into the inference process (Yang et al., 2017;Dong et al., 2019).For example, Rocktäschel and Riedel (2017) target at the problem of knowledge base completion, and use neural unification modules to recursively construct model similar to the backward chaining algorithm of Prolog.Evans and Grefenstette (2018) develop a differentiable model of forward chaining inference, where weights represent a probability distribution over clauses.Li and Srikumar (2019) inject logic-driven neurons to existing neural networks by measuring the degree of the head being true measured by probabilistic soft logic (Kimmig et al., 2012).Our approach belongs to the first direction, and to the best of knowledge our work is the first one that augments neural network with logical knowledge for propaganda detection.

Conclusion
In this paper, we propose a fine-grained multitask learning approach, which leverages declarative knowledge to detect propaganda techniques in news articles.Specifically, the declarative knowledge is expressed in both natural language and firstorder logic, which are used as regularizers to obtain better propaganda representations and improve log-ical consistency between coarse-and fine-grained predictions, respectively.Extensive experiments on the PTC dataset demonstrate that our knowledge augmented method achieves superior performance with more consistent between sentence-level and token-level predictions.

Figure 1 :
Figure 1: An example of propagandistic texts, and definitions of corresponding propaganda techniques (Bold denotes propagandistic texts).

Figure 2 :
Figure2: Overview of our proposed model.A BERT-based multi-task learning approach is adopted to make predictions for 18 propaganda techniques at both sentence-level and token-level.We introduce two types of knowledge as additional objectives: (1) textual knowledge from literal definitions of propaganda techniques, and (2) logical knowledge about the consistency between sentence-level and token-level predictions.
Doc ID: 999000155 • Title: Arizona Democrat Senate Candidate Kyrsten Sinema Refuses To Retract Saying It's OK For Americans To Join Taliban Ground-truth: […] "I go over there, and I'm fighting for the Taliban […]."Sinema responded: "Fine.I don't care if you go and do that, go ahead.1"[…] The Left maintains that Islamic jihad terror is not a problem -it's just a reaction to the evil deeds of the U.S. and Israel3.[…]Sinema is proof: the Left hates America,2 and considers "right-wing extremists," […] It used to be that this fact was dismissed as hysterical hyperbole.2MGN: […] "I go over there, and I'm fighting for the Taliban […]."Sinema responded: "Fine.I don't care if you go and do that, go ahead.5"[…] The Left maintains that Islamic jihad terror is not a problem -it's just a reaction to the evil deeds of the U.S. and Israel .[…]Sinema is proof: the Left hates America,4 and considers "right-wing extremists," […] It used to be that this fact was dismissed as hysterical hyperbole.2Our method: […] "I go over there, and I'm fighting for the Taliban […]."Sinema responded: "Fine.I don't care if you go and do that, go ahead.1"[…] The Left maintains that Islamic jihad terror is not a problem -it's just a reaction to the evil deeds of the U.S. and Israel3.[…]Sinema is proof: the Left hates America,2 and considers "right-wing extremists," […] It used to be that this fact was dismissed as hysterical hyperbole.Doc ID: 999000155 • Title: Arizona Democrat Senate Candidate Kyrsten Sinema Refuses To Retract Saying It's OK For Americans To Join Taliban Ground-truth: […] "I go over there, and I'm fighting for the Taliban […]."Sinema responded: "Fine.I don't care if you go and do that, go ahead.1"[…] Our method: […] "I go over there, and I'm fighting for the Taliban […]."Sinema responded: "Fine.I don't care if you go and do that, go ahead.1"[…]The Left maintains that Islamic jihad terror is not a problem -it's just a reaction to the evil deeds of the U.S. and Israel3.[…]Sinema is proof: the Left hates America,2 and considers "right-wing extremists," […] It used to be that this fact was dismissed as hysterical hyperbole.

Figure 3 :
Figure 3: Qualitative comparison of 2 different models on a news article.The baseline MGN predicts spans of fragments with wrong propaganda techniques, while our method can make more accurate predictions.Here are 5 propaganda techniques:1.Thought-terminating Cliches, 2.Loaded Language, 3.Causal Oversimplification, 4.Flag waving and 5.Repetition.(Best viewed in color)

Figure 4 :
Figure 4: Visualization of confusion matrix result of our LatexPRO (T+L), where O represents the none of them class.

Figure 3
Figure 3 gives a qualitative comparison example between MGN and our LatexPRO (T+L).Different colors represent different propaganda techniques.The results show that although MGN could predict the spans of fragments correctly, it fails to identify their techniques to some extent.However, our method shows promising results on both spans and specific propaganda techniques, which further confirms that our method can make more accurate predictions.

Table 1 :
The statistics of all 18 propaganda techniques.
, we conduct experiments on two different granularities tasks: sentence-level classification (SLC) and fragmentlevel classification (FLC).Formally, in both tasks, the input is a plain-text document d containing a sequence of characters and a set of propagandistic fragments T , in that each propagandistic text fragment is represented as a sequence of contiguous characters t = [t i , ..., t j ] ⊆ d.For SLC, the target is to predict whether a sentence is propagandistic which can be regarded as a binary classification.For FLC, the target is to predict a set S with propagandistic fragments s = [s m , ..., s n ] ⊆ d and identify s ∈ S to one of the propagandistic techniques.

Table 2 :
50.39 46.09 48.15 27.92 27.2727.60-MGN(DaSanMartino et al., 2019) 51.16 47.27 49.14 30.10 29.37 29.73-Overall performance on fragment-level experiments (FLC task) in terms of Precision (P), recall (R) and F 1 scores on our test set.M C denotes the metric of consistency between sentence-level predictions and tokenlevel predictions.Full task is the overall task of detecting both propagandistic fragments and identifying the technique, while Spans is a special case of the Full task, which only considers the spans of fragments except for their propaganda techniques.Note that (T+L), (T), and (L) denote injecting of both textual and logical knowledge, only textual knowledge, and only logical knowledge, respectively.

Table 3 :
Detailed performance on the full task of fragment-level experiments (FLC task) on our test set.Precision (P), recall (R) and F 1 scores per technique are provided.

Table 4 :
Results on sentence-level experiments (SLC task) in terms of Precision (P), recall (R) and F 1 scores on our test set.Random is a baseline which predicts randomly, and All-Propaganda is a baseline always predicts the propaganda class.