CAUnLP at NLP4IF 2019 Shared Task: Context-Dependent BERT for Sentence-Level Propaganda Detection

The goal of fine-grained propaganda detection is to determine whether a given sentence uses propaganda techniques (sentence-level) or to recognize which techniques are used (fragment-level). This paper presents the sys- tem of our participation in the sentence-level subtask of the propaganda detection shared task. In order to better utilize the document information, we construct context-dependent input pairs (sentence-title pair and sentence- context pair) to fine-tune the pretrained BERT, and we also use the undersampling method to tackle the problem of imbalanced data.


Introduction
Propaganda detection is a process of determining whether a news article or a sentence is misleading. Several research works have been proposed to detect propaganda on document-level (Rashkin et al., 2017;Barrón-Cedeño et al., 2019b), sentencelevel and fragment-level (Da San Martino et al., 2019). Sentence-level detection or classification (SLC) is to determine whether a given sentence is propagandistic and it is a special binary classification problem, while the goal of fragment-level classification (FLC) is to extract fragments and assign with given labels such as loaded language, flag-waving and causal oversimplification, and it could be treated as a sequence labeling problem.
Compared with document-level, sentence-level and fragment-level detection are much more helpful, since detection on sentences and fragments are more practical for real-life applications. However, these fine-grained tasks are more challenging. Although Da San Martino et al. (2019) indicates that multi-task learning of both the SLC and the FLC could be beneficial for the SLC, in this paper, we only focus on the SLC task so as to better investigate whether context information could improve the performance of our system. Since several pretrained language models (Devlin et al., 2019;Liu et al., 2019) have been proved to be effective for text classification and other natural language understanding tasks, we use the pretrained BERT (Devlin et al., 2019) for the SLC task. This paper elaborates our BERT-based system for which we construct sentence-title pairs and sentence-context pairs as input. In addition, in order to tackle the problem of imbalanced data, we apply the undersampling method (Zhou and Liu, 2006) to the training data, and we find that this method greatly boosts the performance of our system.

Related Work
Various methods have been proposed for propaganda detection. Rashkin et al. (2017) proposed to use LSTM and other machine learning methods for deception detection in different types of news, including trusted, satire, hoax and propaganda. Barrón-Cedeño et al. (2019b) proposed to use Maximum Entropy classifier (Berger et al., 1996) with different features replicating the same experimental setup of Rashkin et al. (2017) for twoway and four-way classifications. A fine-grained propaganda corpus was proposed in Da San Martino et al. (2019) which includes both sentencelevel and fragment-level information. Based on this corpus and the pretrained BERT which is one of the most powerful pretrained language model, a multi-granularity BERT was proposed and it outperformed several strong BERT-based baselines.

Methodology
In our system, we utilize BERT as our base model and construct different kinds of input pairs to finetune it. When constructing the input representa- tion, a special token [CLS] is padded in front of every sentence and another token [SEP] is added at the end of it. In addition, for each input pair, a [SEP] is added between a sentence and its context or title. Finally, a linear layer and a sigmoid function are applied to the final representation of [CLS] to obtain the probability for classification. For comparison, we also use the official method (Random) as baseline which randomly labels sentences.

Data
The dataset is provided by NLP4IF 2019 Shared Task (Barrón-Cedeño et al., 2019a), and the training set, the development set, and the test set contain approximately 16,000, 2,000 and 3,400 sentences respectively. According to the statistics, only 29% of the training sentences are labeled as propaganda, and thus in this paper, we treat propaganda sentences as positive samples and non-propaganda sentences as negative samples. More details of the dataset could be found in Da San Martino et al. (2019).

Input pairs
Sentence Only: We only use the current sentence to fine-tune the model and models trained with this kind of input are used as baselines for those models trained with the following two kinds of input pairs. Sentence-Title Pair: As described in Da San Martino et al. (2019), the source of the dataset that we use is news articles, and since the title is usually the summarization of a news article, we use the title as supplementary information.
Sentence-Context Pair: In addition to setting the title as the supplementary information, we construct the sentence-context pair which also includes preceding sentences as additional context, since preceding sentences usually convey the same or related events and this historical content is closely related to the current sentence. Figure 1. shows the details of this kind of input pair in which the preceding sentence and the title are directly concatenated.

Undersampling
As mentioned above, there are only 29% of training sentences labeled as propaganda (positive). In order to tackle the problem of imbalanced data, we first collect positive samples which size is S pos and negative samples, then we resample S neg (X percent of S pos ) from negative samples at the beginning of each training epoch. Finally, we combine and shuffle both positive samples and sampled negative samples as a new training set S sampled .

Experiment Details
In this paper, we use the pretrained uncased version of BERT BASE and BERT LARGE 2 for the SLC, and more details of these two models could  be found in Devlin et al. (2019). Before finetuning, sentences are first converted to lower case and their maximum sequence length is set to 128. For a sentence-context pair, the maximum length of context is set to 100. If the sequence length of an input pair exceeds 128, then the context or title is truncated to meet the length.
When fine-tuning, we use the Adam (Kingma and Ba, 2014) with learning rate 2e-5 for 2 epochs, the batch size is 32 and the dropout probability is kept at 0.1. Since the title or context information could help improve the performance, we only apply the undersampling method to input pairs (sentence-title and sentence-context). For those models involved with undersampling, the sample rate X is set to 0.8, 0.9 or 1.0 empirically. During the training stage, all training samples are used.
We directly evaluate all the models on the development set, and the best model is chosen to generate predictions of the test data.

Result
Our approach is evaluated on Propaganda Detec-tion@NLP4IF SLC dataset. In the development stage, we use three kinds of input and three different sample rates for BERT. Table 1. shows the results of the development set. From Table 1., without considering undersampling, we can see that using the sentence-title pair could boost the performance of BERT BASE , compared with the model using only the current sentence and the random baseline. While using the sentence-context pair could improve the F 1 score of BERT BASE by 0.8% with precision rising to 71.10 and recall decreasing to 54.94, the performance of BERT BASE drops by around 1% with recall dropping significantly to 49.12.
We also observe that both performances of BERT BASE and BERT LARGE trained with original training sentences are competitive compared with the random baseline. However, the precision of BERT BASE at 70.54 and the one of BERT LARGE at 71.23 are significantly higher than the recall of both models, at 56.70 and at 54.26 respectively, and this may result from the problem of imbalanced instances. Thus, we introduce the undersampling technique using 0.8, 0.9 or 1.0 sample rate to tackle this issue. We observe from Table 1. that the F 1 score of BERT BASE with the sentence-title pair and 0.8 sample rate rises around by 5% and the same model using the sentence-context pair and 0.9 sample rate performs similarly. As for BERT LARGE , while using the sentence-title pair has the similar performance as it is employed in the base version model, using the sentence-context pair strongly boosts the F 1 score, at 67.94 with 0.8 sample rate and at 67.25 with 1.0 sample rate. In addition, it is worth noting that there is a better trade-off between precision and recall with 1.0 sample rate than the one with 0.8.
In the test stage, since we are only allowed to submit a single run on the test set, we choose the model with the highest F 1 score (67.94) to generate predictions and the evaluated results are listed in Table 2. From Table 2., we can see that the recall raises by nearly 5% and the precision of it drops significantly, by around 7%, compared with the results on the development set, while the recall of Random Baseline also drops by approximately 5.5% and the precision of it remains nearly the same.

Conclusion and Future Work
In this paper, we examine capability of the context-dependent BERT model. In the sentencelevel propaganda detection task, we construct sentence-title pairs and sentence-context pairs in order to better utilize context information to improve the performance of our system. Furthermore, the undersampling method is utilized to tackle the data imbalanced problem. Experiments show that both sentence-title/context pairs and the undersampling method could boost the performance of BERT on the SLC task.
In the future, we plan to apply multi-task learning to this context-dependent BERT, similar to the method mentioned in Da San Martino et al. (2019) or introducing other kinds of tasks, such as sentiment analysis or domain classification.