Automated Fact-Checking of Claims in Argumentative Parliamentary Debates

We present an automated approach to distinguish true, false, stretch, and dodge statements in questions and answers in the Canadian Parliament. We leverage the truthfulness annotations of a U.S. fact-checking corpus by training a neural net model and incorporating the prediction probabilities into our models. We find that in concert with other linguistic features, these probabilities can improve the multi-class classification results. We further show that dodge statements can be detected with an F1 measure as high as 82.57% in binary classification settings.


Introduction
Governments and parliaments that are selected and chosen by citizens' votes have ipso facto attracted a certain level of trust. However, governments and parliamentarians use combinations of true statements, false statements, and exaggerations in strategic ways to question other parties' trustworthiness and to thereby create distrust towards them while gaining credibility for themselves. Creating distrust and alienation may be achieved by using ad hominem arguments or by raising questions about someone's character and honesty (Walton, 2005). For example, consider the claims made within the following question that was asked in the Canadian Parliament: Example 1.1 [Dominic LeBlanc, 2013-10-21] The RCMP and Mike Duffy's lawyer have shown us that the Prime Minister has not been honest about this scandal. When will he come clean and stop hiding his own role in this scandal?
These claims, including the presupposition of the second sentence that the Prime Minister has a role in the scandal that he is hiding, may be true, false, or simply exaggerations. In order to be able to analyze how these claims serve their presenter's purpose or intention, we need to determine their truth.
Here, we will examine the linguistic characteristics of true statements, false statements, dodges, and stretches in argumentative parliamentary statements. We examine whether falsehoods told by members of parliament can be identified with previously proposed approaches and we find that while some of these approaches improve the classification, identifying falsehoods by members of parliament remains challenging. Vlachos and Riedel (2014) proposed to use data from fact-checking websites, such as PolitiFact for the fact-checking task and suggested that one way to approach this task would be using the semantic similarity between statements. Hassan et al. (2015) used presidential debates and proposed three labels -Non-Factual, Unimportant Factual, and Check-worthy Factual sentence -for the fact-checking task. They used a traditional feature-based method and trained their models using sentiment scores using AlchemyAPI, word counts of a sentence, bag of words, part-of-speech tags, and entity types to classify the debates into these three labels. They found that the part-ofspeech tag of cardinal numbers was the most informative feature and word counts was the second most informative feature. They also found that check-worthy actual claims were more likely to contain numeric values and non-factual sentences were less likely to contain numeric values. Patwari et al. (2017) used primary debates and presidential debates for analyzing check-worthy statements. They used topics extracted using LDA, entity history and type counts, part-ofspeech tuples, counts of part-of-speech tags, unigrams, sentiment, and token counts for their classi-

Data
For our analysis, we extracted our data from a project by the Toronto Star newspaper. 1 The Star reporters 2 fact-checked and annotated questions 1 http://projects.thestar.com/ question-period/index.html.
All the data is publicly available.
2 Bruce Campion-Smith, Brendan Kennedy, Marco Chown Oved, Alex Ballingall, Alex Boutilier, and Tonda MacCharles. and answers from the Oral Question Period of the Canadian Parliament (over five days in April and May 2018). Oral Question Period is a session of 45 minutes in which the Opposition and Government backbenchers ask questions of ministers of the government, and the ministers must respond. The reporters annotated all assertions within both the questions and the answers as either true, false, stretch, (half true), or dodge (not actually answering the question). Further, they provided a narrative justification for the assignment of each label (we do not use that data here). Here is an example of the annotated data (not including the justifications):  Table 3: Five-fold cross-validation results (F 1 and % accuracy) of four-way classification of fact-checking for the overall dataset and F 1 for each class.
on climate change. Canadians know that we have a plan, but they are not so sure if the Conservatives do.
For our analysis, we extracted the annotated span of the text with its associated label. The distribution of the labels in this dataset is shown in Table 1. This is a skewed dataset with more than half of the statements annotated as true.
We also use a publicly available dataset from PolitiFact, a website at which statements by American politicians and officials are annotated with a 6-point scale of truthfulness. 3 The distribution of labels in this data is shown in Table 2. We examine PolitiFact data to determine whether these annotations can help the classification of the Toronto Star annotations.

Method
We formulate the analysis as a multi-class classification task; given a statement, we identify whether the statement is true, false, stretch, or a dodge.
We first examine the effective features used for identifying deceptive texts in the prior literature.
• Tuples of words and their part-of-speech tags (unigrams and bigrams weighted by tf-idf, represented by POS in the result tables).
• Named entity type counts, including organizations and locations (Patwari et al., 2017) (represented by NE in the result tables).
• Total number of numbers in the text, e.g., six organizations heard the assistant deputy In addition, we leverage the American Politi-Fact data to fact-check the Canadian Parliamentary questions and answers by training a Gated Recurrent Unit classifier (GRU) (Cho et al., 2014) on this data. We will use the truthfulness predictions of this classifier -the probabilities of the 6-point-scale labels -as additional features for our SVM classifier (using the scikit-learn package (Pedregosa et al., 2011)). For training the GRU classifier, we initialized the word representations using the publicly available GloVe pretrained 100-dimension word embeddings (Pennington et al., 2014) 4 , and restricted the vocabulary to the 5,000 most-frequent words and a sequence length of 300. We added a dropout of 0.6 after the embedding layer and a dropout layer of 0.8 before the final sigmoid unit layer. The model was trained with categorical cross-entropy with the Adam optimizer (Kingma and Ba, 2014) for 10 epochs and batch size of 64. We used 10% of the data for validation, with the model achieving an average F 1 measure of 31.44% on this data.

Results and discussion
We approach the fact-checking of the statements as a multi-class classification task. Our baselines  are the majority class (truths) and an SVM classifier trained with unigrams extracted from the annotated spans of texts (weighted by tf-idf ). We performed five-fold cross-validation. Table 3 reports the results on the multi-class classification task with these baselines and with the additional features described in section 4, including the truthfulness predictions of the GRU classifier trained on PolitiFact data. The best result is achieved using unigrams, POS tags, total number of numbers (NUM), superlatives, and the GRU's truthfulness predictions (PolitiFact predictions). We examined all five lexicons from Wiktionary provided by Rashkin et al. (2017); however, only superlatives affected the performance of the classifier, so we report only the results using superlatives. We also report in Table 3 the average F 1 measure for classification of four labels in multi-class classification using five-fold cross-validation. The truthfulness predictions did not improve the classification of the dodge and true labels in multi-class classification setting. Superlatives slightly improved the classification of all labels except dodge.
We further perform pairwise classification (oneversus-one) for all possible pairs of labels to get better insight into the impact of the features and characteristics of labels. Therefore, we created three rather balanced datasets of truths and falsehoods by randomly resampling the true statements without replacement (85 true statements in each dataset). The same method was used for comparing true labels with dodge and stretch labels, i.e., we created three relatively balanced datasets for analyzing true and dodge labels and three datasets for analyzing true and stretch labels. This allows us to compare the prior work on the 6-point scale truthfulness labels on the U.S. data with the Canadian 4-point scale. Table 4 presents the classification results using five-fold cross-validation with an SVM classifier. The reported F 1 measure is the average of the results on all three datasets for each pairwise setting. Dodge statements were classified more accurately than the other statements with an F 1 measure as high as 82.57%. This shows that the answers that do not provide a response to the question can be detected with relatively high confidence. The most effective features for classifying false against true and dodge statements were named entities.
The predictions obtained from training the GRU model on the PolitiFact annotations, on their own, were not able to distinguish false from true and stretch statements. However, the predictions did help in distinguishing true against stretch and dodge statements. None of the models were able to improve the classification of false against stretch statements over the majority baseline.
Overall, stretch statements were the most difficult statements to identify in the binary classification setting. This could also be due to some inconsistency in the annotation process, with stretch and false not always clearly separated. Here is an example of stretch in the data: Elsewhere in the data, essentially the same claim is labelled false: Example 5.2 [Justin Trudeau] The Conservatives promised that they would also tackle environmental challenges and that they would do so by means other than carbon pricing. . . . They have no proposals, [they did nothing for 10 years.] False We further performed the analysis using the two predictions of more true and more false from the PolitiFact dataset; however, we didn't observe any improvements. Using the total number of words in the statements also did not improve the results.
While Rashkin et al. (2017), found that LIWC features were effective for predicting the truthfulness of the statements in PolitiFact, we did not observe any improvements in the performance of the classifier in our classification task on Canadian Parliamentary data. Furthermore, we did not observe any improvements in the classification tasks using sentiment and subjectivity features extracted using OpinionFinder (Wilson et al., 2005;.

Comparison with PolitiFact dataset
In this section, we perform a direct analysis with the PolitiFact dataset. We first train a GRU model (used a sequence length of 200, other hyperparameters the same as those of the experiment described above) using 3-point scale annotations of Politi-Fact (used 10% of the data for validation). We treat the top two truthful ratings (true and mostly true) as true; half true and mostly false as stretch; and the last two ratings (false and pants-on-fire false) as false. We then test the model on three annotations of true, stretch, and false from the Toronto Star project. The results are presented in Table 5. As the results show, none of the false statements are detected as false and the overall F 1 score is lower than the majority baseline.
We further train a GRU model (trained with binary cross-entropy and sequence length of 200, other hyperparameters the same as above) using 2-point scale where we treat the top three truthful ratings as true and the last three false ratings as false. We then test the model on two annotations of true and false from the Toronto Star project. The results are presented in Table 6; the F 1 score remains below baseline.
The Politifact dataset provided by Rashkin et al. includes a subset of direct quotes by original speakers. We further performed the 3-point scale and 2-point scale analysis using only the direct quotes. Using only the direct quotes, also shown in Tables 5 and 6, did not improve the classification performance.

Conclusion
We have analyzed classification of truths, falsehoods, dodges, and stretches in the Canadian Parliament and compared it with the truthfulness classification of statements in the PolitiFact dataset. We studied whether the effective features in the prior research can help us characterize the truthfulness in Canadian Parliamentary debates and found out that while some of these features help us identify dodge statements with an F 1 measure as high as 82.57%, they were not very effective in identifying false and stretch statements. The truthfulness predictions obtained from training a model on annotations of American politicians' statements, when used with other features, helped slightly in distinguishing truths from other statements. In future work, we will take advantage of journalists' justifications in determining the truthfulness of the statements as relying on only linguistic features is not enough for determining falsehoods in parliament.