SemEval-2019 Task 8: Fact Checking in Community Question Answering Forums

We present SemEval-2019 Task 8 on Fact Checking in Community Question Answering Forums, which features two subtasks. Subtask A is about deciding whether a question asks for factual information vs. an opinion/advice vs. just socializing. Subtask B asks to predict whether an answer to a factual question is true, false or not a proper answer. We received 17 official submissions for subtask A and 11 official submissions for Subtask B. For subtask A, all systems improved over the majority class baseline. For Subtask B, all systems were below a majority class baseline, but several systems were very close to it. The leaderboard and the data from the competition can be found at http://competitions.codalab.org/competitions/20022.


Overview
The current coverage of the political landscape in both the press and in social media has led to an unprecedented situation.Like never before, a statement in an interview, a press release, a blog note, or a tweet can spread almost instantaneously.The speed of proliferation leaves little time for double-checking claims against the facts, which has proven critical in politics, e.g., during the 2016 presidential campaign in the USA, which was dominated by fake news in social media and by false claims.
Investigative journalists and volunteers have been working hard to get to the root of a claim and to present solid evidence in favor or against it.Manual fact-checking is very time-consuming, and thus automatic methods have been proposed to speed-up the process, e.g., there has been work on checking the factuality/credibility of a claim, of a news article, or of an information source (Ba et al., 2016;Zubiaga et al., 2016;Ma et al., 2016;Castillo et al., 2011;Baly et al., 2018).The process starts when a document is made public.First, an intrinsic analysis is carried out in which check-worthy text fragments are identified.Then, other documents that might support or rebut a claim in the document are retrieved from various sources.Finally, by comparing a claim against the retrieved evidence, a system can determine whether the claim is likely true or likely false (or unsure, if no strong enough evidence either way could be found).For instance, Ciampaglia et al. (2015) do this using a knowledge graph derived from Wikipedia.The outcome could then be presented to a human expert for final judgement. 1or our two subtasks, we explore factuality in the context of Community Question Answering (cQA) forums.Forums such as StackOverflow, Yahoo!Answers, and Quora are very popular these days, as they represent effective means for communities around particular topics to share information.However, the information shared by the users is not always correct or accurate.There are multiple factors explaining the presence of incorrect answers in cQA forums, e.g., misunderstanding of the question, ignorance or maliciousness of the responder.Also, as a result of our dynamic world, the truth is time-sensitive: something that was true yesterday may be false today.Moreover, forums are often barely moderated and thus lack systematic quality control.
Here we focus on checking the factuality of questions and answers in cQA forums.This aspect was ignored in recent cQA tasks (Ishikawa et al., 2010;Nakov et al., 2015Nakov et al., , 2016aNakov et al., , 2017a)), where  an answer is considered GOOD if it addresses the question, irrespective of its veracity, accuracy, etc. Figure 1 presents an excerpt of an example from the Qatar Living Forum, with one question and three answers selected from a longer thread.According to SemEval-2016 Task 3 (Nakov et al., 2016a), all three answers would be considered GOOD since they are formally answering the question.Nevertheless, a 1 contains false information, while a 2 and a 3 are correct, as can be established from an official government website.2Checking the veracity of answers in a cQA forum is a hard problem, which requires putting together aspects of language understanding, modelling the context, integrating several information sources, uisng world knowledge and complex inference, among others.Moreover, high-quality automatic fact-checking would offer better experience to users of cQA systems, e.g., the user could be presented with veracity scores, where low scores would warn the user not to completely trust the answer or to double-check it.

Related Work
Fact-checking of answers was not studied before in the context of community Question Answering, apart from our own recent work (Mihaylova et al., 2018).Yet, in the context of cQA and general QA, there has been work on credibility assessment, which has been modelled primarily at the feature level, with the goal of improving GOOD answer identification.A notable exception are (Nakov et al., 2017b;Mihaylov et al., 2018), where credibility was a task on its own right.However, credibility is different from veracity (our focus here) as it is a subjective perception about whether a statement is credible, rather than actually truthful.Jurczyk and Agichtein (2007) modelled author authority using link analysis.Agichtein et al. (2008) looked for high-quality answers using PageRank and HITS, in addition to intrinsic content quality, e.g., punctuation and typos, syntactic and semantic complexity, and grammaticality.Lita et al. (2005) studied three qualitative dimensions for answers: source credibility (e.g., does the document come from a government website), sentiment analysis, and contradiction compared to other answers.Su et al. (2010) looked for verbs and adjectives that cast doubt.Banerjee and Han (2009) used language modelling to validate the reliability of an answer's source.Jeon et al. (2006) focused on non-textual features such as click counts, answer activity level, and copy counts.Pelleg et al. ( 2016) curated social media content using syntactic, semantic, and social signals.Unlike this research, we (i) target factuality rather than credibility, (ii) address it as a task in its own right, and on a specialised dataset.
Information credibility was also studied in social computing.Castillo et al. (2011)  Other authors have been querying the Web to gather support for accepting or refuting a claim (Popat et al., 2016;Karadzhov et al., 2017b).
Finally, there has been work on credibility, trust, and expertise in news communities (Mukherjee and Weikum, 2015).Dong et al. (2015) proposed that a trustworthy source is one that contains very few false claims.Recent work has also focused on evaluating the factuality of reporting of entire news outlets (Baly et al., 2018(Baly et al., , 2019)). 3However, none of this work was about QA or cQA.

Subtasks and Data Description
SemEval-2019 Task 8 has two subtasks: • Subtask A: Given a question from a cQA forum, predict whether this question asks for factual information vs. opinion/advice vs. just socializing.
• Subtask B: Given a factual question from a cQA forum, together with its answer thread, predict whether each answer provides true vs. false vs. non-factual information as a response to the question.

Data and Resources
We retrieved the data from the Qatar Living web forum 4 .We then cleaned it and we annotated it with the labels described in Sections 3.2 and 3.3.
For subtask A, we annotated the questions using Amazon Mechanical Turk 5 .To ensure high quality of the annotation, we went through all annotations and manually double-checked them.
For subtask B, we did not use an external annotation service, but instead we annotated all the data ourselves.Each answer was processed by three independent annotators, and we made sure we had proof for the label from reliable sources on the Web.Then, the annotations were consolidated after a discussion until agreement was achieved for each example.
All data is freely available under a Creative Commons Attribution 3.0 Unported (CC BY 3.0) license, and is accessible on the competition's website 6 .
In addition to the provided annotated data, we also allowed the participants to use unlabelled data from the Qatar Living forum footnotehttp://alt.qcri.org/semeval2016/task3/data/uploads/QL-unannotated-data-subtasas well as additional external resources, which they had to mention explicitly in their submissions.
Note that the class distribution in the training, development and test sets differs, especially for Subtask B. The reason for this is the way the data was prepared.The different datasets (training, development and test) were prepared on stages, because of the very time-consuming data annotation process.For each dataset annotation stage, we had to choose between releasing all the available annotated data or aim at releasing sets with similar label distribution.At the end, we decided to release the available data, although we were aware that this would result in releasing sets with different distribution and, in some cases, unbalanced categories.

Training Data for Subtask A
To create the dataset for the task, we chose to augment a pre-existing dataset for cQA with factuality annotations; this allowed us to stress the difference between (a) distinguishing a good vs. a bad answer, and (b) distinguishing a factually-true vs. a factually-false one.In particular, we added annotations for factuality to the CQA-QL-2016 dataset from SemEval-2016 Task 3 on Community Question Answering (Nakov et al., 2016a).
In CQA-QL-2016, the data is organized in question-answer threads (from the Qatar Living forum).Each question has a subject, a body, and meta information: question ID, date and time of posting, user name and ID, and category (e.g., Computers and Internet and Moving to Qatar).
We analyzed the forum questions and we defined three categories, related to their factuality.We then annotated the questions using Amazon Mechanical Turk.The three factuality categories are as follows: Table 1 shows the distribution of the labels for the question labels in the training, in the development and in the testing datasets.Overall, there are 1118, 239 and 953 questions annotated with the above-described labels.

Training Data for Subtask B
For subtask B, we annotated for veracity the answers to the questions with a FACTUAL label for subtask A. Note that in CQA-QL-2016, each answer has a subject, a body, meta information (answer ID, user name, and ID), the question that it answers, and a judgement about how well it answers the question of its thread (GOOD , BAD or POTENTIALLY USEFUL ).
We annotated the GOOD answers for factuality based on the assumption that a GOOD answer means it provides factual information, whether it is true or false.All BAD and POTENTIALLY USEFUL answers are automatically considered as NON-FACTUAL.The factuality labels are described as follows: * FACTUAL -TRUE: The answer is True and can be proven with an external resource.We further discarded answers whose factuality was very time-sensitive and it makes no sense to check whether the statements are true or false (e.g., "It is Friday tomorrow.","It was raining last week.").
Moreover, many answers are arguably somewhat time-sensitive, e.g., "There is an IKEA in Doha." is true only after IKEA opened, but not before that.In such cases, we just used the present 9 The place mentioned in the answer has situation as a point of reference.We further discarded the answers for which the annotators could not find any information.Ultimately, we consolidated the above finegrained labels into the following coarse-grained labels, which we used for subtask B: * FACTUAL -TRUE: Contains answers with proven true, non-contradictory statements.This includes the answers with the label FAC-TUAL -TRUE from above.This label is used for answers one can trust as a true statement.* FACTUAL -FALSE: Contains answers with statements that are proven to be false or not completely true.This includes answers with the following fine-grained factuality labels: FACTUAL -FALSE, FACTUAL -PAR-TIALLY FALSE, FACTUAL -CONDITION-ALLY TRUE, FACTUAL -RESPONDER UN-SURE.We also use this label for answers that contain a statement for which the person giving the answer expresses uncertainty in the claim.* NON-FACTUAL: These are either non-factual statements or statements that could be factual, but no information about them could be found, i.e., we could find no way to check whether the statement was true or false.This category also includes some statements that have been incorrectly annotated as a GOOD answer.It also includes the very timesensitive statements described before, such as "It is Friday tomorrow?".The BAD and the POTENTIALLY USEFUL answers from CQA-QL-2016 also fall in this category.
As we have mentioned above, we have annotated the answers to the FACTUAL questions selected from the Qatar Living forum.We targeted very high quality annotation, and thus we did not use crowd-sourcing, as a pilot experiment has shown that the task was very difficult and that it was not possible to guarantee that Turkers would do all the necessary verification and gather evidence from trusted sources.Instead, all examples were first annotated independently by three of us, and then, we carefully discussed each example to come up with a final label.The distribution of the labels on the training, on the development, and on the testind dataset are shown in Table 2 11 . 11Although not very big, our dataset is larger than datasets

Evaluation
Both subtasks are three-way classification problems.In subtask A, the questions were to be classified as FACTUAL, OPINION, or SOCIALIZING.Similarly, in subtask B there were also three target categories for the answers: FACTUAL -TRUE, FACTUAL -FALSE, and NON-FACTUAL.
We further scored the submissions based on Accuracy, macro-F1, and average recall (AvgRec). 12 For subtask B, we also report mean average precision (MAP), where the FACTUAL -TRUE instances were considered to be positive, and the remaining ones were negative.The official evaluation measure for both subtasks was Accuracy.

Participants and Results
We received 17 official submissions for Subtask A and 11 official submissions for Subtask B. Below we report the evaluation results.
Table 3 presents the results for subtask A on question classification.The results are based the official submissions in the evaluation phase.In this subtask, all of the submitted systems managed to improve over the majority class baseline, and several teams achieved similarly good results.Whenever a number of teams achieve the same result with respect to the main evaluation measure, i.e., Accuracy, we rank them according to the F1 score, and then by AvgRec if a tie still appears.
Table 4 presents the results based on the evaluation phase on the test set for predicting answer factuality labels.This subtask was more difficult as the majority class baseline was very high due to label unbalance.No team managed to improve over that baseline, but several teams had results that were very close to it.

Discussion
In the evaluation phase of the competition, the participants had to specify one official submission and were allowed up to two contrastive submissions.In the post-evaluation phase, they could upload an unlimited number of contrastive submissions.Below, we will only discuss the official submissions.The contrastive submissions, the used for similar problems, e.g., Ma et al. (2015) experimented with 226 rumors for rumor detection, and Popat et al. (2016) used 100 Wiki hoaxes for credibility assessment of textual claims.

Accuracy F1 AvgRec
Fermi (Syed et al., 2019) IIIT Hyderabad, Microsoft, Teradata 0.840 0.7182 0.7353 TMLab (Niewiński et al., 2019) Samsung  (Some teams did not submit system description papers, and thus we have no citations for their systems.)ablation studies, and the experiments with different techniques are described by the participants in their respective system description papers.
The best system for Subtask A was by team Fermi (IIIT Hyderabad).They used Google's Universal Sentence representation (Cer et al., 2018), and XGBoost (Chen and Guestrin, 2016).
The best system for Subtask B was by team AUTOHOME-ORCA (Autohome Inc. and Beijing University of Posts and Telecommunications), who used BERT (Devlin et al., 2019).
They achieved their best results by using an ensemble, and by also using question metainformation (category and subject) in addition to the question and the answer text.They concatenated the category, the subject and the body of the questions into the first part separated by [SEP].The replier's username and statement were concatenated as the second part.The two parts separated by [SEP] were pushed into the BERT model for answer classification.Then, based on the sequential outputs of the BERT model, some variant methods such as average-pooling, and bi-LSTM were adopted to produce the final results.
To tackle the problem with insufficient training data, they further used data augmentation based on translation with Google Translate: in particular, they performed consecutive English-Chinese and Chinese-English translation to generate more synthetic training data.
Overall, the submitted systems for the two subtasks used a number of pre-processing steps to clean the text of the question and of the answer.As shown by the DOMLIN team, the pre-processing of the data turns out to be crucial.They reported up to 5% improvement in terms of accuracy when cleaning the unannotated forum data before finetuning a BERT model.Common preprocessing steps included removing or replacing the URLs, the numbers, the punctuation, the symbols, spellchecking, expansion of contractions, HTML tags, etc. DUTH also used lemmatization and stopword removal.
The submitted systems used a wide range of strategies for training their models.A sizable part of the systems used manually crafted features such as linguistic, syntactic, stylistic, and semantic features.Moreover, the systems used task-specific information such as answer ranking and rating.ColumbiaNLP also computed an average cosine similarity of one answer with respect to the other answers in the thread for subtask B, assuming that bad answers would differ substantially from the remaining answers.
While some of the approaches used character and word n-gram information, the teams also used word-and sentence-level embeddings.Code-ForTheChange evaluated different classification algorithms fed with Skip-Thought vectors, and ultimately found that neural networks performed best for both subtasks with either concatenation or averaging over the vectors of the available texts.Fermi performed evaluation of different embedding models -InferSent, Concatenated Power Mean Word Embedding, Lexical Vectors, ELMo and The Universal Sentence Encoder, used in subtask A to feed an XGBoost classifier.ColumbiaNLP used ULMFiT, but performed additional unsupervised tuning of the language model on questions, answers and question-answer pairs from the Qatar Living Forum.TMLab's system used the Universal Sentence Encoder.
A common neural network architecture was LSTM, where YNU-HPCC combined LSTM with an attention mechanism.TueFact used comment chain embeddings.Other machine learning algorithms that participants tried include Random Forest, Adaboost, Perceptron, and SVM, inter alia.
While for question classification (subtask A), all the necessary information was contained in the question text and in the metadata, subtask B required additional resources.Most teams used the provided additional unannotated forum data in order to pre-train their language models or to extract more data with weak supervision (DOMLIN).Furthermore, several teams used other means for data augmentation such as SQuAD (BLCU NLP) or external Web information (SolomonLab).

Conclusion
We have described SemEval 2019 Task 8 on Fact Checking in Community Question Answering Forums.We received 17 and 11 submissions for Subtask A and B, respectively.Overall, subtask A (question classification) was easier and all submitted systems managed to improve over the majority class baseline.However, Subtask B (answer classification) proved to be much more challenging, and no team managed to improve over the majority class baseline, even though several teams came very close.For this latter subtask, using external resources and preprocessing proved to be crucial.
Research Institute (QCRI), HBKU and the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).

Figure 1 :
Figure 1: Example from the Qatar Living forum.
modeledd user reputation.Canini et al. (2011) analyzed the interaction of content and social network structure.Morris et al. (2012) studied how Twitter users judge truthfulness.Lukasik et al. (2015) used temporal patterns to detect rumors, and Zubiaga et al. (2016) focused on conversations.

Table 1 :
Subtask A: Distribution of the factuality labels for the questions.

Table 2 :
Subtask B: Distribution of the factuality labels for the answers.*FACTUAL-PARTIALLYTRUE: The answer contains more than one claim, and only some of these claims could be manually verified.(Q:"Iwillbe relocating from the UK to Qatar[...]is there a league or TT clubs / nights in Doha?"; A: "Visit Qatar Bowling Center during thursday and friday and you'll find people playing TT there.").
(Q: "I wanted to know if there were any specific shots and vaccinations I should get before coming over [to Doha].";A: "Yes 9 * FACTUAL -CONDITIONALLY TRUE: The answer is True in some cases, and False in others, depending on some conditions that the answer does not mention.(Q: "My wife does not have NOC from Qatar Airways; but we are married now so can i bring her legally on my family visa as her husband?";A: "Yes you can."). 10* FACTUAL -RESPONDER UNSURE: The person giving the answer is not sure about the veracity of his/her statement.(e.g., "Possible only if government employed.That's what I heard.")* NON-FACTUAL: When the answer does not provide factual information to the question; it can be an opinion or an advice that cannot be verified.(e.g., "Its better to buy a new one.").
table tennis, but we do not know on which days.

Table 3 :
Subtask A: Results for question classification based on the official submissions, evaluated on the test set.

Table 4 :
Subtask B: Results for answer classification based on the official submissions, evaluated on the test set.