SolomonLab at SemEval-2019 Task 8: Question Factuality and Answer Veracity Prediction in Community Forums

We describe our system for SemEval-2019, Task 8 on “Fact-Checking in Community Question Answering Forums (cQA)”. cQA forums are very prevalent nowadays, as they provide an effective means for communities to share knowledge. Unfortunately, this shared information is not always factual and fact-verified. In this task, we aim to identify factual questions posted on cQA and verify the veracity of answers to these questions. Our approach relies on data augmentation and aggregates cues from several dimensions such as semantics, linguistics, syntax, writing style and evidence obtained from trusted external sources. In subtask A, our submission is ranked 3rd, with an accuracy of 83.14%. Our current best solution stands 1st on the leaderboard with 88% accuracy. In subtask B, our present solution is ranked 2nd, with 58.33% MAP score.


Introduction
With the rising popularity of online community question answering (cQA) systems such as Quora, StackOverflow, and Qatar Living forum (QLF), the amount of information shared over these platforms is also increasing rapidly with time.These forums provide effective information sharing mechanism to their users who can seek answers to their queries as well as post answers to the questions.However, the information shared on such platforms may not always be factual and correct.The responders may misunderstand the question being asked or merely ignore certain specific details.At times, the information shared may even be false or ambiguous in the desired context.This is aggravated by lack of moderation and systematic control on cQA forums.The Semeval-2019 Task 8 1 on "Fact Checking in Community Ques-tion Answering Forums" aims to solve this reallife problem.
The above task tries to explore the veracity of an answers to a question posted on QLF.While the precedent tasks such as SemEval (Nakov et al., 2015(Nakov et al., , 2016(Nakov et al., , 2017)), address the issue of ranking answers according to their relevance to a question, the task-at-hand is the first one to consider the correctness of answers.This task is formulated as a two-stage problem.The first stage aims to identify the user posts asking for factual information.The answers to the identified factual questions are then fact-verified in the second stage.Both the subtasks are designed as 3-class supervised classification problems.
More specifically, the first stage or subtask A addresses the problem of determining whether the posted question asks for factual information, an opinion/advice or is just meant for socializing.For example, "what is Ooredoo customer service number?" asks for factual information, whereas "What was your first car?" is socializing and "which is the best bank around?" is seeking guidance/opinion.Each data sample in subtask A is a question posted by a user consisting of a subject, body and meta information (user ID, username, and the category of question, e.g., "Education," "Visa and Permits", "Welcome to Qatar" etc.).
The second stage or subtask B focuses on determining whether an answer to a factual question is true, false or does not constitute a proper answer, in which case, it is labeled as non-factual.For example, to the question "Can I bring my pitbulls to Qatar?", Answer A1: "Yes, you can bring it but be careful this kind of dog is very dangerous" is factual-false 2 , Answer A2: "No, you cannot as they are banned" is factual-true 2 and Answer A3: "There goes another job opportunity for the sake of two lovely animals." is non-factual.The data is organized as a question-answer tuple: question posted by a user and an answer (body, username and answer ID) posted by the same or another user.It has been ensured that all the questions in this task are factual questions.
Our approach to solving this task is based on extracting rich-feature representation from the input and training a classifier to make predictions.The feature representation integrates knowledge from various complementary sources, such as the question/answer content, the content of other answers in the thread, evidence from trustworthy external sources of information, and the relevance of an answer to the question.For subtask A, we rely on question content (semantic, linguistic and syntactic cues), whereas the evidence from external sources and answer relevancy to the question are essential aspects for subtask B. For both the subtasks, we also leverage a data augmentation approach which facilitates the generalization ability of learned classifier on unseen test data as well as ameliorates the class imbalance issues present in the training data.
The rest of the paper is organized as follows: Section 2 gives an overview of our system.Section 3-5 describe the details of our approach.Section 6 demonstrates the experimental results.We conclude in Section 7.

System Overview
Our proposed system primarily relies on following key components (i) data augmentation (DA) (ii) pre-processing of question/answer content and (iii) feature extraction from multi-faceted sources.As depicted in Figure 1, following DA and preprocessing of the question, our system for subtask A extracts semantic (what is said), linguistic (how it is said), syntactic (how it is structured) and writing style based features (how it is depicted) from the processed question.These extracted features are then combined to train a classifier for label prediction.
Subtask B also leverages DA and preprocessing as its first key steps.However, apart from features extracted for subtask A (as mentioned above), it also utilizes external evidence and forum-level features (Figure 2).The external evidence is collected from trusted sources using a search-engine.The forum-level features capture the relevancy of an answer to the question and its similarity to other answers in the same thread.Data augmentation (DA) is one of the main components of our proposed system that resulted in significant performance gains.For both the subtasks, the training data is imbalanced.This motivated us to look for ways to balance the distribution of data samples across classes and at the same time incorporate adversarial examples which are plausible in the real scenario but are not present in the training data.We next discuss the DA details for both the subtasks in the following subsections.

Subtask A
In the training data for subtask A, the number of samples from the "opinion" class (563) is observed to be twice as many samples from "factual" (311) or "socializing" class (244).In order to balance the class distribution, we sought to oversample both of the non-majority classes based on the domain knowledge.
For the "Factual" class, we leveraged the questions asked in subtask B. In subtask B, by its formulation, one is supposed to verify the veracity of answers to "factual questions."Thus, we used the training, development and test set of subtask B to augment training data for subtask A ("factual" class instances).This way, we extracted a total of 91 distinct factual questions.For the "socializing" class, we utilized the QL-unannotated-data3 to se-

Class
Question Body

Factual
Can someone please tell me where can i find Garlic Oil in Qatar?i heard it is good for hairfall.dont know if its true or not but really want to try it.So help me guys!

Opinion
Is it right to resign from your job at this time of global crisis? the reason is i'm not doing anything in the office.I feel useless; but I'm hesistant to resign because of the condition today even that I'm on husband sponsor.

Socializing
Is this a beginning of a mutual friendship between Christianity and Islam in Qatar?I hope they're going to sell some Bibles in Villagio coz I can't find somebody sellin' it around here.
Table 1: Example for query-sentence selection.The highlighted text is considered as the query-sentence.
lect samples from categories ("Funnies," "Good News Everyone," "Party on my mind," "Recipes," "Press Releases") that are assured to contain only socializing content.In these categories, the users are just trying to make conversation or share anecdotes.As the number of such samples is considerably large, we sample 320 samples (using reservoir sampling (Vitter, 1985)) to balance the distribution across classes in the original training data.

Subtask B
For subtask B, we consider an adversarial setting closely related to the problem at hand.As mentioned before, each data sample in this subtask is a question-answer tuple, and the answer can be either "true", "false" or "non-factual."A related task was demonstrated in Semeval 2016 task 3 "Answer Selection in cQA" (Nakov et al., 2016) where the objective was to re-rank the answers based on their relevancy to the question.In this task, the replies such as follow-up question from other user, clarifications, and acknowledgment from the user himself were categorized as "Bad" answers.Although, in the task-at-hand, the organizers have omitted such answers, in the real-life scenario they will also be present and should be categorized as "non-factual" in our current problem setting.Thus, to include such samples, we extract factual questions from the training data of subtask A. For each of these questions, we select "bad answers" from the data provided in the SemEval 2016 task.The chosen question-answer pair is then annotated as "non-factual" and added to the training data of subtask B.

Preprocessing
Before feature extraction, we pre-process the input question/answer using several key steps.We expand the contractions and terms commonly used on social media platforms such as 'i'm: 'i am,' 'i'd: i would,' 'pls: please,' 'nt: not,' 'thru: through' etc.Furthermore, we use several markers such as URLs, images, emoticons, and punctu-ation marks in the question/answer to extract writing style and syntactic features (described in Section 5.3).For semantic and linguistic features, we strip these markers.

Query Sentence Extraction
Based on the empirical evidence, we could infer that the body of each question posed by the user contains several sentences.However, among all these sentences, only one or two convey the query he/she really wants to ask.Also, the user may post his question in the question subject itself.Thus, we extract these "query-sentences" from the question body and subject and use them to extract linguistic, semantic features.An example of the query-sentence and original question posted by the user for each of the three classes corresponding to subtask A is depicted in Table 1.
In order to extract query-sentence, we parse each sentence in the question using Stanford CoreNLP constituency parser (Manning et al., 2014).
A sentence is considered a querysentence if its parse-tree has SBARQ/SQ constituent phrases.We also use some common heuristics such as, whether the sentence ends with a question-mark or starts with common "wh" words (what, why, how etc).

Modeling Content : Feature Extraction
We use rich feature representation to model the information conveyed in question/answer.In the subsequent subsections, we describe the details of each of these features.

Semantic Sentence Embedding
Following the pre-processing step, we compute semantic sentence embedding for query-sentence by using two approaches.The first approach utilizes universal sentence encoder (USE) (Cer et al., 2018).It has been known to perform well with minimal amounts of supervised training data for a downstream task, which is precisely our setting for both the subtasks.The second approach appro-priates pre-trained word embeddings (glove) (Pennington et al., 2014), averaged over each word in a sentence to compute sentence-level embedding.

Linguistic Features
Often, forum users exhibit linguistic cues in writing questions and answers.For example, they may use subjectively loaded words such as 'awesome,' 'worst' etc. while asking for an opinion rather than factual information.While answering on the forum, they may exhibit the degree of confidence in the truthfulness of what they say by using words like "most likely", "probably", "think" etc.We therefore use linguistic markers such as hedges (Hyland, 2018), weasels (Vincze, 2013), factives (Hooper, 1974), assertives (Hooper, 1974), implicatives (Karttunen, 1971), mood 4 , modality 4 , subjectivity 4 , sentiment 4 and polarity of subjective words (Riloff and Wiebe, 2003) based on respective lexicons to compute a feature vector.(For details, refer to (Mihaylova et al., 2018))

Writing Style Features
We extract writing style features from the question/answer which capture the format of a userpost.A socializing question is more likely to be written informally as compared to factual/opinion query.A non-factual answer which is not much informative may also carry distinctive cues.To capture these aspects, we count the number of punctuations, emoticons, NON-ASCII characters and check the presence of URL, image, ALL CAPS, consecutive character repetition (≥ 3 times).Table 2 depicts how the number of samples exhibiting a particular writing style feature vary across the three classes in subtask A. A similar trend is present for factual (true/false) versus non-factual answers in subtask B.

Syntactic Features
We also examine syntactic features such as partof-speech (POS) and category of question encoded as bag-of-words features.Further, we consider the 4 https://www.clips.uantwerpen.be/pages/pattern-enexpected answer type for a question (QType) and named-entity-type (NET) in an answer.
QType suggests the kind of information the question is seeking such as "description", "entity", "human", "location", "number", "yes/no" and "others" (extracted using work in (Madabushi and Lee, 2016)).Such features help segregate the socializing class in subtask A. For subtask B; we exploit the relation between what type of information the user wants to ask (QType) and what type of information is provided in the answer (NET).
To capture this, we extract the type of all namedentity mentions in the answer.We consider "person", "organization", "location" and "quantity" as possible NE tags extracted using spacy5 .

External Evidence
In subtask B, the verification of an answer requires external evidence to conclude about its veracity.We extract external evidence by formulating a search-query from the question and answer followed by a web search6 of this search-query.For each of the obtained search results, we compute its similarity with the question and answer respectively.These similarity scores are then used as features to a classifier.
Search-Query Formulation In order to search the web for relevant evidence, we formulate a search-query based on the question and answer.We extract query-sentence from the question posted by a user and append "Qatar" if neither 'Doha' nor 'Qatar' is present in it.
Further, to incorporate relevant information from the answer into this search-query, we find the answer sentence that has the highest similarity with the query-sentence.From this top-ranked sentence, we extract up to 7 keywords based on named entities, noun-chunks 5 and unigrams sorted by tf-idf scores, where named entities and nounchunks are given high priority.Query-sentence combined with keywords from the answer is used as search-query.

Search Results
We collect search results (snippets) from reputed sources (e.g., news, government websites, official sites of companies) (Mihaylova et al., 2018) for search-query formulated as above.Since the search-query may not always be perfect, we also obtain search results by drop- ping a few keywords from the search-query.From all the obtained search results, we select snippets that are most relevant to the question and the answer.Table 3 illustrates the external evidence retrieved for two question-answer pairs.
Similarity based Features For each question-answer pair, we compute their similarity with external search results obtained above.We use three similarity metrics: containment of unigrams, bigrams and trigrams (Lyon et al., 2001), cosine similarity of USE embedding and cosine similarity of tf-idf representation.For each metric, we compute the similarity of the snippet with: question, answer, query-sentence + top-answer sentence and all of them together.We then take the average and maximum over similarity scores for all the search results.

Forum-Level Features
These features capture the relevance of an answer with respect to a question as well as to other answers.An answer which contains information similar to that specified in other answers is more likely to be relevant and trustworthy.Thus, we consider the similarity of the answer with the question as well as its similarity with other answers in that thread.Here, also we consider all three similarity metrics mentioned before.

Setting and Evaluation Metrics
We now utilize all features as portrayed in Figure 1 for subtask A and Figure 2 for subtask B. We train two separate SVM classifiers (Burges, 1998) on respective features for 3-class classification for both the subtasks.We use 10-fold cross validation for hyper-parameter tuning of SVM based on which, we choose "linear" kernel with C=0.5 (regularization parameter) for all the demonstrated experiments.All the results are reported on the test data with accuracy, recall, and F1 measure as evaluation metrics.
Additionally, we calculate Mean Average Precision (MAP) for subtask B, where the 'True' instances are considered relevant examples (in the context of Information Retrieval).MAP measures the capability of the system to predict 'True' instances with higher confidence.

Results for Subtask A
Table 4 shows the performance of the proposed system (PS).From the results, we can observe that our PS (excluding syntactic features) achieves an impressive performance with accuracy of 84.12% and 72.17% F1.Our submission (with all the features in PS + (POS and QType)) achieved similar performance and ranked 3rd on the leaderboard (83% acc., 71% F1) with only a marginal difference with respect to the first (84% acc., 72% F1 ) and second-ranked (83% acc., 72% F1 ) systems.To push our system's performance even further, we experimented in the post-evaluation phase and achieved 88.10% accuracy and 77.37% F1, highest on the post-evaluation leaderboard7 .This current best solution leverages QType features, extensive data augmentation using bagging technique and excludes writing style features.
In order to appraise the importance of each feature, we conducted an ablation study by analyzing individual features and their combinations.It can be followed from the results that the semantic embedding contributes most to the performance of the system.However, embeddings derived using USE perform better than glove embeddings.This difference is possibly due to the failure of glove-based embedding in capturing word-order and long-range dependencies.
The second most important contributor is the data augmentation approach which resulted in notable accuracy gains (3.71% improvement).As expected, it allows the system to generalize better and ameliorates the issue of class imbalance.Following it is the query-sentence extraction approach with ∼1% accuracy enhancement.The performance is in line with the expectations as query-sentences are sufficient in capturing the essence of user question.QType and linguistic cues help improve the performance further.However, we notice that the performance improves by excluding the writing style features.The possible reason for this observation can be the absence of such features in the test data.In the training and development data-sets, the presence/absence of these features was a distinguishing factor among classes (see Table 2) which made them worth considering.

Results for Subtask B
Table 4 shows the performance of PS with the ablation study.Our PS (also our best 7 ) achieves an accuracy of 77.85% with 58.33 MAP (highest among all the participating systems).From the ablation study, we observe that although removal of external evidence results in slight accuracy gains (0.86%), it causes a drastic reduction in MAP score (33.33 points).This signifies the importance of external evidence as these features enable the system to make better predictions for the true/false classes.
We also conducted a majority baseline experiment where all the samples are labeled as "nonfactual."This experiment resulted in the best performance on the leaderboard with 83% accuracy and very poor MAP.This illustrates that the test data has a majority of non-factual instances.Thus, measuring the performance of any system solely on accuracy for this problem is not fair.
As it can be inferred from the ablation study, among all the features, reputed source based search-results selection (contributing 4.95% acc.gain) and forum-level features (contributing 2.37% acc.gain) are the most important.Reputed source selection helps in relying on only trusted sources for external evidence selection and hence make better predictions for true/false classes.Forum level features help in distinguishing among non-factual versus true/false samples.
Further, data augmentation in subtask B results in significant performance gains of 10.11% accuracy.It helps the system learn about the characteristics of "bad answers" which are not present in the training data and hence enables the system to generalize better on the test data.For this subtask as well, semantic embedding plays a vital role in capturing the essence of the question-answer pair, contributing 3.87% gain in accuracy.

Conclusion
In this work, we have described our system for Semeval-2019 Task 8 on Fact-checking in cQA Forums.Our system leverages data augmentation and integrates knowledge from various aspects, such as the semantics, linguistics, syntax and writing style along with complementary information from trustworthy external sources and QLF.
Our submission was ranked third in Subtask A with marginal performance differences compared to the best-ranked systems.Our current best solution is ranked first on the leaderboard with 88% accuracy 7 .In subtask B, our current best solution is ranked 2nd, with 58.33% MAP score, highest among all participating systems 7 .
However, none of the participating systems could beat the majority baseline for subtask B in terms of accuracy, which signifies that we are still far from solving this task to its entirety with a decent performance.Thus, there remains a lot of potential in this research direction.

Figure 1 :
Figure 1: System Overview for Subtask A

Figure 2 :
Figure 2: System Overview for Subtask B 3 Data Augmentation (DA)

Table 3 :
Example of external evidence collected in Subtask B