CodeForTheChange at SemEval-2019 Task 8: Skip-Thoughts for Fact Checking in Community Question Answering

The strengths of the scalable gradient tree boosting algorithm, XGBoost and distributed sentence encoder, Skip-Thought Vectors are not explored yet by the cQA research community. We tried to apply and combine these two effective methods for finding factual nature of the questions and answers. The work also include experimentation with other popular classifier models like AdaBoost Classifier, DecisionTree Classifier, RandomForest Classifier, ExtraTrees Classifier, XGBoost Classifier and Multi-layer Neural Network. In this paper, we present the features used, approaches followed for feature engineering, models experimented with and finally the results.


Introduction
Community Question Answering (cQA) forums such as Quora, StackOverflow, Yahoo! Answers, Qatar Living etc., now-a-days are fast and effective means of getting answers for any question. But the answers may or may not be correct and factual always. The focus of cQA research, for the last few couple of years, is revolving around determining the model which predicts the best answer for the question, given a question and a number of answers (might be hundreds or even thousands in number).
cQA is one of the popular problems being constantly in focus of SemEval organizers since 2015. The subtasks that were targeted earlier include (i) classifying the answer to a particular question as good or potentially good or bad in 2015 1 , (ii) three reranking subtasks i.e., Question-Comment Similarity, Question-Question Similarity and Question-External Comment Similarity in 1 http://alt.qcri.org/semeval2015/task3/ 2016 2 and (iii) Question Similarity (QS) to detect duplicate questions and Relevance Classification (RC) in 2017 3 . Contrary to earlier tasks of Se-mEval focusing mainly on classification and similarity of questions and/or answers and/or comments, SemEval-2019 targets the factuality of the questions (whether the question is factual or not) and the factuality of the answers (whether the answers provided to the factual questions are factual or not). The tasks become more challenging as data have noisy (like !!!), and unstructured (like Oh..) words.
SemEval-2019 Task 8 features the following two subtasks: Subtask A (Question Classification) -determine whether a question asks for a factual information, an opinion/advice or is just socializing. Example from the "Qatar Living" forum given in competition page 4 for this subtask is as follows: Q: I have heard its not possible to extend visit visa more than 6 months? Can U please answer me.. Thankzzz... answer 1: Maximum period is 9 Months.... answer 2: 6 months maximum answer 3: This has been answered in QL so many times. Please do search for information regarding this. BTW answer is 6 months. This subtask aims at building models to detect true factual information in cQA forums.
Subtask B (Answer Classification) -determine whether an answer to a factual question is true, false, or does not constitute a proper answer. This subtask aims at building models that classify the answers into the following three categories, given a factual question: a) Fac- We participated in both the subtasks of SemEval-2019 Task 8. For detailed description of the task, different approaches used by other participants and results obtained by all the participants, please refer the task description paper (Mihaylova et al., 2019).
The rest of the paper is organized as follows: Section 2 describes the related work. Section 3 describes the data used for this SemEval task. Sections 4 and 5 elucidate the system architecture (feature extraction and model building) and experimentation details (along with the results) respectively. Section 6 concludes the paper with focus on future research on this task.
Another related task to cQA is Fact Checking in Community Forums (Mihaylova et al., 2018). This work doesn't involve classification of questions/answers based on factuality but it determines the veracity of the answer given a particular question. This work is related to our task in a way that the data being used in our task is annotated and released to the research community by Tsvetomila Mihaylova and her team.
The fact that this research problem is relatively new, the strengths of the scalable gradient tree boosting algorithm, XGBoost (Chen and Guestrin, 2016) and distributed sentence encoder, Skip-Thought vectors (Kiros et al., 2015) are not explored yet. We tried to apply and combine these two effective methods for finding factual nature of the questions and answers.

Data Description
The data for both Question Classification -Subtask A and Answer Classification -Subtask B, is organized into train, dev and test sets. The number of samples in each of these datasets is shown in the Table 1.  The data, in Question Classification, has both subject and body for each question. Similarly, for Answer Classification, the data has question subject, question body and an answer (as a comment text). The data of both the subtasks also have other information related to meta-data like user information, date and time of the question and answer post. The detailed description of data can be seen in task description paper (Mihaylova et al., 2019).

Data pre-processing
We have applied some basic preprocessing tasks like removing URLs, converting text to lowercase along with removing stopwords.

Extract Skip-Thought vectors
We choose Skip-Thought Vectors as word embeddings for this task mainly because these are highly generic sentence representations unlike GloVe or Word2Vec which averages word embeddings of each individual word to calculate the word embedding for a complete sentence.
In subtask A, we have retrieved Skip-Thought vectors for question body and question subject. In subtask B, we extracted Skip-Thought vectors for question body, question subject and answer comment. For both the subtasks, we have used the code 7 written by the Skip-Thought vectors' authors.

Model Building
Once we have extracted Skip-Thought vectors, we used these vectors to train different models -AdaBoost Classifier (only in case of Subtask B), DecisionTree Classifier, RandomForest Classifier, ExtraTrees Classifier, XGBoost Classifier and Multi-layer Neural Network with dropout layers in between, Adam optimizer and softmax activation in the final layer. The hyper-parameters 7 https://github.com/ryankiros/Skip-Thoughts of all the models is determined by applying Grid-Search with 10-fold cross-validation. The hyperparameters are shown in the Table 2.

Subtask A (Question Classification)
For this subtask, we extract Skip-Thought vectors as described in section 4.1.2. Once we get these two vectors, we generated four different combinations of vectors -(i) question body only, (ii) question subject only, (iii) concatenation vector of both question body and question subject and (iv) average vector of both question body and question subject. We trained all the models mentioned in the section 4.2 with each one of these vectors. The evaluation scores for these models on test data are shown in the Table 3.

Subtask B (Answer Classification)
For this subtask, we extract Skip-Thought vectors as described in section 4.1.2. Once we get these three vectors, we generated two different combinations of vectors -(i) concatenation vector of question body, question subject & answer and (ii) average vector of question body, question subject & answer. We trained all the models mentioned in the section 4.2 using each one of these embedding vectors. The evaluation scores for these models (except MAP scores) on test data are shown in the Table 4. In both the tables 3 and 4, the column Vector represents Skip-Thought vector combination type (whether it is body only (in case of Subtask A) or subject only (in case of Subtask A) or  Table 3: Evaluation scores for Subtask A * -marks the scores of our primary submission * * -marks the scores of our contrastive submission Row in bold -post evaluation accuracy score (improved over actual submission) concatenation of vectors of body, subject and answer/comment or average of vectors of body, subject and answer/comment). On dev data set, XG-Boost Classifier with concatenated Skip-Thought vectors generated best scores for both subtasks. Hence, these are part of final submissions.
However, the rows which are marked in bold (in both subtasks) produced best accuracy score with Multi-layer Neural Network Classifier beating the best score of our CodaLab final submission. The Multi-layer Neural Network is designed to have an input layer, 2 hidden layers and an output layer with "relu" activations at input and hidden layers and "sigmoid" activation at output layer. All the layers are trained with 50 neurons except the output layer which has one neural node. This model counters overfitting problem by introduction of intermittent Dropout layers.
Another interesting observation that we found is the models, surprisingly, performed better when URLs are kept in the text compared to when URLs were removed.

Model
Vector Accuracy F-score Avgrec Decision Tree  * -marks the scores of our primary submission * * -marks the scores of our contrastive submission Row in bold -post evaluation accuracy score (improved over actual submission)

Conclusion
The earlier works on cQA didn't use Skip-Thought vectors, to the best of our knowledge. Hence, we used these vectors for both subtasks. We also have tried unique combinations of Skip-Thought vectors of question body, question subject and comments/answers (only in case of Subtask B) -either concatenation or average of vectors with different models. Out of all the models, concatenated Skip-Thought vectors with XG-Boost Classifier generated best result out of all the combinations; as a result of which we stood 6 th in Subtask B and 16 th in Subtask A. However, post-evaluation submission which used concatenated Skip-Thought vectors with Neural Network classifier produced better accuracy score of 0.6752 compared to 0.6537 (which is official best result for Task B) and 0.6884 compared to 0.6299 (which is official best result for Task A). However, in future we would like to extend our work with other word embeddings like Word2vec, GloVe and BERT (Devlin et al., 2018) features and compare the results with current work using Skip-Thought vectors.