AUTOHOME-ORCA at SemEval-2019 Task 8: Application of BERT for Fact-Checking in Community Forums

Fact checking is an important task for maintaining high quality posts and improving user experience in Community Question Answering forums. Therefore, the SemEval-2019 task 8 is aimed to identify factual question (subtask A) and detect true factual information from corresponding answers (subtask B). In order to address this task, we propose a system based on the BERT model with meta information of questions. For the subtask A, the outputs of fine-tuned BERT classification model are combined with the feature of length of questions to boost the performance. For the subtask B, the predictions of several variants of BERT model encoding the meta information are combined to create an ensemble model. Our system achieved competitive results with an accuracy of 0.82 in the subtask A and 0.83 in the subtask B. The experimental results validate the effectiveness of our system.


Introduction
The Community Question Answering (CQA) forums are gaining more and more popularity because they can offer great opportunity for users to get appropriate answers to their questions from other users. Meanwhile, the accumulated massive questions and answers in CQA forums present a new challenge to provide valuable information for users more effectively. Therefore, researchers have shown an increased interest in CQA systems (Srba and Bielikova, 2016;Wang et al., 2018), aiming to facilitate efficient knowledge acquisition and circulation. Specifically, a large portion of researches mainly focus on the two tasks: find relevant questions to a new question to reuse corresponding answers (Question Retrieval), and search for relevant answers among existing answers to other questions (Answer Selection).
Despite a great deal of research on CQA, there are relatively few studies focusing on the quality of questions and answers. Actually, the credibility of answers is an important aspect, which can directly affect the user experience for CQA forums. In order to check the veracity of answers automatically, some recent works (Karadzhov et al., 2017;Mihaylova et al., 2018) attempt to utilize external sources and extract appropriate features for classification. Considering the importance of information veracity in CQA forums, the fact checking of answers is still an issue that is worth investigating further.
Therefore, the SemEval-2019 task 8 aims to conduct fact checking in CQA forums. In order to detect the veracity of answers, it is necessary to identify whether the questions are factual firstly. The task is comprised of two subtasks: the subtask A is targeted to identify whether a question is asking for factual information, an opinion/advice or socializing. Given factual questions, the subtask B is aimed to determine whether the corresponding answers are true, false or not factual.
In order to address the SemEval-2019 task 8, we propose a system based on the BERT model (Devlin et al., 2018). In our system, we extend BERT for integrating some meta information of questions into the BERT encoder, and generate an ensemble model from some potential classification models to achieve very competitive results. To be specific, in subtask A, two outputs of fine-tuned BERT classifiers are obtained from subjects and bodies of questions respectively. Then by combining both outputs with the length of questions as features, the AdaBoost method (Schapire, 1999) is utilized to boost the performance of question classification. As for subtask B, while encoding additional meta information (category and subject of questions) into BERT model, we adopt the bagging method for some variants of BERT model produced by adding additional layers. The experimental results in both subtasks demonstrate the effectiveness of our system. The rest of our paper is organized in the following way. The related work about CQA is summarized in Section 2. Section 3 gives a more detailed description of our system. The results and analysis of experiments are demonstrated in Section 4. Finally, Section 5 presents the main conclusions.

Related Work
So far, most studies about CQA mainly pay attention to two tasks: Question Retrieval and Answer Selection. In previous works, some traditional methods treat questions or answers as bag of words and measure their similarities based on weighted matching between the words (Robertson et al., 1994) or translation probability learning from language model (Xue et al., 2008). In fact, similar questions often are not phrased with exactly same words, but related words, while there is very little token overlap between questions and answers. These methods essentially consider the question or answer as a bag of words, neglecting semantic information. So it is not surprising that the performance of traditional methods is not very well on aforementioned tasks. Recently, the neural-based models (He et al., 2015;Feng et al., 2015;Tan et al., 2016;Bachrach et al., 2017;Tay et al., 2018), which can capture some semantic relations, are proposed and become mainstream in the research about CQA gradually. The basic idea behind them is to learn the representation of questions and answers based on CNN or LSTM models, then conduct text matching by regarding both tasks as classification or learning to rank.
Furthermore, there are also public CQA datasets and competitions available, which promote relevant researches substantially. The public datasets are collected from various CQA websites, including Quora 1 , Yahoo! Answers 2 , Qatar Living 3 , etc. As for competitions, there is a kaggle competition 4 to identify the duplicated question pairs collecting from the Quora website. In SemEval-2015 Task 3 "Answer Selection in Community Question Answering" (Nakov et al., 2015), it is mainly targeted on the answer selection task. And there is a more comprehensive competition in SemEval-2016 Task 3 (Nakov et al., 2016) designed for both Question Retrieval and Answer Selection, which is consisted of four subtasks: Question-Comment Similarity, Question-Question Similarity, Question-External Comment Similarity and Reranking the correct answers for a new question. In contrast, in SemEval-2017 task 3 , a new duplicate question detection subtask is incorporated on the basis of the SemEval-2016 Task 3.
Although much work has been done in CQA researches, few attentions have been paid on improving the quality of questions and answers. In order to detect true factual answers automatically, Karadzhov et al. (Karadzhov et al., 2017) propose a general framework using external sources, which adopts the LSTM model (Hochreiter and Schmidhuber, 1997) to learn text representation of answers and external sources. Mihaylova et al. (Mihaylova et al., 2018) extract features from multiple aspects (the answer content, the author profile, the rest of the community forum and external authoritative sources) and demonstrate the effectiveness of fact checking of answers. At the same time, the lack of large-scale dataset also restricts the progress on fact checking in CQA forums further.
Recently, there are some of key milestones in the NLP field, such as ELMo (Peters et al., 2018), ULMFiT (Howard and Ruder, 2018), Ope-nAI GPT (Radford, 2018) and BERT (Devlin et al., 2018). These large-scale models have provided great performance on various NLP tasks, which can be pre-trained on a massive corpus of unlabeled data, and then fine-tuned to downstream tasks. Especially, the BERT model has achieved state-of-the-art results on a variety of language tasks, which allows us to obtain significantly higher performance than models that are only able to leverage a small task-specific dataset. Therefore, we build a system based on the BERT model for the SemEval2019 task 8 and achieve satisfactory results.

System Overview
The pipeline of our system is shown in Figure 1. Firstly, original input files with questions and answers are preprocessed, including removing redundant information (e.g., HTML tags, URLs and strings exceeding maximum length limit) and extracting the structured contents and  Thirdly, based on the pre-trained BERT model released by Google 5 , we conduct unsupervised training on specific CQA corpus further to make the model more suitable for the following classification tasks. Finally, the pre-trained BERT model and extracted features are fed into two subsystems to obtain predictions for the subtask A and the subtask B respectively. In the subsystem for subtask A (detailed in Subsection 3.2), the AdaBoost model is adopted to predict the classification of questions by combining the outputs of fine-tuned BERT classifier and the feature of length of questions. In the subsystem for subtask B (described in Subsection 3.3), some variant BERT models which encode meta information of questions are combined to generate an ensemble model for predication of labels of answers.

Subsystem for Subtask A
In this Subsection, the subsystem for subtask A is described in detail below. Firstly, the subject and body of questions are encoded into two BERT models separately for finetuning on the question classification. The different inputs for both BERT models are represented as e.g., when will eid start? like holidays Socializing e.g., What do you like about the person above you?
e.g., Hello people...let's play this game...you have to write something good about the person whose 'post' is above you on QL.You can write anything and you can write multiple times. For ex;the person who will respond to my post will write about me ;) and so on. This will be fun... where text1 and text2 are the subject and body of question respectively.
Secondly, the outputs of two fine-tuned BERT models are concatenated with the length of questions' body as features for classification. As illustrated in Table 1, it is rather intuitive that the body length of questions for socializing is inclined to be longer than ones for factual or opinion. Therefore, it is reasonable to consider the body length of questions as a suitable feature for classification. In addition, the results of each BERT model are probabilities of questions belonging to different classes (Factual, Opinion and Socializing). Then the feature vector x vector for question classification is represented as follows where P s 1, P s 2, P s 3 are the output of a BERT model encoding the question subject. Similarly, P b 1, P b 2, P b 3 are the output of another BERT model encoding the question body, and L b is the body length of a question.
Finally, based on the generated feature vector x vector , the AdaBoost algorithm is adopted to obtain the final results of classification. AdaBoost is a typical Boosting algorithm that aims to convert a set of relative weak classifiers to a strong classifier. Therefore, the performance of classification can be strengthened by considering additional length feature, compared to the one that the BERT models have achieved.

Subsystem for Subtask B
The Subsection describes the details of the subsystem for subtask B as follows.
Firstly, the subject and body of question, corresponding reply (i.e., answer) and meta information of question are combined to generate sequences for BERT encoders. In order to identify the true factual reply, the content of corresponding questions and auxiliary information (e.g., the category of question, username of a questioner or replier) should be necessary for classifiers. So in the subsystem, we investigate the influence of different information for the classification performance (see Table 4 for details), including the subject of question (F-subject), the usernames of questioner and replier (F-username) and the category of question (F-category). Ultimately, the text of answer and the information of F-subject and F-category are employed for our BERT based models. The generated sequence for inputs of models are represented as following: We use [SEP] to separate between the information of question and answer. text1 is composed of Fsubject, F-category and the body of question separated by the special symbol (∼), while text2 is the text of corresponding reply.
Secondly, based on the generated sequences as inputs, we design three different categories of BERT based models for ensemble. As shown in  • BERT-CLS. The final hidden state for the first token [CLS] in the input is employed for fine-tuning the pre-trained BERT model by adding a classification layer and a standard softmax.
• BERT-AVG. Different from BERT-CLS, the final hidden states for all tokens are utilized for classification by conducting an average pooling, and then adding a full connected layer and a standard softmax.
• BERT-LSTM. Compared with BERT-AVG, a Bi-LSTM network is added between the pooling layer and the pre-trained BERT encoder. It must be noted that we only obtain the outputs of BERT encoder and the parameters of BERT encoder are not updated when training.
Thirdly, we select a set of competitive classifiers in the training process by the three kinds of BERT based models respectively. The method of five-fold cross-validation is employed. To be specific, the original samples are randomly divided into five sub-samples with equal size. And one of the five sub-samples is retained for validating the performance of classifier, and the rest of four subsamples are used as training data. For each kind of BERT based model, the cross-validation process is repeated five times and each time no more than five optimal classifiers are obtained. Therefore, we get a total of sixty-five competitive classifiers filtered by certain threshold value on accuracy metric from three kinds of BERT based models for ensemble.
Finally, an effective integration strategy is applied to produce a strong classifier for the subtask. There are two candidate integration strategies: • Strategy 1 (Vote-ensemble): Each classifier casts a vote, the label of a sample is decided according to the majority of votes.
• Strategy 2 (Distribution-ensemble): If the number of votes for any label exceeds onehalf of the total number of classifiers, the sample is classified as the corresponding label. Otherwise, the label of the sample will be determined by considering the actual label distribution of the training data and the label distribution of votes together. For example, if one sample's votes for different labels are very close, then the sample is classified as the label with the largest proportion of data distribution.
At last, the strategy 2 is employed in our subsystem because it seems that Distribution-ensemble strategy is more robust for variance error, especially for small dataset, which will be discussed in Subsection 4.2 further.

Dataset
The dataset is organized in question-answer threads from the Qatar Living forum. Each question, which is annotated by labels: Opinion, Factual and Socializing, has a subject, a body and meta information including question ID, category, posting time, user's ID and name. And each answer, which is classified as Factual-True, Factual-False and Non-Factual, has a body and meta information (answer ID, posting time, user's ID and name). The detailed statistics of the dataset in this task are illustrated in the task description paper (Mihaylova et al., 2019).

Experimental Results and Analysis
As for pre-training the BERT model, it is trained based on the BERT-Base-Cased model by the forum corpus provided by organizer 6 . The training batch size is 32, the number of train steps is 1e+5 and the learning rate is 2e-5. The detailed experimental results for both subtasks are described as following.

Results for Subtask A
In the subsystem for subtask A, the AdaBoost algorithm is employed to boost the performance on question classification. The number of estimators for the AdaBoost method is 10. To evaluate the performance of question classification, we compare our proposed method against the following models: • Text-CNN (Kim, 2014): a simple CNN with one layer of convolution on top of word vectors. The subject and body of each question are concatenated as the input of Text-CNN model. When training, the number of epoch is 80, the initial learning rate is 0.001 and the dropout rate is set to 0.4.
• BERT without pre-training: the BERT-Base cased model release by Google. The input of the model is the concatenation of the subject and the body of each question with the symbol [SEP], which is represented as follows: text1 and text2 are the subject and body of a question separately. When training the model, the batch size of training is 32, the initial learning rate is 2e-5 and the number of epoch is 9.
• BERT with pre-training: the BERT model pre-trained by CQA corpus. The settings of hyper-parameters is the same as the BERT model without pre-training.  The comparison results are shown in Table 2. From the table, it can be observed that the accuracy of the Text-CNN model is much lower than the other three BERT-based models. Even if only the BERT model without pre-training is used to predict the final result, it is 2.93% and 8.68% higher than Text-CNN model on development dataset and test dataset, respectively. Considering the size of dataset is relative small, it seems to demonstrate the potential advantage of BERT based models. Compared with the BERT model without pre-training, the BERT model with pre-training has 3.35% and 5.52% increase respectively. It is illustrated that the step of pre-training the BERT model is very important. Furthermore, the accuracy achieved by our method is 0.86% and 2.59% higher than the one by the BERT with pre-training model on two datasets separately. It shows that the AdaBoost algorithm can make better use of the probability outputs from the finetuned BERT models for prediction. What's more, the body length of questions can be considered as an effective feature for training model and predicting results.

Results for Subtask B
In the experiments for subtask B, the three kinds of BERT based models are implemented with Ten-sorFlow and trained with Adam optimizer. The maximum length of sequence is set to 150 and the batch size is 4. The initial learning rates are 3e-5 for parameters of BERT encoder and 1e-3 for others.

Models
Acc  The experimental results of different kinds of models are shown in Table 3. From the table, it can be observed that the BERT-AVG model achieves the best performance in the three single models. By conducting average pooling operation on final hidden states of all tokens, the BERT-AVG model can capture more semantic information than the BERT-CLS model which can only pay attention to the hidden state of the [CLS] token. As for the BERT-LSTM model, it performs the worst, which may be caused by the highest model complexity and the lack of adequate training dataset, resulting in somewhat overfitting. In addition, it is indicated that ensemble models can obtain higher accuracy than single models and the strategy of Distribution-ensemble is more robust than the strategy of Vote-ensemble. This is because that when the numbers of votes for different labels are close to each other, it is difficult to identify the correct class only by the majority. By considering actual classification distribution in training dataset additionally, the Distribution-ensemble can show its potential advantage.  In order to explore the effectiveness of differ-ent information for classification, a series of experiments based on the BERT-CLS model are conducted. The baseline model (BERT-CLS) is established only by encoding the information of the body of question and the corresponding answer. Therefore, the influence of other information can be discussed individually. By considering different information, the performance of the model validated on development dataset is shown in Table  4. It is observed that the F-username can not contribute to the increase of accuracy, which may be caused by existing many anonymous users in the forum. By encoding the information F-subject and F-category into the model, it can achieve the best performance.

Conclusion
Detecting the veracity of answers is vital to maintain high quality information in CQA forums. In order to address this problem, a system based on BERT model is developed for participating in the SemEval-2019 Task 8. In the system, the meta information of questions is encoded into the BERT model and an ensemble with multiple variants of BERT model are produced to accomplish better performance. In subtask A, we utilize the Ad-aBoost algorithm to the features that is consisted of fine-tuned results of BERT models and length of questions. In subtask B, after encoding the auxiliary information of questions and answers into the BERT model, fine-tuned BERT model and two variant models by adding average-pooling or LSTM layers are combined to reduce the variance error. Finally, our system achieved great performance with an accuracy of 0.82 and 0.83 in the two subtasks respectively. To our surprise, the system has impressive results in the subtask B without using external sources. It may be explained by the potential advantage of BERT model over other models only trained on a small task-specific dataset. In the future, we will explore to retrieve relevant information from the Web efficiently and then integrate the external information into our BERT based model.