Fermi at SemEval-2019 Task 8: An elementary but effective approach to Question Discernment in Community QA Forums

Online Community Question Answering Forums (cQA) have gained massive popularity within recent years. The rise in users for such forums have led to the increase in the need for automated evaluation for question comprehension and fact evaluation of the answers provided by various participants in the forum. Our team, Fermi, participated in sub-task A of Task 8 at SemEval 2019 - which tackles the first problem in the pipeline of factual evaluation in cQA forums, i.e., deciding whether a posed question asks for a factual information, an opinion/advice or is just socializing. This information is highly useful in segregating factual questions from non-factual ones which highly helps in organizing the questions into useful categories and trims down the problem space for the next task in the pipeline for fact evaluation among the available answers. Our system uses the embeddings obtained from Universal Sentence Encoder combined with XGBoost for the classification sub-task A. We also evaluate other combinations of embeddings and off-the-shelf machine learning algorithms to demonstrate the efficacy of the various representations and their combinations. Our results across the evaluation test set gave an accuracy of 84% and received the first position in the final standings judged by the organizers.


Introduction
The massive rise in popularity of Community Question Answering (cQA) forums like Stack-Overflow, Quora, Yahoo!Answers and Google Groups have led to an effective means of information dissemination for topic-centered communities to share and engage in knowledge consumption needs.After a considerable time, information becoming obsolete is a major problem which results in change of many of the facts that were previously true.Another problem is that most of the forums lack exhaustive moderation and control -which results in high-latency quality checks and eventually results in the sharing of non-factual information.Various factors are responsible for this -primarily being ignorance or misunderstanding and sometimes, maliciousness of the responder to the questions (Mihaylova et al., 2018).
In the pipeline of detection of whether the given responses to a question are indeed factual, the necessary first step is to discern what category the question asked in the cQA forum falls into.As an example, "What is Domino's customer service number?" is a factual question as it asks for a fact rather than an opinion or discourse.In contrast, consider the question "Can someone recommend a good pediatrician in Mumbai?" asks for an opinion rather than a particular factual information as opinions on the matter of a good pediatrician may be subjective and depend on various other factors the conclusion of which is not universally true.
We tackle the problem proposed by organizers (Mihaylova et al., 2019) in sub-task A as a multi-class classification problem, i.e., categorizing questions in cQA forums into one of the following three categories: 1. Factual: The question is asking for factual information, which can be answered by checking various information sources, and it is not ambiguous.(e.g., "What is the currency used in Taiwan?") 2. Opinion: The question asks for an opinion or an advice, not for a fact.(e.g., "Can somebody recommend good restaurants around the SF Bay Area?") 3. Socializing: Not a real question, but intended for socializing or for chatting.This can also mean expressing an opinion or sharing some information, without really asking anything of general interest.(e.g., "What was your first bike?") Our submission involves the use of pre-trained models for generating sentence embeddings from existing trained models and then employing the use of off-the-shelf machine learning algorithms for the multi-class prediction problem.The approach is described in Section 3 where we describe our methodology in detail.

Related Work
For classification tasks like question similarity across community QA forums, machine learning classification algorithms like Support Vector Machines (SVMs) have been used ( Šaina et al., 2017;Nandi et al., 2017;Xie et al., 2017;Mihaylova et al., 2016;Wang and Poupart, 2016;Balchev et al., 2016).Recently, advances in deep neural network architectures have also led to the use of Convolutional Neural Networks (CNNs) ( Šaina et al., 2017;Mohtarami et al., 2016) which perform reasonably well for selection of the correct answer amongst cQA formus.Algorithms and methods for answer selection also include works by (Zhang et al., 2017) which use a Long-Short Term Memory (LSTM) model for answer selection.Similarly, LSTMs for answer selection are also used by (Feng et al., 2017;Mohtarami et al., 2016).Other works in the space include use of Random Forests (Wang and Poupart, 2016); topic models to match the questions at both the term level and topic level (Zhang et al., 2014).There have also been works on translation based retrieval models (Jeon et al., 2005;Zhou et al., 2011); Xg-Boost (Feng et al., 2017) and Feedforward Neural Networks (NN) (Wang and Poupart, 2016).
In this work, we seek to evaluate pre-trained sentence embeddings and how they perform across comprehension of questions in the community QA tasks.We now describe the methodology and data in the following section.

Methodology and Data
The data supplied by organizers is used for the task at hand.Specifically, for sub-task A, the subject and body for each question is provided by the task organizers.The data consists of 1118 training instances along with 239 and 935 question instances in the development and testing sets respectively.

Word Embeddings
Word embeddings have been widely used in modern Natural Language Processing applications as they provide vector representation of words.They capture the semantic properties of words and the linguistic relationship between them.These word embeddings have improved the performance of many downstream tasks across many domains like text classification, machine comprehension etc. (Camacho-Collados and Pilehvar, 2018).Multiple ways of generating word embeddings exist, such as Neural Probabilistic Language Model (Bengio et al., 2003), Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and more recently ELMo (Peters et al., 2018).
These word embeddings rely on the distributional linguistic hypothesis.They differ in the way they capture the meaning of the words or the way they are trained.Each word embedding captures a different set of semantic attributes which may or may not be captured by other word embeddings.In general, it is difficult to predict the relative performance of these word embeddings on downstream tasks.The choice of which word embeddings should be used for a given downstream task depends on experimentation and evaluation.

Sentence Embeddings
While word embeddings can produce representations for words which can capture the linguistic properties and the semantics of the words, the idea of representing sentences as vectors is an important and open research problem (Conneau et al., 2017).
Finding a universal representation of a sentence which works with a variety of downstream tasks is the major goal of many sentence embedding techniques.A common approach of obtaining a sentence representation using word embeddings is by the simple and naïve way of using the simple arithmetic mean of all the embeddings of the words present in the sentence.Smooth inverse frequency, which uses weighted averages and modifies it using Singular Value Decomposition (SVD), has been a strong contender as a baseline over traditional averaging technique (Arora et al., 2016).Other sentence embedding techniques include pmeans (Rücklé et al., 2018), InferSent (Conneau et al., 2017), SkipThought (Kiros et al., 2015), Universal Encoder (Cer et al., 2018).
We formulate sub-task A of Task 8 in SemEval 2019 as a text multi-classification task.In this paper, we evaluate various pre-trained sentence embeddings for identifying each of the categories of factual, socializing and opinion among the questions in community QA forums.We train multiple models using different machine learning algorithms to evaluate the efficacy of each of the pretrained sentence embeddings for the sub-task.In the following, we discuss various popular sentence embedding methods in brief.
• InferSent (Conneau et al., 2017) is a set of embeddings proposed by Facebook.In-ferSent embeddings have been trained using the popular language inference corpus.Given two sentences the model is trained to infer whether they are a contradiction, a neutral pairing, or an entailment.The output is an embedding of 4096 dimensions.
• Concatenated Power Mean Word Embedding (Rücklé et al., 2018) generalizes the concept of average word embeddings to power mean word embeddings.The concatenation of different types of power mean word embeddings considerably closes the gap to state-of-theart methods mono-lingually and substantially outperforms many complex techniques crosslingually.
• Lexical Vectors (Salle and Villavicencio, 2018) is another word embedding similar to fastText with slightly modified objective.
FastText (Bojanowski et al., 2016) is another word embedding model which incorporates character n-grams into the skipgram model of Word2Vec and considers the sub-word information.
• The Universal Sentence Encoder (Cer et al., 2018) encodes text into high dimensional vectors.The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs.It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks.The input is variable length English text and the output is a 512 dimensional vector.
• Deep Contextualized Word Representations (ELMo) (Peters et al., 2018) use language models to get the embeddings for individual words.The entire sentence or paragraph is taken into consideration while calculating these embedding representations.ELMo uses a pre-trained bi-directional LSTM language model.For the input supplied, the ELMo architecture extracts the hidden state of each layer.A weighted sum is computed of the hidden states to obtain an embedding for each sentence.
Using each of the sentence embeddings we have mentioned above, we seek to evaluate how each of them performs when the vector representations of the body of questions in the cQA forums are supplied for classification with various off-the-shelf machine learning algorithms.For each of the evaluation tasks, we perform experiments using each of the sentence embeddings mentioned above and show our classification performance on the dev set given by the task organizers.

Results
The official ranking metric is Accuracy.We have included the F-1 score here as well for comparison.Table 1 provides the results on the system runs for the evaluation phase as judged by the organizers on the CodaLab platform.Our system ranked first among the participants in the evaluation phase.We observe that Universal Sentence Encoder representations with the XGBoost classifier gives the best results on the test set.
As a way to elicit different performances for our experiments, we also provide our results from the system runs on the development set provided by the organizers.These results are shown in Table 2.

Conclusions and Future Work
We see from the results that our system is able to discern the type of questions asked in community QA forums with high performance metrics.This shows that using pre-trained embeddings with a simple machine learning classification algorithm often helps in greater understanding of the text at hand -in this case, the questions in community question-answering forums.
In future work, we also seek to evaluate different transfer learning approaches which utilize pretrained language models (LMs) across different base language corpora and see how varying these base corpora for pre-training the language model results in the performance change while finetuning for question comprehension in cQA forums.

Table 1 :
Results showing Macro-F1 score and accuracy for Sub-task A, using Universal Encoder Sentence embeddings and training the model with XGBoost.

Table 2 :
Dev Set Accuracy and Macro-F-1 scores (in percentage) for Sub-Task A of Task 8