Answer Selection in Arabic Community Question Answering: A Feature-Rich Approach

The task of answer selection in community question answering consists of identifying pertinent answers from a pool of user-generated comments related to a question. The recent SemEval-2015 introduced a shared task on community question answering, providing a corpus and evaluation scheme. In this paper we address the problem of answer selection in Arabic. Our proposed model includes a manifold of features including lexical and semantic similarities, vector representations, and rankings. We investigate the contribution of each set of features in a supervised setting. We show that employing a feature combination by means of a linear support vector machine achieves a better performance than that of the competition winner (F 1 of 79.25 compared to 78.55).


Introduction
Community Question Answering (cQA) platforms have become an important resource of punctual information for users on the Web. A person posts a question on a specific topic and other users post their answers with little, if not null, restrictions. The liberty to post questions and answers at will is one of the ingredients that make this kind of fora attractive and allows questions to be answered in a very short time. Nevertheless, this same anarchy could cause a question to receive as many answers as to make manual inspection difficult while a given comment might not even address the question (e.g., because the topic gets diverted, or the user aims to make fun of the topic).
Our task is defined as follows. Given a question q and its set of derived comments C, identify whether each c ∈ C represents a DIRECT , RELATED , or IRRELEVANT answer to q. In order to do that, we take advantage of the framework provided by the SemEval-2015 Task 3 on "Answer Selection in Community Question Answer-ing"  and focus on the Arabic language. Our approach is treating each questioncomment as an instance in a supervised learning scenario. We build a support vector machine (SVM) classifier that is using different kinds of features, including vector representations, similarity measures, and rankings. Our extensive feature set allows us to achieve better results than those of the winner of the competition: 79.25 F 1 compared to 78.55, obtained by Nicosia et al. (2015).
The rest of the paper is organized as follows. Section 2 describes the experimental framework -composed of the Fatwa corpus and the evaluation metrics-and overviews the different models proposed at competition time. Section 3 describes our model. Experiments and results are discussed in Section 4. Related work is discussed in Section 5. We summarize our contributions in Section 6, and include an error analysis in Appendix A.
2 Overview of SemEval-2015 Task 3 Task overview The SemEval-2015 Task 3 on "Answer Selection in Community Question Answering"  proposed two tasks in which, given a user-generated question-answer pair, a system would identify the level of pertinence of the answer. The task was proposed in English and Arabic. In the case of English, the topic of the corpus was daily life in Qatar. In the case of Arabic, the topic was Islam. Whereas the English task attracted twelve participants, only four teams accepted the challenge of the Arabic one.
The evaluation framework is composed of a corpus and a set of evaluation measures. 1 The corpus for the Arabic task is called Fatwa, as this is the name of the community question answering platform from which the questions were extracted. 2 Questions (Fatwas) about Islam are posted by reg- A person working for a company that has bonds and trades stocks is asking for an opinion.
. ] DIRECT answer addressing both bonds and stocks issues.
. ] RELATED answer addressing only the trading of stocks.
. ] RELATED answer addressing only the buying and selling of bonds.
. ] IRRELEVANT answer discussing whether a husband is allowed to take money from his wife.
. ] IRRELEVANT answer discussing masturbation habits. Figure 1: Example of a question (QID 132600) and its answers from the Fatwa corpus. Key terms appear in bold italics. Note that the direct answer has a high overlap with the question's key terms, the related answers have a lower overlap, and the irrelevant answers have no such overlap.

Train
Dev.   linked more comments to each question. There are two other kinds of answers: RELATED are those associated to other questions in the forum which have been identified as related to the current question; IRRELEVANT comments were randomly picked from the rest of the collection. Each question in the final corpus has five answers. Figure 1 shows an example question and its answers, illustrating some of the challenges of this task. Table 1 includes some statistics on the Fatwa corpus.
The second part of the framework consists of the evaluation metrics. The official scores are macro-averaged F 1 and accuracy. Macroaveraging gives the same importance to the three classes even if there are two times more IRRELEVANT instances than instances in any other class. The intuition behind this metric is that showing IRRELEVANT instances to a user in a real scenario is not as important as showing her DIRECT ones.
Participating systems As aforementioned, four research teams approached this task at the competition. As the rules allowed to submit one primary and two contrastive submissions to encourage experimentation, a total of eleven approaches were submitted. In what follows, we describe all the approaches without distinguishing between primary and contrastive. Interestingly, all the approaches from each group appear grouped in the task ranking, so we review them in decreasing order of performance.
The best out of the three systems designed by Nicosia et al., (2015) used a variety of similarity features -including cosine, Jaccard coefficient, and containment-on word [1, 2]-grams. Addi-tionally, the word [1,2]-grams themselves were considered as features. They applied a logistic regressor to rank the comments and label the top answer as DIRECT , the next one as RELATED and the remaining as IRRELEVANT . Their second system used the same lexical similarity, n-grams features, and learning model, but this time on a binary setting: DIRECT vs. NO-DIRECT . The prediction confidence produced by the classifier was used as a score to rank the comments and assign labels accordingly: DIRECT for the top ranked, RELATED for the second ranked, and IRRELEVANT for the rest. Their third approach is rule-based: a tailored similarity measure in which more weight is given to matching 2-grams than to 1-grams and a label assignment which depends on the relative similarity to the most similar comment in the thread. The output of this rule-based system was also used as a set of extra features in their top-performing approach. Belinkov et al., (2015)'s best submission was very similar to the one of Nicosia et al., (2015): a ranking approach based on confidence values obtained by an SVM ranker (Joachims, 2006). Their second approach consisted of a multi-class linear SVM classifier relying on three feature families: (i) lexical similarities between q and c (similar to those applied by the previous team); (ii) word vector representations of q and c; and (iii) a ranking score for c produced by the SVM ranker.
The two best approaches of Hou et al., (2015) used features representing different similarities between q and c, lengths of words and sentences, and the number of named-entities in c, among others. In this case [1,2,3]-grams were also considered as features, but with two differences with respect to the other participants: only the most frequent n-grams were used and a translated version to English was also included. They explored two strategies using SVMs in their top performing submissions: (i) a hierarchical setting, first discriminating between IRRELEVANT and NON-IRRELEVANT and then between DIRECT and RELATED ; and (ii) a multiclass classification setting. Their third approach was based on an ensemble of classifiers.
Finally, Mohamed et al., (2015) applied a decision tree whose output is composed of lexical and enriched representations of q and c: the terms in the texts are expanded on the basis of a set of Quranic ontologies. The authors do not report the  We participated in the submissions of the topperforming models (Belinkov et al., 2015;Nicosia et al., 2015). As described below, here we explore effective combinations of the features applied in both approaches, as well as an improved feature design.

Model
We train a simple support vector machine (SVM) linear classifier (Joachims, 1999) on pairs of questions and comments. We opt for this alternative because it allowed us to get the best performance during the SemEval task (cf. Section 2); our previous experiments with more sophisticated kernels did not show any improvement. Each question q and comment c is assigned a feature vector. Some features are unique to either q or c, while others capture the relationship between the two. Our features can be broadly divided into fours groups: vector representations, similarity measures, statistical ranking, and rule-based ranking. We describe each kind in turn.

Vectors
Our motivation for using word vectors for this task is that they convey a soft representation of word meanings. In contrast to similarity measures that are based on words, using word vectors has the potential to bridge over lack of lexical overlap between questions and answers.
We start by creating word vectors from a large corpus of raw Arabic text.
We use Word2Vec (Mikolov et al., 2013b;Mikolov et al., 2013a) with default settings for creating 100-dimensional vectors. We experimented with the Arabic Gigaword (Linguistic Data Consortium, 2011), containing newswire text, and with the King Saud University Corpus of Classical Arabic (KSUCCA), containing classical Arabic text (Alrabiah et al., 2013). Table 2 provides some statistics for these corpora. We were initially expecting KSUCCA to produce better results, be-cause its language should be more similar to the religious texts in the Fatwa corpus. However, in practice we found vectors trained on the Arabic Gigaword to perform better, possibly thanks to its larger coverage, so we report only results with the Gigaword corpus below.
We noticed in preliminary experiments that many errors are due to lack of overlap in vocabulary between answers and questions (cf. Section 4.1). In some cases, this overlap stems from the rich morphology of Arabic forms, and can be avoided by lemmatizing. Therefore, we also lemmatize the Arabic corpus using MADAMIRA (Pasha et al., 2014) before creating word vectors. We notice that lemma vectors tend to give small improvements experimentally.
For each question and answer, we average all lemma vectors excluding stopwords. This simple bag-of-words approach ignores word order, but is quite effective at capturing question and answer content. We calculate an average vector for each answer, and concatenate the average question and answer vectors. The resulting concatenated vectors form the features for our classifier. Note that we do not calculate vector similarities (e.g. cosine similarity), letting the classifier have access to all vector dimensions instead.

Similarity
This set of features measures the similarity sim(q, c) between a question and a comment, assuming that high similarity signals a DIRECT answer.
We compute the similarity between word n-gram representations (n = [1, . . . , 4]) of q and c, using different lexical similarity measures: greedy string tiling (Wise, 1996), longest common subsequences (Allison and Dix, 1986), Jaccard coefficient (Jaccard, 1901), word containment (Lyon et al., 2001), and cosine similarity. The preprocessing in this case consists only of stopword removal. Additionally, we further compute cosine similarity on lemmas and part-of-speech tags, both including and excluding stopwords.

Statistical Ranking
The features described so far apply to each comment independently without considering other comments in the same thread. To include such global information, we take advantage of our previous work (Belinkov et al., 2015) and formulate the problem as a ranking scenario: com-ments are ordered such that better comments have a higher ranking. Concretely, DIRECT answers are ranked first, RELATED answers second, and IRRELEVANT answers third. We then train an SVM ranker (Joachims, 2002), and add its scores as additional features. We also scale ranking features to [0, 1] and map scores into 10 bins in the [0, 1] range, with each bin assigned a binary feature. If a score falls into a certain bin, its matching binary feature fires.
We found such ranking scores to be a valuable addition in our experiments. To understand why, we note that they are able to neatly separate the different labels, with the following average scores: DIRECT 14.5, RELATED 12.3, and IRRELEVANT 10.5.

Rule-based Ranking
In addition to the machine learning approaches, we adapted our rule-based model, which ranked 2 nd in the competition (Nicosia et al., 2015). The basic idea is to rank the comments according to their similarity and label the top ones as DIRECT .
In this case our preprocessing consists of stemming, performed with QATARA (Darwish et al., 2014), and again stopword removal. In our implementation, the score of a comment is computed as where ω(t) = 1 if t is a 1-gram, 4 if it is a 2-gram, and pos(t) represents the relative position of t in the question and is estimated as the length of q minus the position of t in q. That is, we give significantly more relevance to 2-grams and to those matching n-grams at the beginning of the question. We compute this score twice: once considering the subject and once considering the body of the question, and sum them together to get the final score. In the first case, α = 1.1; in the second case, α = 1.
We map the scores of comments c 1 , . . . , c 5 ∈ C into the range [0, 1] such that the best ranked comment gets a score of 1.0, and assign a label to comment c as follows: All the parameters and thresholds in this rulebased approach were manually tuned on the training data.

Experiments and Results
The aim of our experiments is to explore each set of features both isolated and combined. Thus we isolate rule-based features from similarity features and from vector-based features. In our experiments we combined vector-based and statistical ranking features, following our previous work (Belinkov et al., 2015). Note that the rulebased ranking system (Section 3.4) does not produce any features. Instead, we binarize its output to produce the features to be combined with the rest. We train and tune all the models on the training and development sets and perform a final evaluation on the test set. This experimental design mimics the competition setting, making the figures directly comparable. Table 3 shows the results. It is worth noting that the performance of the different feature sets is already competitive with respect to the top models at competition time. On the development set, we found it useful to run an SVM ranker on the entire set of features and convert its ranking to predictions as follows: the top scoring comment is DIRECT , next best is RELATED , and all others are IRRELEVANT . This heuristic (marked with "#" in the table) produced the best results on the development set, but was not as successful on the test set. Instead, we observe that the best performing system is obtained by combining vectors and rule-based ranking, achieving 79.25 F 1 and outperforming the best result from the SemEval 2015 task.

Error Analysis
We analyzed a sample of errors made by a preliminary version of our system. We focused on the case of RELATED answers predicted as IRRELEVANT , as this was the largest source of errors. See Appendix A for examples of common errors. The analysis indicates the following trends: • Under-specification: RELATED answers tend to have a smaller vocabulary overlap with the question, compared to DIRECT answers (c.f. Figure 1).
• Over-specification: RELATED answers sometimes contain multiple other terms that are not directly related to the question.
• Non-trivial overlap: occasionally, questions and answers may be related through synonyms or through lemmas rather than surface forms.
These observations shed some light on the contribution of our different features. In cases of under-or over-specification, text similarity features help the classifier determine the correct answer. Cases of non-trivial overlap require other solutions. We use lemmatization and stemming to collapse different surface forms. Finally, our vector-based features can capture synonyms between question and answer, thanks to their property of similar words having similar vectors.

Related Work
The SemEval 2015 Task 3 was the first to include an answer selection in community question answering task as far as we know. Previously, the importance of cQA to the Arab world has been recognized by Darwish and Magdy (2013), who mention two such forums: Google Ejabat, akin to Yahoo! Answers; and the Fatwa corpus. The authors identify several research problems for cQA, two of which resemble the answer selection task: their (3) ranking questions and answers; and (4) classifying answers.
Other efforts have been conducted on the analysis and exploitation of non-Arabic cQA data. Nam et al. (2009) analyzed a Korean cQA forum and identified interesting patterns of participation. For instance, users asking for questions do not answer to others' and vice versa, and they tend to "specialize" on a number of categories rather than participate all across the forum. The recognition of their peers (by means of a scoring schema) motivates the top users to more and better responses to questions. Whether these patterns remain in other fora represents an interesting problem for future research. Bian et al. (2008) aimed at ranking factoid answers to questions in Yahoo! Answers to identify the most appealing ones in terms of relevance to the topic and quality. In addition to text-based features (e.g., similarity between question and answer), they took advantage of user-interaction information including the number of answers previously posted by the user and the number of questions that they "resolved", determined by the question poster.
Non-community Arabic question answering has received a little more attention. The Question Answering for Machine Reading (QA4MRE) task included Arabic data sets in both its 2012 and 2013 editions (Peñas et al., 2012;Sutcliffe et al., 2013), although only the 2012 instantiation attracted participating teams for the Arabic task. This task focused on answering multiple choice questions by retrieving relevant passages. Participating systems used mostly information retrieval methods and question classification. For more details on this and other Arabic question answering efforts we refer to (Darwish and Magdy, 2013;Ezzeldin and Shaheen, 2012).

Summary
In this work we tackled the problem of answer selection in a community question answering Arabic forum, consisting of religious questions and answers. We explored a wide range of features in a supervised setting and achieved state-of-theart performance on the SemEval 2015 Task 3. We demonstrated that using features of different kinds, along with raw Arabic corpora and existing preprocessing tools, is important for addressing the challenges of this task.
To conclude, we note some drawbacks of the Fatwa corpus: it was created by artificially retrieving answers that are not originally linked to the answer. This makes the detection of IRRELEVANT answers quite trivial, as observed by . In addition, there is little sense in using contextual information from different answers to the same question when some of them are retrieved randomly. We believe that future endeavors should focus on more natural community question answering forums in Arabic, for example Google Ejabat. Discussion: The question asks if it is allowed to undergo laser treatments. The related answer says that treatments are allowed based on the authority of the Prophet, but does not mention laser, whereas the direct answer refers to laser explicitly. Discussion: The question asks if it is allowed to trade farm products from a non-Muslim country out of that country, given that the law in that country forbids it. The related answer says that one has to follow a non-Muslim country's laws, as long as they do not contradict the Islamic law. This answer does not specifically address the matter of selling farm products, whereas the direct answer uses specific words that appear in the question. Discussion: The question asks whether it is allowed to borrow with interest from the state, for example when the state builds a factory for someone. Both the direct and related answers are very similar, pointing to a difference between interest loans and ownership of something by the bank. The related answer refers to equipment, which is different from the factory asked about in the question, while the direct answer does not refer to anything specifically.