GW_QA at SemEval-2017 Task 3: Question Answer Re-ranking on Arabic Fora

This paper describes our submission to SemEval-2017 Task 3 Subtask D, “Question Answer Ranking in Arabic Community Question Answering”. In this work, we applied a supervised machine learning approach to automatically re-rank a set of QA pairs according to their relevance to a given question. We employ features based on latent semantic models, namely WTMF, as well as a set of lexical features based on string lengths and surface level matching. The proposed system ranked first out of 3 submissions, with a MAP score of 61.16%.


Introduction
Nowadays Community Question Answering (CQA) websites provide a virtual place for users to share and exchange knowledge about different topics. In most cases, users freely express their concerns and hope for some reliable answers from specialists or other users. In addition, they can search for an answer from previously posted question-answers (QA) that are similar to their question. Although posting a question and looking for a direct or related answer in CQA sounds appealing, the number of unanswered questions are relatively high.
According to Baltadzhieva and Chrupała (2015) the number of unanswered questions in Stack Overflow 1 and Yahoo! Answers 2 are approximately 10.9% and 15%, respectively. Interestingly, as noted in (Asaduzzaman et al., 2013), the high percentage of unanswered questions is due to the duplicate question problem, i.e. the existence of a similar question that had been addressed before, which 1 A programming CQA forum 2 A community-driven question-and-answer site makes users not re-address the question again. Hence, it is the asker's role to review the site looking for an answer before posting a new question. This is a task that requires searching related questions from a hundred others posted on a daily basis. Thus, in a good forum there should be an automatic search functionality to retrieve the set of QA that are more likely to be related to the new question being asked. As a result, the number of duplications and unanswered questions will be limited.
In order to find a solution to this and other problems in CQA, the SemEval 2015, 2016, and 2017 Task 3 have been dedicated to dealing with "Answer Selection in Community Question Answering" (Nakov et al., 2017(Nakov et al., , 2016AlessandroMoschitti et al., 2015). There are 5 different subtasks, one of which has been proposed for Arabic. The specific task for Arabic in the SemEval 2016-2017 Task 3, subtask D, was to re-rank the possible related question-answer pairs to a given question.
The Arabic task is especially difficult due to its challenging characteristics. Arabic is one of the most complex languages to process due to its morphological richness, with relative free word order, and its diglossic nature (where the standard and the dialects mix in most genres of data).
The rest of this paper is organized as follows: Section 2 gives an overview of the task and data, Section 3 describes the proposed system, Section 4 presents a discussion of the experiments and results, Section 5 outlines the error analysis, and Section 6 concludes.

Task and Data Description
Arabic by nature has different characteristics that make it one of the most challenging languages to process from an NLP perspective. It is a morphologically rich language, flexible word order, and in most typical genres and domains available online, we note a significant mix of the standard form of Arabic (MSA) and dialectical variants (DA). In fact, the use of dialectical Arabic in fora such as the CQA presents a special challenge for processing Arabic. The SemEval 2017 subtask D targets the Arabic language. In particular, the task is to re-rank a given set of QA pairs with respect to their relatedness to a given query. Therefore, the top of the ranked list is either a directly related pair, "Direct"; a "Relevant" pair, which is not directly related but includes relevant information; or an "Irrelevant" pair, at the end of the list. These are the three labels used for the task. The organizers cast the task as both a ranking problem with the three possible ranks as well as a binary classification problem where they grouped the labels Direct and Relevant as true, while Irrelevant is deemed False.
The Arabic dataset was extracted from medical fora, where users ask question(s) about medical concerns and the answers are generally from doctors. The dataset contains: a training of 1,031 questions and 30,411 potentially related QA pairs, a development set of 250 questions and 7,385 potentially related QA pairs, and a test set of 1400 questions associated with 8 to 9 potentially related QA pairs for each. 3

Approach
In this work, we are interested in studying the effect of using semantic textual similarity (STS) based on latent semantic representations and surface level similarity features derived from the given triple: User new Question Q u , and the retrieved Question Answer (QA) pairs which we will refer to as R Q and R A , respectively. Therefore, we casted the problem as a ranking problem that orders the QA pairs according to their relatedness to a given query Q u . We used a supervised framework SV M rank (Manning et al., 2008).
In order to extract the features set between the Q u and QA pair, we extracted a set of features shared between the (Q u , R Q ) and shared between the (Q u , R A ) and then we used the concatenation of both as a feature vector for each triple.
In the following subsection, we describe in detail the preprocessing steps we applied to the raw data and the set of features we used in the submit-3 For more details refer to the task description paper at (Nakov et al., 2017) ted model.

Text Preprocessing
Text preprocessing is especially important for this CQA dataset. Therefore, in this section we briefly outline the preprocessing we applied before the feature extraction. First of all, we used SPLIT (Al-Badrashiny et al., 2016) to check if a token is a number, date, URL, or punctuation. All URLs and punctuation are removed and numbers and dates are normalized to Num and Date, respectively. Alef and Yaa characters are normalized each to a single form which is typical in large scale Arabic NLP applications to overcome and avoid writing variations. For tokenization, lemmatization and stemming we used MADAMIRA (Pasha et al., 2014) (a D3 tokenization scheme which segments determiners as well as proclitics and enclitics). Finally, we removed stop words based on a list. 4

Features
1 . Latent Semantics Features: a latent semantic representation transforms the high dimensional representation of text into a low dimensional latent space and thus overcomes the problem of standard bag-of-words representation by assigning a semantic profile to the text, which captures implicit syntactic and semantic information. There are various models such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003), which rely on observed words to find text distribution over "K" topics. These models in general are applied to relatively lengthy pieces of text or documents. However, texts such as question and answer pairs found in CQA are relatively short pieces of text with two to three sentences on average. Therefore, we used the Weighted Textual Matrix Factorization (WTMF) (Guo and Diab, 2012)  We used the implementation of WTMF, 5 with a modification in the preprocessing pipeline to accommodate Arabic, i.e. we used the same preprocessing steps in 3.1.1. We used the stems of the word as the level of representation. To train the model we used a sample data from Arabic Gigaword (Parker et al., 2011) with the UNANNOTATED Arabic data provided in the task website. 6 We used the default parameters except for the number of dimensions, which we set to 500. Table 1 shows Training data statistics.
For feature generation, we first generated vector representation for Q u , R Q , and R A using the above model. Then, we used Euclidean distance, Manhattan distance, and Cosine distance to calculate the overall semantic relatedness scores between ( Q u ,R Q ) and between ( Q u ,R A ).

. Lexical
Features: similar pairs are more likely to share more words and hence they are more likely to be related. Following this assumption, the following set of features are used to record the length information of a given pair using the following measures:|B − A|, |A ∩ B|, (|B|−|A|) |A| , (|A|−|B|) |B| , |A∩B| |B| where |A| represents the number of unique instances in A, |B − A| refers to the number of unique instances that are in B but not in A, and |A ∩ B| represents the number of instances that are in both A and B. To account for word forms variations, we applied them at the token, lemma and stem levels.

Experiments and Results
Our ranking system is a supervised model using SV M rank , a variation of SVM (Hearst et al., 1998) for ranking. We tested different types of kernels, and the best result was obtained using a linear kernel, which we used to train our model. Furthermore, we tuned the cost factor parameter C of the linear kernel on the development set and we obtained the best result with C=3, which we set during the testing of our model. The outputs of the SV M rank are mainly used for ordering and they do not have any meaning of relatedness. 7 For binary classification, "Direct" and "Relevant" are mapped to "True" and "Irrelevant" is mapped to "False" for the classification task. We employed a logistic regression (LR) classifier, LI-BLINEAR classifier with the default parameters, implemented using WEKA package (Witten and Frank, 2005).
We report results on the development tuning set, DEV, and TEST set. Furthermore, we report the results of different experimental setups to show the performance over different feature sets. We report results using lexical features (LEX), using WTMF features (WTMF), and with combined features (WTMF+LEX). The latter is our primary submission to the SemEval-2017 subtask D. It is worth noting that we only officially participated in the ranking task. In addition, we report the binary classification results, which we did not officially submit. Furthermore, we compare our results to subtask D baselines and we report the results using the official metrics.
As can be seen in Table 2, the combined WTMF+LEX setting outperformed the other settings, WTMF and LEX, individually. This indicates that the combination of LEX features with WTMF provide complementary information about the relatedness at the explicit matching level for the model. Specifically, the WTMF+LEX based system improved the MAP by about 1% increase from the WTMF and the LEX based system. Furthermore, we obtain a significant improvement over the baselines for the DEV set and relatively modest improvements in the TEST set, with MAP 45.73 and 61.16, respectively. Table 3 on the other hand, presents the results of the binary classification on the TEST set using the WTMF+LEX setting along with the baseline and the results submitted by the two other participants. As can be seen in the the table, we achieved the best result on all metrics except for precision.
For a while I have been suffering from itching in my hands and legs resulting in redness [-]Knowing that when I put my hand on the itch place I find it burning and swelling Table 4: 1 is an example of Mixed Languages and 2 is an example of Mixed between Dialectal,words between parentheses, and Modern Standard Arabic. Both types of mix resulted in wrong prediction of the relatedness relation FN categories. For example, words describing personal information such as weight, age, or gender are not directly related to the medical concern being asked and are considered noise. Therefore, this data needed a hand crafted list to be used for cleaning.

Conclusion
We have presented in this paper the submission of the GW QA team in SemEval-2017 Task 3 subtask D on Arabic CQA ranking. We used a supervised machine learning ranker based on a combination of latent Semantics based similarity and lexical features. We submitted a primary result using the SV M rank and we used Logistic regression for the binary classification setting, not an official submission. Our primary submission MAP official score ranked first for the Arabic subtask D. Furthermore, we analyzed the performance of our model and outlined the limitations that caused false positive and false negative predictions.