QU-BIGIR at SemEval 2017 Task 3: Using Similarity Features for Arabic Community Question Answering Forums

In this paper we describe our QU-BIGIR system for the Arabic subtask D of the SemEval 2017 Task 3. Our approach builds on our participation in the past version of the same subtask. This year, our system uses different similarity measures that encodes lexical and semantic pairwise similarity of text pairs. In addition to well known similarity measures such as cosine similarity, we use other measures based on the summary statistics of word embedding representation for a given text. To rank a list of candidate question answer pairs for a given question, we learn a linear SVM classifier over our similarity features. Our best resulting run came second in subtask D with a very competitive performance to the first-ranking system.


Introduction
The ubiquitous presence of community question answering (CQA) websites has motivated research on building automatic question answering (QA) systems that can benefit from previously-answered questions to answer newly-posed ones (Shtok et al., 2012). A core functionality of such systems is their ability to effectively rank previouslysuggested answers with respect to their degree/probability of relevance to a posted question. Ranking is vital to push away irrelevant and low quality answers, which are commonplace in CQA as they are generally open with no restrictions on who can post or answer questions.
To this effect, SemEval 2017 Task 3 "Community Question Answering" has emphasized the ranking component in the main task of the challenge. We have participated in Task 3-Subtask D (Arabic Subtask) which is confined to the main  Figure 1: A question and 4 of its given 30 candidate QApairs task of ranking answers; given a new question and a set of 30 question-answer pairs (QApairs) retrieved by a search engine, re-rank those QApairs by their degree/probability of relevance to the new question. Figure 1 shows an example of a question and four of its 30 given candidate question-answer pairs. Further details about SemEval 2017 Task 3 can be found in (Nakov et al., 2017).
In this paper, we describe the system we developed to participate in that task. The system leverages a supervised learning approach over similarity features. We utilize two types of similarity features. First, we employ similarity features based on term representation for a given pairs of text. Second, we utilize word2vec to build text representation following the same approach as in our last year's submission for the same subtask (Malhas et al., 2016). We used similarity features based on that text representation to encode the semantic similarity for pairs of texts.
The rest of the paper is organized as follows; the approach and description of features are in-troduced in section 2; the experimental setup followed in our submitted runs and the results are presented in section 3. Finally we conclude our study with final remarks in section 4.

Approach
We tackled the answer ranking task with a supervised learning approach that uses linear SVM models. The features used in classification are designed to capture both lexical and semantic information of pairs of texts.

Data Setup
We are given a set of questions Q; each is associated with P question-answer pairs. To compute our features, we define a text pair < T 1 , T 2 > according to three setups: • QQA: We consider T 1 to be the original question q and the concatenation of one pair p of its associated question-answer pairs as T 2 .
• QA: We consider T 1 to be the original question q and one answer of its associated question-answer pairs as T 2 .
• QQ: We consider T 1 to be the original question q and one question of its associated question-answer pairs as T 2 .

Term-based Similarity Features
A recent study has showed that simple features like MK features (Metzler and Kanungo, 2008) can be very effective for re-ranking candidate question answer pairs (Yang et al., 2016). We specifically use the following features described by Yang et al. (Yang et al., 2016). For all features, we assume the input to be two pieces of text: T 1 and T 2 as defined by any of the setups illustrated in section 2.1.
The feature is then computed as the portion of T 1 terms that have a synonym or the original term in T 2 . Synonyms are extracted from the Arabic WordNet. 1 To compute the remaining features, we normalize T 1 and T 2 following the same approach when computing SynonymsOverlap feature. We also apply preprocessing steps including stemming and stopwords removal.
• LMScore: The language model score is computed as the Dirichlet-smoothed loglikelihood score of generating T 1 given T 2 . The score is computed using the following equation: where tf w,T 1 and tf w,T 2 is the frequency of term w in T 1 and T 2 respectively. P (w|C) is the background language model computed using the maximum likelihood estimate with term statistics extracted from a recent large-scale crawl of the Arabic Web called ArabicWeb16 (Suwaileh et al., 2016). We set µ to 2000 as this is the default value used in Lucene's language modeling retrieval model. 2 • CosineSimialirty. This feature computes the cosine similarity between T 1 and T 2 as follows.
where T 1 and T 2 is the vector representation of T 1 and T 2 respectively and || T 1 || and || T 2 || is the Euclidean lengths of vectors T 1 and T 2 . We represent texts as vectors using TF-IDF representation where term statistics are extracted from ArabicWeb16 (Suwaileh et al., 2016).
• JaccardSimialirty. This feature computes the Jaccard similarity between T 1 and T 2 as follows.
• JaccardSimialirtyV1. This is a variant of Jaccard similarity computed as follows.
• JaccardSimialirtyV2. This is a second variant of Jaccard similarity computed as follows.

Semantic word2vec Similarity Features
Every text snippet T has a set of words. Each word has a fixed-length word embedding representation, w ∈ R d , where d is the dimensionality of the word embedding. Thus for a text snippet T we define T = {w 1 , · · · , w k }, where k is the number of words in T . The word embedding representation is computed offline following Mikolov et al. approach (Mikolov et al., 2013).
To compute similarity scores, we represent each text snippet by a feature vector; different alternatives for feature representations are adopted as described next.

Average Word Embedding Similarity
For a text snippet T that has k words, we compute the average vector as follows: Notice that T µ =∈ R d . This leads to the following cosine similarity feature.

Covariance Word Embedding Similarity
Instead of computing the average vector, we can compute a covariance matrix C ∈ R d×d . The covariance matrix C is computed by treating each dimension as a random variable and every entry in C u,v is the covariance between the pair of variables (u, v). The covariance between two random variables u and v is computed as in eq. 8, where k is the number of observations (words).
The matrix C ∈ R d×d is a symmetric matrix. We compute a vectorized representation of the matrix C as the stacking of the lower triangular part of matrix C as in eq. 9. This process produces a vector T Cov ∈ R d×(d+1)/2 T Cov =vect(C)={Cu,v:u∈{1,··· ,d},v∈{u,··· ,d}} This leads to the following cosine similarity feature.

Ranking Using SVM
Although Subtask D is a re-ranking task, it has also a classification task where answers need to be ranked and labeled with either true or false; the former designates a Direct or Relevant answer to the new question, and the latter designates an Irrelevant answer. In our last year's submission (Malhas et al., 2016) we used learning-to-rank module for re-ranking pairs, but we used a simple heuristic to give labels to the candidate question-answer pairs. This year we use SVM to give a label for every candidate pair using the SVM model. In addition to labeling pairs, we use the decision scores from the SVM model for re-ranking the candidate question-answer pairs.

Experimental Evaluation
In this section we present the experimental setup and results of our primary, contrastive-1 and contrastive-2 submissions.

Experimental Setup
We used the Arabic collection of questions and their potentially related question-answer pairs provided by Task 3 organizers to train our word embedding model. The Gensim 3 tool was used to generate the word2vec model from training data 4 , setting d = 100. We used the learned model to compute our features as described in section 2. Features were generated for the three data setups described in section 2.1.

Submissions and Results
The differences among our submitted runs is based on the selection of the features. In all cases we use linear SVM for classifying and ranking questionanswer pairs. Details on our official submissions 3 http://radimrehurek.com/gensim/ 4 Testing data are held out during the computation of the word2vec model. Primary. We use the full set of similarity features defined in section 2.2 and section 2.3. In addition, we performed a weighted score fusion with an SVM model based on fixed length representation using Covariance word embedding. The feature vectors we used are computed using equation 9. We tuned the model weights using the development set.

Discussion
• Our best official submission is Contrastive-2 using both term-based similarity features and semantic word2vec similarity features. This indicates that the two similarity features types are complementing each other.
• Our results justify the usage of SVM model for labeling and re-ranking question-answer pairs. This is clear in the P, R, F1 and Acc scores reported across all other baselines. We report very competitive MAP scores to the best performing ranking systems which are not using any form of labeling such as IRbaseline.
• Score fusion in our primary run did not achieve best results on the official test set while it was the best run in our experiments on the development set. We believe that this happened due to the difference in the source of question-answer pairs in the development set compared to the the official test set where the test set contains only medical questions.

Conclusion
This paper describes the system we developed to participate in SemEval-2017 Task 3 on Community Question Answering. Our system has focused on the Arabic Subtask D which is confined to Answer Selection in Community Question Answering, i.e., finding good answers for a given new question.
We have adopted a supervised learning approach where linear SVM models were trained over similarity features. In our best submission, term-based similarity features and word2vec similarity features were both used; our system ranked second among the other participating teams.