ConvKN at SemEval-2016 Task 3: Answer and Question Selection for Question Answering on Arabic and English Fora

We describe our system, ConvKN, participating to the SemEval-2016 Task 3 “Commu-nity Question Answering”. The task targeted the reranking of questions and comments in real-life web fora both in English and Arabic. ConvKN combines convolutional tree kernels with convolutional neural networks and additional manually designed features including text similarity and thread speciﬁc features. For the ﬁrst time, we applied tree kernels to syntactic trees of Arabic sentences for a reranking task. Our approaches obtained the second best results in three out of four tasks. The only task we performed averagely is the one where we did not use tree kernels in our classiﬁer.


Introduction
SemEval-2016 Task 3 challenged the participants on the different steps of the full task of Community Question Answering (cQA). 1 Given a set of existing forum questions Q, where each existing question q ∈ Q is associated with a set of answers C q , and a new user question q , the ultimate task is to determine whether a comment c ∈ C q represents a pertinent answer to q or not. This task can be subdivided into three tasks, namely: (A) to assign a relevance (goodness) score to each answer c ∈ C q with respect to the existing question q; (B) to re-rank the set of questions Q according to their relevance against the new question q ; and finally (C) to predict the appropriateness of the answers c ∈ C q against q . 1 http://alt.qcri.org/semeval2016/task3 Task 3 included these three tasks for English, whereas an adaptation of Task C was proposed for Arabic (Task D). The reader can refer to  for a more detailed description of the tasks. Task A was also proposed in the SemEval-2015 edition . 2 We designed systems for all tasks. We used the feature vectors designed by  and Nicosia et al. (2015) for tasks A, B and C, whereas we just used a basic feature vector derived from the system of Belinkov et al. (2015) for Task D.
Additionally, we used Convolutional Neural Networks (CNNs) (Severyn and Moschitti, 2015) and combined them with vectors and tree kernels for Task A as we did in (Tymoshenko et al., 2016).
We acknowledge that the automatic feature engineering of tree kernels was very useful to tackle the new challenges of the SemEval-2016 Task 3. Indeed, all our three systems using relational models based on tree kernels achieved the second official position. In contrast, for Task C, we did not have time for using the relational model in our submitted system, this has probably caused our average performance in such task, i.e., our system was ranked at the eighth position. For similar reasons, we could apply CNNs to only Task A.
The rest of the paper is organized as follows. Section 2 describes the four CQA tasks and gives a brief overview of the corpora. Section 3 describes the features used. Section 4 discusses our models and our official results. Section 6 presents final remarks.

Tasks Description
In this section we sketch the four proposed tasks.
Task A: Question-Comment Similarity. Given a user question and a thread of ten comments associated with it, re-rank the comments in the thread according to their pertinence. Three classes exist in this case: (i) good : the comment is definitively relevant; (ii) potentially useful : the comment is not good, but it still contains related information worth checking; and (iii) bad : the comment is irrelevant (e.g., it is part of a dialogue or unrelated to the topic). For evaluation purposes, both potentially useful and bad comments were considered as bad .
Task B: Question-Question Similarity. Given a new question and a set of ten forum questions, re-rank the forum questions by assessing if they are (i) perfect match : the new and forum questions request roughly the same information, (ii) relevant : the new and forum questions ask for similar information, or (iii) irrelevant : the new and forum questions are completely unrelated. For evaluation purposes, both perfect match and relevant forum questions are considered as relevant .
Task C: New Question-Comment Similarity. Similar to task A, but in this case the relevance of one-hundred comments is assessed against a new out-of-the-forum question. Same evaluation considerations as in task A apply.

Task D: Question-{Forum Question+Comment}.
A new question and a set of thirty forum questionanswer pairs are provided (it is known in advance that the answer is correct with respect to the forum question). Re-rank the question+comment pairs according to three classes: (i) direct : a direct answer to the new question; (ii) relevant : not a direct answer to the question but with information related to the topic; and (iii) irrelevant : an answer to another question, not related to the topic. For evaluation purposes, both direct and relevant forum questions are considered as good .
Tasks A, B, and C use English instances extracted from Qatar Living, a forum for people to pose questions on multiple aspects of daily life in Qatar. 3 Task D uses Arabic instances extracted from three medical fora: webteb, altibbi, and consult islamweb. 4 As this is a reranking task, mean average precision (MAP) is the referring evaluation metric. We also evaluate our models in terms of average Recall (AvgRec), Precision (P), Recall (R), F-measure (F 1 ), and Accuracy.
Further details about the corpora, evaluation and other settings can be found in .

Approach
In order to re-rank the comments according to their relevance, either against the forum questions or against the new questions, we train a binary SVM classifier and use its score as a measure of relevance. The classifier uses partial tree kernels (Moschitti, 2006) defined over shallow syntactic trees, along with other numeric features.
We used the MADAMIRA toolkit (Pasha et al., 2014) for segmenting Arabic texts. In order to split the texts into sentences, we used the Stanford splitter. 7 For parsing Arabic texts into syntactic trees, we used the Berkeley parser (Petrov and Klein, 2007). Following, we briefly describe the numeric features used in different tasks.

SemEval-2015 Features
For English texts, we consider three kinds of similarity measures: lexical, syntactic, and semantic Nicosia et al., 2015) In the case of Task A, the context of a comment is a relevant factor. Comments are organized sequentially according to the time line of the comment thread. Important factors to assess the value of a comment is whether the thread includes further comments by the person who originally asked the question, if the same user is behind various comments in the thread, or what forum category the thread belongs to. Therefore, we consider a set of features that try to describe a comment in the context of the entire thread. Other Boolean context features characterize different situations including the identification of potential dialogues, which usually represent a bunch of bad comments, or the position of the comment in the thread. We also considered the categories of the questions in the forum (as some of them tend to include more open-ended questions and even invite for discussion on ambiguous topics), as well as the occurrence of specific strings or the length of a comment. In-depth descriptions of these features are available in (Nicosia et al., 2015).
For Arabic texts, we utilize the embedding vectors as obtained by Belinkov et al. (2015): employing word2vec (Mikolov et al., 2013) on the Arabic Gigaword corpus (Parker et al., 2011). More specifically, we concatenate the vectors representing a new question and an existing question in the questionanswer pair, which is then fed to the SVM classifier.

Rank Feature
The meta-information in the English corpus includes the position of the forum threads in the rank generated by the Google search engine for a given new question. We exploit this information in tasks B and C. We employ the inverse of such position as a feature and refer to it as the rank feature.

Tree Kernels
Tree kernels are similarity functions that measure the similarity between tree structures. We con-structed a syntactic tree for each comment or question. Each task involves a pair of trees, question and comment (tasks A and C) and new and forum questions (tasks B and D). Replicating Severyn and Moschitti (2012), we link the two trees by connecting (i) part-of-speech nodes with a lexical match between the corresponding non-stop words; and (ii) chunk nodes such as NP, PP, VP, when there is a link above between POS-tags. Such links are marked with the presence of a specific tag. We then apply the partial tree kernel (PTK) or the syntactic tree kernels 8 (STK) both defined in (Moschitti, 2006) on the pairs as: (1) where t and u are parse trees extracted from the text pair, i.e., either question and comment for task A or question and question for tasks B and D.

Submissions and Results
We describe our primary submissions for the four tasks in Section 4.1. The contrastive submissions are discussed in Section 4.2. Table 1 shows our official competition results for both primary and contrastive submissions.

Primary Submissions
Task A. The submission consists in an SVM operating on two kernels: (i) the tree kernel described in Section 3.3, applied to the structures described by Tymoshenko and Moschitti (2015) without question and focus classification; (ii) a polynomial kernel of degree 3 applied to the feature vector that is a concatenation of the feature vector described in Section 3.1, and question and answer embeddings learned on the training set by the Convolutional Neural Network (CNN) described in (Severyn and Moschitti, 2015). More details about the used   embeddings and the resulting kernels can be seen in (Tymoshenko et al., 2016). The SVM was trained on the union of both training and development sets.
Task B. The submission consists in an SVM operating on three kernels: (i) an RBF kernel on the features described in Section 3.1, (ii) an RBF kernel on the features described in Section 3.2; and (iii) the tree kernel described in Section 3.3. The C parameter of the SVM was set to 1. Both the tree and the RBF kernels use default values for the parameters. The SVM was trained on the union of the training and development sets.
Task C. The submission consists in an SVM operating on two RBF kernels (with default parameter values): the first one is on the features described in Section 3.1. The second one is on the features described in Section 3.2 plus the score obtained from the prediction of a comment according to a classifier built for task A, computed by cross-validation. The SVM is trained on the union of the training part 2 and development sets.
Task D. The submission consists in an SVM operating on two kernels: (i) the syntactic tree kernel (SST) (Moschitti, 2006), applied as described in Section 3.3, to the constituency trees of the question texts; (ii) a linear kernel applied to the features in (Belinkov et al., 2015). In tasks A and B we used PTK, which is slower but more accurate. However, the trees of the Arabic data were rather large and very noisy. Thus we used SST, which is faster and uses less features. The value 0.1 for parameter L served the purpose of removing noise. The SVM was trained on the union of the training and development sets.

Contrastive Submissions
Task A. We submitted a contrastive run (cont 1 ), where we use a joint learning and inference approach based on a Fully-connected Conditional Random Field (FCCRF) (Joty et al., 2016) to classify all the comments in a thread collectively. We used the numeric (non-tree) features used previously in , and also the predictions of the SVM used in our primary run. The FCCRF model uses an Ising-like edge potential, which distinguishes between only same and different (as opposed to all four possibilities) state transitions to model all pair dependencies.
The second contrastive run (cont 2 ) is as the primary submission, but without tree kernels.
Task B. We submitted one contrastive submission which is identical to the primary one, with the only exception that SVM is trained on the training part 2 and development sets only.
Task C. We submitted one contrastive submission which is identical to the primary one, with the exception that SVM is trained on all training and development sets.
The second contrastive submission consists of a rule-based system which relies on the outputs from tasks A and B. A comment is labeled as good if it is considered good with respect to the related question (Task A) and the related question is considered relevant with respect to the new question (Task B). The comment is considered bad otherwise.
Task D. The contrastive systems did not use tree kernels. Our first contrastive run used only feature vectors. Our second contrastive run also used additional features borrowed from machine translation evaluation: BLEU (Papineni et al., 2002), TER (Snover et al., 2006), Meteor (Lavie and Denkowski, 2009), NIST (Doddington, 2002), Precision and Recall, and length ratio between the question and the comment. Table 1 shows the results obtained in the four tasks. We achieved the second position for tasks A, B, and D. In Task A, tree kernels give no major boost, but without them our model would be cont 2 , which achieved the third position on the test set. The joint model cont 1 , run on top of our primary system, was able to improve it by more than one point. We were not sure about the outcome of this model, thus we preferred not to use it as our primary submission, even though we got an improvement also on the development set. Our cont 1 system for Task B, trained only on the train part 2 and development sets, scored less than our primary one. Even if our preliminary observations had suggested that the distributions of the different training and development sets were too different and potentially damaging the model learning, having more diverse data ended up as a better solution to the task.

Results and Discussion
Our submission for Task C is very limited as it does not include tree kernel models. The use of our feature vectors only (the same used for tasks A and B), results in an average performance in the challenge.
Regarding Task D, cont 1 , using embedding features from (Belinkov et al., 2015), is an average system. When we add the machine translation evaluation (MTE) features the MAP increases from 38.33 to 39.98. We did not trust the MTE features as in the development set they obtained a lower result than the simple embedding features. This resulted to be a mistake from the competition viewpoint as they could have been combined with tree kernels. Indeed, our Primary system just combines tree kernels with the embedding features improving them by more than 7 absolute points, achieving the second position with a MAP of 45.50, very close to the best system, i.e., 45.83.

Conclusions
In this paper, we have presented the systems developed by the teams of the Qatar Computing Research Institute (QCRI) and the University of Trento for participating in SemEval-2016 Task 3 on Community Question Answering.
We used supervised machine learning approaches based on various combinations of the convolution tree kernels, text embedding features, including those learned by the convolutional neural networks, and a number of task-specific features from our previous work for SemEval-2015, Task 3.
For each task we submitted one primary and two contrastive runs incorporating various combinations of the above components. Our primary runs scored second for tasks A, B and D and eighth for task C. Finally, we analyzed the performance of our runs and discussed which components are more beneficial for a specific task/language.
In future work, we plan to devise better ways of combining convolution tree kernels with CNNs, e.g. by embedding the CNN similarities into the structural kernels, and encoding more complex relations into the structural representations of the text snippets.