NLM_NIH at SemEval-2017 Task 3: from Question Entailment to Question Similarity for Community Question Answering

This paper describes our participation in SemEval-2017 Task 3 on Community Question Answering (cQA). The Question Similarity subtask (B) aims to rank a set of related questions retrieved by a search engine according to their similarity to the original question. We adapted our feature-based system for Recognizing Question Entailment (RQE) to the question similarity task. Tested on cQA-B-2016 test data, our RQE system outperformed the best system of the 2016 challenge in all measures with 77.47 MAP and 80.57 Accuracy. On cQA-B-2017 test data, performances of all systems dropped by around 30 points. Our primary system obtained 44.62 MAP, 67.27 Accuracy and 47.25 F1 score. The cQA-B-2017 best system achieved 47.22 MAP and 42.37 F1 score. Our system is ranked sixth in terms of MAP and third in terms of F1 out of 13 participating teams.


Introduction
SemEval-2017 Task 3 1 on Community Question Answering (cQA) focuses on answering new questions by retrieving related answered questions in community forums (Nakov et al., 2017). This task extends the previous SemEval-2015 and SemEval-2016 cQA tasks.
This year, five subtasks were proposed: English Question-Comment Similarity (subtask A), English Question-Question Similarity (subtask B), English Question-External Comment Similarity (subtask C), Arabic Answer Re-rank (subtask D) and English Multi-Domain Duplicate Question Detection (subtask E). 1 http://alt.qcri.org/semeval2017/task3 Subtask B (Question Similarity) aims to re-rank a set of similar questions retrieved by a search engine with respect to the original question, with the idea that the answers to the similar questions should also be answers to the new question. For a given question, a set of ten similar questions is provided for re-ranking.

Data
The cQA task covers two languages: English and Arabic. The English dataset (CQA-QL corpus) is based on data from the Qatar Living forum. The CQA-QL corpus consists of a list of original questions, having each ten related questions from Qatar Living, and the first ten comments from their threads. For subtask B, questions are annotated as PerfectMatch, Relevant and Irrelevant with respect to the original question. Both Per-fectMatch and Relevant questions are considered as good without distinction.
For the cQA-B-2017 task, training and development datasets are the cQA-B-2016 datasets. A total of 3,869 question pairs is available for training including cQA-2016 test questions. cQA-B-2017 test data is composed of 880 question pairs.

Question Similarity vs. Question Entailment
In addition to the efforts within the semEval cQA tasks since 2015, earlier definitions and methods were proposed for Question Similarity based on different elements such as the question topic and question type (Burke et al., 1997;Jeon et al., 2005;Duan et al., 2008). But other definitions using specific kinds of question similarity such as entailment and paraphrases are not yet very developed for Question Answering (QA). In a previous effort (Ben Abacha and Demner-Fushman, 2016), we introduced a new task called Recognizing Question Entailment (RQE), which tackles a specific kind of question similarity. As question entailment has not previously been proposed for automatic QA, we proposed a new RQE definition: A question PQ entails a question HQ if every answer to HQ is also an exact or partial answer to PQ.
The RQE task is proposed to automatically provide an existing answer if an entailment relation exists between a new question and an existing answered question. Considering the example of a question P Q asking about medications for a pregnant woman, the entailed question should include this specificity too, otherwise the question is not entailed from a semantic standpoint since its answer is not relevant to the original question P Q. This answer-related definition makes question entailment a relevant extension of textual entailment for QA.
Also, our definition includes partial answers (e.g. an answer of only one sub-question of a question P Q asking about causes, diagnoses and treatments of a specific disease). Partial answers are crucial in dealing with complex questions including more than one sub-question.
Our RQE system obtained 75% F1 score on medical questions when using training data constructed automatically (Ben Abacha and Demner-Fushman, 2016). Our RQE method is applied to answer consumer health questions received by the U.S. National Library of Medicine 2 (NLM).

System
Our RQE System uses a supervised machine learning approach to determine whether or not a question HQ can be inferred from a question P Q. We use Logistic Regression with a set of lexical and morpho-syntactic features. The used features were selected empirically after numerous tests on Recognizing Question Entailment (RTE) datasets.

Preprocessing
For each question, we remove stop words and perform word stemming using the Porter algorithm (Porter, 1980).

Similarity Features
We compute different similarity measures between the pre-processed questions and use their values as features: 2 http://www.nlm.nih.gov • Selected similarity measures are Word Overlap, the Dice coefficient based on the number of common bigrams, cosine distance, Levenshtein distance, and Jaccard distance.
• The feature list also includes the maximum and average values among the five similarity measures and the questions length ratio.

Morpho-syntactic Feature
We use TreeTagger (Schmid, 1994) for POS tagging. We generate an additional feature for the number of common nouns and verbs between the two questions.

Question Similarity System
For cQA-B-2017, we used our RQE classifier trained on semEval-2016 datasets (3,869 question pairs). In cQA-B-2016, the IR baseline system provided interesting results. We used a weightbased method to combine the scores provided by the Logistic Regression model and the IR baseline ranks. We used a reciprocal rank to convert the IR baseline rank and a weight w fixed after several empirical tests on cQA-2016 data. The formula that we used for combination is: score = LogisticRegression score + w × 1 IR rank

Results
Systems are scored according to Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), Average Recall (AvgRec), Precision (P), Recall (R), F1 and Accuracy (Acc). The official evaluation measure used to rank the participating systems is MAP.
For the final evaluation we submitted 3 runs. We only changed the weighting coefficient. For NLM NIH-primary, the combination weight was the one that gave the best results on the 2016 test data (w = 7.9). For NLM NIH-contrastive1, we used the combination weight that performed the best on the 2016 development data (w = 8.9), which had a slightly better impact on MAP on the new 2017 test data (44.66 vs. 44.62 MAP). For NLM NIH-contrastive2, we used a third combination weight (w = 6.8).   Table 2 presents the results using the test data from cQA-B-2016. We used the same system with the best combination weight according to the development data (w = 8.9). Our results outperformed the best system on the 2016 test data. A general drop of 30 points on performance for all systems can be observed with the cQA-B-2017 test data.

Conclusion
In this paper, we described our participation in the task 3-B of SemEval 2017. We explored the adequacy of our question entailment system for the question similarity task. Despite the general drop of performance with regards the 2016 test data for all participating systems, we obtained good results on the 2017 test data with 44.62 MAP, 67.27 Accuracy and 47.25 F1 score. Our system is ranked sixth in terms of MAP and third in terms of F1 out of 13 participating teams.