KeLP at SemEval-2017 Task 3: Learning Pairwise Patterns in Community Question Answering

This paper describes the KeLP system participating in the SemEval-2017 community Question Answering (cQA) task. The system is a refinement of the kernel-based sentence pair modeling we proposed for the previous year challenge. It is implemented within the Kernel-based Learning Platform called KeLP, from which we inherit the team’s name. Our primary submission ranked first in subtask A, and third in subtasks B and C, being the only systems appearing in the top-3 ranking for all the English subtasks. This shows that the proposed framework, which has minor variations among the three subtasks, is extremely flexible and effective in tackling learning tasks defined on sentence pairs.


Introduction
This paper describes the KeLP system participating in the SemEval-2017 cQA challenge (Nakov et al., 2017). The task setting for the English part is the same as the previous edition (Nakov et al., 2016): the corpus is extracted from Qatar Living 1 , a web forum where people pose questions about multiple aspects of their daily life in Qatar, and three subtasks are defined: Subtask A: Given a question q and its first 10 comments c 1 , . . . , c 10 in its question thread, rerank these 10 comments according to their relevance with respect to the question, i.e., the good comments have to be ranked above potential or bad comments. Subtask B: Given a new question o and the set of the first 10 related questions q 1 , . . . , q 10 (retrieved by a search engine), re-rank the related questions according to their similarity with respect 1 http://www.qatarliving.com/forum to o, i.e., the perfect match and relevant questions should be ranked above the irrelevant ones. Subtask C: Given a new question o, and the set of the first 10 related questions, q 1 , . . . , q 10 , (retrieved by a search engine), each one associated with its first 10 comments, c q 1 , . . . , c q 10 , appearing in its thread, re-rank the 100 comments according to their relevance with respect to o, i.e., the good comments are to be ranked above potential or bad comments.
We participated to the previous year edition, where our system (Filice et al., 2016) achieved very good results, i.e., first in subtask A, third in B and second in C. For the new year challenge, we therefore decided to reuse the same system applied to a new method for selecting tree structures,  summarized in Sec. 3.
We modeled the three subtasks as binary classification problems: kernel-based classifiers are trained and the classification score is used to sort the instances and produce the final ranking. We implemented models within the Kernel-based Learning Platform 2 (KeLP) (Filice et al., 2015a), which determined the team's name. Our tests provide two main contributions: (i) we asses the results obtained in (Filice et al., 2016), demonstrating that our kernel-based models for relational learning tasks between two texts (Filice et al., 2015b) are effective for community Question Answering. (ii) We studied the impact of text selection described in .
Our primary submission ranked first in subtask A, and third in subtasks B and C, demonstrating that the proposed method is very accurate and adaptable to different learning problems. At the moment, we could not find out if text selection is always useful as our contrastive submission not using it turned out to be much more accurate for Task B.
In the reminder, Section 2 introduces the proposed kernel-based system, Section 3 describes the pruning technique to select the relevant parts from the input sentences, while Section 4 reports official results. 2 The KeLP system: kernel-based learning from text pairs In the three subtasks, the underlying problem is to understand if two texts are related. Thus, in subtasks A and C, each pair, (question, comment), generates a training instance for a binary Support Vector Machine (SVM) (Chang and Lin, 2011), where the positive label is associated with a good comment and the negative label includes the potential and bad comments. In subtask B, we evaluated the similarity between two questions. Each pair generates a training instance for SVM, where the positive label is associated with the perfect match or relevant classes and the negative label is associated with the irrelevant ; the resulting classification score is used to rank the question pairs. In KeLP, the SVM learning algorithm operates on a kernel combination of tree kernels and a linear kernel. In particular the linear kernel is applied on feature vectors containing (i) linguistic similarities between the texts in a pair (Section 2.1); (ii) task-specific features (Section 2.3).
Tree kernels are applied to evaluate inter-pair similarities between sentence pairs, in order to automatically discover pairwise relational patterns.

Intra-pair similarities
In subtasks A and C, a good comment is likely to share similar terms with the question. In subtask B a question that is relevant to another probably shows common words. Following this intuition, given a text pair (either question/comment or question/question), we define a feature vector whose dimensions reflect the following similarity metrics: • Lexical Similarities: Cosine similarity, Jaccard coefficient (Jaccard, 1901) and containment measure (Broder, 1997) of n-grams of word lemmas (n = 1, 2, 3, 4 was used in all experiments); Longest common substring measure (Gusfield, 1997), Longest common subsequence measure (Allison and Dix, 1986), and Greedy String Tiling (Wise, 1996).
• Syntactic Similarities: Cosine similarity of n-grams of part-of-speech tags. It considers a shallow syntactic similarity (n = 1, 2, 3, 4 was used in all experiments); Partial tree kernel (Moschitti, 2006) between the parse tree of the sentences.
• Semantic Similarities: Cosine similarity between additive representations of word embeddings generated by applying word2vec (Mikolov et al., 2013) to the entire Qatar Living corpus from SemEval 2015 3 . Five features are derived considering (i) only nouns, (ii) only adjectives, (iii) only verbs, (iv) only adverbs and (v) all the above words.
These metrics are computed in all the subtasks between the two elements within a pair, i.e., q and c i for subtask A, q and o for subtask B, o and c i for subtask C. In addition, in subtasks B and C, the similarity metrics (except the Partial Tree Kernel similarity) are computed between o and the entire thread of q, concatenating q with its answers. Similarities between q and o are also employed in subtask C.

Inter-pair kernel methods
In tasks A and C, some question types may have an expected answering form. Similarly, in Task B, related questions may be characterized by the application of some latent paraphrasing rules. Such pairwise patterns cannot be captured by any intrapair similarity feature, and require an alternative approach. Specific features may be manually defined, but this would require a complex feature engineering.
To automatize relational learning between pairs of texts, one of the early works is (Moschitti et al., 2007;Moschitti, 2008). This approach was improved in several subsequent researches (Severyn and Moschitti, 2012;Severyn et al., 2013a,b;Severyn and Moschitti, 2013;Tymoshenko et al., 2014;Tymoshenko and Moschitti, 2015), exploiting relational tags and linked open data. In particular, in (Filice et al., 2015b), we defined new interpair methods to directly employ text pairs into a kernel-based learning framework.
The kernels we proposed can be directly applied to subtask B and to subtasks A and C for learn- ing question-question and question-answer pairwise patterns (see also (Tymoshenko et al., 2016;. As shown in Figure  1, a pair of sentences is represented as pair of their corresponding shallow parse trees, where common or semantically similar lexical nodes are linked using a tagging strategy (which is propagated to their upper constituents). This method discriminates aligned sub-fragments from non-aligned ones, allowing the learning algorithm to capture relational patterns, e.g., the REL-best beach and the RELbest option. Thus, given two pairs of sentences p a = a 1 , a 2 and p b = b 1 , b 2 , some tree kernel combinations can be defined: where PTK is the Partial Tree Kernel (PTK) (Moschitti, 2006). Tree kernels, computing the shared substructures between parse trees, are effective in evaluating the syntactic similarity between two texts. The proposed tree kernel combinations extend such reasoning to text pairs.

Task Specific Features
While the features described so far can be effectively applied to any sentence pair modeling task, in this section, we describe features specifically developed for the cQA domain.
• Ranking Features: The ten questions related to an original question are retrieved using a search engine. We use their absolute and relative ranks 4 as features for subtasks B and C (for the latter the question rank is given to all the comments within the related question thread).
• Heuristics: We adopt the heuristic features described in (Barrón-Cedeño et al., 2015), which can be applied to subtasks A and C.
In particular, forty-five features capture some comment characteristics such as its length, its category (Socializing, Life in Qatar, etc.), whether it includes URLs, emails, or other particular words, etc.
• Thread-based features: As discussed in (Barrón-Cedeño et al., 2015), comments in a common thread are strongly interconnected: users replicate to each others and start a concrete discussion. We used some specific features for subtasks A and C that aim at capturing some thread-level dependencies, such as whether a comment is part of a dialogue or whether a comment is followed by an acknowledgment from the user who asked the question • Stacking features: A good comment for a question q should be also good for an original question o if q and o are strongly related, i.e., q is relevant or a perfect match to o. We thus developed a stacking strategy for Subtask C that uses the following scores in the classification step, w.r.t. an original question o and the comment c i from the thread of q: -p q,c i , which is the score of the pair q, c i provided by the model trained on Subtask A; -p o,c i , which is the score of the pair o, c i provided by the model trained on Subtask A; -p o,q , which is the score of the pair o, q provided by the model trained on Subtask B. Starting from these scores, we built the following features: (i) values and signs of p q,c i , p o,c i and p o,q (6 feats); (ii) a boolean feature indicating whether both p q,c i and p o,q are positive; (iii) min value = min(p q,c i , p o,q ); (iv) max value = max(p q,c i , p o,q ); (v) average value = 1 2 (p q,c i + p o,q ).

Tree Pruning Techniques
We propose to reduce the size of the input trees by removing all nodes and branches that are less discriminative for the task. To determine such fragments, we use the supervised approach described in . After training a tree kernel, K(), on pairs of trees, the solution of the dual optimization problem is expressed as a linear combination of a subset of the training examples, i.e., the support vectors: M = {(α j , (a j , b j ))}, where the (a j , b j ) ∈ A × B is a pair of parse trees (a j could be the one of an original or related question and b j the one of a related question or a comment, depending on the subtask) and α j are the coefficients of the combination. The classification of a new example is obtained as the sign of the score function f (): where |M | is the number of support vectors, i.e., the number of elements of the set M . The higher the absolute value of the score of an example, the more confident the learning algorithm is in classifying it. We exploit such property to devise a strategy for determineing the importance w(n) of a node. Let n be a node of a tree t, n is the proper sub-tree rooted at n, i.e., the tree composed of n and all its descendants in t. We use the score of n with respect to M to assess the importance of n: (2) In order to be consistent, only the parse trees of a j ∈ A will be used to compute w(n), if n belongs to the first tree of the pair (a j , b j ) ∈ M . Conversely if n belongs to the second tree of the pair (a j , b j ) only the parse trees of b i ∈ B will be used. Now we can proceed to prune a tree on the basis of the w(n) importance estimated by model M for each of its nodes and a user-defined threshold. We prune a leaf node n if −h < w(n) < h. If n is not a leaf, then it is removed if all its children are going to be removed. Note that the threshold h determines the number of pruned nodes. Our algorithm has a constraint: REL-tagged nodes are never pruned, regardless of their estimated importance. This is because a REL tag indicates that a and b share a common leaf in n , which conveys useful information, e.g., for paraphrasing (Filice et al., 2015b).

Submission and Results
We chose parameters using the 2016 official test set as validation set, and we trained on the official train and development sets 5 . In Subtask C, the stacking features (Section 2.3) need the scores provided by the models on subtasks A and B. Such scores are generated with a 10-fold cross validation. For the final submissions we used all the 2016 data (including the testset) as training. We used the OpenNLP pipeline for lemmatization, POS tagging and chunking to generate the tree representations described in Section 2.2. All the kernel-based learning models are implemented in KeLP (Filice et al., 2015a). For all the tasks, we used the C-SVM learning algorithm (Chang and Lin, 2011). The MAP@10 was the official metric. In addition, results are also reported in Average Recall (AvgR), Mean Reciprocal Rank (MRR), Precision (P), Recall (R), F 1 , and Accuracy (Acc).

Subtask A
Model: The learning model operates on questioncomment pairs p = q, c . The kernel is PTK + (p a , p b ) + LK A (p a , p b ). Such kernel linearly combines PTK + (p a , p b ) = PTK(q 1 , q 2 ) + PTK(c 1 , c 2 ) (see Section 2.2) with a linear kernel LK A that operates on feature vectors including: (i) the similarity metrics between q and c described in Section 2.1; (ii) the heuristic features and (iii) the thread-based features discussed in Section 2.3. PTK uses the default parameters (Moschitti, 2006), while the best SVM regularization parameter we estimated was C = 1. This system is identical to the one we proposed in the previous year. Results: Table 1 reports the results on subtask A. We confirmed the excellent results of 2016: the model is very accurate and achieved the first position among 13 systems in terms of MAP.  Table 2: Results on subtask B on the 2016 and 2017 official test sets. KeLP is our primary submission, while KC1 is the contrastive one. IR is the baseline system based on the search engine results.

Subtask B
Model: The proposed system operates on question-question pairs p = o, q . The kernel is PTK × (p a , p b ) + LK B (p a , p b ), by adopting the kernels defined in Section 2.2. The product in the PTK × combination acts like a logic and, as, when comparing two pairs, we want a strict match in which both the elements of the first pair must be similar to the counterpart elements in the second pair. Conversely, in subtasks A and C, the adopted PTK + (p a , p b ) applies a sort of logic or as we noticed that some form of comments may be considered good (or bad ) regardless the question they are answering. We pruned the question trees according to the criterion described in Section 3. The best pruning threshold we estimated on the 2016 test set was h = 0.91. The previous year model adopted the Smoothed Partial Tree Kernel (SPTK) (Croce et al., 2011) in place of the PTK. This year we decided to use the PTK kernel as our preliminary experiments did not justified the usage of the slower SPTK.
LK B is a linear kernel that operates on feature vectors including: (i) the similarity metrics between o and q, and between o and the entire answer thread of q, as described in Section 2.1; (ii) ranking features, described in Section 2.3. With respect to the previous year challenge we did not include some features derived from subtask A, because in subsequent experiments they did not demonstrate a significant impact.
The best SVM regularization parameter estimated during the tuning stage is C = 1.
We made an additional submission in which the pruning in not applied.
Results: Table 2 shows the results on subtask B. On the official test set, our primary submission achieved the third position w.r.t. MAP among 13 systems. Differently from what observed in the tuning stage, on the official test set the contrastive system achieves the highest MAP and would have ranked first in the challenge.
In general, the difference between the system accuracy obtained in 2016 and 2017 suggests that the two test sets are rather different. To verify this hypothesis, we performed a 10-fold cross validation using only the data from 2017 test set. We kept the same pruning strategy and weights computed on the 2016 training set that we applied to the entire test set of 2017 for our official submission. We evaluated different pruning thresholds. Figure 2 reports the MAP averaged over the results of a 10 fold cross validation on the official 2017 test set (the 2016 dataset is not used at all).
The results show that (i) our best system with or without pruning is less accurate than the submitted results, i.e., producing an MAP of 46.29: this is reasonable since the model uses less training data. (ii) our pruning can improve our best system from 46.29 to 47.10 MAP.
Thus, it would seem that the difference between 2016 and 2017 dataset plays an important role for the pruning approach as removing some subtrees makes the TK approach more effective but probably also more specific to the data used for training the model. Another possible explanation is that it is easier to improve a weaker model, using less data. Finding out the properties of tree pruning is surely an interesting research line we would like to pursue in the future.  Table 3: Results on subtask C on the 2016 and 2017 official test sets. KeLP is our primary submission, while IR is the baseline system based on the search engine results.

Model:
The learning model operates on the triplet, o, q, c , using the kernel, PTK + (p a , p b ) + LK C (t a , t b ), where PTK + (p a , p b ) = PTK(o 1 , o 2 ) + PTK(c 1 , c 2 ) (see Section 2.2) and LK C is a linear kernel operating on feature vectors, which include: (i) the similarity metrics between o and c, between o and q, and between o and the entire thread of q, as described in Section 2.1; (ii) the heuristic features, (iii) the thread-based features, (iv) the ranking features, and (v) the features derived from the scores of subtasks A and B, described in Section 2.3. PTK uses the default parameters. The subtask training data is rather imbalanced, as the number of negative examples is about 10 times the positive ones. We took this into account by setting the regularization parameter for the positive class, C p = #negatives #positives C, as in (Morik et al., 1999). The best SVM regularization parameter estimated during the tuning stage is C = 5. The system is identical to the one proposed the previous year. Table 3 shows the results for subtask C. Our primary submission achieved the third highest MAP among 5 systems. The large difference among the 2016 and 2017 MAP is mainly due to the much lower presence of relevant examples in the 2017 test set, indeed, more than 97% of instances are irrelevant.