Cross-Pair Text Representations for Answer Sentence Selection

High-level semantics tasks, e.g., paraphrasing, textual entailment or question answering, involve modeling of text pairs. Before the emergence of neural networks, this has been mostly performed using intra-pair features, which incorporate similarity scores or rewrite rules computed between the members within the same pair. In this paper, we compute scalar products between vectors representing similarity between members of different pairs, in place of simply using a single vector for each pair. This allows us to obtain a representation specific to any pair of pairs, which delivers the state of the art in answer sentence selection. Most importantly, our approach can outperform much more complex algorithms based on neural networks.


Introduction
Answer sentence selection (AS) is an important subtask of open-domain Question Answering (QA). Its input are a question Q and a set of candidate answer passages A = {A 1 , A 2 , ..., A N }, which may, for example, be the output of a search engine. The objective consists in selecting A i , i ∈ {1, ..., N } that contain correct answers.
Pre-deep learning renaissance approaches to AS typically addressed the task by modeling Q-to-A (intra-pair) similarities (Yih et al., 2013;Wang et al., 2007;Heilman and Smith, 2010;Wang and Manning, 2010). Q-to-A similarity and alignment are indeed crucial, but, in practice, it is very difficult to automatically extract meaningful relations between Q and A. For example, consider two positive Q/A pairs in Table 1. If we want to learn a model based only on the intra-pair Qto-A matches, simple lexical matching (marked with italics) will not be enough. One would need to conduct more complex processing and identify * Professor at the University of Trento. that movie and film are synonyms, and that the ngram play in the movie or be in the movie can be paraphrased as star. While the former can be easily detected using an external lexical resource, e.g., WordNet (Fellbaum, 1998), the latter would require more complex inference.
On the other hand, Q 1 and Q 2 contain the same pattern who ... in the movie ..., and their respective answers contain film ... starring .... If we know that P 1 = (Q 1 , A 1 ) is a positive AS example and want to classify P 2 = (Q 2 , A 2 ), then high Q 2to-Q 1 and A 2 -to-A 1 cross-pair similarities can suggest that P 1 and P 2 are likely to have the same label. This idea, for example, was exploited by Severyn and Moschitti (2012), whose system measures syntactic-semantic similarities directly between structural syntactic tree representations of Q 1 /Q 2 and A 1 /A 2 . This model still exhibited state-of-the-art performance in 2016 (Tymoshenko et al., 2016a).
Deep neural networks (DNNs) also naturally use such cross-pair similarity when modeling two input texts, and then further combine it with intrapair similarity, for example, by means of attention mechanisms (Shen et al., 2017), compareaggregate architectures (Bian et al., 2017;Wang and Jiang, 2017), or fully-connected layers (Severyn and Rao et al., 2016).
In this work, we observe that: (i) the high accuracy of the kernel model by Severyn and Moschitti (2012) was due not only to the use of syntactic structures, but also to the use of cross-pair similarities; and (ii) the success of DNNs for QA can be partially attributed to an implicit combination of cross-and intra-pair similarity.
More specifically, we investigate, whether simple similarity metrics, e.g., cosine similarity between standard vector representations, can perform competitively to the state-of-the-art neural models when employed as cross-pair kernels.  Table 1: Question/Answer Sentence pairs from WikiQA corpus. We use italic font to mark intra-pair lexical matches between Q1 and A1, Q2 and A2, and bold font to mark the cross-pair matches between Q1 and Q2, A1 and A2.
To this end, we apply linear and cosine kernels to Q i /Q j and A i /A j pairs (i, j = 1, ..., N ) represented as a bag-of-words (BoW) or an averaged sum of their pretrained word embeddings. Then, we combine them with the cross-pair Tree Kernels (TKs) and kernels applied to the traditional Q/A intra-pair similarity feature vector representations in a composite kernel and use it in an SVM model.
We experiment with three reference datasets, WikiQA (Yang et al., 2015), TREC13 (Yao et al., 2013;Wang et al., 2007) and SemEval-2016, Task 3.A (Nakov et al., 2016), using a number of lexical-overlap/syntactic kernels. The latter challenge refers to a community question answering (cQA) task. It consists in reranking the responses to user questions from online forums. It is the same setting as AS, but the text of questions and answer sentences can be ungrammatical due to the nature of the online forum language.
We obtain competitive results on WikiQA and SemEval tasks, showing that: (i) simple BoW representations, when used in cross-pair kernels, perform comparably to and even outperform handcrafted intra-pair features. (ii) In cQA, simple cross-pair embedding-and BoW-based similarity features outperform domain-specific similarity features, which are hand-crafted from intrapair members. The simple features also perform comparably to syntactic TKs. (iii) We show that a combination of simple cosine-intra-and crosspair kernels with TKs can outperform the most recent state-of-the-art DNN architectures.
Assuming the conjecture of our paper correct, cross-pair modeling is the major neural network contribution, the last point above is not surprising as on relatively small datasets kernels-based models can exploit syntactic information very effectively while neural models cannot.
The paper is structured as follows. We describe the kernels incorporating intra-and crosspair matches in Sec. 3.2, list the simple crossand intra-pair features in Sec. 3.3, describe strong hand-crafted baseline features in Sec. 4, and report the experimental results in Sec. 5.

Related work
Early approaches to AS typically focused on modeling intra-pair Q-to-A alignment similarities. For example, Yih et al. (2013) proposed a latent alignment model that employed lexical-semantical Q-to-A alignments, Wang et al. (2007) modeled syntactic alignments with probabilistic quasisynchronous grammar, and Heilman and Smith (2010); Yao et al. (2013); Wang and Manning (2010) employed Tree Edit Distance-based Q-to-A alignments.
Originally, the idea of cross-pair similarity was proposed by Zanzotto and Moschitti (2006) and applied to the recognizing textual entailment task, which consists in detecting whether a text T entails a hypothesis H. They assumed that if two H/T pairs H 1 , T 1 and H 2 , T 2 share the same T-to-H "rewrite rules", they are likely to share the same label. Based on this idea, they proposed an algorithm applying TKs to (H 1 , H 2 ) and (T 1 , T 2 ) syntactic tree representations, enriched with H-to-T intra-pair rewrite rule information. More concretely, such algorithm aligns the constituents of H with T and then marks them with symbols directly in the trees. This way the alignment information can be matched by tree kernels applied to cross-pair members.
Then, a line of work on AS, started by Severyn and Moschitti (2012; , was inspired by a similar idea of incorporating "rewrite rules" directly into the tree representations of Q 1 /A 1 and Q 2 /A 2 . They represent Q and A as syntactic trees enhanced with Q-to-A relational information, and apply TKs (Moschitti, 2006) to (Q 1 , Q 2 ) and (A 1 , A 2 ). Thus they model cross-pair similarity, and learn important patterns occurring in Q and A separately. As shown in (Tymoshenko et al., 2016a), this approach is competitive with convolutional neural networks (CNNs).
In our approach, instead of using only one TK, we employ a number of different word-based kernels, most of which can be computed more efficiently than TKs.
Most recent AS models are based on Deep Neural Networks (DNNs), which learn distributed representations of the input data. DNNs are trained to apply series of non-linear transformations to the input Q and A, represented as compositions of word or character embeddings. DNN architectures learn AS-relevant patterns using intra-pair similarities as well as cross-pair, Q-to-Q and A-to-A, similarities, when modeling the input texts. For example, the CNN network by (Severyn and Moschitti, 2015) has two separate embedding layers for Q and A, which are followed by the respective convolution layers, whose output is concatenated and then passed through the final fully-connected joint layer. The weights in the Q and A convolution layers are learned by means of the backpropagation algorithm on the training Q/A pairs. Thus, obviously, classifying a new Q/A pair is partially equivalent to performing the implicit cross-pair Qto-Q and A-to-A comparison.
We believe that the ability of DNNs to implicitly capture cross-pair relational matching, i.e., the capacity of learning from (Q 1 , Q 2 ) and (A 1 , A 2 ), is a very important factor to their high performance. This is of course paired with their ability to learn non-linear patterns and capture Q-to-A relatedness by means of attention mechanisms. It should be noted that the latter are typically hard-coded in kernel models as lexical matching/similarity (Severyn and Moschitti, 2012). This is effective as much as the attention approach, at least with standard-size dataset, also in neural models (Severyn and Moschitti, 2016).
In our work, we model Q-to-A, Q-to-Q and Ato-A similarities with intra-and cross-pair ker-nels and show that such combination also exhibits state-of-the-art performance on the reference corpora. In addition, our approach can be applied to smaller datasets as it utilizes less parameters, and can provide insights on future DNN design.
3 Cross-pair similarity kernels for text

Background on Kernel Machines
Kernel Machines (KMs) allow for replacing the dot product with kernel functions directly applied to examples, i.e., they avoid mapping examples into vectors. The main advantage of KMs is a much lower computational complexity than the dot product as the kernel computation does not depend on the size of the feature space.
KMs are linear classifiers: given a labeled training dataset S = {(x i , y i ) : i = 1, . . . , n}, their classification function can be defined as: where x is a classification example, w is the gradient of the separating hyperplane, and b its bias. The equation shows that the gradient is a linear combination of the training points x i ∈ R n multiplied by their labels y i ∈ {−1, 1} and their weights α i ∈ R + . Note that the latter are different from zero only for the support vectors: this reduces the classification complexity, which will be lower than O(n) for each example.
We can replace the scalar product with a kernel function directly defined over a pair of ob- where φ : O → R n maps from objects to vectors of the final feature space. The new classification function becomes: which only needs the initial input objects.

Inter-and intra-pair match kernel
We cast AS as a text pair classification task: given a pair, P = (Q, A), constituted by a question (Q) and a candidate answer sentence (A), we classify it as either correct or incorrect. We used KMs, where K (·, ·) operates on two pairs, P 1 = (Q 1 , A 1 ) and P 2 = (Q 2 , A 2 ).

Intra-pair similarity
A traditional baseline approach would (i) represent Q/A pairs as feature vectors, where the components are similarity metrics applied to Q and A, e.g., a world overlap-based similarity; and (ii) train a classification model, e.g., an SVM using the following kernel: where K v can be any kernel operating on the feature vectors, e.g., the polynomial or linear (as in our work) kernel. V T 1 ,T 2 is a vector built on N similarity features, f 1 (·, ·), f 2 (·, ·), ..., f N (·, ·) , extracted by applying similarity metrics to two texts, T 1 and T 2 (see Sec. 3.3 for the list of the similarity metrics we used). K IP merely uses intrapair similarities.

Cross-pair similarity
We incorporate the intuition, similar questions are likely to demand similar answer patterns, by means of a cross-pair kernel, which measures similarity between questions and answers from P 1 and P 2 as follows: becomes a kernel, which takes the (Q 1 , Q 2 ) or (A 1 , A 2 ) pairs as input. In other words, V Q 1 ,Q 2 · V A 1 ,A 2 is a sum of products of f i (·, ·) kernels applied to the (Q 1 , Q 2 ) and (A 1 , A 2 ) pairs. K CP is a valid kernel if the similarity metrics used to compute the f i (·, ·) are valid kernel functions. Finally, combining K IP and K CP enables learning of two different kinds of valuable crossand intra-pair AS patterns. We combine various K IP and K CP by summing them or by training a meta-classifier on their outputs. See Section 5.4 for more details. Figure 1 summarizes the differences between the K IP and K CP computation processes described above.

Similarity features
We employ three similarity feature types as f i (·, ·). Two of them are computed using the cosine similarity metrics and differ only in terms of the input texts, T 1 and T 2 , representations. The other type is constituted by TKs applied to the structural representations of T 1 and T 2 . Note that, since cosine similarity and TKs are valid kernels, K CP (P 1 ,P 2 )=V Q1,Q2 ! V A1,A2 (cross-pair kernel) K IP (P 1 ,P 2 ) = K V (V Q1,A1 ,V Q2,A2 ) (intra-pair kernel) Figure 1: Feature extraction schema for two Q/A pairs, P1 and P2. VT 1 ,T 2 is a vector of similarity features extracted for a pair of texts T1, T2, the respective dashed boxes show from which pair of input texts they are extracted.
K CP is also guaranteed to be a valid kernel when computed using these similarity features.

Bag-of-n-grams overlap (B)
f t,l,s B (T 1 , T 2 ) is a cosine similarity metric applied to the bag-of-n-grams vector representations of The {t, l, s} index describes an n-gram representation configuration: t denotes whether the n-grams are assembled of word lemmas (L), or their part-of-speech tags (P OS), or lemmas concatenated with their respective POS-tags (L P OS ); l is a (n 1 , n 2 ) tuple, with n 1 and n 2 being the minimal and maximal length of n-grams considered, respectively; and s is Y ES if the representation discards the stopwords and N O, otherwise.
We used {t, l, s} configurations from the following set: It follows that |C| = 23, which means we have 23 similarity features, f t,l,s B (T 1 , T 2 ), in total, in the intra-pair setting. The respective cross-pair kernels are a composite kernel summing 23 products of cosine kernels applied to 23 different (Q 1 , Q 2 ) and (A 1 , A 2 ) bag-of-ngram representations.

Embedding-based similarities (E).
We represent an input text as an average of embeddings of its lemmas from pre-trained word embedding models. Then, the embedding feature f E model (T 1 , T 2 ) is the cosine kernel applied to the embedding-based representations of T 1 and T 2 . We use two pretrained embeddings: Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), resulting in three 1 embedding-based features (see Sec. 5 for more technical details).

Tree-kernel based similarities
Following the framework defined in Tymoshenko et al., 2016a), we represent T 1 and T 2 as syntactico-semantic structures and use TKs as semantic similarity metrics. When computing K CP with TK as similarities, in Eq.2, we employ summation instead of multiplication 2 .
More specifically, we represent T 1 and T 2 as (i) constituency trees and apply subset TK (SST); or (ii) shallow chunk-based trees, similar to the one presented in Figure 2, and apply partial tree (PTK) kernel. In the shallow trees, lemmas are leaves and POS tags are pre-terminals. POS nodes are grouped under chunk nodes, and then under the sentences nodes. These representations encode also some intra-pair similarity information, e.g., prefix REL denotes the lexical Q-to-A match. In a structural representation, we prepend it to the parent and grand-parent nodes of lemmas which occur both in Q and A, e.g., "Mary" in the first example of Table 1.
Then, for factoid QA 3 , we mark focus words in Q and entities in A, if the answer contains any named entities of types matching the question expected answer type (EAT) 4 . More specifically, we mark the semantic Q-to-A match by prepending the REL-FOCUS-<EAT> label to the answer chunk nodes that contain such named entities and also to the question focus word. Here, <EAT> stands for the EAT label. For example, in the Q 1 /A 1 pair in Table 1, the Q 1 EAT is HUMan, and the matching named entities include "Julie Andrews", "David Tomlinson" and others. Figure 2 depicts Q 1 annotated both with REL-and REL-FOCUS links. We detect both question focus and EAT automatically. Due to the space limita- 1 We use Word2Vec embeddings trained on two different corpora, which result in two features, and GloVe trained on one corpus. 2 We have opted to use summation in this case to follow the earlier work.
3 WikiQA and TREC13 are the factoid AS datasets, as their questions ask for a specific fact, e.g. date or a name. 4 For example, the PERson named entity type matches the HUMan EAT. More specifically, we employ the following NER-to-EAT matching rules: PERson, ORGanization → HUMan; LOCation → LOCation; DATE, TIME, MONEY, PERCENTAGE, DURATION, NUMBER, SET → NUM; ORGanization, PERson, MISCellanious → ENTiTY. We employ the (Li and Roth, 2002) Table 1 (i) cosine similarity applied to the BoW representations of T1 and T2 in terms of word lemmas, bi-, three-, four-grams (computed twice with and without stopwords); POS-tags; dependency triplets; (ii) longest common string subsequence measure w. and w/out stopwords; (iii) Jaccard similarity metric applied to one-, two-, four, three-grams w. and w/out stopwords; (iv) word n-gram containment measure on uni-and bi-grams w. and w/out stopwords (Broder, 1997); (v) greedy string tiling (Wise, 1996) with minimum matching length of 3 ; (vi) string kernel similarity (Lodhi et al., 2002); (vii) expected answer type match: percentage of named entities (NE) in the answer passage compatible with the question class 5 ; (viii) WordNet-based similarity. WordNet T1/T2 common lemma/synonym/hypernym overlap ratio; (ix) PTK (Moschitti, 2006) similarity between constituency or dependency tree representations of input texts; tions, we do not describe the structural representations and matching algorithms in more detail, but refer the reader to the works above.

Strong baseline feature vector
As a strong baseline, we use similarity feature vectors and intra-pair K IP kernel.
For the factoid answer sentence selection task, we use 47 strong features listed in Table 2. This is a compilation of features used in the topperforming system at SemEval-2012 Semantic Text Similarity workshop (Bär et al., 2012) and earlier factoid QA work (Severyn and Moschitti, 2012), extended with few additional features.
For the community question answering (cQA) task, we employ instead a combination of similarity-based and thread-level features shown to be very effective for the cQA task (Nicosia et al., 2015;Barrón-Cedeño et al., 2016). We use the exact feature combination from (Barrón-Cedeño et al., 2016), which includes both lexical and syntactic similarity measures (cosine similarity of bag-of-words, PTK similarity over syntactic tree representations of the input texts) and thread-  level domain specific features (are the question and comment authored by the same person?, does the comment contain any questions?, and so on.).
We cannot directly use these feature vectors in the K CP kernels, as not all functions used to compute features are valid kernels, e.g., the longest common string subsequence is not a kernel function. Moreover, some of them can be computed only on the (Q, A) pairs, e.g., the expected type match feature (vii) in Tab. 2, or many of the cQA domain-specific features.

Experiments
We conduct experiments on three corpora, namely TREC13, WikiQA and SemEval, and evaluate the results in terms of Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR). Our code is available at https://github.com/ iKernels/RelTextRank.

Datasets
WikiQA dataset. WikiQA (Yang et al., 2015) is a factoid answer sentence selection dataset with Bing query logs as questions. Candidate answer sentences are extracted from Wikipedia and labeled manually. Some of the questions have no correct answer sentence (all − ) or have only correct answer sentences (all + ). Table 3 reports the statistics of the WikiQA corpus as distributed (raw), without all − questions, and without both all − and all + questions (clean). We train in the "no all − " mode using 10 answer sentences per question 6 and test in the "clean" mode.
TREC13 dataset. A factoid answer sentence selection dataset originally presented in (Wang et al., 2007) 7 , also frequently called QASent (Yang et al., 2015). We train on 1,229 automatically labeled TREC8-12 questions. We use only 10 candidate answer sentences per question. We test in the "clean" setting defined in (Rao et al., 2016), i.e., we discard the all + and all − questions, resulting in 65 DEV and 68 TEST 6 The 10 answer sentences per question limit speeds up training time without loss in performance 7 We use the version distributed by (Yao et al., 2013) in https://code.google.com/p/jacana/ questions, respectively. DEV and TEST contain 1117 and 1442 candidate associated answer sentences, respectively. SemEval-2016, Task 3.A dataset. SemEval cQA dataset is a benchmark dataset in the Se-mEval 2016 Task 3. A question-to-comment similarity competition. It is a collection of user questions and the respective answer comment threads from Qatar Living forum, where the user comments to questions were manually labeled as correct or incorrect. Each question has 10 respective candidate answers. The training, dev and test sets have 1790, 244 and 327 questions, respectively. The AS task consists in reranking comments with respect to the question: most questions are nonfactoid and the text is often noisy.

Models
We used the following notation: B, E. Intra-pair K IP kernels (see Sec. 3.2) using the eponymous similarity features from Sec. 3.3. V. Linear kernel applied to the strong intra-pair feature vector representation defined in Section 4. Note that, as already mentioned in Sec. 4, due to the slightly different nature of the factoid and community question answering tasks, we used different strong feature groups for WikiQA, TREC13 (Table 2) and SemEval-2016. B cr , E cr . Cross-pair K CP kernel applied to B and E similarity feature vectors respectively. More specifically, B cr and E cr are a sum of 23 and 3 cross-pair kernel products, respectively (see Eq. 2 and Sec. 3.2). PTK, SST are the cross-pair PTK and SST tree kernels applied to the shallow chunk-and constituency-based representations (see Sec. 3.3). "+" denotes kernel summation. We use this symbol to denote that we sum the gram-matrices for the distinct standalone kernels, and use the resulting kernel matrix as input to SVMs. META BASE;P T K , META BASE;SST . Logistic regression metaclassifiers trained on the outputs of two standalone systems, namely (i) V+B cr +E cr +E (we denote it as BASE to simplify the notation), and (ii) PTK or SST, respectively. We ran 10-fold cross-validation on the training set and used the resulting predictions as training data for the ensemble classifier. We did not use the development or training sets for any parameter tuning, thus we report the results both on the DEV and TEST sets. SUM BASE;P T K , SUM BASE;SST . Simple metaclassifiers, summing the output of the BASE and PTK or SST systems, respectively.

Toolkits
We trained the models using scikit-learn 8 by Pedregosa et al. (2012) using the SVC version of SVM with precomputed K IP and K CP kernel matrices and default parameters. We trained the ensemble model using the scikit LogisticRegression classifier implementation with the default parameters. We used spaCy library 9 and scikit to obtain bag-of-n-gram representations for the B similarity features, and to compute B-and E-base gram matrices.
We used the RelTextRank framework 10 (Tymoshenko et al., 2017b) to generate the structural representations for the TK similarity features and to extract the strong baseline feature vectors from Sec. 4. We used KeLP (Filice et al.) to compute the TK gram matrices.
Regarding the Embedding-based similarities (E), we obtain three similarity features by using three word embedding models to generate the representations of the input texts, T 1 and T 2 , namely GloVe vectors trained on common crawl data 11 , Word2Vec vectors pre-trained on Google News 12 , and another Word2Vec vectors model 13 pre-trained on Aquaint 14 plus Wikipedia. Table 4 reports the results obtained with the intraand cross-pair kernels K IP , K CP and their combinations. In the following, we describe the results according to the model categories above.

Results and discussion
Intra-pair kernels. Taking into account intrapair similarity is the standard approach in the majority of the previous non-DNN work. In our experiments, we implement this approach as K IP using B, E, V groups of similarity features. K IP performs worse than the state-of-art (SoA) DNN systems on all the datasets (see tables 5, 6 and 7, for the SoA systems).
The results on WikiQA are particularly low even when the best K IP system, B+V+E, is used, which scores up to 15 points less than the state of the art. This confirms the Yang et al. (2015) observation on WikiQA, according to which, simple word matching methods are likely to underperform on its data, considering how it was built. Nevertheless, despite its simplicity, B+V+E performs comparably to the Yang et al. (2015) reimplementation of LCLR, the complex latent structured approach employing rich lexical and semantic intra-pair similarity features (Yih et al., 2013). Yang et al. (2015) report that on WikiQA LCLR obtains MRR of 60.86 and MAP of 59.93.
Then, on TREC13 and SemEval-2016, the intra-pair V, V+E and B+V+E kernels exhibit rather high performance, however, they are still significantly below the state of the art, thus confirming our hypothesis that intra-pair similarity alone does not guarantee top results.
Cross-pair kernels. B cr and E cr obtain rather high results on WikiQA and SemEval. On Wik-iQA, both B cr and B cr + E cr outperform all the intra-pair kernels by a large margin, while, on Se-mEval, they perform comparably to the manually engineered domain-specific V features of Nicosia et al. (2015). On the contrary, on TREC13, V outperforms both B cr and E cr , thus showing that TREC13 is indeed biased towards intra-pair relatedness features by construction.
More complex PTK and SST cross-pair kernels, both alone and combined with B cr , E cr , typically outperform the standalone B cr and E cr on all the corpora (PTK on TREC13 and WikiQA, and SST and B cr +E cr +PTK on SemEval). This can be explained by the fact that PTK and SST are able to learn complex syntactic patterns and also contain some information about intra-pair relations, namely REL-labels described in Sec. 3.2. Thus, it is natural that they outperform simpler cross-pair kernels. Nevertheless, on WikiQA-DEV, B cr +E cr performs very close to PTK. Moreover, on Se-mEval, B cr +E cr outperforms PTK and is behind SST for less than 1 point in terms of MAP. This can be explained by the fact that Q and A, in Se-mEval, are frequently ungrammatical as the cQA corpus is collected from online forums.
Finally, note that the B cr +E cr +PTK system, which does not use any cQA domain-specific features, is only 0.56 MAP points behind KeLP, the best-performing system in the SemEval competition (see Line 1 of Table 7).
Kernels combining the intra-and cross-pair similarities. The V+B cr +E cr +E combination (we  will refer to it as BASE), outperforms the standalone domain-specific handcrafted cQA features, V, and both PTK and SST on SemEval 2016 TEST and DEV by at least 2.3 points in all metrics. Moreover, V+B cr +E cr +E is only less than 0.5 points behind the #1 system of the SemEval-2016 competition (see Tab. 7). We recall that V+B cr +E cr +E only uses basic n-gram overlapbased cross-and intra-similarity features and embedding-based cosine similarities.
Finally, when we add tree kernel models to the combination, i.e., V+B cr +E cr +E+PTK or V+B cr +E cr +E+SST, we note improvement for Se-mEval and TREC13 tasks.
Ensemble models. We ensemble cross-and intra-pair kernels-based models by summing the predictions of the standalone SVM classifiers (SUM models) or by training a logistic regression meta-classifier on them (META models). We build the meta-classifiers on the outputs of the standalone system BASE and TKs, namely PTK and SST. The "Ensemble" section of Table 4 shows that meta-system combinations mostly outperform the standalone kernels.
In general, combining cross-pair and intra-pair similarities (with kernel sum or meta-classifiers) provides state-of-the-art results without using deep learning. Additionally, the outcome is de-terministic, while the DNN accuracy may vary depending on the type of the hardware used or the random initialization parameters (Crane, 2018).

Comparison with the state of the art
Tables 5, 6 and 7 report the performance of the most recent state-of-the-art systems on WikiQA, TREC13 and SemEval in comparison with our best results. We discuss them with respect to the different datasets.
WikiQA. As already mentioned earlier, Wik-iQA contains many questions without correct answer (see Tab. 3). When evaluated on the full data, even the oracle system will achieve at most 38.38 points of MAP. Moreover, as originally observed in (Wang et al., 2007), the questions that do not have either correct answers or incorrect answers are not useful for comparing the performance of different answer sentence selection systems. Therefore, they are typically removed from WikiQA and TREC13 before the evaluation.
There has been some discrepancy in the community when evaluating on WikiQA. The original baselines proposed for the corpus in (Yang et al., 2015) were evaluated in the "clean" setting 15 . We no all − clean MRR MAP MRR MAP LCLR (Yang et al., 2015) impl. of (Yih et al., 2013) (Shen et al., 2017) 88.9 82.2 BIMPM  87.5 80.2 C/A, k-threshold (Bian et al., 2017) 89.9 82.1 C/A-listwise (Bian et al., 2017) 88.9 81.0 HyperQA (Tay et al., 2018) 86   (Tymoshenko et al., 2016b) 86.26 78.78 HyperQA (Tay et al., 2018) n/a 79.5 AI-CNN (Zhang et al., 2017) n/a 80.14 Our model (V+Bcr+Ecr+E+SST) 86.52 79.79 also evaluate in the "clean" setting. However, the performance of the most recent state-of-theart systems listed in the Tab. 5 is reported in the "no all − " setting, in the respective papers, i.e., they keep the all + questions 16 . Thus, they have 6 extra questions always answered correctly by default. To account for this discrepancy, in Tab. 5, we report the results in both settings. It is trivial to convert the performance figures from one setting to another. In the table, we mark the conversion results with italic. Our SUM BASE;P T K system (i) outperforms all the state-of-the-art systems, including the sophisticated architectures with attention, such as IWANatt (Shen et al., 2017), and compare-aggregate (C/A) frameworks (Wang and Jiang, 2017;Bian et al., 2017) in terms of MRR; and (ii) has the same MAP as (Bian et al., 2017). Obviously, this improvement is not statistically significant with re- 16 We deduced that from the corpus statistics reported by the authors of the papers. They all report having 243 test questions, which corresponds to the "no all − " setting spect to C/A systems by Bian et al. (2017). Nevertheless, ours is a very promising result, considering that we only use linear models with simple kernels and do not tune any learning parameter of such models.
TREC13. As shown in Tab. 6, our models do not outperform the state of the art on TREC13, but they still perform comparably to the recent DNN HyperQA model (Tay et al., 2018). In general, our model is behind the state-of-the-art IWAN-att system by 4.55 points in terms of MAP. Note, however, that TREC13 test set contains only 68 questions, therefore this difference in performance is not likely to be statistically significant 17 .
Semeval. Table 7 compares performance of B cr + E cr + V + E + SST system on Semeval to that of KeLP and ConvKN, the two top systems in the SemEval 2016 competition, and also to the performance of the recent DNN-based HyperQA and AI-CNN systems. In the Semeval 2016 competition, our model would have been the first 18 , with #1 KeLP system being 0.6 MAP points behind. Then, it would have outperformed the state-of-theart AI-CNN system by 0.35 MAP points.

Conclusions
This work proposes a simple, yet effective approach to the task of answer sentence selection based on the intuition that similar patterns in questions are likely to demand similar patterns in answers. We showed that this hypothesis provides an improvement on three benchmark datasets, Wik-iQA, TREC13, Semeval-2016, and, moreover, it enables simple features to achieve the state of the art on WikiQA and Semeval-2016, outperforming many of state-of-the-art DNN-based systems. There is significant room for further elaboration of this approach, for example, by expanding feature spaces with more syntactic and semantic features, employing new types of kernels for measuring the inter-question/answer pair similarity or trying to implement the same idea in DNN architectures.