Talla at SemEval-2017 Task 3: Identifying Similar Questions Through Paraphrase Detection

This paper describes our approach to the SemEval-2017 shared task of determining question-question similarity in a community question-answering setting (Task 3B). We extracted both syntactic and semantic similarity features between candidate questions, performed pairwise-preference learning to optimize for ranking order, and then trained a random forest classifier to predict whether the candidate questions are paraphrases of each other. This approach achieved a MAP of 45.7% out of max achievable 67.0% on the test set.


Introduction
A large amount of information of interest to users of community forums is stored in semi-structured text, but surfacing that information can be challenging given the variety of ways users can phrase their search queries. Question-answering is a significant task for both natural language processing (NLP) and information retrieval (IR), as both the actual terms used in the query plus the semantic intent of the query itself need to be accounted for in surfacing relevant potential answers. The Community Question Answering (cQA) task of SemEval-2017 (Nakov et al., 2017) seeks to address this problem through several related subtasks around effectively determining and ranking the relevance of related stored questions and associated answers.
We chose to focus on subtask B: questionquestion similarity. This problem can be seen as one of paraphrase detection -determine if two questions have the same meaning. We reviewed existing performant paraphrase detection methods and selected several to implement and ensemble (Ji and Eisenstein, 2013;Wan et al., 2006;Wang and Ittycheriah, 2015;Filice et al., 2015) along with the related question IR system rank provided in the dataset. As paraphrase detection is a classification problem while subtask B is a ranking problem, we also incorporated pairwisepreference learning (Joachims, 2002;Fürnkranz and Hüllermeier, 2003) to aid in improving the key metric of mean average precision (MAP).
The rest of the paper is organized as follows. Section 2 provides a detailed description of our system, including the key identified features that were extracted, while Section 3 provides the results from experiments used to evaluate the system. Section 4 concludes the paper with a summary of the work and directions for future exploration.

System Description
Our approach consisted of four parts: data preparation, feature extraction, pairwise-preference learning, and paraphrase classification. All code was implemented in Python 3.5. For data extraction, we converted the XML documents provided by (Nakov et al., 2017) into pandas DataFrames, retaining the subject text, body text, and metadata related to the original and related questions. The feature extraction and the pairwise-preference learning phase are described below. Classification was handled with a random forest classifier containing 2000 weak estimators.

Feature Extraction
We computed features as described in several leading paraphrase detection method papers. One of which, fine-grained textual features (Wan et al., 2006), failed to produce any significant value during further evaluation for this task and so were discarded. In addition to the paraphrase detection features, we also incorporated the reciprocal of the reported IR system rank of the related question as an additional feature.
Unless otherwise noted, question texts for feature extraction were created by concatenating the subject and body fields of the question, all terms were made lowercase, and stop words were removed.

Tree Kernels
Tree kernel (TK) features (Filice et al., 2015) were derived by generating parse trees of the two sentences, then defining a kernel that allows for a numerical distance to be computed. The kernel takes all possible valid (not necessarily terminal) partial tree structures within the sentence parse trees and counts the amount of overlap between the two. The result is a score for every pair of sentences.
The kernel function K(S 1 , S 2 ) for two trees S 1 and S 2 is defined as follows: where ∆(n 1 , n 2 ) is the Partial Tree Kernel (PTK) function as defined in (Filice et al., 2015). A standard kernel norm is then applied, given by: .
We computed distances for both constituency trees and dependency trees. For constituency parse trees, words that occur in both sentences were marked along with their part of speech in order to increase the effect of shared terms belonging to similar subtrees. Dependency parse trees, on the other hand, were constructed so that non-leaf nodes are made up entirely of dependency types (rather than parts of speech). For example a single ROOT node may have nodes nsubj and dobj as children. Leaves were all tokens representing words themselves, and every interior node had a child that was a leaf. The final features produced were the result of the kernel applied to the constituency parse tree and that result multiplied by the result from the kernel applied to the dependency parse tree.

TF-KLD
TF-KLD (Term Frequency Kullback-Leibler Divergence) (Ji and Eisenstein, 2013) is a supervised TF weighting scheme based on modeling probability distributions of phrases being aligned with or without the presence of a particular term. More formally: We assume labeled sentence pairs w is the binarized vector of bigram and unigram occurrence for the first sentence, w (2) i is the bigram and unigram occurrence vector for the second, and r i ∈ {0, 1} is an indicator of whether the two sentences match. We assume the order of the sentences are irrelevant, and for each feature with index k we define two Bernoulli distributions: which is the probability that feature k appears in the first sentence given that k appears in the second and both are matched, and which is the probability that feature k appears in the first sentence given that k appears in the second and both are not matched.
The Kullback-Leibler divergence is a premetric over probability distributions, defined as We calculate a KLD score for each feature k, then use this to weight the vector of non-binarized occurrences. The sparse TF-KLD vector then undergoes dimensionality reduction by means of rank-100 nonnegative matrix factorization. Finally, the cosine similarity of individual vectors is taken to give a single feature for each pair of sentences.

Semantic Word Alignment
Semantic word alignment (WA) (Wang and Ittycheriah, 2015) used word embeddings to infer semantic similarity between documents at the individual word level. For embeddings we used the pre-trained 300-dimensional GloVe vectors (Pennington et al., 2014).
Given a source question Q and reference question R, let Q = {q 0 , q 1 , ..., q m } and R = {r 0 , r 1 , ..., r n } denote the words in each question text. First, the cosine-similarity between all pairs of the words (q i , r j ) was computed to form a similarity matrix (Figure 1). Next we denote the word alignment position for each query word q i as align i , similarity score as sim i , and the inverse document frequency as idf i . Word alignment position align i for a query word q i in Q w.r.t words in R is equal to the position of a word r j in R at which q i has maximum similarity score sim i . Finally, we compute a set of distinct word alignment features as: • similarity: f 0 = i sim i * idf i / i idf i . This feature represents question similarity based on the aligned words.
• dispersion: . This feature is a measure of contiguously aligned words.
• penalty: If we denote the position of unaligned words (where sim i = 0) as unalign i , then this feature penalizes pairs with unaligned question words and was calculated as f 2 = unalign i idf i / i idf i .
• five important words: f i th = sim i th * idf i th . This feature set included the similarity score of the top five important words in the question text, where importance of a word was based on its IDF score.
The first three features were computed in both directions i.e. for (Q i , R j ) and (R j , Q i ). The cosine similarity of the aggregate of all embeddings in the questions was also computed. This process was repeated separately for both question subjects and bodies (instead of on the combined concatenated text) for a total of 24 distinct features.

Pairwise-Preference Learning
Since the official evaluation metric for Subtask B was MAP, we adopted a ranking approach to indirectly optimize for MAP. Given an original question Q i and its list of corresponding related questions {R 1 , R 2 , ..R 10 }, we are interested in learning a ranking of this list, where relevant questions are ranked higher than irrelevant ones. An alternative way to learn this ranking is to classify if a pair from a set of pairs formed within one group, where a group is formed for each original question Q i is correctly ordered or not. This principle is called "pairwise-preference learning" (Joachims, 2002;Fürnkranz and Hüllermeier, 2003).
To make use of this approach we transformed the datasets from question-question(or questioncomment) pairs into a set of instance pairs. That is, we presented a pair of answers with one correct and one incorrect answer to the same question. Number of features were kept constant, while feature values were equal to the difference between the values of two answers in the instance pair.
In training phase, for each question group (Q i , {R 1 , R 2 , ..R 10 }) we generated labeled pairs as "correct-pair(Q i , R j ) minus incorrectpair(Q i , R k )" with label true and "incorrectpair(Q i , R k ) minus correct-pair(Q i , R j )" with label false. In this way, we generated 2 * (n c + n i ) instance pairs for each question group, where n c and n i is the number of correct pairs and number of incorrect pairs within a group respectively.
In testing phase, number of instance pairs generated for a question group(Q i , {R 1 , R 2 , ..R 10 }) were equal to the number of all possible pairs within that question group. Then, our model assigned a probability to each of these instance pairs that it is correctly ordered. To create a final score for each related-question R j , we took the sum of probabilities over all pairs in which R j was ranked first. This final score was then used to create a ranked list of related-questions R j for each original question Q i .

Experiments and Evaluation
We combined the provided training and dev datasets as our system training set and used the provided SemEval-2016 test data with gold labels as our test set. No additional external data, other than pre-trained word embeddings, were used. We evaluated different classifier hyperparameters using 10-fold cross-validation and ultimately chose a random forest classifier with 2000 trees as our final model.
This system achieved fourth place overall (Table 1) on the SemEval-2017 test dataset, and while both contrastive submissions placed higher than the primary, nether was able to achieve a greater MAP than the third place entry. Contrastive1 was identical in feature set to the primary submission, but included the SemEval-2016 test dataset as part of the training data, suggesting that MAP can be improved by increasing the amount of examples used to train the system. Contrastive2 did not include the extra data and also omitted the TF-KLD features. Comparing the effects of ablating the other individual features (Table 2) across both SemEval-2016 and SemEval-2017 test datasets demonstrated that both the TF-KLD and TK features were minimally effective. The IR system features had a dramatic difference between the two years -in 2016 it accounted for a 0.022 gain in MAP, while in 2017 it produced a 0.010 reduction. In both cases the WA features contributed the most, with gains of 0.041 and 0.034, respectively. Subtask B of Task 3 combines the PerfectMatch and Relevant classes into a single positive class for purposes of evaluation. Given that this approach treated question-question similarity as a paraphrase detection problem, the expectation was that this model would do better on the Perfect-Match and Irrelevant samples, but have a harder time with Relevant questions. This is seen in the SemEval-2016 data (Figure 2) (Table 3). The train + dev dataset we used for general training was more closely aligned with the distribution of class labels in 2016 than in 2017, suggesting a potential i.i.d. data dependence on this approach to produce good results on test data.  Table 3: Distribution of the PerfectMatch (PM), Relevant (R), and Irrelevant (I) classes within the datasets.

Summary
We described a system that relies on an ensemble of syntactic, semantic, and IR features to detect question-question similarity and demonstrated it on the SemEval-2017 community question answering shared task. Of the four feature sources we evaluated, the semantic word alignment features provided the largest contributed and consistent boost in MAP. Features derived from TF-KLD and tree kernel methods had modest effects. The efficacy of the IR-derived features varied from providing a noticeable gain on historical data vs a significant drop on the current test set, likely attributable to the significant increase in the number of Irrelevant class samples. Future work will explore how to compensate for highly unbalanced class scenarios.