A strong baseline for question relevancy ranking

The best systems at the SemEval-16 and SemEval-17 community question answering shared tasks – a task that amounts to question relevancy ranking – involve complex pipelines and manual feature engineering. Despite this, many of these still fail at beating the IR baseline, i.e., the rankings provided by Google’s search engine. We present a strong baseline for question relevancy ranking by training a simple multi-task feed forward network on a bag of 14 distance measures for the input question pair. This baseline model, which is fast to train and uses only language-independent features, outperforms the best shared task systems on the task of retrieving relevant previously asked questions.


Introduction
Community question-answer fora are great resources, collecting answers to frequently and lessfrequently asked questions on specific topics, but these are often not moderated and contain many irrelevant answers. Community Question Answering (CQA), cast as a question relevancy ranking problem, was the topic of two shared tasks at Se-mEval 2016-17. This is a non-trivial retrieval task, typically evaluated using mean average precision (MAP). We present a strong baseline for this task, on par with or surpassing state-of-the-art systems.
The English subtasks of the SemEval CQA (Nakov et al., 2015(Nakov et al., , 2017 consist of Question-Question Similarity, Question-Comment Similarity, and Question-External Comment Similarity. In this study, we focus on the core subtask of Question-Question similarity, defined as follows: Given a question, rank other relevant questions by their relevancy to that question. This proved to be a difficult task in both SemEval-16 and SemEval-17 as it is the one with the least amount of data available. The baseline was the ranking retrieved by performing a Google search, which proved to be a strong baseline beating a large portion of the systems submitted.
Contribution Our baseline is a simple multitask feed-forward neural network taking distance measures between pairs of questions as input. We use a question-answer dataset as auxiliary task; but we also experiment with datasets for pairwise classification tasks such as natural language inference and fake news detection. This simple, easy-totrain model is on par or better than state-of-theart systems for question relevancy ranking. We also show that this simple model outperforms a more complex model based on recurrent neural networks.

Our Model
We present a simple baseline model for question relevancy ranking. 1 It is a deep feed-forward network with a hidden layer that is shared with an auxiliary task model. The input to the network is extremely simple and consists of five distance measures of the input question-question pair. §2.1 discusses these distance measures, and how they relate. §2.2 introduces the multi-task learning architecture that we propose.

Features
We use four similarity metrics and three sentence representations (averaged word embeddings, binary unigram vectors, and trigram vectors). The cosine distance between the sentence representations of query x and query y is and is a measure of divergence, and the Euclidean distance is Note that the squared Euclidean distance is proportional to cosine distance and Manhattan distance. The Bhattacharya and Jaccard metrics, on the other hand, are sensitive to the number of types in the input (the 1 norm of the vector encodings). So, for example, only the cosine, Euclidean, and Manhattan distances will be the same for x = 1, 1, 0, 0, 1, 0, 1, 1, 0, 1 , y = 0, 0, 1, 0, 1, 0, 0, 0, 1, 1 and x = 0, 0, 0, 0, 0, 1, 0, 0, 1, 1 , y = 1, 1, 1, 1, 0, 0, 0, 0, 0, 1 The Jaccard index is the only metric that can only be applied to two of our representations, unigrams and n-grams: It is defined over mdimensional binary (indicator) vectors and therefore not applicable to averaged embeddings. It is defined as x · y m We represent each query pair by these 14 numerical features.

MTL Architecture
Our architecture is a simple feed-forward, multitask learning (MTL) architecture. Our architecture is presented in Figure 1 and is a Multi-Layer Perceptron (MLP) that takes a pair of sequences as input. The sequences can be sampled from the main task or the auxiliary task. The MLP has one shared hidden layer, a task-specific hidden layer and, finally, a task-specific classification layer for each output. The hyper-parameters, after doing grid search, optimizing performance on the validation data, are given in Figure 2.

LSTM baseline
We compare our MLP ranker to a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) model. It takes two sequences inputs: sequence 1 and sequence 2, and a stack of three bidirectional LSTM layers, which encode sequence 1 and sequence 2, respectively. The outputs are then concatenated, to enable representing the differences between the two sequences. Instead of relying only on this presentation (Bowman et al., 2015;Augenstein et al., 2016), we also concatenate our distance features and feed everything into our MLP ranker described above.

Datasets
For our experiments, we use data from SemEval shared tasks, but we also take advantage of potential synergies with other existing datasets for classification of sentence pairs. Below we present the datasets used for our main and auxiliary tasks. We provide some summary statistics for each dataset in Table 3.
SemEval 2016 and 2017 As our main dataset we use the queries from SemEval's subtask B which consists of an original query and 10 possibly related queries. As an auxiliary task, we use the data from subtask A, which is a questionrelated comment ranking task.   Nangia et al., 2017), since it contains different genres. Our model is not built to be a strong NLI system; we use the similarity between premise and hypothesis as a weak signal to improve the generalization on our main task.

Fake News Challenge
The Fake News Challenge 2 (FNC) was introduced to combat misleading and false information online. This task has been used before in a multi-task setting as a way to utilize general information about pairwise relations (Augenstein et al., 2018). Formally, the FNC task consists in, given a headline and the body of 2 http://www.fakenewschallenge.org/ text which can be from the same news article or not, classify the stance of the body of text relative to what is claimed in the headline. There are four labels: • AGREES: The body of the article is in agreement with the headline • DISAGREES: The body of the article is in disagreement with the headline • DISCUSSES: The body of the article does not take a position • UNRELATED: the body of the article discusses a different topic We include fake news detection as a weak auxiliary signal that can lead to better generalization of our question-question ranking model.

Evaluation
We evaluate our performance on the main task of question relevancy ranking using the official SemEval-2017 Task 3 evaluation scripts (Nakov et al., 2017). The scripts provide a variety of metrics; however, in accordance with the shared task, we report Mean Average Precision (MAP) (the official metric for the SemEval 2016 and 2017 shared tasks); Mean Reciprocal Rank (MRR), which has being thoroughly used for IR and QA; Average Recall; and, finally, the accuracy of predicting relevant documents.

Results
The results from our experiments are shown in Table 1. We present the official metric from the Se-mEval task, as well as other common metrics. For the SemEval-16 data, our multitask MLP architecture with a question-answer auxiliary task performed best on all metrics, except accuracy, where the multi-task MLP using all auxiliary tasks performed best. We outperform the winning systems of both the SemEval 2016 and 2017 campaigns. In addition, our improvements from single-task to multi-task are significant (p < 0.01). We also outperform the official IR baseline used in the Se-mEval 2016 and 2017 shared tasks. We discuss the STL-LSTM-SIM results in §5. Furthermore, in Table 2, we show the performance of our models when training on feature combinations, while in Table 3, we present an ablation test where we remove one feature at a time.
Learning curve In Figure 4, we also present our learning curves for the development set when incrementally increasing the training set size. We observe that when using an auxiliary task, the learning is more stable across training set size.

Discussion
For the SemEval shared tasks on CQA, several authors used complex recurrent and convolutional neural network architectures (Severyn and Moschitti, 2015;Barrón-Cedeno et al., 2016). For example, Barrón-Cedeno et al. used a convolutional neural network in combination with feature vectors representing lexical, syntactic, and semantic similarity as well as tree kernels. Their performance was slightly lower than the best system (SemEval-Best for 2016 in Table 1). The best system used lexical and semantic similarity measures in combination with a ranking model based on support vector machines (SVMs) (Filice et al., 2016;Franco-Salvador et al., 2016). Both systems are harder to implement and train than the model we propose here. For SemEval-17, Franco-Salvador et al. (2016), the winning team used  Table 3: We perform an ablation test, where we remove one feature at a time and report performance on development data of our single-task baseline. We observe that our baseline suffers most from removing the Euclidean distance over trigrams and the cosine similarity over unigrams. Note also that the Jaccard index over unigrams seems to carry a strong signal, albeit a very simple feature.
distributed representations of words, knowledge graphs and frames from FrameNet (Baker et al., 1998) as some of their features, and used SVMs for ranking. For a more direct comparison, we also train a more expressive model than the simple MTLbased model we propose. This architecture is based on bi-directional LSTMs (Hochreiter and Schmidhuber, 1997). For this model, we input sequences of embedded words (using pre-trained word embeddings) from each query into independent BiLSTM blocks and output a vector representation for each query. We then concatenate the vector representations with the similarity features from our MTL model and feed it into a dense layer and a classification layer. This way we can evaluate the usefulness of the flexible, expressive LSTM network directly (as our MTL model becomes an ablation instance of the full, more complex architecture). We use the same dropout regularization and SGD values as for the MLP. Tuning all parameters on the development data, we do not manage to outperform our proposed model, however. See lines MTL-LSTM-SIM in Table 1 for results.

Conclusion
We show that simple feature engineering, combined with an auxiliary task and a simple feedfor-ward neural architecture is appropriate for a small dataset and manages to beat the baseline and the best performing systems for the Semeval task of question relevancy ranking. We observe that introducing pairwise classification tasks leads to significant improvements in performance and a more stable model. Overall, our simple model introduces a new strong baseline which is particularly useful when there is a lack of labeled data.