Preferred Answer Selection in Stack Overflow: Better Text Representations ... and Metadata, Metadata, Metadata

Community question answering (cQA) forums provide a rich source of data for facilitating non-factoid question answering over many technical domains. Given this, there is considerable interest in answer retrieval from these kinds of forums. However this is a difficult task as the structure of these forums is very rich, and both metadata and text features are important for successful retrieval. While there has recently been a lot of work on solving this problem using deep learning models applied to question/answer text, this work has not looked at how to make use of the rich metadata available in cQA forums. We propose an attention-based model which achieves state-of-the-art results for text-based answer selection alone, and by making use of complementary meta-data, achieves a substantially higher result over two reference datasets novel to this work.


Introduction
Community question answering ("cQA") forums such as Stack Overflow have become a staple source of information for technical searches on the web. However, often a given query will match against multiple questions each with multiple answers. This complicates technical information retrieval, as any kind of search or questionanswering engine must decide how to rank these answers. Therefore, it would be beneficial to be able to automatically determine which questions in a cQA forum are most relevant to a given query question, and which answers to these questions best answer the query question.
One of the challenges in addressing this problem is that cQA threads tend to have a very rich and specific kind of structure and associated metadata. The basic structure of cQA threads is as follows: each thread has a unique question (usually editable by the posting user) and any number of answers to that question (each of which is usually editable by the posting user); comments can be posted by any user on any question or answer, e.g. to clarify details, challenge statements made in the post, or reflect the edit history of the post (on the part of the post author); and there is some mechanism for selecting the "preferred" answer, on the part of the user posting the original question, the forum community, or both. There is also often rich metadata associated with each question (e.g. number of views or community-assigned tags), each answer (e.g. creation and edit timestamps), along with every user who has participated in the thread -both explicit (e.g. badges or their reputation level) and implicit (e.g. activity data from other threads they have participated in, types of questions they have posted, or the types of answers they posted which were accepted).
Our research is aimed at improving the ability to automatically identify the best answer within a thread for a given question, as an initial step towards cross-thread answer ranking/selection. In this work we use Stack Overflow as our source of cQA threads. More concretely, given a Stack Overflow cQA thread with at least four answers, we attempt to automatically determine which of the answers was chosen by the user posting the original question as the "preferred answer".
A secondary goal of our research is learning how to leverage both the question/answer text in cQA threads, along with the associated metadata. We show how to create effective representations of both the thread text and the metadata, and we investigate the relative strength of each as well as their complementarity for preferred answer selection. By leveraging this metadata and using an attentional model for constructing question/answer pair representations, we are able to obtain greatly improved results over an existing state-of-the-art method for answer retrieval.
The contributions of our research are as follows: • we develop two novel benchmark datasets for cQA answer ranking/selection; • we adapt a deep learning method proposed for near-duplicate/paraphrase detection, and achieve state-of-the-art results for text-based answer selection; and • we demonstrate that metadata is critical in identifying preferred answers, but at the same time text-based representations complement metadata to achieve the best overall results for the task.
The data and code used in this research will be made available on acceptance.

Related work
The work that is most closely related to ours is Bogdanova and Foster (2016) and Koreeda et al. (2017). In this first case, Le and Mikolov's para-graph2vec was used to convert question-answer pairs into fixed-size vectors in a word-embedding vector space, which were then fed into a simple feed-forward neural network. In the second case, a decompositional attentional model is applied to the SemEval question-comment re-ranking task, and achieved respectable results for text alone. We improve on the standalone results for these two methods through better training and hyperparameter optimisation. We additionally extend both methods by incorporating metadata features in the training of the neural model, instead of extracting neural features for use in a non-deep learning model, as is commonly done in re-ranking tasks (Koreeda et al., 2017). In addition to this, there is a variety of other recent work on deep learning methods for answer ranking or best answer selection. For instance, Wang et al. (2010) used a network based on restricted Bolzmann machines (Hinton, 2002), using binary vectors of the most frequent words in the training data as input. This model was trained by trying to reconstruct question vectors from answer vectors, then at test time question vectors were compared against answer vectors to determine their relevance.
Elsewhere, Zhou et al. (2016) used Denoising Auto-Encoders (Vincent et al., 2008) to learn how to map both questions and answers to lowdimensional representations, which were then compared using cosine similarity. The resulting score was used as a feature in a learn-to-rank setup, together with a set of hand-crafted features including metadata, which did not have a positive effect on the results.
In another approach, Bao and Wu (2016) mapped questions and answers to multiple lower dimensional layers of variable size. They then used a 3-way tensor transformation to combine the layers and produce one output layer. Nassif et al. (2016) used stacked bidirectional LSTMs with a multilayer perceptron on top, with the addition of a number of extra features including a small number of metadata features, to classify and re-rank answers. Although the model performed well, it was no better than a far simpler classification model using only features based on text (Belinkov et al., 2015).
Compared to these past deep learning approaches for answer retrieval, our work differs in that we include metadata features directly within our deep learning model. We include a large number of such features and show, contrary to the results of previous research, that they can greatly improve classification performance.
In addition to deep learning methods for answer retrieval, there is plenty of research on answer selection using more traditional methods. Much of this work involves using topic models to infer question and answer representations in topic space, and retrieving based on these representations (Vasiljevic et al., 2016;Zolaktaf et al., 2011;Chahuara et al., 2016). However, the general finding is that this kind of method is insufficient to capture the level of detail needed to determine if an answer is truly relevant (Vasiljevic et al., 2016). They therefore tend to rely on complementary approaches such as using translation-based language models (Xue et al., 2008), or using category information. Given this, we do not experiment with these kinds of approaches.
There is also some work on improving answer retrieval by directly modelling answer quality (Jeon et al., 2006;Omari et al., 2016;Zhang et al., 2014). User-level information has proven to be very useful for this (Agichtein et al., 2008;Burel et al., 2012;Shah, 2015), which helps motivate our use of metadata.
Finally, an alternative strategy for answer selection is analogical reasoning or collective classification, which has been investigated by ) and Joty et al. (2016. In this kind of approach, questions and their answers are viewed as nodes in a graph connected by semantic links, which can be either positive or negative depending on the quality of the answer and its relevance to the question. However, we leave incorporating such graph-based approaches to future work.

Dataset
We developed two datasets based on Stack Overflow question-answer threads, along with a background corpus for pre-training models. 1 The evaluation datasets were created by sampling from threads with at least four answers, where one of those answers had been selected as "best" by the question asker. 2 The process for constructing our dataset was modelled on the 10,000 "how" question corpus (Jansen et al., 2014), similar to Bogdanova and Foster (2016).
The two evaluation datasets, which we denote as "SMALL" and "LARGE", contain 10K and 70K questions, respectively, each with a predefined 50/25/25 split between train, val, and test questions. On average, there are approximately six answers per question.
In addition to the sampled sub-sets, we also used the full Stack Overflow dump (containing a full month of questions and answers) for pretraining; we will refer to this dataset as "FULL". This full dataset consists of approximately 300K questions and 1M answers. In all cases, we tokenised the text using Stanford CoreNLP .
Stack Overflow contains rich metadata, including user-level information and question-and answer-specific data. We leverage this metadata in our model, as detailed in Section 4.2. Summary statistics of SMALL, LARGE and FULL are presented in Table 1.
In addition to the Stack Overflow dataset, we also experiment with an additional complementary dataset: the SEMEVAL 2017 Task 3A Question-Comment reranking dataset (Nakov et al., 2017). We include this dataset to establish the competitiveness of our proposed text processing networks (noting that the data contains very little metadata to be able to evaluate our metadata-based model).
We used the 2016 test set as validation, the 2017 test set as test. Note that there are 3 classes in SEMEVAL: Good, PotentiallyUseful, and Bad, but we collapse PotentiallyUseful and Bad into a single class, following most competition entries.

Methodology
We treat the answer ranking problem as a classification problem, where given a question/answer pair, the model tries to predict how likely the answer is to be the preferred answer to the question. So for a given question, the answers are ranked by descending probability. We explore three methods, which vary based on how they construct a question/answer pair embedding. Respectively these variations leverage: (1) only the question and answer text; (2) only the metadata about the question, answer and users; or (3) both text and metadata.
In all cases, given a vector embedding of a question/answer pair (based on a text embedding and/or metadata embedding), we feed the vector into a feed-forward network, H, which outputs the probability that the answer is the preferred answer to the given question. The network H consists of a series of dense layers with relu activations, and a final softmax layer. The model is trained using SGD with standard categorical cross-entropy loss, and implemented using TensorFlow. 3

Text Only
We experiment with two methods for constructing our text embeddings: an attentional approach, and a benchmark approach using a simple paragraph vector representation.

Decompositional Attentional Model
Parikh et al. (2016) proposed a decompositional attentional model for identifying near-duplicate questions. It is based on a bag-of-words model, and has been shown to perform well over the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015;Tomar et al., 2017).
We adapt their architecture for our task, running it on question/answer pairs instead of entailment pairs. Note that, in our case, the best answer is in no way expected to be a near-duplicate of the question, and rather, the attention mechanism over word embeddings is used to bridge the "lexical gap" between questions and answers (Shtok et al., 2012), as well as to automatically determine the sorts of answer words that are likely to align with particular question words. Henceforth we refer to our adapted model as "decatt".
The model works as follows: first it attends words mutually between the question and answer pair. Then, for each word in the question (respectively answer), it computes a weighted sum of the word embeddings in the answer (respectively question) to generate a soft-alignment vector. The embedding and alignment vector of each word are then combined together (by concatenation and feed-forward neural network) to form a tokenspecific representation for each word. Finally, separate question/answer vectors are constructed by summing over their respective token representations, and these are concatenated to form the final question/answer pair vector.
Formally, let the input question and answer be a = (a 1 , ..., a la ) and b = (b 1 , ..., b l b ) with lengths l a and l b , respectively. a i , b j ∈ R d are word embeddings of dimensionality d. These embeddings are not updated during training, following Parikh et al. (2016).
We first align each question (answer) word with other answer (question) words. Let F be a feedforward network with relu activations. We define the unnormalised attention weights as follows: We then perform softmax over the attention weights and compute the weighted sum: Let G be a feed-forward network with relu activations. We define the representation for each word as follows: for i = 1, ..., l a , j = 1, ..., l b , and where [·; ·] denotes vector concatenation. Lastly, we aggregate the vectors in the question and answer by summing them: . This text vector is used as the input in the classification network H.

Paragraph Vectors
Our second approach uses the method of Bogdanova and Foster (2016), who achieved state-ofthe-art performance on the Yahoo! Answers corpus of Jansen et al. (2014). The method, which we will refer to as "doc2vec", works by independently constructing vector representations of both the question and answer texts, using paragraph vectors (Le and Mikolov, 2014;Lau and Baldwin, 2016) in the same vector space. The training is unsupervised, only requiring an unlabelled pretraining corpus to learn the vectors.
The doc2vec method is an extension of word2vec (Mikolov et al., 2013) for learning document embeddings. The document embeddings are generated along with word embeddings in the same vector space. word2vec learns word embeddings that can separate the words appearing in contexts of the target word from randomly sampled words, while doc2vec learns document embeddings that can separate the words appearing in the document from randomly sampled words.
Given the doc2vec question and answer vectors, we concatenate them to construct the text vector, v text , which is used as the input to H. Note that in this model v text is kept fixed after pretraining (unlike in decatt where errors are propagated all the way back to the v text vectors).

Metadata Only
In order to leverage the metadata in the Stack Overflow dataset, we extract a set of features to form a fixed-length vector as input to our model. Given the wide difference in scale of these features, all feature values are linearly scaled to the range [0, 1]. We denote this vector as v meta , and in the metadata-only case this is used as the input to the classification network H.
The raw metadata is as follows: firstly, for each question and answer we used the number of times the post had been viewed, the creation date of the post, the last activity date on the post, and a list of comments on the post, including the user ID for each comment. Secondly, for each question we used the top n tags for the question (based on the number of community votes), where n is a tunable hyperparameter. Finally, for each user we used the account creation date, number of up/down votes, reputation score, and list of badges obtained by the user. 4 From these raw metadata fields we constructed sets of question-specific, answer-specific, and user-specific features, which are summarised in Table 2. All date features were converted to integers using seconds since Unix epoch, and all binary features were converted to zero or one. In addition, the tag-based features were converted to a probability distribution based on simple MLE. 5 One concern with this model is that concatenating all features together could lead to feature groups with lots of features dominating groups with fewer features (for example the BasicQ and BasicA features could be overshadowed by the QTags and UTags features). In order to control for this, we only used the top n tags for the QTags and UTags feature groups.
A further possible concern is that, in a realworld scenario, not all of this metadata would be available at classification time (e.g. some of it is generated quite a bit after the questions and answers are posted). In practice, all of the Question and User features are available at the time of question creation, and it is only really the Answer features where ambiguity comes in. With the comments, for example, the norm is that comments lead to the refinement (via post-editing) of the answer, and the vast majority of comments in our dataset were posted soon after the original answer. Thus, while it is certainly possible for comments to appear after the answer has been finalised, any biasing effect here is minor. The only feature which has potentially changed signifi-cantly from the time of answer posting is the number of answer views, although as we will observe empirically, the utility of this feature is slight.

Combining Text and Metadata
To combine textual and metadata features, we concatenate [v text ; v meta ] as the input question/answer pair embedding for the classification network H.
We define the predictionŷ := H([v text , v meta ]), whereŷ ∈ R C in the case of C = 2 classes (i.e. "best" or not). Now given training instance n, for the predictionŷ

Experiments
To train our models, we used the Adam Optimiser (Kingma and Ba, 2014). For decatt, we used dropout over F, G, H after every dense layer. For the doc2vec MLP, we included batch normalisation before, and dropout after, each dense layer. For testing, we picked the best model according to the validation results after the end of each epoch.
The parameters for decatt were initialised with a Gaussian distribution of mean 0 and variance 0.01, and for the doc2vec MLP we used Glorot normal initialization. For Stack Overflow, the parameters for Word embeddings were pretrained using GloVe (Pennington et al., 2014) with the FULL data (by combining all questions and answers in the sequence they appeared) for 50 epochs. Word embeddings were set to 150 dimensions. The co-occurrence weighting function's maximum value x max was kept at the default of 10. For SEMEVAL, we used pretrained Common Crawl cased embeddings with 840G tokens and 300 dimensions (Pennington et al., 2014).
To train the decatt model for Stack Overflow we split the data into 3 partitions based on the size of the question/answer text, with separate partitions where the total length of the question/answer text was: (1) ≤ 500 words; (2) > 500 and ≤ 2000 words; and (3) > 2000 words. We used a different batch size for each partition (32, 8, and 4 respectively). 6 Examples were shuffled within each partition after every epoch. For SEMEVAL we did not

Type Name Size Description
Question BasicQ 3 Number of times question has been viewed (×1), creation date of question (×1), and date of most recent activity on question (×1).

QTags n
Probability distribution over top-n tags for question (×n).

Answer
BasicA 3 Number of times answer has been viewed (×1), creation date of answer (×1), and date of most recent activity on answer (×1).
Comments 10 Number of comments on question and answer (×2), whether asker/answerer commented on question/answer (×4), number of sequential comments between asker and answerer across both question and answer (×1), average sentiment of comments on answer using , both including and ignoring neutral sentences (×2), and whether there was at least one comment on answer (×1).

86
Whether user has each badge or not (×86).

UTags 2n
Probability distribution over top-n tags across all questions answered by user (×n), and the same distribution restricted to questions answered by the user where their answer was chosen as best (×n). Table 2: Summary of the metadata features used to improve question answering performance. These features are separated into feature groups, which in turn are separated into group types based on whether the values are specific to a given question, to a question's answer, or to a user.
use partitions, and instead used a batch size of 32, since training was fast enough. For doc2vec pre-training, we used the FULL corpus, with train, val and test documents excluded. 7 We used the dbow version of doc2vec, and included an additional word2vec step to learn the word embeddings simultaneously. 8 Note that for SEMEVAL, we experiment with 7 The text was additionally preprocessed by lowercasing. doc2vec training and inference was done using the gensim (Řehůřek and Sojka, 2010) implementation. 8 Based on Lau and Baldwin (2016), our hyperparameter configuration of doc2vec for training was as follows: vector size = 200; negative samples = 5; window size = 3; minimum word frequency = 5; frequent word sampling threshold =1 · 10 −5 ; starting learning rate (αstart) = 0.05; minimum learning rate (αmin) = 0.0001; and number of epochs = 20. For inferring vectors in our train, val and test sets we used: αstart = 0.01; αmin = 0.0001; and number of epochs = 500.
using only the text features to better understand the competitiveness of these text-processing networks decatt and doc2vec.
We tuned hyperparameters for all methods based on validation performance using the SigOpt Bayesian optimisation service. Optimal hyperparameter configurations are detailed in Table 3.
For additional comparison, we implemented the following baselines (some taken from Jansen et al. (2014), plus some additional baselines of our own), including: (1) random, which ranks the answers randomly; (2) first-answer, which ranks the answers in chronological order; (3) highest-rep, which ranks the answers by decreasing reputation; (4) longest-doc, which ranks the answers by decreasing length; and (5) tf-idf, which ranks the answers by the cosine of the tf-idf 9 vector representations between the   question and answer.

Results
For a given question, we are interested both in how accurately our model ranks the answers, and whether it classifies the best answer correctly. However, for simplicity we simply look at the performance of the model in correctly predicting the best answer. Following Bogdanova and Foster (2016), we measure this using P@1. In all cases this is calculated on the test set of questions, using the gold-standard "best answer" labels from the Stack Overflow corpus, as decided by the question asker. For SEMEVAL we use MAP to compare with other published results. The results are presented in  portance of the different metadata feature groups, we additionally provide feature ablation results in Table 5 for the Stack Overflow dataset. We can make several observations from these results. Firstly, we can see that performance increases when we increase the dataset size (from SMALL to LARGE), showing that our models scale well with more data. For the text-only models, decatt outperforms doc2vec consistently over both datasets. In addition, metadata achieves much higher results than the text-only models, which shows the importance of utilising the rich metadata data available for cQA retrieval. The best model, decatt + metadata, is the hybrid model that combines both sources of information and substantially improves performance compared to metadata. From the SEMEVAL results, we can see that our best text model (decatt) is competitive with the state-of-the-art  model (semeval-best), which also incorporates a number of handcrafted metadata features to achieve a score of 88.4%.
To better understand the attention learnt by decatt, we plotted the attention weights for a number of question-answer pairs in Figure 1. In general, technical words that appear relevant to the question and answer have a high weight. Overall, we find that decatt does not appear to capture word pairs which correspond to each other, as important question words are given strong attention consistently for most answer words. We do find a few exceptions with strong mutual attention, e.g. roughly 10-20+ connections and multiple concurrent sockets have strong mutual attention. This may explain the small difference in performance between the doc2vec and decatt models.
In terms of our feature ablation results, all feature types contribute to an increase in performance. The increases are greater in LARGE, suggesting that the model is better able to utilize the information given more data. The BasicQ, BasicA features, which include dates and view counts, do not appear to be of much use. Niether does Badges, which appears to hurt the model slightly. The other features give substantial gains, especially in LARGE. The Comments feature is strongest, but since it includes information based on the comments of the question asker, it may not be as relevant for the ultimate goal of crossquestion answer retrieval.
Comparing decatt and metadata model, we found that overall, both models perform well, and even when a model does not predict the ac-cepted answer it often gives a highly-voted answer. We found that the metadata model tends to favour answers which have multiple comments involving the asker, and especially answers from high-reputation users. For example, in answer A 1,2 to question Q 1 in Table 6, there were a total of 8 comments to the answer (and no comments to any of the other answers), biasing metadata to prefer it. In practice, however, those comments were uniformly negative on the part of a number of prominent community members, which the model has failed to capture. This makes sense given the results in Table 5. However, it does not appear to understand comments where the asker is discussing why the answer fails to address his question, for example I can't choose one Polygon class because each library operates only in its own implementation. While we include sentiment features in our metadata features, this alone might not be sufficient, since the disussion may revolve around facts and require more detailed modelling of the discourse structure of comments. Note that here, decatt correctly selected A 1,1 , on the basis of its content.
As an example of a misclassification by decatt, answer A 2,2 is preferred over (bestanswer) A 2,1 in response to question Q 2 in Table 6, but is actually a more comprehensive answer which deals with more issues in the original code and receives an equal number of community votes from the community to A 2,2 . However, A 2,1 was posted first and receives a comment of gratitude from the question asker, meaning that metadata is able to correctly classify it as best answer.

Future Work
There are multiple avenues for future research based on our work. Our model's use of attention in the Stack Overflow dataset appears to be very limited, so a model which can make full use of attention could be a good direction of investigation. Another approach would be to extend our model to incorporate the entire list of answers and comments, possibly using graph-based approaches, instead of relying on individual question/answer pairs and manually engineered comment features. Ultimately, we would like to extend our methodology for cross-question answer retrieval, rather than just answer retrieval from a single question, given the goal of utilising the data in cQA forums to facilitate general-purpose nonfactoid question answering

Conclusions
In this paper we built a state-of-the-art model for cQA answer retrieval model based on a deeplearning framework. Unlike recent work on this problem we successfully utilised metadata to substantially boost performance. In addition, we adapt an attentional component in our model, which improves results over the simple paragraph vectorbased approach used in our benchmark, which was previously the state-of-the-art model. It is our hope that this work facilitates future research on utilising cQA data for non-factoid question answering.