Semi-supervised Question Retrieval with Gated Convolutions

Question answering forums are rapidly growing in size with no effective automated ability to refer to and reuse answers already available for previous posted questions. In this paper, we develop a methodology for finding semantically related questions. The task is difficult since 1) key pieces of information are often buried in extraneous details in the question body and 2) available annotations on similar questions are scarce and fragmented. We design a recurrent and convolutional model (gated convolution) to effectively map questions to their semantic representations. The models are pre-trained within an encoder-decoder framework (from body to title) on the basis of the entire raw corpus, and fine-tuned discriminatively from limited annotations. Our evaluation demonstrates that our model yields substantial gains over a standard IR baseline and various neural network architectures (including CNNs, LSTMs and GRUs).


Introduction
Question answering (QA) forums such as Stack Exchange 2 are rapidly expanding and already contain millions of questions. The expanding scope and coverage of these forums often leads to many duplicate and interrelated questions, resulting in the same questions being answered multiple times. By identifying similar questions, we can potentially reuse Title: How can I boot Ubuntu from a USB? Body: I bought a Compaq pc with Windows 8 a few months ago and now I want to install Ubuntu but still keep Windows 8. I tried Webi but when my pc restarts it read ERROR 0x000007b. I know that Windows 8 has a thing about not letting you have Ubuntu but I still want to have both OS without actually losing all my data ...

Title: When I want to install Ubuntu on my laptop I'll
have to erase all my data. "Alonge side windows" doesnt appear Body: I want to install Ubuntu from a Usb drive. It says I have to erase all my data but I want to install it along side Windows 8. The "Install alongside windows" option doesn't appear. What appear is, ... existing answers, reducing response times and unnecessary repeated work. Unfortunately in most forums, the process of identifying and referring to existing similar questions is done manually by forum participants with limited, scattered success.
The task of automatically retrieving similar questions to a given user's question has recently attracted significant attention and has become a testbed for various representation learning approaches dos Santos et al., 2015). However, the task has proven to be quite challenging -for instance, dos Santos et al. (2015) report a 22.3% classification accuracy, yielding a 4 percent gain over a simple word matching baseline.
Several factors make the problem difficult. First, submitted questions are often long and contain extraneous information irrelevant to the main question being asked. For instance, the first question in Figure 1 pertains to booting Ubuntu using a USB stick. A large portion of the body contains tangential de-tails that are idiosyncratic to this user, such as references to Compaq pc, Webi and the error message. Not surprisingly, these features are not repeated in the second question in Figure 1 about a closely related topic. The extraneous detail can easily confuse simple word-matching algorithms. Indeed, for this reason, some existing methods for question retrieval restrict attention to the question title only. While titles (when available) can succinctly summarize the intent, they also sometimes lack crucial detail available in the question body. For example, the title of the second question does not refer to installation from a USB drive. The second challenge arises from the noisy annotations. Indeed, the pairs of questions marked as similar by forum participants are largely incomplete. Our manual inspection of a sample set of questions from AskUbuntu 3 shows that only 5% of similar pairs have been annotated by the users, with a precision of around 79%.
In this paper, we design a neural network model and an associated training paradigm to address these challenges. On a high level, our model is used as an encoder to map the title, body, or the combination to a vector representation. The resulting "question vector" representation is then compared to other questions via cosine similarity. We introduce several departures from typical architectures on a finer level. In particular, we incorporate adaptive gating in non-consecutive CNNs  in order to focus temporal averaging in these models on key pieces of the questions. Gating plays a similar role in LSTMs (Hochreiter and Schmidhuber, 1997), though LSTMs do not reach the same level of performance in our setting. Moreover, we counter the scattered annotations available from user-driven associations by training the model largely based on the entire unannotated corpus. The encoder is coupled with a decoder and trained to reproduce the title from the noisy question body. The methodology is reminiscent of recent encoder-decoder networks in machine translation and document summarization (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Cho et al., 2014b;. The resulting encoder is subsequently fine-tuned discriminatively on the basis of limited annotations yielding an additional performance boost. 3 http://askubuntu.com/ We evaluate our model on the AskUbuntu corpus from Stack Exchange used in prior work (dos Santos et al., 2015). During training, we directly utilize noisy pairs readily available in the forum, but to have a realistic evaluation of the system performance, we manually annotate 8K pairs of questions. This clean data is used in two splits, one for development and hyper parameter tuning and another for testing. We evaluate our model and the baselines using standard information retrieval (IR) measures such as Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) and Precision at n (P@n). Our full model achieves a MRR of 75.6% and P@1 of 62.0%, yielding 8% absolute improvement over a standard IR baseline, and 4% over standard neural network architectures (including CNNs, LSTMs and GRUs).

Related Work
Given the growing popularity of community QA forums, question retrieval has emerged as an important area of research (Nakov et al., 2015;Nakov et al., 2016). Previous work on question retrieval has modeled this task using machine translation, topic modeling and knowledge graph-based approaches (Jeon et al., 2005;Li and Manandhar, 2011;Duan et al., 2008;Zhou et al., 2013). More recent work relies on representation learning to go beyond word-based methods. For instance,  learn word embeddings using category-based metadata information for questions. They define each question as a distribution which generates each word (embedding) independently, and subsequently use a Fisher kernel to assess question similarities. Dos Santos et al. (2015) propose an approach which combines a convolutional neural network (CNN) and a bagof-words representation for comparing questions. In contrast to , our model treats each question as a word sequence as opposed to a bag of words, and we apply a recurrent convolutional model as opposed to the traditional CNN model used by dos Santos et al. (2015) to map questions into meaning representations. Further, we propose a training paradigm that utilizes the entire corpus of unannotated questions in a semi-supervised manner.
Recent work on answer selection on community QA forums, similar to our task of question retrieval, has also involved the use of neural network architectures (Severyn and Moschitti, 2015;Wang and Nyberg, 2015;Shen et al., 2015;Feng et al., 2015;Tan et al., 2015). Compared to our work, these approaches focus on improving various other aspects of the model. For instance, Feng et al. (2015) explore different similarity measures beyond cosine similarity, and Tan et al. (2015) adopt the neural attention mechanism over RNNs to generate better answer representations given the questions as context.

Question Retrieval Setup
We begin by introducing the basic discriminative setting for retrieving similar questions. Let q be a query question which generally consists of both a title sentence and a body section. For efficiency reasons, we do not compare q against all the other queries in the data base. Instead, we retrieve first a smaller candidate set of related questions Q(q) using a standard IR engine, and then we apply the more sophisticated models only to this reduced set. Our goal is to rank the candidate questions in Q(q) so that all the similar questions to q are ranked above the dissimilar ones. To do so, we define a similarity score s(q, p; θ) with parameters θ, where the similarity measures how closely candidate p ∈ Q(q) is related to question q. The method of comparison can make use of the title and body of each question.
The scoring function s(·, ·; θ) can be optimized on the basis of annotated data D = (q i , p + i , Q − i ) , where p + i is a question similar to question q i and Q − i is a negative set of questions deemed not similar to q i . During training, the correct pairs of similar questions are obtained from available user-marked pairs, while the negative set Q − i is drawn randomly from the entire corpus with the idea that the likelihood of a positive match is small given the size of the corpus. The candidate set during training is just During testing, the candidate sets are retrieved by an IR engine and we evaluate against explicit manual annotations.
In the purely discriminative setting, we use a maxmargin framework for learning (or fine-tuning) parameters θ. Specifically, in a context of a particular training example where q i is paired with p + i , we where δ(·, ·) denotes a non-negative margin. We set δ(p, p + i ) to be a small constant when p = p + i and 0 otherwise. The parameters θ can be optimized through sub-gradients ∂L/∂θ aggregated over small batches of the training instances.
There are two key problems that remain. First, we have to define and parameterize the scoring function s(q, p; θ). We design a recurrent neural network model for this purpose and use it as an encoder to map each question into its meaning representation. The resulting similarity function s(q, p; θ) is just the cosine similarity between the corresponding representations, as shown in Figure 2 (a). The parameters θ pertain to the neural network only. Second, in order to offset the scarcity and limited coverage of the training annotations, we pre-train the parameters θ on the basis of the much larger unannotated corpus. The resulting parameters are subsequently fine-tuned using the discriminative setup described above.

Non-consecutive Convolution
We describe here our encoder model, i.e., the method for mapping the question title and body to a vector representation. Our approach is inspired by temporal convolutional neural networks (LeCun et al., 1998) and, in particular, its recent refinement , tailored to capture longerrange, non-consecutive patterns in a weighted manner. Such models can be used to effectively summarize occurrences of patterns in text and aggregate them into a vector representation. However, the summary produced is not selective since all pattern occurrences are counted, weighted by how cohesive (non-consecutive) they are. In our problem, the question body tends to be very long and full of irrelevant words and fragments. Thus, we believe that interpreting the question body requires a more selective approach to pattern extraction.
Our model successively reads tokens in the question title or body, denoted as {x i } l i=1 , and transforms this sequence into a sequence of states The resulting state sequence is subsequently aggregated into a single final vector representation for each text as discussed below. Our approach builds on , thus we begin by briefly outlining it. Let W 1 and W 2 denote filter matrices (as parameters) for pattern size n = 2.  generate a sequence of states in response to tokens according to where c t ,t represents a bigram pattern, c t accumulates a range of patterns and λ ∈ [0, 1) is a constant decay factor used to down-weight patterns with longer spans. The operations can be cast in a "recurrent" manner and evaluated with dynamic programming. The problem with the approach for our purposes is, however, that the weighting factor λ is the same (constant) for all, not triggered by the state h t−1 or the observed token x t .
Adaptive Gated Decay We refine this model by learning context dependent weights. For example, if the current input token provides no relevant information (e.g., symbols, functional words), the model should ignore it by incorporating the token with a vanishing weight. In contrast, strong semantic content words such as "ubuntu" or "windows" should be included with much larger weights. To achieve this effect we introduce neural gates similar to LSTMs to specify when and how to average the observed signals. The resulting architecture integrates recurrent networks with non-consecutive convolutional models: where σ(·) is the sigmoid function and represents the element-wise product. Here c are accumulator vectors that store weighted averages of 1-gram to n-gram features. When the gate λ t = 0 (vector) for all t, the model represents a traditional CNN with filter width n. As λ t > 0, however, c (n) t becomes the sum of an exponential number of terms, enumerating all possible n-grams within x 1 , · · · , x t (seen by expanding the formulas). Note that the gate λ t (·) is parametrized and responds directly to the previous state and the token in question. We refer to this model as RCNN from here on.
Pooling In order to use the model as part of the discriminative question retrieval framework outlined earlier, we must condense the state sequence to a single vector. There are two simple alternative pooling strategies that we have explored -either averaging over the states 4 or simply taking the last one as the meaning representation. In addition, we apply the encoder to both the question title and body, and the final representation is computed as the average of the two resulting vectors.
Once the aggregation is specified, the parameters of the gate and the filter matrices can be learned in a purely discriminative fashion. Given that the available annotations are limited and user-guided, we instead use the discriminative training only for fine tuning an already trained model. The method of pretraining the model on the basis of the entire corpus of questions is discussed next.

Pre-training Using the Entire Corpus
The number of questions in the AskUbuntu corpus far exceeds user annotations of pairs of similar questions. We can make use of this larger raw corpus in two different ways. First, since models take word embeddings as input we can tailor the embeddings to the specific vocabulary and expressions in this corpus. To this end, we run word2vec (Mikolov et al., 2013) on the raw corpus in addition to the Wikipedia dump. Second, and more importantly, we use individual questions as training examples for an auto-encoder constructed by pairing the encoder model (RCNN) with an corresponding decoder (of the same type). As illustrated in Figure 2 (b), the resulting encoder-decoder architecture is akin to those used in machine translation (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Cho et al., 2014b) and summarization .
Our encoder-decoder pair represents a conditional language model P (title|context), where the context can be any of (a) the original title itself, (b) the question body and (c) the title/body of a similar question. All possible (title, context) pairs are used during training to optimize the likelihood of the words (and their order) in the titles. We use the question title as the target for two reasons. The question body contains more information than the title but also has many irrelevant details. As a result, we can view the title as a distilled summary of the noisy body, and the encoder-decoder model is trained to act as a denoising auto-encoder. Moreover, training a decoder for the title (rather than the body) is also much faster since titles tend to be short (around 10 words).
The encoders pre-trained in this manner are subsequently fine-tuned according to the discriminative criterion described already in Section 3.

Alternative models
For comparison, we also train three alternative benchmark encoders (LSTMs, GRUs and CNNs) for mapping questions to vector representations. LSTM and GRU-based encoders can be pre-trained analogously to RCNNs, and fine-tuned discriminatively. CNN encoders, on the other hand, are only trained discriminatively. While plausible, neither alternative reaches quite the same level of performance as our pre-trained RCNN.
LSTMs LSTM cells (Hochreiter and Schmidhuber, 1997) have been used to capture semantic information across a wide range of applications, including machine translation and entailment recognition (Bahdanau et al., 2015;Bowman et al., 2015;Rocktäschel et al., 2016). Their success can be attributed to neural gates that adaptively read or discard information to/from internal memory states.
Specifically, a LSTM network successively reads the input token x t , internal state c t−1 , as well as the visible state h t−1 , and generates the new states c t , h t : where i, f and o are input, forget and output gates, respectively. Given the visible state sequence {h i } l i=1 , we can aggregate it to a single vector exactly as with RCNNs. The LSTM encoder can be pre-trained (and fine-tuned) in the similar way as our RCNN model. For instance, Dai and Le (2015) recently adopted pre-training for text classification task.
GRUs A GRU is another comparable unit for sequence modeling (Cho et al., 2014a;Chung et al., 2014). Similar to the LSTM unit, the GRU has two neural gates that control the flow of information: where i and r are input and reset gate respectively. Again, the GRUs can be trained in the same way.
CNNs Convolutional neural networks (LeCun et al., 1998) have also been successfully applied to various NLP tasks (Kalchbrenner et al., 2014;Kim, 2014;Kim et al., 2015;Gao et al., 2014). As models, they are different from LSTMs since the temporal convolution operation and associated filters map local chunks (windows) of the input into a feature representation. Concretely, if we let n denote the filter width, and W 1 , · · · , W n the corresponding filter matrices, then the convolution operation is applied to each window of n consecutive words as follows: The sets of output state vectors {h t } produced in this case are typically referred to as feature maps. Since each vector in the feature map only pertains to local information, the last vector is not sufficient to capture the meaning of the entire sequence. Instead, we consider max-pooling or average-pooling to obtain the aggregate representation for the entire sequence.

Experimental Setup
Dataset We use the Stack Exchange AskUbuntu dataset used in prior work (dos Santos et al., 2015). This dataset contains 167,765 unique questions, each consisting of a title and a body 5 , and a set of user-marked similar question pairs. We provide various statistics from this dataset in Table 1.
Gold Standard for Evaluation User-marked similar question pairs on QA sites are often known to be incomplete. In order to evaluate this in our dataset, we took a sample set of questions paired with 20 candidate questions retrieved by a search engine trained on the AskUbuntu data. The search engine used is the well-known BM25 model  We truncate the body section at a maximum of 100 words. son and Zaragoza, 2009). Our manual evaluation of the candidates showed that only 5% of the similar questions were marked by users, with a precision of 79%. Clearly, this low recall would not lead to a realistic evaluation if we used user marks as our gold standard. Instead, we make use of expert annotations carried out on a subset of questions.
Training Set We use user-marked similar pairs as positive pairs in training since the annotations have high precision and do not require additional manual annotations. This allows us to use a much larger training set. We use random questions from the corpus paired with each query question p i as negative pairs in training. We randomly sample 20 questions as negative examples for each p i during each epoch. Baselines and Evaluation Metrics We evaluated neural network models-including CNNs, LSTMs, GRUs and RCNNs-by comparing them with the following baselines:

Development and Test Sets
• BM25, we used the BM25 similarity measure provided by Apache Lucene.
• TF-IDF, we ranked questions using cosine similarity based on a vector-based word representation for each question.
We evaluated the models based on the following IR metrics: Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), Precision at 1 (P@1), and Precision at 5 (P@5).   Table 3: Configuration of neural models. d is the hidden dimension, |θ| is the number of parameters and n is the filter width.

Hyper-parameters
We performed an extensive hyper-parameter search to identify the best model for the baselines and neural network models. For the TF-IDF baseline, we tried n-gram feature order n ∈ {1, 2, 3} with and without stop words pruning. For the SVM baseline, we used the default SVM-Light parameters whereas the dev data is only used to increase the training set size when testing on the test set. We also tried to give higher weight to dev instances but this did not result in any improvement.
For all the neural network models, we used Adam (Kingma and Ba, 2015) as the optimization method with the default setting suggested by the authors. We optimized other hyper-parameters with the following range of values: learning rate ∈ {1e − 3, 3e − 4}, dropout (Hinton et al., 2012) probability ∈ {0.1, 0.2, 0.3}, CNN feature width ∈ {2, 3, 4}. We also tuned the pooling strategies and ensured each model has a comparable number of parameters. The default configurations of LSTMs, GRUs, CNNs and RCNNs are shown in Table 3. We used MRR to identify the best training epoch and the model configuration. For the same model configuration, we report average performance across 5 independent runs. 7 Word Vectors We ran word2vec (Mikolov et al., 2013) to obtain 200-dimensional word embeddings using all Stack Exchange data (excluding Stack-Overflow) and a large Wikipedia corpus. The word vectors are fixed to avoid over-fitting across all experiments.

Results
Overall Performance Table 2 shows the performance of the baselines and the neural encoder models on the question retrieval task. The results show that our full model, RCNNs with pre-training, achieves the best performance across all metrics on both the dev and test sets. For instance, the full model gets a P@1 of 62.0% on the test set, outperforming the word matching-based method BM25 by over 8 percent points. Further, our RCNN model also outperforms the other neural encoder models and the baselines across all metrics. This superior performance indicates that the use of nonconsecutive filters and a varying decay is effective in improving traditional neural network models.

Pooling Strategy
We analyze the effect of various pooling strategies for the neural network encoders. As shown in Table 4, our RCNN model outperforms other neural models regardless of the two pooling strategies explored. We also observe that simply using the last hidden state as the final representation achieves better results for the RCNN model. Pre-training Note that, during pre-training, the last hidden states generated by the neural encoder are used by the decoder to reproduce the question titles. It would be interesting to see how such states capture the meaning of questions. To this end, we evaluate MRR on the dev set using the last hidden states of the question titles. We also test how the encoder captures information from the question bodies to produce the distilled summary, i.e. titles. To do so, we evaluate the perplexity of the trained encoderdecoder model on a heldout set of the corpus, which contains about 2000 questions. As shown in Figure 3, the representations generated by the RCNN encoder perform quite well, resulting in a perplexity of 25 and over 68% MRR without the subsequent fine-tuning. Interestingly, the LSTM and GRU networks obtain similar perplexity on the heldout set, but achieve much worse MRR for similar question retrieval. For instance, the GRU encoder obtains only 63% MRR, 5% worse than the RCNN model's MRR performance. As a result, the LSTM and GRU encoder do not benefit clearly from pre-training, as suggested in Table 2.

Using Question Body
The inconsistent performance difference may be explained by two hypotheses. One is that the perplexity is not suitable for measuring the similarity of the encoded text, thus the power of the encoder is not illustrated in terms of perplexity. Another hy-h o w c a n i a d d g u a k e t e r m in a l t o t h e s t a r t -u p a p p li c a t io n s (a) how can i add guake terminal to the start-up applications b a n s h e e c r a s h e s w it h``an u n h a n d le d e x c e p t io n w a s t h r o w n : '' (b) banshee crashes with `` an unhandled exception was thrown : ''  pothesis is that the LSTM and GRU encoder may learn non-linear representations therefore their semantic relatedness can not be directly accessed by cosine similarity.
Adaptive Decay Finally, we analyze the gated convolution of our model. Figure 5 demonstrates at each word position t how much input information is taken into the model by the adaptive weights 1 − λ t . The average of weights in the vector decreases as t increments, suggesting that the information encoded into the state vector saturates when more input are processed. On the other hand, the largest value in the weight vector remains high throughout the input, indicating that at least some information has been stored in h t and c t .
We also conduct a case study on analyzing the neural gate. Since directly inspecting the 400dimensional decay vector is difficult, we train a model that uses a scalar decay instead. As shown in Figure 4, the model learns to assign higher weights to application names and quoted error messages, which intuitively are important pieces of a question in the AskUbuntu domain.

Conclusion
In this paper, we employ gated (non-consecutive) convolutions to map questions to their semantic representations, and demonstrate their effectiveness Values are averaged across all questions in the dev and test set. on the task of question retrieval in community QA forums. This architecture enables the model to glean key pieces of information from lengthy, detail-riddled user questions. Pre-training within an encoder-decoder framework (from body to title) on the basis of the entire raw corpus is integral to the model's success.