Learning Hybrid Representations to Retrieve Semantically Equivalent Questions

Retrieving similar questions in online Q&A community sites is a difﬁcult task because different users may formulate the same question in a variety of ways, us-ing different vocabulary and structure. In this work, we propose a new neural network architecture to perform the task of semantically equivalent question retrieval. The proposed architecture, which we call BOW-CNN, combines a bag-of-words (BOW) representation with a distributed vector representation created by a convolutional neural network (CNN). We perform experiments using data collected from two Stack Exchange communities. Our experimental results evidence that: (1) BOW-CNN is more effective than BOW based information retrieval methods such as TFIDF; (2) BOW-CNN is more robust than the pure CNN for long texts.


Introduction
Most Question-answering (Q&A) community sites advise users before posting a new question to search for similar questions. This is not always an easy task because different users may formulate the same question in a variety of ways.
We define two questions as semantically equivalent if they can be adequately answered by the exact same answer. Here is an example of a pair of such questions from Ask Ubuntu community, which is part of the Stack Exchange Q&A community site: (q 1 )"I have downloaded ISO files recently. How do I burn it to a CD or DVD or mount it?" and (q 2 )"I need to copy the iso file for Ubuntu 12.04 to a CD-R in Win8. How do I do so?". Retrieving semantically equivalent questions is a challenging task due to two main factors: (1) the same question can be rephrased in many different ways; and (2) two questions may be different but may refer implicitly to a common problem with the same answer. Therefore, traditional similarity measures based on word overlap such as shingling and Jaccard coefficient (Broder, 1997) and its variations (Wu et al., 2011) are not able to capture many cases of semantic equivalence. To capture the semantic relationship between pair of questions, different strategies have been used such as machine translation (Jeon et al., 2005;Xue et al., 2008), knowledge graphs (Zhou et al., 2013) and topic modelling (Cai et al., 2011;Ji et al., 2012).
Recent papers (Kim, 2014;Hu et al., 2014;Yih et al., 2014;dos Santos and Gatti, 2014;Shen et al., 2014) have shown the effectiveness of convolutional neural networks (CNN) for sentence-level analysis of short texts in a variety of different natural language processing and information retrieval tasks. This motivated us to investigate CNNs for the task of semantically equivalent question retrieval. However, given the fact that the size of a question in an online community may vary from a single sentence to a detailed problem description with several sentences, it was not clear that the CNN representation would be the most adequate.
In this paper, we propose a hybrid neural network architecture, which we call BOW-CNN. It combines a traditional bag-of-words (BOW) representation with a distributed vector representation created by a CNN, to retrieve semantically equivalent questions. Using a ranking loss function in the training, BOW-CNN learns to represent questions while learning to rank them according to their semantic similarity. We evaluate BOW-CNN over two different Q&A communities in the Stack Exchange site, comparing it against CNN and 6 well-established information retrieval algorithms based on BOW. The results show that our proposed solution outperforms BOW-based information retrieval methods such as the term frequency -inverse document frequency (TFIDF) in all evalu- ated scenarios. Moreover, we were able to show that for short texts (title of the questions), an approach using only CNN obtains the best results, whereas for long texts (title and body of the questions), our hybrid approach (BOW-CNN) is more effective.

Feed Forward Processing
The goal of the feed forward processing is to calculate the similarity between a pair of questions (q 1 , q 2 ). To perform this task, each question q follows two parallel paths (BOW and CNN), each one producing a distinct vector representations of q. The BOW path produces a weighted bag-of-words representation of the question, r bow q , where the weight of each word in the vocabulary V is learned by the neural network. The CNN path, uses a convolutional approach to construct a distributed vector representations, r conv q , of the question. After producing the BOW and CNN representations for the two input questions, the BOW-CNN computes two partial similarity scores s bow (q 1 , q 2 ), for the CNN representations, and s conv (q 1 , q 2 ), for the BOW representations. Finally, it combines the two partial scores to create the final score s(q 1 , q 2 ).

BOW Path
The generation of the bag-of-words representation for a given question q is quite straightforward. As detailed in Figure 1, we first create a sparse vector q bow ∈ R |V | that contains the frequency in q of each word of the vocabulary. Next, we compute the weighted bag-of-words representation by per- forming the element-wise vector multiplication: where the vector t ∈ R |V | , contains a weight for each word in the vocabulary V . The vector t is a parameter to be learned by the network. This is closely related to the TFIDF text representation.
In fact, if we fix t to the vector of IDFs, this corresponds to the exact TFIDF representation.

CNN Path
As detailed in Figure 2, the first layer of the CNN path transforms words into representations that capture syntactic and semantic information about the words. Given a question consisting of N words q = {w 1 , ..., w N }, every word w n is converted into a real-valued vector r wn . Therefore, for each question, the input to the next NN layer is a sequence of real-valued vectors q emb = {r w 1 , ..., r w N }. Word representations are encoded by column vectors in an embedding matrix W 0 ∈ R d×|V | , where V is a fixed-sized vocabulary. The next step in the CNN path consists in creating distributed vector representations r conv q 1 and r conv q 2 from the word embedding sequencies q emb 1 and q emb 2 . We perform this by using a convolutional layer in the same way as used in (dos Santos and Gatti, 2014) to create sentence-level representations.
More specifically, given a question q 1 , the convolutional layer applies a matrix-vector operation to each window of size k of successive windows in q emb 1 = {r w 1 , ..., r w N }. Let us define the vector z n ∈ R dk as the concatenation of a sequence of k word embeddings, centralized in the n-th word: The convolutional layer computes the j-th element of the vector r conv q 1 ∈ R clu as follows: where W 1 ∈ R clu×dk is the weight matrix of the convolutional layer and f is the hyperbolic tangent function. Matrices W 0 and W 1 , and the vector b 1 are parameters to be learned. The word embedding size d, the number of convolutional units cl u , and the size of the word context window k are hyperparameters to be chosen by the user.

Question Pair Scoring
After the bag-of-words and convolutional-based representations are generated for the input pair (q 1 , q 2 ), the partial scores are computed as the cosine similarity between the respective vectors: The final score for the input questions (q 1 , q 2 ) is given by the following linear combination where β 1 and β 2 are parameters to be learned.

Training Procedure
Our network is trained by minimizing a ranking loss function over the training set D. The input in each round is two pairs of questions (q 1 , q 2 ) + and (q 1 , q x ) − where the questions in the first pair are semantically equivalent (positive example), and the ones in the second pair are not (negative example). Let ∆ be the difference of their similarity scores, ∆ = s θ (q 1 , q 2 ) − s θ (q 1 , q x ), generated by the network with parameter set θ. where γ is a scaling factor that magnifies ∆ from [-2,2] (in the case of using cosine similarity) to a larger range. This helps to penalize more on the prediction errors. Following (Yih et al., 2011), in our experiments we set γ to 10.
Sampling informative negative examples can have a significant impact in the effectiveness of the learned model. In our experiments, before training, we create 20 pairs of negative examples for each positive pair (q 1 ,q 2 ) + . To create a negative example we (1) randomly sample a question q x that is not semantically equivalent to q 1 or q 2 ; (2) then create negative pairs (q 1 ,q x ) − and (q 2 ,q x ) − . During training, at each iteration we only use the negative example x that produces the smallest different s θ (q 1 , q 2 ) + − s θ (q 1 , q x ) − . Using this strategy, we select more representative negative examples.
We use stochastic gradient descent (SGD) to minimize the loss function with respect to θ. The backpropagation algorithm is used to compute the gradients of the network. In our experiments, BOW-CNN architecture is implemented using Theano (Bergstra et al., 2010).

Data
A well-structured source of semantically equivalent questions is the Stack Exchange site. It is composed by multiple Q&A communities, whereby users can ask and answer questions, and vote up and down both questions and answers. Questions are composed by a title and a body. Moderators can mark questions as duplicates, and eventually a question can have multiple duplicates.
For this evaluation, we chose two highlyaccessed Q&A communities: Ask Ubuntu and English. They differ in terms of content and size. Whereas Ask Ubuntu has 29510 duplicated questions, English has 6621. We performed experiments using only the title of the questions as well as title + body, which we call all for the rest of this section. The average size of a title is very small (about 10 words), which is at least 10 times smaller than the average size of all for both datasets. The data was tokenized using the tokenizer available with the Stanford POS Tagger (Toutanova et al., 2003), and all links were replaced by a unique string. For Ask Ubuntu, we did not consider the content inside the tag code, which contains some specific Linux commands or programming code.
For each community, we created training, vali-

Baselines and Neural Network Setup
In order to verify the impact of jointly using BOW and CNN representations, we perform experiments with two NN architectures: the BOW-CNN and the CNN alone, which consists in using only the CNN path of BOW-CNN and, consequently, computing the score for a pair of questions using s(q 1 , q 2 ) = s conv (q 1 , q 2 ). Additionally, we compare BOW-CNN with six well-established IR algorithms available on the Lucene package (Hatcher et al., 2004). Here we provide a brief overview of them. For further details, we refer the reader to the citation associated with the algorithm.
• TFIDF (Manning et al., 2008) uses the traditional Vector Space Model to represent documents as vectors in a high-dimensional space. Each position in the vector represents a word and the weight of words are calculated using TFIDF.
• BM25 (Robertson and Walker, 1994) is a probabilistic weighting method that takes into consideration term frequency, inverse document frequency and document length. Its has two free parameters: k1 to tune termfrequency saturation; and b to calibrate the document-length normalization.
• IB (Clinchant and Gaussier, 2010) uses information-based models to capture the importance of a term by measuring how much  Table 2: Neural Network Hyper-Parameters its behavior in a document deviates from its behavior in the whole collection.
• DFR (Amati and Van Rijsbergen, 2002) is based on divergence from randomness framework. The relevance of a term is measured by the divergence between its actual distribution and the distribution from a random process.
• LMDirichlet and LMJelinekMercer apply probabilistic language model approaches for retrieval (Zhai and Lafferty, 2004). They differ in the smoothing method: LMDirichlet uses Dirichlet priors and LMJelinekMercer uses the Jelinek-Mercer method.
The word embeddings used in our experiments are initialized by means of unsupervised pretraining. We perform pre-training using the skipgram NN architecture (Mikolov et al., 2013) available in the word2vec tool. We use the English Wikipedia to train word embeddings for experiments with the English dataset. For the AskUbuntu dataset, we use all available Ask-Ubuntu community data to train word embeddings.
The hyper-parameters of the neural networks and the baselines are tuned using the development sets. In Table 2, we show the selected hyperparameter values. In our experiments, we initialize each element [t] i of the bag-of-word weight vector t with the IDF of i−th word w i computed over the respective set of questions Q as follows

Experimental Results
Comparison with Baselines. In Tables 3 and  4, we present the question retrieval performance (Accuracy@k) of different algorithms over the AskUbuntu and English datasets for the title and all settings, respectively. For both datasets, BOW-CNN outperforms the six IR algorithms for both title and all settings. For the AskUbuntu all, BOW-CNN is four absolute points larger than the   best IR baseline (LMJ) in terms of Accuracy@1, which represents an improvement of 21.9%. Since the BOW representation we use is closely related to TFIDF, an important comparison is the performance of BOW-CNN vs. TFIDF. In Tables 3 and  4, we can see that BOW-CNN consistently outperforms the TFIDF model in the two datasets for both cases title and all. These findings suggest that BOW-CNN is indeed combining the strong semantic representation power conveyed by the convolutional-based representation to, jointly with the BOW representation, construct a more effective model.
Another interesting finding is that CNN outperforms BOW-CNN for short texts (Table 3) and, conversely, BOW-CNN outperforms CNN for long texts (Table 4). This demonstrates that, when dealing with large input texts, BOW-CNN is an effective approach to combine the strengths of convolutional-based representation and BOW.
Impact of Initialization of BOW Weights. In the BOW-CNN experiments whose results are presented in tables 3 and 4 we initialize the elements of the BOW weight vector t with the IDF of each word in V computed over the question set Q. In this section we show some experimental results that indicate the contribution of this initialization.
In Table 5, we present the performance of BOW-CNN for the English dataset when different configurations of the BOW weight vector t are used. The first column of Table 5 indicates the type of initialization, where ones means that t is initialized with the value 1 (one) in all positions. The second column informs whether t is allowed to be updated (Yes) by the network or not (No). The numbers suggest that letting BOW weights free to be updated by the network produces better results than fixing them to IDF values. In addition, using IDF to initialize the BOW weight vector is better than using the same weight (ones) to initialize it. This is expected, since we are injecting a prior knowledge known to be helpful in IR tasks.

Conclusions
In this paper, we propose a hybrid neural network architecture, BOW-CNN, that combines bag-ofwords with distributed vector representations created by a CNN, to retrieve semantically equivalent questions. Our experimental evaluation showed that: our approach outperforms traditional bow approaches; for short texts, a pure CNN obtains the best results, whereas for long texts, BOW-CNN is more effective; and initializing the BOW weight vector with IDF values is beneficial.