Learning Semantic Relatedness in Community Question Answering Using Neural Models

Community Question Answering forums, such as Quora and Stackoverﬂow contain millions of questions and answers. Au-tomatically ﬁnding the relevant questions from the existing questions and ﬁnding the relevant answers to a new question are Natural Language Processing tasks. In this paper, we aim to address these tasks, which we refer to as similar-Question Retrieval and Answer Selection . We present a neural-based model with stacked bidirectional LSTMs and MLP to address these tasks. The model generates the vector representations of the question-question or question-answer pairs and computes their semantic similarity scores, which are then employed to rank and predict relevancies. Extensive experiments demonstrate our re-sults outperform the baselines.


Introduction
Community Question Answering (cQA) websites such as Quora 1 and Stackoverflow 2 are rapidly expanding. Managing such platforms has become increasingly difficult because of the exponential growth in content, triggered by wider access to the internet. Traditionally, websites used to keep track of a list of frequently asked questions (FAQ) that they expect visitors to consult before asking a question. Now, with a wider range of questions being asked, a need has emerged for a better and more scalable system to automatically identify similarities between any two questions on the platform. In addition, with many users contributing to a single question, it has become harder to identify which answers are more relevant than others. We summarize these two problems as follows: • Question Retrieval: given a new question and a list of questions, we automatically rank the questions in the list according to their relevancy to the new question.
• Answer Selection: given a cQA thread containing a question and a list of answers, we automatically rank the answers according to their relevance to the question.
The increase in the number of communitybased Q&A platforms has lead to a rapid build up of large archives of user-generated questions and answers. When a new question is asked on the platform, the system searches for questions that are semantically similar in the archives. If a similar question is found, the corresponding correct answer is retrieved and returned immediately to the user as the final answer. The quality of the answer depends on the effectiveness of the question-similarity calculation. However, measuring semantic relatedness between questions and answers is not trivial. Sometimes, similar questions or relevant answers use very different wording. For instance, the two questions "Is downloading movies illegal?" and "Can I share a copy of a DVD online" have an almost identical meaning but are lexically very different. Traditional text-based similarity metrics for measuring sentence distance such as the Jaccard coefficient and the overlap coefficient (Manning and Schütze, 1999), perform poorly. In this paper, we present a neural-based model including stacked Bidirectional Long Short-Term Memory (BLSTM) networks and Multi-Layer Perceptron (MLP) to address the question retrieval and answer selection problems. The model computes the representations of the Q&As and then their semantic simi-larity scores. These scores are subsequently employed to rank the list of existing questions and answers with respect to the given question. We evaluate our model on a public benchmark cQA data (Nakov et al., 2016), and show that the results of our model outperform the baselines.
2 Related Work 2.1 Question Retrieval As explained in Section 1, two questions that are worded very differently can be similar in meaning. Three types of approaches have been developed in the literature to solve this word mismatch problem among similar questions. The first type of approach uses knowledge databases such as dictionaries. For example, Frequently Asked Question (FAQ) Finder (Burke et al., 1997) heuristically combined statistical similarities computed using conventional vector space models with semantic similarities between questions estimated using WordNet (Fellbaum, 1998) to rank FAQs. Song et al. (2007) presented an approach which is a linear combination of statistic similarity, calculated based on word co-occurrence, and semantic similarity, calculated using WordNet and a bipartite mapping. Auto-FAQ (Whitehead, 1995) applied shallow language understanding into automatic FAQ answering, where the matching of a question to FAQs is based on keyword comparison enhanced by limited language processing techniques. However, the quality and structure of current knowledge databases are, based on the results of previous experiments, not good enough for reliable performance.
The second type of approach employed manual rules or templates. These methods are expensive and hard to scale for large size collections. Sneiders (2002) proposed template based FAQ retrieval systems, while Kim and Seo (2006) proposed using user click logs to find similar queries. Lai et al. (2002) proposed an approach to automatically mine FAQs from the web; However, they did not study the use of these FAQs after they were collected. Berger et al. (2000) proposed a statistical lexicon correlation method. These previous approaches were tested with relatively small sized collections and are hard to scale because they are based on specific knowledge databases or handcrafted rules.
The third type of approach uses statistical techniques developed in information retrieval and nat-ural language processing (Berger et al., 2000). Jeon et al. (2005) presented question retrieval methods that are based on using the similarity between answers in the archive to estimate probabilities for a translation-based retrieval model. They run the IBM model 1 (Brown et al., 1993) to learn word translation probabilities on a collection of question pairs. Given a new question, a translation based information retrieval model exploits the word relationships to retrieve similar questions from Q&A archives. They show that with this model it is possible to find semantically similar questions with relatively little word overlap.
Recently, many advanced models have been developed for automating answer selection based on syntactic structures (Severyn and Moschitti, 2012;Severyn and Moschitti, 2013;Grundström and Nugues, 2014) and textual entailment. These models include quasi-synchronous grammar to learn syntactic transformations from the question to the candidate answers (Wang et al., 2007); Continuous word and phrase vectors to encode semantic similarity (Belinkov et al., 2015); Tree Edit Distance (TED) to learn tree transformations in pairs (Heilman and Smith, 2010); probabilistic model to learn tree-edit operations on dependency parse trees (Wang and Manning, 2010); and linear chain CRFs with features derived from TED to automatically learn associations between questions and candidate answers (Yao et al., 2013).
In addition to the usual local features that only look at the question-answer pair, automatic answer selection algorithms can rely on global threadlevel features, such as the position of the answer in the thread (Hou et al., 2015), or the context of an answer in a thread (Nicosia et al., 2015), or dependencies between thread answers using structured prediction models . Joty et al. (2015) modeled the relations between pairs of answers at any distance in the thread, which they combine in a graph-cut and in an Integer Linear Programming (ILP) frameworks. They then proposed a fully connected pairwise CRFs (FCCRF) with global normalization and an Isinglike edge potential.

Neural Networks
Neural based approaches have wide applications including speech recognition (Graves and Jaitly, 2014), language modeling (Mikolov et al., 2010;Mikolov et al., 2011;Sutskever et al., 2011), translation (Liu et al., 2014;Sutskever et al., 2014;Auli et al., 2013), and image captioning (Karpathy and Fei-Fei, 2015). In addition, recent work shows the effectiveness of neural models in answer selection (Severyn and Moschitti, 2015b;Feng et al., 2015) and question similarity (dos Santos et al., 2015) in community question answering. Dos Santos et al. (2015) developed CNN and bag-of-words (BOW) representation models for the question similarity task. Cosine similarity between the representations of the input questions were used to compute the CNN and BOW similarity scores for the question-question pairs. The convolutional representations in conjunction with other vectors are then passed to a MLP to compute the similarity score of the question pair. Furthermore, recent research has shown the effectiveness of CNNs for answer ranking of short textual contents (Severyn and Moschitti, 2015b).
In this paper, we present a neural model based on stacked bidirectional LSTMs and MLP to capture the long dependencies in longer-length questions and answers.

Method
In this paper, we present a neural based model using stacked bidirectional LSTMs and MLP to address the question retrieval and answer selection problems. We first briefly explain recurrent neural networks (RNNs), Long Short-Term Memory (LSTM) networks and their bidirectional networks. Then, we present the stacked bidirectional LSTMs for capturing the semantic similarity of questions and answers in cQA.

Recurrent Neural Networks:
A recurrent neural network (RNN) has the form of a chain of repeating modules of neural network. This architecture is pertinent to learning sequences of informa-tion because it allows information to persist across states. The output of each loop is utilized as input to the following loop through hidden states that capture information about the preceding sequence.
RNNs are trained using backpropagation through time (BPTT) where the gradient at each output depends on the current and previous time steps. The BPTT approach is not effective at learning long term dependencies because of the exploding gradients problem (Pascanu et al., 2012;Bengio et al., 1994). A certain type of RNN, Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) has been designed to improve the learning of long-term dependencies.
Long Short-Term Memory Recurrent Neural Networks: Similar to Recurrent Neural Networks, Long Short-Term Memory Networks (LSTM) (Hochreiter and Schmidhuber, 1997) have a chain like architecture, with a different module structure. Instead of having a single neural network layer, each module has four layers filling different purposes. Each LSTM unit contains a memory cell with self-connections, as well as three multiplicative gates -forget, input, output -to control information flow. Each gate is composed of a sigmoid neural net layer and a pointwise multiplication operation.
Given input vector x t , previous hidden outputs h t−1 , and previous cell state c t−1 , the LSTM unit performs the following operations: where f t , i t , o t and h t respectively represent the forget gate, input gate, output gate and the hidden layer.

Bidirectional Recurrent Neural Networks:
Bidirectional RNNs (Schuster and Paliwal, 1997) or BRNN use past and future context sequences to predict or label each element. This is done by combining the outputs of two RNN, one processing the sequence forward (or left to right), the other one processing the sequence backwards (from right to left) as shown in Figure 1. This technique proved to be especially useful when combined with LSTM (Graves and Schmidhuber, 2005).

Stacked Bidirectional LSTMs for cQA
Given a question, we aim to rank a list of questions for question retrieval and a list of answers for answer selection. To address these ranking problems, we propose a neural model to compute the semantic similarities for the questionquestion (q, q ) or question-answer (q, a) pairs. These scores are then employed to rank the list of questions and answers with respect to the given question q. Figure 2 shows the general architecture of our model. We explain our model by referring to the pair (q, a), but the same applies to the pair (q, q ). The question q and answer a contain the lists of words: where w q i and w a i are the i th word of the q and a respectively.
First, the q and a are truncated to have similar length 3 , and two lists of vectors corresponding to the words for the question q and a are generated and randomly initialized: where X i with i ∈ [1, n/2] is the vector of w q i for the q, X i with i ∈ [n/2 + 1, n] is the vector of w a i−n/2 for the a 4 . 3 We truncate the length of questions and answers to a maximum of 100 words. The questions and answers with less than 100 words are padded with zeros. 4 n equals to 200 The word vectors for the q (i.e., V q ) are passed to the model as shown in Figure 2. The model computes the representation of the question q after passing its last word vector to the model. Then the q representation along with the word vectors of the answer a (i.e., V a ) are passed to the model. The model generates the representation of the given pair (q, a) after processing the last word vector of the answer a affected by the representation of q. This information processing is performed at the forward layer of the first bidirectional LSTM shown in the figure (left to right). Similar processing in the reverse direction (right to left) is further applied on the given pair at the first bidirectional LSTM. The output vectors of the hidden layers for these two directions of the first bidirectional LSTM are then concatenated and inputted into the second bidirectional LSTM as shown in the Figure  2.
While the second bidirectional LSTM processes the input vectors similarly to the first one, its output vectors from two directions are summed 5 instead of concatenated. Finally, the resulting vec-   tors can be augmented with the additional features and passed to the MLP with two hidden layers in order to compute the semantic similarity score of the q and a.

Results and Discussion
Hyper-parameters: Table 1 shows the hyperparameters used in our model. The values for the hyper-parameters are optimized with respect to the results on the development set. The word vectors are randomly initialized and updated during the training step as explained in Section 3, and the weights for the two bidirectional LSTMs of the model are not shared. We employ Adam (Kingma and Ba, 2014) as the optimization method and mean squared error as loss function for our model. We further use the values 0.001, 0.5 and 16 for learning rate, dropout rate and batch size respectively.
Dataset: We evaluate our model on the cQA data (Nakov et al., 2016) in which the questions and answers have been manually labeled by a community of annotators in a crowdsourcing platform. Table 2 shows the statistics for the train, development and test data. The related questions are labeled as Perfect-Match, Relevant and Irrelevant with respect to an original question in the question retrieval task. The Irrelevant questions should be ranked lower than the other questions by the model. In addition, the answers are labeled as Good, Bad and Potentially-Useful with respect to a question in the answer selection task. The Text-based features -Longest Common Substring -Longest Common Subsequence -Greedy String Tiling -Monge Elkan Second String -Jaro Second String -Jaccard coefficient -Containment similarity Vector-based features -Normalized Averaged Word Vectors using word2vec (Mikolov et al., 2013) -Most similar sentence pair for a given (q, a) using sentence vector representation -Most similar chunk pair for a given (q, a) using chunk vector representation Metadata-based features -User information, like user id expected result is that both Good and Potentially-Useful answers have useful information, while the Good answers should be ranked higher than both Potentially-Useful and Bad answers.

Baselines:
We compare our neural model with the BOV, BM25, IR and TF-IDF baselines that are briefly explained below: • Bag-of-Vectors (BOV): This baseline employed various text-and vector-based features for the cQA problems (Belinkov et al., 2015). We highlight some of those features in Table 3.
• BM25: We use the BM25 similarity measure trained on the cQA raw data provided by (Màrquez et al., 2015).
• IR: This is the order of the related questions provided by the search engine for question retrieval task and is the chronological ranking, in which answers are ordered by their time of posting, for the answer selection task.
• TF-IDF: This is computed using the cQA raw data provided by (Màrquez et al., 2015), and the ranking is defined by the cosine similarity of the TF-IDF vectors for the questions and answers.
We evaluate our models using F1-score for a global assessment of the models in addition to the following ranking metrics: Mean Average Precision (MAP), Average Recall (AveRec) and Mean Reciprocal Rank (MRR). For the MAP, we use the average of MAP@1 to MAP@10.  Performance for Answer selection: The results of the answer selection task on development and test data are respectively shown in Tables 4a and 4b. In the tables, the first four rows show the baseline results, and the following rows show the neural models results. The "Single LSTM -F aug " row shows the results of the model presented by Mohtarami et al. (2016) when only one LSTM is used instead of two bidirectional LSTMs, and no augmented features F aug are used. The "Single BLSTM -F aug " row indicates the results when one bidirectional LSTM is used in our model, and no augmented features F aug are used. Using a BLSTM improves the performance compared to the single LSTM, as can be seen in Tables 4a and 4b. The "Single BLSTM" row shows the results for one bidirectional LSTM using F aug . F aug is a 10-length binary vector that encodes the order of the answers in their threads corresponding to their time of posting. F aug helps improve the performance, as can be seen by comparing the results with the ones obtained using a single BLSTM without F aug . The "Double BLSTM" row shows the results generated by the complete model illustrated in Figure 2. For the development set represented in Table 4a, the highest results over all the evaluation metrics are obtained using the neural models. The "Double BLSTM" achieves the highest performance over the ranking metrics. In addition, the results on the test set shown in Ta-ble 4b indicate that while the MAPs of the "Double BLSTM" and BOV baseline are comparable, the "Double BLSTM" achieves the highest performance over the other metrics, especially F1.
Performance for Question Retrieval: The results of question retrieval task on development and test data are respectively shown in Tables 5a and 5b. In the tables, the first four rows show the baseline results, and the following rows show the neural models results. The neural models are the ones described in the previous section. In this task, we employ the order of the related questions, provided by the search engine, as augmented features F aug explained under IR baseline in Section 4. As shown in the tables, the neural models using F aug outperform the models without F aug for both development and test data. For the development set shown in Table 5a, the "Double BLSTM" model achieves the highest performance over the evaluation metrics. For the test set shown in Table 5b, the result of the "Single BLSTM" model is comparable with the IR and TF-IDF over the ranking metrics, while the highest F1 is obtained using BOV baseline. There are several points to highlight regarding the performance of the neural models compared to the baselines, : First, the size of the data for this task is small, which makes it harder to train our neural models. Second, the baselines have access to external resources; for example IR had ac-  cess to the click log of the users and TF-IDF is trained on a large cQA raw dataset (Màrquez et al., 2015). Finally, the number of out-of-vocabulary (OOV) words in the test data is higher than the development data, and the OOV word vectors are randomly initialized and do not get updated during the training phase. This results in a smaller improvement on the test data.

Visualization
In order to gain better intuition on our neural model, we consider our complete model with two bidirectional LSTMs as illustrated in Figure 2, and represent the outputs of the hidden layers for each bidirectional LSTM. The represented outputs correspond to the cosine similarities between word vector representations of words in questionquestion pairs or question-answer pairs. Figure  3 shows the heatmaps for the first bidirectional LSTM (top) and the second bidirectional LSTM (bottom) for the question retrieval task with the following two questions: • q 1 : Which is the best Pakistani school for children in Qatar ? Which is the best Pakistani school for children in Qatar ?
• q 2 : Which Indian school is better for the kids ? I wish to admit my kid to one of the Indian schools in Qatar Which is better DPS or Santhinekethan ? please post your comments The areas of high similarity are highlighted in the red squares in figure 3. While both bidirectional LSTMs correctly predict that the questions are similar, the heatmaps show that the second bidirectional LSTM performs better than the first one, and that the areas of similarities (delimited by the red rectangles) are much better defined by the second bidirectional LSTM. For ex- ample, the first bidirectional LSTM identifies similarities between the part "for children in qatar ? Which is the" from the question q 1 with the parts "is better for the kids ?" and "is better DPS or Santhinekethan ? please post" from the question q 2 . The second bidirectional LSTM accurately updates those parts from the question q 2 to "for the kids ? I wish to admit my" and "Qatar which is better DPS or Santhinekethan" respectively. This shows that the second bidirectional LSTM assigns smaller values to the non-important words (e.g., "please post") while highlighting important words (e.g., "admit"). Figure 4 shows the heatmaps for the first bidirectional LSTM (top) and the second bidirectional LSTM (bottom) for another example of the question retrieval task with the following two questions: • q 3 : New car price guide. Can anyone tell me prices of new German cars in Qatar and deals available • q 4 : Reliable and honest garages in Doha. Can anyone recommend a reliable garage that is also low priced ? I have been around the industrial area but it is hard to know who is reliable and who is not. The best way is if I hear from the experience of the qatarliving mem-bers . I am looking to do some work on my land cruiser As shown in the figure, the areas highlighted in dark blue in the first bidirectional LSTM are much larger than the second bidirectional LSTM. These results show that the first bidirectional LSTM incorrectly predicts that the questions q 3 and q 4 are similar, while the second bidirectional LSTM correctly predicts that the questions are dissimilar.

Conclusion
In this paper, we present a neural-based model with stacked bidirectional LSTMs to generate the vector representations of questions and answers, and predict their semantic similarities. These similarity scores are then employed to rank elements in a list of questions in the question retrieval task, and a list of answers in the answer selection task for a given question. The experimental results show that our model can perform better than the baselines, even though the baselines use various textand vector-based features and have access to external resources. We also demonstrate the impact of the OOV words, and the size of the train data on the performance of the neural model.