Latent Space Embedding for Retrieval in Question-Answer Archives

Community-driven Question Answering (CQA) systems such as Yahoo! Answers have become valuable sources of reusable information. CQA retrieval enables usage of historical CQA archives to solve new questions posed by users. This task has received much recent attention, with methods building upon literature from translation models, topic models, and deep learning. In this paper, we devise a CQA retrieval technique, LASER-QA, that embeds question-answer pairs within a unified latent space preserving the local neighborhood structure of question and answer spaces. The idea is that such a space mirrors semantic similarity among questions as well as answers, thereby enabling high quality retrieval. Through an empirical analysis on various real-world QA datasets, we illustrate the improved effectiveness of LASER-QA over state-of-the-art methods.


Introduction
Community-based Question Answering (CQA) services such as Yahoo! Answers 1 , Quora 2 , Stack-Overflow 3 , and Baidu Zhidao 4 have become a dependable source of knowledge to solve common user problems. These allow a user to post queries such as how and why questions that seek descriptive solutions and opinions as answers. Over time, these services build up a large archive of questionanswer knowledge that may be leveraged to solve new user questions. The CQA retrieval problem,  that has received much recent attention, is about addressing this opportunity. CQA retrieval methods focus on finding historical archived knowledge (questions, answers or QA pairs) that are relevant to a newly posed user question. The central technical challenge that differentiates CQA retrieval from other general purpose IR tasks is that of the need to address the lexical gap (aka lexical chasm) in QA archives. Lexical chasm means that text fragments in questions (e.g., disk full) may lead to semantically correlated content in answers (e.g., format). This QA-correlation is different from semantic relatedness such as synonymy and antonymy; in the above example, the correlation is due to disk full issues often leading to solution involving disk formating. Explicit correlation modelling, using statistical translation models, have met with much success in CQA retrieval.
In this paper, we take a neighborhood preserving learning approach, and learn a unified representation for QA pairs in an abstract latent space. Consider two example CQA pairs from a technical support forum presented in Table 1; the intuitive causes listed alongside are external to the dataset. Though the questions are reasonably similar lexically, they pertain to very different issues as illustrated by the wide disparity in the answers posed to them. We model QA-pairs in a unified space that preserves the similarity neighborhood in question and answer spaces. In this example, the wide divergence in answer-space similarity neighborhoods between the two QAs would pull them apart, so they live in different parts of the latent space, reflecting the dissimilarity between their causes. Thus, our contribution in this paper is a neighborhood-preserving method for CQA retrieval, LASER-QA, expanding to LAtent-Space Embedding for Rretrieval in QA archives.

Related Work
The three main CQA retrieval tasks target retrieving (a) related past questions (Zhou et al., 2015), (b) potentially usable past answers (Shtok et al., 2012), and (c) past question-answer pairs (Xue et al., 2008). Techniques for CQA typically use one of: (i) statistical translation models, (ii) topic models and (iii) neural networks. A fourth class target exploiting metadata such as question categories and author data, or domain-specific syntactic information, and are not as applicable in the absence of such information.
In the interest of keeping this section focused on retrieval, we do not cover other tasks that have been addressed for CQA, such as QA-pair discovery (Deepak and Visweswariah, 2014), clustering (Deepak, 2016) and auxiliary IR tasks such as query suggestions (Deepak et al., 2013).

Translation Model based Techniques
Translation models (Brown et al., 1990) take parallel corpora, collections of document pairs expressing the same thing in different natural languages, and learn correlations between words/phrases; for example, p(f |e) quantifies the probability of an english word e getting translated to a french word f in an English-French translation system. Though question-answer pairs do not semantically qualify as parallel corpora, usage of translation models treating them so (Xue et al., 2008) have led to retrieval accuracy improvements. Simplistically, a high probability for p(f ormat|disk) leads to retrieval models boosting the score of a answer containing the word format to respond to a user query involving a disk problem. Later methods have improved upon them by phrase-level  and entitylevel (Singh, 2012) modelling as well as by unim-portant word removal (Lee et al., 2008) and differential treatment of concepts (Park and Croft, 2015). Recent work has even explored using a different language (e.g., Chinese) to enrich questions .

Topic Model based Techniques
Topic models (Blei et al., 2003) have been used to retrieve topically similar questions  with usage of the solution side leading to further improvements (Ji et al., 2012). They have been combined with language modeling whereby question and answer parts are modeled to have been generated from paired latent topics, but in "question and answer languages" (Zhang et al., 2014). We will use such paired topic modelling, called TBLM, as a baseline in our experimental study.

Topic+Translation Models
Hybrid methods build upon topic and translation models by interpolating the separate scorings. Due to the usage of a combination of multiple types of parameterized models, the results of such "pipeline methods" have been observed to be hard to reproduce (Qiu et al., 2013).We use a recent hybrid scoring method, called TopicTRLM (Zhou et al., 2015), as a baseline in our experimental study.

Deep Learning Methods
Neural networks such as DBNs (Wang et al., 2011;Hu et al., 2013) and more sophisticated neural pipelines (Shen et al., 2015) have been explored for CQA retrieval. A recent work (Nakov et al., 2016a) trains a neural network to discriminate between good and bad comments for a question. Using neural networks for retrieval within question datasets (not involving answers) has also been a subject of recent interest (e.g., (Bogdanova et al., 2015;Das et al., 2016)). The most recent method for generic QA-pair processing, which we will call as AENN (Zhou et al., 2016), trains separate autoencoders for question and answer corpora, and induces correlatedness of intermediate representations in a fine-tuning step. In our empirical analysis, we will use the AENN approach from (Zhou et al., 2016) as a baseline.

Auxiliary-information based Methods
This category of methods target to exploit specific kinds of auxiliary information that are potentially available with CQA data. Techniques have considered usage of question categories (Cao et al., 2009;, the split between question title and description (Qiu et al., 2013), and assumptions of the question syntax (Duan et al., 2008). While such information is available in many systems, QA information from systems such as forums and chat-based customer support sometimes have very little information other than just QApairs. We target a general scenario where such metadata is not expected as a pre-requisite, as in the case of most techniques from other categories.

Problem Statement
Let D = {(q 1 , a 1 ), . . . , (q n , a n )} be QA pairs from a CQA archive where answer a i is associated with question q i ; for cases involving multiple answers for a question, the question would be replicated for each answer. For a new question q, the CQA retrieval problem is about devising a scoring function f (q, (q i , a i )) that quantifies the relevance of each (q i , a i ) pair from D to the new question q. Having devised a scoring function, retrieval is trivially accomplished by choosing an ordered set of top-t QA pairs from D in accordance with their f (·, ·) scores.

Evaluation
In the datasets that we use, we have labels indicating which QAs are related/relevant to a particular question. Thus, the quality of the scoring function can be evaluated using traditional information retrieval metrics (Robertson and Zaragoza, 2007) such as Precision, MAP, MRR, and NDCG when measured against such labellings. In addition, we will use one more metric, namely Success Rate, the fraction of questions for which at least one related question is ranked among the top-t, in evaluation.

LASER-QA: The Proposed Technique
Our method, LASER-QA, embeds QA pairs in D, within a unified space of desired dimensionality.
where, u i ∈ R d is a vector space embedding in the latent space R d . As we will illustrate, LASER-QA targets to preserve the local similarity structures in the question and answer spaces within the unified embedding. Having built the embedding of QA pairs, cosine similarity between vectors in R d is used for scoring: where, u ∈ R d is the embedding of the new question q; we will outline the embedding of single questions into R d in a later section. Our motivation behind LASER-QA stems from the idea of Local Linear Embedding (LLE) (Saul and Roweis, 2000); further, the choice of local neighborhood preservation is motivated by pervasive usage of local neighbors (i.e., k-N N retrieval) in case-based reasoning frameworks (De Mantaras et al., 2005) that seek to reuse structured problem-solution data.

Data Representation
We use the tf-idf vector representation for each question (denoted as x i ) and each answer (y i ) in D. The tf-idf vectors are in R D where D denotes the size of the vocabulary. The question and answer tf-idf vectors are arranged as columns to form matrices X and Y , both of size D × n. Recall, the latent space would be a Euclidean space of dimension d, and typically, we have d < D. Our method is intentionally designed to not rely on the specifics of the representation used, and thus can make use of any vector representation of text data. Note that our latent space embeddings in R d are evidently unrelated to distributional text embeddings (e.g., (Mikolov et al., 2013)) and are complementary in that such embeddings could be used as an alternative input representation for x i and y i .

Regularized Reconstruction Coefficients
For any question x i , let N k (x i ) denote the set of top-k nearest questions to the question x i , proximity assessed using cosine similarity of vectors in R D ; analogously, N k (y i ) denotes the top-k nearest answers to y i . Much like the representation, the similarity measure may also be replaced as appropriate. Inspired by LLE (Saul and Roweis, 2000), we model the local neighborhood geometry around x i using reconstruction coefficient w q ij for each question x j ∈ N k (x i ). We intend to learn the co-efficients such that x i may be reconstructed well as a linear combination of the neighbors using the co-efficients. Thus, these co-efficients are computed by minimizing, for every question x i , the regularized reconstruction penalty (RRP) given below: The first term denotes the approximation error in reconstructing x i as a linear combination of its k nearest neighbors using weights w q ij . The second term is an L2 regularization term weighted with a non-negative hyperparameter λ, which we set to 0.01 in our experiments. We replaced the sum-to-one constraint in (Saul and Roweis, 2000) by L2 regularization since the former produces large swings in magnitude on either sides of 0.0 (note co-efficients are not constrained to be nonnegative) on high-dimensional spaces such as our tf-idf space, leading to stability concerns.
By explicitly assigning w q ij = 0 ∀x j ∈ N k (x i ), we rewrite the above problem as: where I is an n × n identity matrix and w q i is a column vector of size n comprising reconstruction coefficients vector for x i . It can be shown that the nonzero entries of the optimal coefficient vector is: where I k is an identity matrix of the size k and matrix X i is a D × k matrix obtained from the matrix X by retaining only those columns which are neighbors of x i . Note, the above matrix inverse is well-defined since the matrix is positive definite by construction. Once we find these optimal coefficient vectors for all questions (answers), we stack them together column-wise and obtain a matrix, W q (W a ) of size n × n, called the reconstruction coefficient matrix for questions (answers). These two matrices W q and W a capture the local geometry of the questions and answers in the QA-archive D.

Embedding into Latent Space R d
In this step, we use the W q and W a matrices to do the transformation of the QA pairs, the (x i , y i )s to u i s. Building upon LLE, we develop a scheme to preserve the local neighborhood structure around x i and y i in learning the u i .
where, U is a d × n matrix whose i th column is equal to u i . α ∈ [0, 1] is a weighting parameter that allows to trade-off between question and answer spaces. At α = 1, the embedding u i will try to maximally align with question x i and vice versa. Our constraints, like the analogous ones in LLE, ensure origin-centered mean solutions and avoid degenerate solutions, respectively (Pang et al., 2005). The first constraint is soft in that any optimal solution disregarding the constraint can be shifted to ensure origin-centering.
Towards capturing the optimal solution for Eq. 5, we define three n × n symmetric matrices, Theorem 1. If the eigenvalues of the matrix Z are arranged in the descending order and the eigenvectors corresponding to the last d eigenvalues are denoted by {v 1 , v 2 , . . . , v d }, then, the optimal solution for Eq. (5), denoted by U * is: Further, origin centering is achieved by the following transformation: where e is a a vector of all 1 s.
Proof: First, observe that the objective function of Eq. (5) can be rewritten in a compact form: where, · F denotes the Frobenius norm. Now, keeping the first constraint aside, we fold in the second constraint using Lagrange multipliers yielding the following Lagrangian L(U , Λ). (12) where e is an all 1's vector. In this Lagrangian, • Matrix Λ is a d × d symmetric matrix denoting the Lagrange multipliers for the second constraint. Note, the last term is a compact representation of d 2 /2 equality constraints.
• The symbol • denotes the Hadamard products (element wise product) of two matrices.
For any matrix M , we have M 2 F = Tr(M M ) where Tr(.) is the trace. Thus, we can rewrite the first two terms of Eq.(12) as: A slight re-arrangement yields: The Q and A space components are now separated out into the first and second terms. We now simplify the notation using Eq. (6) and (7) to: Recall the following for any matrices A, B, & C. This allows us to rewrite Eq. (13) as: To find an optimal U , we differentiate L(U , Λ) w.r.t U and equate to zero. This leads to: The above follows from standard matrix properties (Petersen and Pedersen, 2012). Re-arranging: One possible solution of the above equation could be constructed in the following manner. While any subset of d eigenvectors (and their eigenvalues) would be a solution for Eq. (14), we would take the bottom d eigenvectors for minimizing the objective; this is so since the objective becomes Tr(−Λ) when Eq. (14) holds. The matrix constructed above is the optimal U * in Eq. (9). This completes the proof.
The first constraint in Eq. (5) is then applied to centre the vectors around the origin using Eq.(10).

Embedding a new Question in R d
To use the historical u i vectors to retrieve historical QAs against a new question (vector) q, we need to embed the latter in the same space R d . This is achieved using the same structure as applied in forming the embedding; we start with identifying, from D, the k-nearest questions to q. The reconstruction co-efficient vector w q is then learnt using Eq. (2). Finally, we obtain the embedding u for x as a w q -weighted linear combination of the R d embeddings corresponding to the k-nearest neighbors. This is captured in steps 9-11 in Algorithm 1 given in the next section.
Algorithm 1: LASER-QA Algorithm input : D = {(q 1 , a 1 ), . . . , (q n , a n )} (CQA corpus) & query q output : Top-t relevant QA pairs from D Offline Phase 1 Use appropriate data representation to form vector-pairs (x i , y i ) for every QA (q i , a i ); 2 Compute the reconstruction coefficient matrices W q and W a (Ref. Section 4.2); x ← Vector representation of the query q; 10 w x ← Vector of size n capturing the reconstruction coefficients for x; 11 u ← U * centered w x ; 12 Output top-t QA pairs based by computing the following scores f (q, (q i , a i )) = u u i u u i

LASER-QA Algorithm
The details of the LASER-QA technique from the previous sections are summarized in Algorithm 1, with the offline (Steps 1-8) and query-time phase (Steps 9-12) clearly demarcated. It may be noted that, LASER-QA, being an optimization-based method, preserves Q/A-space local neighborhoods on a best-effort basis and does not offer guarantees on the fraction of local neighbors preserved from either spaces in the R d embedding.

Generalizability of LASER-QA
LASER-QA can be easily extended to incorporate other kinds of information that might be available along with QA pairs such as images, votes (e.g., Blurtit 5 , Quora and Yahoo! Answers) tags (Quora), categories (answers.com 6 and Yahoo! 5 http://www.blurtit.com/ 6 http://www.answers.com/ Answers) or comments (Quora and Blurtit). Consider data in the form of triplets (q i , a i , m i ) where m i represents the extra information. The m i vectors are subjected to the same form of processing as q i and a i vectors, leading to the W m and M matrices. Line 5 in Algorithm 1 would then change to: where the different αs denote interpolation weights that need to be set appropriately. The remainder of the LASER-QA steps remain identical to Algorithm 1. It may be noted that α m could be set to a low value if the utility of the extra information is deemed to be low.

Scalability of LASER-QA
We now analyze the scalability of LASER-QA, separately analysing the (a) one-time offline phase, and (b) query-time phase. Offline Phase: This is a one-time operation at the system design time, involving matrix multiplications followed by eigen-decomposition. Our matrices being sparse, multiplications are fast and worst-case quadratic 7 in n. The Eigendecomposition is O(n 3 ), but being a fundamental matrix operation, very efficient implementations exist (especially for symmetric matrices such as ours) with sub-second response times for n of the order of thousands (in packages such as Eigen 8 and LA-PACK 9 ), trendlines illustrating that Eigendecompositions with even n of the order of millions are easy. The embeddings of all vectors are then indexed using conventional multi-dimensional indexes and/or locality sensitive hashing to aid querying. Online/Query-time Phase: This encompasses (a) an IR query to find the k most similar questions, (b) solving for the k reconstruction co-efficients in Eq. 4 and forming the embedding, and (c) simply querying for top-t nearest neighbors over indexes built at design-time. The main query-time overhead (vis-a-vis conventional information retrieval) is the additional query over the multi-dimensional index; this construction ensures fast sub-linear response times for the online phase. Scalability against other methods: In contrast to LASER-QA, it is notable that the baselines employ expensive query-time operations; specifically, it is unclear as to how query-time retrieval using Eq.4 in the TBLM paper (Zhang et al., 2014) and Eq.1 in Topic-TRLM paper (Zhou et al., 2015) could be completed in sub-linear time.

Datasets and Experimental Setup
Datasets: We use two recent datasets in our evaluation, CQADupStack (Hoogeveen et al., 2015) and SemEval2016-Task3 (Nakov et al., 2016b). The former has a manually labelled set of related questions to every question, whereas the latter has relevance labels associated with answers (appearing as comments); these labellings make automated evaluation possible. Among the 12 subsets in CQADupStack, owing to scalability issues of the AENN baseline, we choose the three smaller subsets from CQADupStack, namely webmasters (1299 QAs), android (2193), and gis (3726) for a full comparative evaluation. Each of these are split into two halves, with one portion used for the training (that is, learning the statistical model such as LASER-QA, translation model, etc.) and the other one used for the testing (the 50:50 split ensures a sizeable test set). The related labellings are used only for evaluation purposes; however, since only training pairs are retrieved within this setup, related labellings across QAs in the testing set would be missed, artificially lowering the recall of all the methods in our evaluation. In a recent analysis (Hoogeveen et al., 2016), CQADupStack authors quantify the incompleteness of labeling in the dataset. Such issues further artificially reduce retrieval accuracies as estimated from our automated evaluation. The SemEval2016 dataset, on the other hand, has an implicit test-train split. We use the subset of the data categorized under Qatar Living Lounge, the largest category (which is 27% of the full dataset), for our experiments. All 'comments' that are labelled relevant to the associated question are paired together as QA-pairs to form a training set of 1366 pairs, with the test questions from the dataset used as is. Baselines: As detailed in Section 2, we compare against three baselines (a) TBLM (Zhang et al., 2014) (topic model approach), (b) Topic-TRLM (Zhou et al., 2015) (topic+translation models), and (b) AENN (Zhou et al., 2016) (deep learning). TBLM requires an answer quality signal that we set to unity. We use author-recommended parameter settings for TBLM and TopicTRLM. Since AENN learns a latent space representation (though a separate one for questions and answers unlike LASER-QA), the evaluation w.r.t LASER-QA is a direct comparison of the quality of the respective latent spaces. The AENN method requires training triplets, i.e., [question, answer, other answer]; we populate the other answer part using the answer of a related question. This gives AENN an advantage as it uses relations among training pairs that are unavailable to other methods. For AENN, quality measures peaked around 2000 (for webmasters and gis) and 3000 (for android and SemEval2016) for latent space dimensionality; our results are from such settings. LASER-QA Parameters: We set k = 15 and α = 0.8, the latter ensuring that the question space is given more importance. We always set d to the number of eigen vectors in Z, equalling |D|. We will separately study LASER-QA trends against parameter variations as well. Evaluation Metrics: We use Precision, Success Rate (SR) (Ref. Sec 3), MAP and NDCG (Robertson and Zaragoza, 2007) for our evaluation. Precision simply measures the fraction of related documents among the top-t that were retrieved. Due to this rank-agnostic construction, precision is unable to incentivize for putting the relevant results at the top of the result instead of deeper down. In contrast, MAP and NDCG are rank-aware metrics. MAP 10 computes the average of precisions computed at rank positions where a relevant result is returned. NDCG is another rank-aware metric 11 that discounts the appearance of the revelant result based on it's rank in the result set. We assess statistical significance using randomization tests (Smucker et al., 2007). Table 2 summarizes the comparative evaluation across varying t (best results boldfaced). The following observations are notable:

Evaluation Results and Insights
• LASER-QA outperforms the other methods across datasets. This is followed by Topic-TRLM, TBLM and then AENN.
• LASER-QA's margin is highest at (small) values of t that are typical of scenarios involving human perusal of results. As t in-   Table 3 to illustrate the consistency in trends. Boldfacing and statistical significance have the same semantics as earlier, with the comparison performed against only Top-icTRLM and TBLM.

LASER-QA Parameter Analysis
We now analyze the NDCG trends (NDCG being the most popular IR metric) across LASER-QA parameters, i.e., k, α and d, varying each one separately keeping the default choice for others.
• Varying k: Figure 1 plots NDCG against values of k from {5, 10, 15, 20}. As may be seen, the accuracy is seen to improve with increasing k in the lower ranges, while sat- urating beyond 15. The trends are seen to be similar across datasets.
• Varying α: The retrieval accuracies were seen to be stable across a wide range of values of α such as illustrated in Figure 2. This shows that LASER-QA is not very sensitive to α.
• Varying d: The size of the latent space, d, forms a critical parameter for LASER-QA. Given the LASER-QA construction, this space is limited by the number of eigenvectors in the matrix Z which is n × n. This means, d is limited above by n, the size of the training dataset. Table 3 plots the accuracies with varying values of d, with the upper end different for different datasets due to the dependence on the training dataset size. The plots indicate that the performance improves steadily with increasing values of d. The performance saturates beyond 400 for the topically coherent CQADupStack datasets. The Qatar Living Lounge category in SemEval2016, unlike the CQADupStack categories, is more diverse discussing issues ranging from massage centres to immigration. Thus, LASER-QA is able to make use of much more dimensions to model the complexity involved.
To summarize, LASER-QA is not very sensitive to α and is best run with k ≥ 15 and values of d ≈ n. Finally, the precision-recall curve with varying values of t is presented in Figure 4. As may be observed, LASER-QA exhibits a gradual degradation of precision with increasing t correlated with a corresponding improvement in recall. The diversity in the SemEval2016 dataset manifests as a sharper precision drop at high t, as the result set starts to transcend sub-topic boundaries.

Conclusions
We considered the problem of CQA retrievalthe task of retrieving relevant historical QA pairs in response to a new question. We formulated a method that builds upon the ideas from local linear embedding to use collective corpus level information across historical QA pairs to embed them in a latent space. In contrast to the mainstream paradigm in literature, we do not explicitly model lexical correlations; instead, we learn an embedding of QA pairs in a way that the local neighborhood in question and answer spaces are preserved. LASER-QA provides a single-model based solution in lieu of learning separate models (e.g., topic and translation models) which are then interpolated to a final scoring; the latter approach has been observed to have reproducilibility issues (Qiu et al., 2013). We analyzed our method empirically against state of the art methods from across families of CQA retrieval methods that use topic models, translation models and deep learning. Our empirical results confirm that the LASER-QA method significantly outperforms the baselines on all IR metrics of interest, underlining the effectiveness of our modelling. Future Work: A study on the correlation between the kNNs in the LASER-QA embedded space and the original Question and Answer spaces would provide insights into the extent of correlation between manifolds in the original spaces. Further, we would like to see how LASER-QA generalizes to beyond text; one immediate scenario of interest is to explore how pictures and multimedia within QAs may be leveraged within LASER-QA.