Crossing Variational Autoencoders for Answer Retrieval

Answer retrieval is to find the most aligned answer from a large set of candidates given a question. Learning vector representations of questions/answers is the key factor. Question-answer alignment and question/answer semantics are two important signals for learning the representations. Existing methods learned semantic representations with dual encoders or dual variational auto-encoders. The semantic information was learned from language models or question-to-question (answer-to-answer) generative processes. However, the alignment and semantics were too separate to capture the aligned semantics between question and answer. In this work, we propose to cross variational auto-encoders by generating questions with aligned answers and generating answers with aligned questions. Experiments show that our method outperforms the state-of-the-art answer retrieval method on SQuAD.


Introduction
Answer retrieval is to find the most aligned answer from a large set of candidates given a question (Ahmad et al., 2019;Abbasiyantaeb and Momtazi, 2020).It has been paid increasing attention by the NLP and information retrieval community (Yoon et al., 2019;Chang et al., 2020).
Sentence-level answer retrieval approaches rely on learning vector representations (i.e., embeddings) of questions and answers from pairs of questionanswer texts.The question-answer alignment and question/answer semantics are expected to be preserved in the representations.In other words, the question/answer embeddings must reflect their semantics in the texts of being aligned as pairs.
One popular scheme "Dual-Encoders" (also known as "Siamese network" (Triantafillou et al., 2017;Das et al., 2016)) has two separate encoders to generate question and answer embeddings and Question (1): What three stadiums did the NFL decide between for the game?Question (2): What three cities did the NFL consider for the game of Super Bowl 50?
... Question (17): How many sites did the NFL narrow down Super Bowl 50's location to?Answer: The league eventually narrowed the bids to three sites: New Orleans Mercedes-Benz Superdome, Miami Sun Life Stadium, and the San Francisco Bay Area's Levi's Stadium.a predictor to match two embedding vectors (Cer et al., 2018;Yang et al., 2019).Unfortunately, it has been shown difficult to train deep encoders with the weak signal of matching prediction (Bowman et al., 2015).Then there has been growing interests in developing deep generative models such as variational auto-encoders (VAEs) and generative adversial networks (GANs) for learning text embeddings (Xu et al., 2017;Xie and Ma, 2019).As shown in Figure 1(b), the scheme of "Dual-VAEs" has two VAEs, one for question and the other for answer (Shen et al., 2018).It used the tasks of generating reasonable question and answer texts from latent spaces for preserving semantics into the latent representations.
Although Dual-VAEs was trained jointly on question-to-question and answer-to-answer reconstruction, the question and answer embeddings can only preserve isolated semantics of themselves.In the model, the Q-A alignment and Q/A semantics were too separate to capture the aligned semantics (as we mentioned at the end of the first paragraph) between question and answer.Learning the alignment with the weak Q-A matching signal, though now based on generatable embeddings, can lead to confusing results, when (1) dif-arXiv:2005.02557v1[cs.CL] 6 May 2020  ferent questions have similar answers and (2) similar questions have different answers.Table 1 shows an examples in SQuAD: 17 different questions share the same sentence-level answer.
Our idea is that if aligned semantics were preserved, the embeddings of a question would be able to generate its answer, and the embeddings of an answer would be able to generate the corresponding question.In this work, we propose to cross variational auto-encoders, shown in Experiments show that our method improves MRR and R@1 over the state-of-the-art method by 1.06% and 2.44% on SQuAD, respectively.On a subset of the data where any answer has at least 10 different aligned questions, our method improves MRR and R@1 by 1.46% and 3.65%, respectively.

Related Work
Answer retrieval (AR) is defined as the answer of a candidate question is obtained by finding the most similar answer between multiple candidate answers (Abbasiyantaeb and Momtazi, 2020).While another popular task on SQuAD dataset is machine reading comprehension (MRC), which is introduced to ask the machine to answer questions based on one given context (Liu et al., 2019).In this section, we review existing work related to answer retrieval and variational autoencoders.
Answer Retrieval.It has been widely studied with information retrieval techniques and has received increasing attention in the recent years by considering deep neural network approaches.Recent works have proposed different deep neural models in text-based QA which compares two segments of texts and produces a similarity score.
Document-level retrieval (Chen et al., 2017;Wu et al., 2018;Seo et al., 2018Seo et al., , 2019) ) has been studied on many public datasets including including SQuAD (Rajpurkar et al., 2016), MsMarco (Nguyen et al., 2016) and NQ (Kwiatkowski et al., 2019) etc. ReQA proposed to investigate sentence-level retrieval and provided strong baselines over a reproducible construction of a retrieval evaluation set from the SQuAD data (Ahmad et al., 2019).We also focus on sentence-level answer retrieval.
Variational Autoencoders.VAE consists of encoder and generator networks which encode a data example to a latent representation and generate samples from the latent space, respectively (Kingma and Welling, 2013).Recent advances in neural variational inference have manifested deep latent-variable models for natural language processing tasks (Bowman et al., 2016;Kingma et al., 2016;Hu et al., 2017a,b;Miao et al., 2016).The general idea is to map the sentence into a continuous latent variable, or code, via an inference network (encoder), and then use the generative network (decoder) to reconstruct the input sentence conditioned on samples from the latent code (via its posterior distribution).Recent work in cross-modal generation adopted cross alignment VAEs to jointly learn rep-resentative features from multiple modalities (Liu et al., 2017;Shen et al., 2017;Schonfeld et al., 2019).DeConv-LVM (Shen et al., 2018) and VAR-Siamese (Deudon, 2018) are most relevant to us, both of which adopt Dual-VAEs models (see Figure 1(b)) for two text sequence matching task.In our work, we propose a Cross-VAEs for questions and answers alignment to enhance QA matching performance.

Proposed Method
Problem Definition.Suppose we have a question set Q and an answer set A. Each question and answer have only one sentence.Each question q ∈ Q and answer a ∈ A can be represented as (q, a, y), where y is a binary variable indicating whether q and a are aligned.Therefore, the solution of sentence-level retrieval task could be considered as a matching problem.Given a question q and a list of answer candidates C(q) ⊂ A, our goal is to predict p(y|q, a) of each input question q with each answer candidate a ∈ C(q).

Crossing Variational Autoencder
Learning cross-domain constructions under generative assumption is essentially learning the conditional distribution p(q|z a ) and p(a|z q ) where two continuous latent variables z q , z a ∈ R dz are independently sampled from p(z q ) and p(z a ): (1) The question-answer pair matching can be represented as the conditional distribution p(y|z q , z a ) from latent variables p(q|z a ) and p(a|z q ): Objectives.
We denote E q and E a as question and answer encoders that infer the latent variable z q and z a from a given question answer pair (q, a, y), and D q and D a as two different decoders that generate corresponding question and answer q and a from latent variables z a and z q .Then, we have cross construction loss: Variational Autoencoder (Kingma and Welling, 2013) imposes KL-divergence regularizer to align both posteriors p E (z q |q) and p E (z a |a): where θ E , θ D are all parameters to be optimized.Besides, we have question answer matching loss from f φ (y|q, a) as: where f is a matching function and φ f are parameters to be optimized.Finally, we obtain the overall object function to be minimized: where α, β and γ are introduced as hyperparameters to control the importance of each task.

Model Implementation
Dual Encoders.We use Gated Recurrent Unit (GRU) as encoders to learn contextual words embeddings (Cho et al., 2014).Question and answer embeddings are reduced by weighted sum through multiple hops self-attention (Lin et al., 2017) of GRU units and then fed into two linear transition to obtain mean and standard deviation as N (z q ; µ q , diag(σ 2 q )) and N (z a ; µ a , diag(σ 2 a )).Dual Decoders.We adopt another Gated Recurrent Unit (GRU) for generating token sequence conditioned on the latent variables z q and z a .
Question Answer Matching.We adopt cosine similarity with l 2 normalization to measure the matching probability of a question answer pair.

Dataset
Our experiments were conducted on SQuAD 1.1 (Rajpurkar et al., 2016).It has over 100,000 questions composed to be answerable by text from Wikipedia documents.Each question has one corresponding answer sentence extracted from the Wikipedia document.Since the test set is not publicly available, we partition the dataset into 79,554 (training) / 7,801 (dev) / 10,539 (test) objects.

Baselines
InferSent (Conneau et al., 2017).It is not explicitly designed for answer retrieval, but it produces results on semantic tasks without requiring additional fine tuning.QA-Lite.Like USE-QA, this model is also trained over online forum data based on transformer.The main differences are reduction in width and depth of model layers, and sub-word vocabulary size.
BERT QA (Devlin et al., 2019) .BERT QA first concatenates the question and answer into a text sequence then passes through a 12-layers BERT and takes the [CLS] vector as input to a binary classifier.
SenBERT (Reimers and Gurevych, 2019) .It consists of twin structured BERT-like encoders to represent question and answer sentence, and then applies a similarity measure at the top layer.

Experimental Settings
Implementation details.We initialize each word with a 768-dim BERT token embedding vector.If a word is not in the vocabulary, we use the average vector of its sub-word embedding vectors in the vocabulary.The number of hidden units in GRU encoder are all set as 768.All decoders are multi-layer perceptions (MLP) with one 768 units hidden layer.The latent embedding size is 512.The model is trained for 100 epochs by SGD using Adam optimizer (Kingma and Ba, 2014).For the KL-divergence, we use an KL cost annealing scheme (Bowman et al., 2016), which serves the purpose of letting the VAE learn useful representations before they are smoothed out.We increase the weight β of the KL-divergence by a rate of 2/epochs per epoch until it reaches 1.We set learning rate as 1e-5, and implemented on Pytorch.
Competitive Methods.We compare our proposed method cross variational autoencoder (Cross-VAEs) with dual-encoder model and dual variational autoencoder (Dual-VAEs).For fair comparisons, we all use GRU as encoder and decoder, and keep all other hyperparameters the same.
Evaluation Metrics.The models are evaluated on retrieving and ranking answers to questions using three metrics, mean reciprocal rank (MRR) and recall at K (R@K).R@K is the percentage of correct answers in topK out of all the relevant answers.MRR represents the average of the reciprocal ranks of results for a set of queries.
Comparing performance with baselines.As shown in Table 2, two BERT based models do not perform well, which indicates fune tuning BERT may not be a good choice for answer retrieval task due to unrelated pre-training tasks (e.g, masked language model).In contrast, using BERT token embedding can perform better in our retrieval task.Our proposed method outperforms all baseline methods.Comparing with USE-QA, our method improves MRR and R@1 by +1.06% and +2.44% on SQuAD, respectively.In addition, Dual variational autoencoder (Dual-VAEs) does not make much improvement on question answering retrieval task because it can only preserve isolated semantics of themselves.Our proposed crossing variational autoencoder (Cross-VAEs) could outperform dual-encoder model and dual variational autoencoder model, which improves MRR and R@1 by +1.23%/+0.81%and +0.90%/+0.59%,respectively.
Analyzing performance on sub-dataset.We extract a subset of SQuAD, in which any answer has at least eight different questions.As shown in (c) Two questions were incorrectly matched by USE-QA, but correctly matched by CrossVAEs.
Figure 2: A case of 14 different questions aligned to the same answer.We use SVD to reduce embedding dimensions to 2, and then project them on the X-Y coordinate axis.The scale of X-Y axis is relative with no practical significance.We observe that our method makes questions that share the same answer to be closer with each other.
toencoder (Cross-VAEs) could outperform baseline methods on the subset.Our method improves MRR and R@1 by +1.46% and +3.65% over USE-QA.Cross-VAEs significantly improve the performance when an answer has multiple aligned questions.Additionally, SSE of our method is smaller than that of USE-QA.Therefore, the questions of the same answer are closer in the latent space.

Case Study
Figures 2(a) and 2(b) visualize embeddings of 14 questions of the same answer.We observe that crossing variational autoencoders (CrossVAE) can better capture the aligned semantics between questions and answers, making latent representations of questions and answers more prominent.Figure 2(c) demonstrates two of example questions and corresponding answers produced by USE-QA and CrossVAEs.We observe that CrossVAEs can better distinguish similar answers even though they all share several same words with the question.

Conclusion
Given a candidate question, answer retrieval aims to find the most similar answer text between can-didate answer texts.In this paper, We proposed to cross variational autoencoders by generating questions with aligned answers and generating answers with aligned questions.Experiments show that our method improves MRR and R@1 over the best baseline by 1.06% and 2.44% on SQuAD.

Figure 1 :
Figure 1: (a)-(b) The Q-A alignment and Q/A semantics were learned too separately to capture the aligned semantics between question and answer.(c) We propose to cross VAEs by generating questions with aligned answers and generating answers with aligned questions.
, by reconstructing answers from question embeddings and reconstructing questions from answer embeddings.Note that compared with Dual-VAEs, the encoders do not change but decoders work across the question and answer semantics.

Table 1 :
The answer at the bottom of this table was aligned to 17 different questions at the sentence level.

Table 2 :
Performance of answer retrieval on SQuAD.

Table 3 :
Performance of answer retrieval on a subset of SQuAD in which any answer has more than 8 questions.Our method outperforms baselines much more.SSE indicates the sum of squared distances/errors between two different questions aligned to same answer.
Table 3, our proposed cross variational au-The Super Bowl 50 halftime show was headlined by the British rock group Cold-play with special guest performers Beyonće and Bruno Mars, who headlined the Super Bowl XLVII and Super Bowl XLVIII halftime shows.