Learning Analogy-Preserving Sentence Embeddings for Answer Selection

Answer selection aims at identifying the correct answer for a given question from a set of potentially correct answers. Contrary to previous works, which typically focus on the semantic similarity between a question and its answer, our hypothesis is that question-answer pairs are often in analogical relation to each other. Using analogical inference as our use case, we propose a framework and a neural network architecture for learning dedicated sentence embeddings that preserve analogical properties in the semantic space. We evaluate the proposed method on benchmark datasets for answer selection and demonstrate that our sentence embeddings indeed capture analogical properties better than conventional embeddings, and that analogy-based question answering outperforms a comparable similarity-based technique.


Introduction
Answer selection is the task of identifying the correct answer to a question from a pool of candidate answers. The standard methodology is to prefer answers that are semantically similar to the question. Often, this similarity is strengthened by bridging the lexical gap between the text pairs via learned semantic embeddings for words and sentences. The main drawback of this method is that question-answer (QA) pairs are modeled independently, and that the correspondence between different pairs is not considered in these embeddings. In fact, these methods only focus on the relationship that may exist between the entities that constitutes the QA pair at hand and are thus, limited to pairwise semantic structures.
Instead, we argue in this paper that questions and their correct answers often form analogical relations. For example, the question "Who is the president of the United States?" and its answer are Figure 1: Illustration of analogy-based answer selection. Given a question and its candidate answers, each pair is compared to a QA prototype pair. The candidate answer with the highest score is assumed to be the correct answer.
in the same relation to each other as the question "Who is the current chancellor of Germany?" and "Angela Merkel". Thus, for modelling these relations, we need to look at quadruples of textual items in the form of two question-answer pairs, and want to reinforce that they are in the same relation to each other.
We expect that using analogies to identify and transfer positive relationships between QA pairs will be a better approach for tackling the task of answer selection than simply looking at the similarity between individual questions and their answers.
We use sentence embeddings as the mechanism to assess the relationship between two sentences, and aim to learn a latent representation in which their analogical relation is explicitly enforced in the latent space. Analogies are defined as relational similarities between two pairs of entities, such that the relation that holds between the entities of the first pair, also holds for the second pair. Loosely speaking, the quadruple of sentences is in analogical proportion if the difference between the first question and its answer is approximately the same as the difference between the second question and its answer.
This formulation is especially valuable because analogies allow to put on relation pairs that are not directly or explicitly linked. Consequently, in the vector space, analogous QA pairs will be oriented in the same direction, whereas dissimilar pairs will not correspond. The remainder of the paper is organized as follows: the next section will present related work on answer selection, metric learning, as well as laying down the foundations of analogical reasoning. In Section 3, we formally define analogies, and introduce our approach for learning such analogical embeddings. Finally, in Section 4, we evaluate the learnt representations to demonstrate that the found embeddings indeed respect the sought analogies, and to illustrate the benefits of analogies for the task of answer selection.

Related Work
Answer Selection. Answer selection is an important problem in natural language processing that has drawn a lot of attention in the research community (Lai et al., 2018). Given a question and a set of candidate answers, the task is to identify the correct answer(s) in this set. This task can be formulated as a classification or a ranking problem. Early works relied on computing a matching score between a question and its correct answer, and were characterized by the heavy reliance on feature engineering for representing the QA pairs. Representative works include (Filice et al., 2016), which studies the effects of various similarity, heuristic, and threadbased features, or (Tymoshenko and Moschitti, 2015), which analyzes the effect of syntactic and semantic features extracted by syntactic parser for answer re-ranking. Recently, deep learning methods have achieved excellent results in mitigating the difficulty of feature engineering. These methods are used to learn latent representations for questions and answers independently, and a matching function is applied to give the score of the two texts. The most representative works in this line of work include (Wang and Nyberg, 2015;Yin et al., 2016;Severyn and Moschitti, 2015;Tay et al., 2017).
Embeddings and Metric Learning. Our work is also related to representation learning usig deep neural networks. In fact, learning the embeddings of entities can be seen as a knowledge induction process, as those induced latent representations can be used to infer properties of unseen samples.
Although many studies confirmed that embeddings obtained from distributional similarity can be useful in a variety of different tasks, (Levy et al., 2015) showed that the semantic knowledge encoded by general-purpose similarity embeddings is limited, and that enforcing the learnt representations to distinguish functional similarity from relatedness is beneficial. For this purpose, many task-specific embeddings have been proposed for a variety of tasks including (Riedel et al., 2013) for binary relation extraction and (FitzGerald et al., 2015) for semantic role labeling. This work aims to preserve more far reaching structures, namely analogies between pairs of entities.
Analogical Reasoning. Analogical reasoning has been an active research topic in classic artificial intelligence. It has been successfully used in different domains such as classification (Bounhas et al., 2014), clustering (Marx et al., 2002), dimensionality reduction (Memisevic and Hinton, 2004), or learning to rank (Fahandar and Hüllermeier, 2018). Gentner (1983) studies analogies with respect to human cognition, defines an analogy as a relational similarity over two pairs of entities, and differentiates it from the more superficial similarity defined by attributes. Since this general definition of analogy requires high-level reasoning which is not scalable to large-scale automated prediction systems, Miclet et al. (2008) define the concept of analogical dissimilarity between entities in the same semantic universe. The analogical dissimilarity allows to perform direct inference for unseen entities. Contrary to their direct inference setting, we enforce the analogical constraints in the learned embedding in the form of geometrical constraints, by imposing the co-linearity of the vector that maps the entities of each pair in the analogical proportion. It is worth mentioning that analogies have been found as the result of several word embedding models-inter alia (Mikolov et al., 2013;Pennington et al., 2014)-but those are allegedly only empirical observations, which we found to not carry over to our task.

Analogical Embeddings
In this section, we explain our approach towards generating semantic embeddings that preserve analogical proportions.

Analogical Reasoning
In this section, we briefly introduce key concepts in analogical reasoning, starting with analogical proportions.
Definition 1 (Analogical Proportion) Let a, b, c, d be four values from a domain X. The quadruple (a, b, c, d) is said to be in analogical proportion This comparative relation between two pairs of entities can be expressed in many ways (Dubois et al., 2016), but the most noteworthy are: -Geometric proportion: min(ad, bc) max(ad, bc) In this work, we focus solely on the arithmetic interpretation of analogy.
An intuitive way of viewing analogies is through geometrical constraints in an Euclidean space. Enforcing the relational similarity between pairs of elements is equivalent to constraining the four elements to form a parallelogram.
The left graph of Figure 2 illustrates such an analogical parallelogram. As we can see, in an analogical parallelogram, there is not only a relation R holding between (a, b) and (c, d) respectively, but there must also hold a similar relation R between (a, c) and (b, d).
We can now make a first step towards our problem, which is learning to identify correct answers according using analogical inference. Given the aforesaid quadruple, when one of the four elements is unknown, an analogical proportion becomes an analogical equation. where x represents an unknown element that is in analogical proportion to a, b, c.
In our setting, an exact solution to an analogical equation can often not be expected. Instead, we aim at finding the element d i , among n candidates, where the analogical proportion is as closely satisfied as possible. For example, in the right graph of Figure 2, neither d 1 nor d 2 are perfect solutions to the analogical equation a : b :: c : x, but d 1 seems to be a better solution than d 2 .
In order to relax the equality constraint between the pairs of entities, and to generalize the formulation of analogical proportions beyond the Boolean case, Miclet et al. (2008) proposed to measure the degree of an analogical proportion using analogical dissimilarity.
This equation represents the relation R as the difference between the entities of the pair and ∼ as the difference between the previously the so expressed relation pairs. Obviously, v(a, b, c, d) = 0 if (a, b, c, d) are in analogical proportion, and the value increases the less similar (a − b) and (c − d) are to each other. This allows us to re-frame the original problem of answer selection as a ranking problem, in which the goal is to select the candidate answer d which minimizes the degree of analogical dissimilarity: In the following sections, we will describe the details of the model by motivating the architectural choices.

Generating Quadruples
In this work, we consider QA pairs as relational data. We aim to transfer knowledge from pairs whose relation is well known, which we call prototypes, to unseen pairs. For this, we train a model to encode analogies in the latent representations of the sentences. For creating a instances of quadruples to train the model, we adapt state-ofthe-art datasets.
An analogy quadruple has the following form: [q p : a p :: q i : a ij ] "Where" questions Sentence A "Where was Abraham Lincoln born?" Sentence B "On February 12, 1809, Abraham Lincoln was born Hardin County, Kentucky" Sentence C "Where was Franz Kafka born?" Sentence D "Franz Kafka was born on July 3, 1883 in Prague, Bohemia, now the Czech Republic." "Who" questions Sentence A "Who made the rotary engine automobile?" Sentence B "Mazda continued work on developing the Wankel rotary engine." Sentence C "Who discovered prions?" Sentence D "Prusiner won Nobel prize last year for discovering prions" "When" questions Sentence A "When was Leonardo da Vinci born?" Sentence B "Leonardo da Vinci was actually born on 15 April 1452 [...] " Sentence C "When did Mt St Helen last have significant eruption?" Sentence D "Pinatubo's last eruption [...] as Mt St Helen's did when it erupted in 1980." where the q p and a p , respectively stand for the question and the answer of the prototype pair, whereas q i is the i-th question and a ij is the j-th candidate answer to q i . The cells in red represent positive analogical quadruples, composed of a prototype QA pair, a question and its correct answer. In reverse, a negative quadruple contains a QA prototype, a question and one of its incorrect answer.
Given a set of questions and their relative candidates answers, we construct the analogical quadruples in two steps. First, we divide all the questions into three different subsets of wh-word questions: "Who", "When" and "Where". We focus on these three types because their answer type fall in distinct and easily identifiable categories: • "Where" corresponds to an answer of type "Location" • "Who" corresponds to an answer of type "Person" • "When" corresponds to an answer of type "Date" or "Time" Table 1 illustrates examples of quadruples for the three described categories. From these categories, we extract a variable number of QA pairs in order to form the prototype set. To generate positive quadruples, we select a prototype from one of the above-mentioned subsets and we associate a question from the same set and the correct answer among its candidates. This procedure provides a large number of analogical quadruples. On the other hand, to generate negative training samples we use the following approach: in the same subset, we associate a prototype, a question and a randomly selected wrong answer among its candidates. This is done in order to purposely break the analogical relation between a prototype QA pair and the QA pair at hand. This approach will generate a set of hard examples to help improve the training. Figure 3 illustrates the procedure.
To summarise, ranking by analogical dissimilarity is performed in three steps: 1. Given a prototype QA pair, a question and N candidate answers, N quadruples are generated.
2. The analogical dissimilarity score is computed for each quadruple.
3. The N candidates are consequently ranked by the analogical dissimilarity score.
The next section closely describe the architectural choices of the model. We recall that our focus is on learning an embedding function that pushes analogous QA pairs with similar mappings to be mutually close by enforcing a geometrical constraint in the vector space. This constraint states that the vector shift that maps the entities of the first pair should be similar to the vector shift of the second pair, according to the degree of analogical dissimilarity that holds between the two pairs (a, b), (c, d).

Quadruple Siamese Network
To tackle this problem, we propose a Siamese network architecture as shown in Figure 5. In the next paragraphs, we describe the notation used and the details of each component of the model.

Notation.
Let Q and A be the space of all questions and candidate answers. We denote a quadruple of sentences as (a, b, c, d), where a, c ∈ Q and b, d ∈ A. Quadruples are assigned a label y = 1 if the analogical proportion holds, and 0 otherwise. θ denotes the parameters to be learnt that map the relation from a to b, and c to d respectively. Let x · refer to the latent representation of one sentence in the quadruple.
Architecture. The Siamese network takes as input four sentences. The sub-networks in the Siamese model share the parameters and learn the vector representations for every sentence received as input. A sentence S i = w i1 , ...w ik where w ij represents the j th word in the sentence S i , ∀i ∈ 1 ≤ i ≤ n and ∀j ∈ 1 ≤ j ≤ k. Words are mapped into word embeddings x ij = Ew ij , where E d,|V | is a matrix of vectors of size d, and V is the vocabulary. Out-of-vocabulary words are initialized by a random vector. We use bidirectional gated recurrent units (GRUs) (Cho et al., 2014) over the input sentence. For a sentence of T words, the network encodes T hidden states h 1 , ..., h T such that: In order to obtain a fixed-size vector, we select the maximum value over each dimension of h t using max pooling. After this step, we obtain four vectors of dimension d, one for each input sentence of the quadruple.
Training Strategy. The next step is to get the semantic relation between the pairs of input sentences. Given a pair a vectors, (x i , x j ), the arithmetic proportion expects the difference of the vectors to encode the relational similarity between the entities that constitutes the pair. We let the network predict four d-dimensional embedding vectors, which we merge through a pairwise subtraction. Let f W (·) be the projection of an input sentence in the embedding space computed by the network function f W . Furthermore, let be the pairwise differences between the embedding vectors. In order to separate instances of analogical proportion, similar pairs need to be mapped mutually close to each other, whereas dissimilar instances should be pushed apart.
For the energy of the model, we use the cosine similarity between the vector shifts of each pair of the quadruple: We argue that this is an appropriate energy function since the goal is for the pairs of parallel vectors to be parallel which maximises the analogical parallelogram likelihood. We propose to use the contrastive loss (Hadsell et al., 2006) to perform the metric learning. This loss function has two terms, one for the similar and and another dissimilar samples. The similar instances are denoted by a label y = 1 whereas the dissimilar pairs are represented by y = 0. Thus, the loss function has the following form: Each term is expressed by: This loss function measures how well the model learns to encode similar transformations such that analogous pairs are mutually close and form an analogical parallelogram in the embedding space, while pushing dissimilar transformations apart. Given a question and a pool of candidate answers, the goal is to rank the correct answer in the first position, based on how well each sentence completes the analogical equation according to (6).
This architecture is summarized in Figure 5. We learn all the parameters of the model through a gradient based method that minimizes the L2regularized loss. Further details about the implementation are given in section 4.1.

Experiment
In this section, we present an evaluation of our approach in two experiments: first, in Section 4.2, we confirm that the found analogical embeddings do indeed improve the analogical parallelogram structure illustrated in Figure 2 over commonly used word-and sentence-based embeddings. In Section 4.3 we then show that this also results in improved performance for question answering. Before that, we start with a brief description of our experimental setup.

Experimental Setup
We begin the assessment of our model with a direct evaluation, which is ranking candidate answers in the same setting as during the training of the embedding. We generate quadruples with the same prototypes used for the training and we look for the correct answers by iteratively solving the analogical equations. We compare our model to commonly used sentence representations methods to evaluate the proposed approach results with respect to general purpose sentence embedding and word embedding methods. In the next paragraphs we present the experimental setup and the results obtained.
Datasets. We validate the proposed method on two datasets: WikiQA (Yang et al., 2015), an open domain QA dataset with answers collected over Wikipedia and TrecQA, which was created from the TREC Question Answer Track. Both resources are well established for benchmarking answer selection. We split each dataset into three subsets, which contain only "who", "where" and "when" questions.  Evaluation metrics. We assess the performance of our method by measuring the Mean Average Precision (MAP) and the Mean Reciprocal Rank (MRR) for the generated quadruples in the test set. Given a set of questions Q, MRR is computed as follows: where rank i represents the rank position of the first correct candidate answer for the i th question. In other words, MRR is the average of the reciprocal ranks of results for the questions in set Q. MAP is calculated as follows: Precision(π jk ) (11) where q j ∈ Q is a question whose candidate answers are a 1 , ..., a m j and π jk is the rank associated with those candidate answers. While MRR measures the rank of any correct answer, MAP computes the rank of all correct answers. Generally, MRR is higher than MAP on the same set of ranked objects.
Implementation details. We initiate the embedding layer with FastText vectors. These weights are not updated during training. The dimension of the output of the sentence encoder is 300. For alleviating overfitting we apply a dropout rate of 0.5. The model is trained with Adam optimizer with a learning rate of 0.001 and a weight decay rate of 0.01.

Quality of Analogical Embedding
Baselines. To support our claim that the learnt representations of our model encode the semantic of question answer pairs better than pre-trained sentence representation models, we choose four baselines commonly used to encode sentences: 1. Word2Vec and Glove (Mikolov et al., 2013;Pennington et al., 2014): We use the simple approach of averaging the word vectors for all words in a sentence. This method has the drawback of ignoring the order of the words of the sentence, but has shown to perform reasonably well.
3. Sent2Vec (Pagliardini et al., 2017): A method to learn sentence embeddings such that the average of all words and n-grams can serve as sentence vector.
For each document in test set, we generate analogical quadruples as explained in section 3.2. Given a question q i in the test set with k candidate answers, we obtain p × k possible quadruples, where p is the cardinality of the prototype set. The network encodes each sentence in the quadruple and computes the cosine similarity (6) between the obtained vector shifts.
Not every prototype QA pair will fit to the QA pair at hand, so we compute p × m scores, and choose only the prototype that leads to the highest analogical score for each document and discard the other comparisons. One might think about using the average of the scores and sorting the candidate answers accordingly, but this strategy introduces noise in the analogical inference procedure.
Results. We applied the described procedure to vectors obtained from our network as well as from the baseline representation methods. The results are shown in Table 3.
In order to better perceive the analogical properties of the baselines and the proposed approach, we also include a random baseline in the comparison. We observe that averaging word embeddings such as Glove or Word2Vec performs better than the dedicated sentence representations in the WikiQA dataset. This might be due to the fact that word embeddings have shown to encode some analogical properties. On the other hand, sentence embeddings have been trained with a particular learning objective, for example, InferSent has been train for the task of claim entailment with a classification objective and might not be suitable for representing relations between pairs of sentences. Nevertheless, ranking by the cosine similarity of the difference vectors do not lead to acceptable performances. This confirms our hypothesis that pre-trained sentence representation do not preserve analogical properties.
Similarly, we measure the influence of the number of prototypes on the performances.    We vary the number of prototypes pair p ∈ {10, 20, 30, 40, 50} and measure the MAP and the MRR for both datasets. The results are shown in Figures 6 and 7. We can observe that the best performances are obtained for p = 30 and that after both MAP and MRR decrease. The reason might be that a high number of prototypes brings more comparisons and increases the probability of spurious interactions between QA prototypes and QA pairs.

Question-Answering Performance
A natural benchmark model for our work is the approach of Tam et al. (2017), which is similar to ours in that it proposed to replace wh-word in questions with appropriate named entities. This approach leverages typological information from a named entity recognizer and the word vector space.
It showed that simply replacing the wh-word, with a named entity that has the highest cosine similarity with all the candidate answers for a given question. This substitution is operated for "where", "when" and "who" types of questions. Finally, the transformed QA pair is fed to a network suited for the task of answer selection. This study demonstrated that this simple pre-processing step improves the state of the art results for the task of answer selection.
Alike our experimental setup, they divide the dataset in three categories, namely "when", "who" and "where", which is the same division we used for our experimental setup, and evaluate their method on the split dataset and the full dataset. We will consider their work as our baseline in order to evaluate the capabilities of the analogy based embeddings. Moreover, we compare our approach to a setup which doesn't exploit analogical properties. This is to say, a Siamese network that takes as input a question and a candidate answer, generate the respective representations and compute the cosine similarity of the obtained sentence embeddings. The described baseline corresponds to the model proposed by (Tan et al., 2015)    The results are shown in Tables 4 and 5. We observe that simply computing the cosine similarity between the difference vector of the prototype QA pair and the QA pair at hand with the learnt embedding from the proposed approach lead to significant improvements for some particular type of questions. The bold numbers in Tables  4 and 5 indicate the best results for each dataset. We can see that our method improves the MRR of at least two of questions types by a relevant margin. The last row of the same tables confirms that enforcing analogical properties in the embedding space generally improves the overall MRR for these three subsets.

Conclusion
This work introduced a new approach to learn sentence representations for answer selection, which preserve structural similarities in the form of analogies. Analogies can be seen as a way of injecting reasoning ability, and we express this by requiring common dissimilarities implied by analogies to be reflected in the learned feature space. We showed that explicitly constraining structural analogies in the learnt embeddings leads to better results over the distance-only embeddings. We believe that it is worth-while to further explore the potential of analogical reasoning beyond their common use in word embeddings, as it is a natural mean of learning and generalizing about relations between entities. The focus of this work has been on answer selection, but analogical reasoning can be useful in many other machine learning tasks such as machine translation or visual question answering. As a next step, we plan to explore other forms of analogies that involve modelling across domains.