A Gated Self-attention Memory Network for Answer Selection

Answer selection is an important research problem, with applications in many areas. Previous deep learning based approaches for the task mainly adopt the Compare-Aggregate architecture that performs word-level comparison followed by aggregation. In this work, we take a departure from the popular Compare-Aggregate architecture, and instead, propose a new gated self-attention memory network for the task. Combined with a simple transfer learning technique from a large-scale online corpus, our model outperforms previous methods by a large margin, achieving new state-of-the-art results on two standard answer selection datasets: TrecQA and WikiQA.


Introduction and Related Work
Answer selection is an important task, with applications in many areas . Given a question and a set of candidate answers, the task is to identify the most relevant candidate. Previous work on answer selection typically relies on feature engineering, linguistic tools, or external resources (Wang et al., 2007;Wang and Manning, 2010;Heilman and Smith, 2010;Yih et al., 2013;Yao et al., 2013). Recently, with the renaissance of neural network models, many deep learning based methods have been proposed to address the task (Tay et al., 2017b;Shen et al., 2017;Bian et al., 2017;Tymoshenko and Moschitti, 2018;Tay et al., 2018;Tayyar Madabushi et al., 2018;Yoon et al., 2019). They outperform traditional techniques. A common trait of a number of these deep learning methods is the use of the Compare-Aggregate architecture (Wang and Jiang, 2017). Typically in this architecture, contextualized vector representations of small units such as words of the question and the candidate * Equal contributions. The work was conducted while the first author interned at Adobe Research.
are first compared and aligned. After that, these comparison results are then aggregated to calculate a score indicating the relevance between the question and the candidate. On standard answer selection datasets such as TrecQA (Wang et al., 2007) or WikiQA (Yang et al., 2015), Compare-Aggregate approaches achieve very competitive performance. However, they still have some limitations. For example, the first few layers of most previous Compare-Aggregate models encode the question-candidate pair into sequences of contextualized vector representations separately Shen et al., 2017;Bian et al., 2017). These sequences are independent and completely ignore the information from the other sequence.
In this work, we take a departure from the popular Compare-Aggregate architecture, so instead, we propose a mix between two very successful architectures in machine comprehension and sequence modeling, the memory network (Sukhbaatar et al., 2015) and the selfattention architecture (Vaswani et al., 2017). In the context of answer selection, the self-attention architecture allows us to learn the contextual representation of elements in the sequence with respect to both the question and the answer, while the multi-hop reasoning memory network allows us to refine the decision over multiple steps. To this end, we propose a new memory-based, gated self-attention architecture for the task of answer selection. Combined with a simple transfer learning technique from a large-scale online corpus, our model achieves new state-of-the-art results on the TrecQA and WikiQA datasets.
In the following parts, we first describe our gated self-attention memory network for answer selection in Section 2. We then go into details our transfer learning approach in Section 3. After that, we describe the conducted experiments and their results in Section 4. Finally, we conclude this 5954 work in Section 5.

The gated self-attention mechanism
The gated attention mechanism (Dhingra et al., 2017;Tran et al., 2017) extends the popular scalarbased attention mechanism by calculating a real vector gate to control the flow of information, instead of a scalar value. Let's denote the sequence of input vectors as X = [x 1 ..x n ]. If we have context information c, then in traditional attention mechanism, association score α i is usually calculated as a normalized dot product between the two vectors c and x i (Equation 1) where i ∈ [1..n].
For the gated attention mechanism, the association between two vectors c and x i is represented by gate vector g i as follows: where σ denotes the element-wise sigmoid function. Function f is a parameterized function and thus, is more flexible in modelling the interaction between vectors c and x i . In this work, we propose a new type of selfattention based on the gated attention mechanism described above, and we refer to it as the gated self-attention mechanism (GSAM). We want to condition the gate vector not only on a context vector and a single input vector but also on the entire sequence of inputs. Therefore, we design function f to be dependent on all the inputs in the sequence and the context vector. To calculate the gate for input x i , first, each of the inputs in the input sequence and the context vector will present an individual gate "vote". Then, the votes are aggregated to calculate gate g i for x i . This process is illustrated in Equation 3: where W and b are learnable parameters shared among functions f 1 ... f n . Vectors vs are lineartransformed inputs which are used to calculate the self attentions. s j i is the unnormalized attention score of input x j put on x i and α j i is the normalized score. We use affine-transformed inputs v and x to calculate the self-attention instead of just x to break the attention symmetry phenomenon.

Combining with the memory network
In most previous memory network architectures, interactions between memory cells are relatively limited. At each hop, a single control vector is used to interpret each memory cell independently. To overcome this limitation, we combine GSAM described in Section 2.1 with the memory network architecture to create a new network called the Self-Attention Memory Network (GSAMN). Figure 1 shows the simplified computation flow of GSAMN. In each reasoning hop, instead of using only context vector c to interpret the inputs, we use GSAM. Let c k be the controlling context and x k 1 ... x k n be the memory values at the k th reasoning hop. Each memory cell update from the k th hop to the next hop is calculated as the gated selfattention update (Equation 4).
The controller's update is a combination of the gated self-attention above, and the memory network's traditional aggregate update. As the memory state values have already been attended to by the gating mechanism, we only need to average them (not weighted average). .

GSAMN for answer selection
In the context of answer selection, we concatenate question Q and candidate answer A to a single sequence and treat the task as a binary classification problem. Given the GSAMN architecture above, we can use final controller state c T as the representation of the sequence. The matching probability P (A | Q) is finally calculated as follows: where W c and b c are learnable parameters. We can initialize the memory values x 0 1 ... x 0 n using any representation model such as word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), ELMo (Peters et al., 2018), or BERT (Devlin et al., 2018). The control vector c is a randomly initialized learnable vector.

Transfer Learning
Previous studies on answer selection have focused mostly on small-scale datasets. On the other hand, many community question answering (CQA) platforms such as Yahoo Answers and Stack Exchange have become an essential source of information for many people. The amount of data (i.e., questions and answers) in these CQA platforms is huge and encompasses many domains and topics. This provides a great opportunity to apply transfer learning techniques to improve answer selection systems trained on limited datasets.
We crawled question-answer pairs related to various topics from Stack Exchange 1 . After that, we removed every pair that contains text written in a language different from English. Furthermore, to ensure that in each collected pair the answer is highly relevant to the question, we removed pairs whose answers have less than two up-votes from community users. Finally, because training answer selection models also require negative examples, for each question, we sampled several real answers not related to the question to build up negative pairs. In the end, our dataset has 628,706 1 https://stackexchange.com/ positive pairs and 9,874,758 negative pairs in total. We refer to our newly collected dataset as StackExchangeQA. Table 1 shows some examples of positive question-answer pairs from the dataset. The dataset has question-answer pairs from many different domains and topics. The code for constructing the StackExchangeQA dataset is available online 2 .
In this work, we employ a basic transfer learning technique. The first step is to pre-train our answer selection model on the StackExchangeQA dataset. Then, the second step is to fine-tune the same model on a target dataset of interest such as TrecQA or WikiQA. Despite the simplicity of the technique, the performance of our model improves substantially compared to not using transfer learning. Different from previous works which use source datasets that were manually annotated (Min et al., 2017;Chung et al., 2018), our source dataset required minimal effort to obtain and preprocess. The choice of crawling question-answer pairs from the Stack Exchange website was arbitrary. We could also have crawled data from websites such as Yahoo Answers instead.

Experiments and Results
To evaluate the effectiveness of our proposed answer selection model, we use two datasets: TrecQA and WikiQA. The TrecQA dataset (Wang et al., 2007) was created from the TREC Question Answering tracks. There are two versions of TrecQA: raw and clean. Both versions have the same training set but their development and test sets differ. In this study, we use the clean version of the dataset that removed questions in development and test sets with no answers or only positive/negative answers. The clean version has 1,229/65/68 questions and 53,417/1,117/1,442 question-answer pairs for the train/dev/test split. The WikiQA dataset (Yang et al., 2015) was constructed from real queries of Bing and Wikipedia. Following the literature (Yang et al., 2015;Bian et al., 2017;Shen et al., 2017), we removed all questions with no correct answers before training and evaluating answer selection models. Cooking Question: How do I prevent tomatoes from falling in a green salad? Answer: I work around this by serving tomatoes on the top of the individual salads after they've been portioned out. I'm not sure of a way to keep them incorporated.

Philosophy
Question: What did Socrates teach which lead to his conviction that he spoiled youth and taught other Gods? Answer: I think in general one of the problems Socrates' contemporaries may have had with him was not so much what he taught but how he taught. Perhaps Socrates' method of philosophy was characterised more by testing propositions through questioning, than any strict concern with formulating a set of propositions on any one subject. performance as the mean average precision (MAP) and mean reciprocal rank (MRR) 3 . In all experiments, we use the base version of BERT (Devlin et al., 2018) to initialize the memory of our proposed architecture. We fine-tune the BERT embeddings during training. We set the number of reasoning hops to be 2. We use the Adam optimizer with a learning rate of 5e-5, β 1 = 0.9, β 2 = 0.999, L2 weight decay of 0.01, learning rate warmup over the first 10 percent of the total number of training steps, and linear decay of the learning rate. We did hyper-parameter tuning on the development sets. It is worth noting that, we have experimented with various values for the number of reasoning hops. We found that using 2 hops gives the best performance on the tested datasets while using larger number of hops decreases the performance slightly. We attribute the diminishing returns in increasing the number of hops to the limited size of the TrecQA and WikiQA datasets. Many previous works related to memory networks also use small number of memory hops Sukhbaatar et al., 2015;Miller et al., 2016;Zhang et al., 2018). Table 2 summarizes the performances of our proposed models and compares them to the baselines on the TrecQA and WikiQA datasets. The full model [BERT + GSAMN+ Transfer Learning] outperforms the previous state-of-the-art methods by a large margin. Note that by simply fine-tuning the pre-trained BERT embeddings, one can easily achieve very competitive performance on both datasets. This is expected as BERT has been pretrained on a massive amount of unlabeled data. However, our proposed techniques do add a significant amount of performance. The gain from using all of our proposed techniques is larger than the difference between fine-tuning BERT model compared to previous systems in the TrecQA dataset.

Ablation Analysis
We aim to analyze the relative effectiveness of different components of our full model. From the original BERT baseline, we add one component at a time and evaluate the performance of the partial models on the datasets. From Table 2, we can see that both the variants [BERT + GSAMN] and [BERT + Transfer Learning] have better performance than the original BERT baseline. However, both of the partial variants still perform worse than the one with all the techniques. This shows that although each of our proposed components is effective by itself, we need to combine them together in order to achieve the best performance.

GSAMN versus Transformer
It was an iterative process to arrive at the current design of GSAMN. We aim to analyze whether the improvement in performance comes from the inductive bias that we introduced into the architec-  We have experimented with adding more Transformer layers on top of BERT but the performance did not improve. For example, using 6 extra Transformer layers only achieves a MAP score of 0.885 on the TrecQA dataset. This is reasonable because BERT by itself already contains 12 Transformer layers. Without a new kind of layer such as our proposed GSAMN architecture, stacking more Transformer layers will not be helpful, especially in this case where the tested datasets are not large.

GSAMN versus Compare-Aggregate
Finally, we have a comparison between our full model and the Compare-Aggregate framework. Most previous Compare-Aggregate architectures use traditional word embeddings such as word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014). In contrast, our full model uses BERT which is an arguably more powerful language representation model. To this end, we implemented a Compare-Aggregate variant that uses dynamic-clip attention (Bian et al., 2017). We use ELMo (Peters et al., 2018) to represent the input words to the implemented Compare-Aggregate architecture. We use ELMo instead of BERT because BERT is in subword-level while one of the intuitions behind the Compare-Aggregate variant is about comparing word-level representations. In addition, we have tested the variant [BERT + Compare-Aggregate] but found it to be worse than the version [ELMo + Compare-Aggregate]. The results in Table 2 show that our model significantly outperforms [ELMo + Compare-Aggregate] as well.

Conclusions
In this paper, we propose a new gated selfattention memory network architecture for answer selection. Combined with a simple transfer learning technique from a large-scale CQA corpus, the model achieves the state-of-the-art performance on two well-studied answer selection datasets: TrecQA and WikiQA. In the future, we plan to investigate more transfer learning techniques for utilizing the large volume of existing CQA data. In addition, we plan to apply our self-attention memory network on other sentence matching tasks such as natural language inference, paraphrase identification, or measuring semantic relatedness.