Inner Attention based Recurrent Neural Networks for Answer Selection

Attention based recurrent neural networks have shown advantages in representing natural language sentences (Hermann et al., 2015; Rockt¨aschel et al., 2015; Tan et al., 2015). Based on recurrent neural networks (RNN), external attention information was added to hidden representations to get an attentive sentence representation. Despite the improvement over non-attentive models, the attention mechanism under RNN is not well studied. In this work, we analyze the deﬁciency of traditional attention based RNN models quantitatively and qualitatively. Then we present three new RNN models that add attention information before RNN hidden representation, which shows advantage in representing sentence and achieves new state-of-art results in answer selection task.


Introduction
Answer selection (AS) is a crucial subtask of the open domain question answering (QA) problem. Given a question, the goal is to choose the answer from a set of pre-selected sentences (Heilman and Smith, 2010;Yao et al., 2013). Traditional AS models are based on lexical features such as parsing tree edit distance. Neural networks based models are proposed to represent the meaning of a sentence in a vector space and then compare the question and answer candidates in this hidden space (Wang and Nyberg, 2015;Feng et al., 2015), which have shown great success in AS. However, these models represent the question and sentence separately, which may ignore the information subject to the question when representing the answer. For example, given a candidate answer: Michael Jordan abruptly retired from Chicago Bulls before the beginning of the 1993-94 NBA season to pursue a career in baseball. For a question: When did Michael Jordan retired from NBA? we should focus on the beginning of the 1993-94 in the sentence; however, when we were asked: Which sports does Michael Jordan participates after his retirement from NBA? we should pay more attention to pursue a career in baseball.
Recent years, attention based models are proposed in light of this purpose and have shown great success in many NLP tasks such as machine translation (Bahdanau et al., 2014;, question answering (Sukhbaatar et al., 2015) and recognizing textual entailments (Rocktäschel et al., 2015). When building the representation of a sentence, some attention information is added to the hidden state. For example, in attention based recurrent neural networks models (Bahdanau et al., 2014) each time-step hidden representation is weighted by attention. Inspired by the attention mechanism, some attention-based RNN answer selection models have been proposed (Tan et al., 2015) in which the attention when computing answer representation is from question representation.
However, in the RNN architecture, at each time step a word is added and the hidden state is updated recurrently, so those hidden states near the end of the sentence are expected to capture more information 1 . Consequently, after adding the attention information to the time sequence hidden representations, the near-the-end hidden variables will be more attended due to their comparatively abundant semantic accumulation, which may result in a biased attentive weight towards the later coming words in RNN. In this work, we analyze this attention bias problem qualitatively and quantitatively, and then propose three new models to solve this problem. Different from previous attention based RNN models in which attention information is added after RNN computation, we add the attention before computing the sentence representation. Concretely, the first one uses the question attention to adjust word representation (i.e. word embedding) in the answer directly, and then we use RNN to model the attentive word sequence. However, this model attends a sentence word by word which may ignore the relation between words. For example, if we were asked: what is his favorite food? one answer candidate is: He likes hot dog best. hot or dog may be not relate to the question by itself, but they are informative as a whole in the context. So we propose the second model in which every word representation in answer is impacted by not only question attention but also the context representation of the word (i.e. the last hidden state). In our last model, inspired by previous work on adding gate into inner activation of RNN to control the long and short term information flow, we embed the attention to the inner activation gate of RNN to influence the computation of RNN hidden representation. In addition, inspired by recent work called Occam's Gate in which the activation of input units are penalized to be as less as possible, we add regulation to the summation of the attention weights to impose sparsity. Overall, in this work we make three contributions: (1) We analyze the attention bias problem in traditional attention based RNN models. (2) We propose three inner attention based RNN models and achieve new state-of-the-art results in answer selection. (3) We use Occam's Razor to regulate the attention weights which shows advantage in long sentence representation.

Related Work
Recent years, many deep learning framework has been developed to model the text in a vector space, and then use the embedded representations in this space for machine learning tasks. There are many neural networks architectures for this representation such as convolutional neural networks , recursive neural networks (Socher et al., 2013) and recurrent neural networks (Mikolov et al., 2011). In this work we propose Inner Attention based RNN (IARNN) for answer selection, and there are two main works which we are related to.

Attention based Models
Many recent works show that attention techniques can improve the performance of machine learning models (Mnih et al., 2014;Zheng et al., 2015). In attention based models, one representation is built with attention (or supervision) from other representation. Weston et al (2014) propose a neural networks based model called Memory Networks which uses an external memory to store the knowledge and the memory are read and written on the fly with respect to the attention, and these attentive memory are combined for inference. Since then, many variants have been proposed to solve question answering problems (Sukhbaatar et al., 2015;Kumar et al., 2015). Hermann (2015) and many other researchers (Tan et al., 2015;Rocktäschel et al., 2015) try to introduce the attention mechanism into the LSTM-RNN architecture. RNN models the input sequence word-by-word and updates its hidden variable recurrently. Compared with CNN, RNN is more capable of exploiting long-distance sequential information. In attention based RNN models, after computing each time step hidden representation, attention information is added to weight each hidden representation, then the hidden states are combined with respect to that weight to obtain the sentence (or document) representation. Commonly there are two ways to get attention from source sentence, either by the whole sentence representation (which they call attentive) or word by word attention (called impatient).

Answer Selection
Answer selection is a sub-task of QA and many other tasks such as machine comprehension. Given a question and a set of candidate sentences, one should choose the best sentence from a candidate sentence set that can answer the question. Previous works usually stuck in employing feature engineering, linguistic tools, or external resources. For example, Yih et al. (2013) use semantic features from WordNet to enhance lexical features. Wang and Manning (2007) try to compare the question and answer sentence by their syntactical matching in parse trees. Heilman and Smith (Heilman and Smith, 2010) try to fulfill the matching using minimal edit sequences between their dependency parse trees. Severyn and Moschitti (2013) automate the extraction of discriminative tree-edit features over parsing trees.
While these methods show effectiveness, they might suffer from the availability of additional resources and errors of many NLP tools such as dependency parsing. Recently there are many works use deep learning architecture to represent the question and answer in a same hidden space, and then the task can be converted into a classification or learning-to-rank problem (Feng et al., 2015;Wang and Nyberg, 2015). With the development of attention mechanism, Tan et.al(2015) propose an attention-based RNN models which introduce question attention to answer representation.

Traditional Attention based RNN Models and Their Deficiency
The attention-based models introduce the attention information into the representation process.
In answer selection, given a question Q = {q 1 , q 2 , q 3 , ..., q n } where q i is i-th word, n is the question length, we can compute its representation in RNN architecture as follows: where D is an embedding matrix that projects word to its embedding space in R d ; W ih , W hh , W ho are weight matrices and b h , b o are bias vectors; σ is active function such as tanh. Usually we can ignore the output variables and use the hidden variables. After recurrent process, the last hidden variable h n or all hidden states average 1 n n t=1 h t is adopted as the question representation r q .
When modeling the candidate answer sentence with length m:S = {s 1 , s 2 , s 3 , ..., s m } in attention based RNN model, instead of using the last hidden state or average hidden states, we use attentive hidden states that are weighted by r q : where h a (t) is hidden state of the answer at time t. In many previous work Rocktäschel et al., 2015;Tan et al., 2015), the at- W hm and W qm are attentive weight matrices and w ms is attentive weight vector. So we can expect that the candidate answer sentence representation r a may be represented in a question-guided way: when its hidden state h a (t) is irrelevant to the question (determined by attention weight s t ), it will take less part in the final representation; but when this hidden state is relavent to the question, it will contribute more in representing r a . We call this type of attention based RNN model OARNN which stands for Outer Attention based RNN models because this kind of model adds attention information outside the RNN hidden representation computing process. An illustration of traditional attention-based RNN model is in Figure 1. However, we know in the RNN architecture, the input words are processed in time sequence and the hidden states are updated recurrently, so the current hidden state h t is supposed to contain all the information up to time t, when we add question attention information, aiming at finding the useful part of the sentence, these near-the-end hidden states are prone to be selected because they contains much more information about the whole sentence. In other word, if the question pays attention to the hidden states at time t , then it should also pay attention to those hidden states after t (i.e {h t |t > t}) as they contain the information at least as much as h t , but in answer selection for a specific candidate answer, the useful parts to answer the question may be located anywhere in a sentence, so the attention should also distribute uniformly around the sentence. Traditional attention-based RNN models under attention after representation mechanism may cause the attention to bias towards the later coming hidden states. We will analyze this attention bias problem quantita-tively in the experiments.

Inner Attention based Recurrent Neural Networks
In order to solve the attention bias problem, we propose an intuition:

Attention before representation
Instead of adding attention information after encoding the answer by RNN, we add attention before computing the RNN hidden representations. Based on this intuition, we propose three inner attention based RNN models detailed below.

IARNN-WORD
As attention mechanism aims at finding useful part of a sentence, the first model applies the above intuition directly. Instead of using the original answer words to the RNN model, we weight the words representation according to question attention as follows: where M qi is an attention matrix to transform a question representaion into the word embedding space. Then we use the dot value to determine the question attention strength, σ is sigmoid function to normalize the weight α t between 0 and 1. The above attention process can be understood as sentence distillation where the input words are distilled (or filtered) by question attention. Then, we can represent the whole sentence based on this distilled input using traditional RNN model. In this work, we use GRU instead of LSTM as building block for RNN because it has shown advantages in many tasks and has comparatively less parameter (Jozefowicz et al., 2015) which is formulated as follows: where W xz , W hz , W xf , W hh , W xh are weight matrices and stands for element-wise multiplication. Finally, we get candidate answer representation by average pooling all the hidden state h t . we call this model IARNN-WORD as the attention is paid to the original input words. This model is shown in Figure 2.

IARNN-CONTEXT
IABRNN-WORD attend input word embedding directly. However, the answer sentence may consist of consecutive words that are related to the question, and a word may be irrelevant to question by itself but relevant in the context of answer sentence.
So the above word by word attention mechanism may not capture the relationship between multiple words. In order to import contextual information into attention process, we modify the attention weights in Equation 4 with additional context information: where we use h t−1 as context, M hc and M qc are attention weight matrices, w C (t) is the attention representation which consists of both question and word context information. This additional context attention endows our model to capture relevant part in longer text span. We show this model in Figure 3.

IARNN-GATE
Inspired by the previous work of LSTM (Hochreiter and Schmidhuber, 1997) on solving the gradient exploding problem in RNN and recent work on building distributed word representation with topic information (Ghosh et al., 2016), instead of adding attention information to the original input, we can apply attention deeper to the GRU inner activation (i.e z t and f t ). Because these inner activation units control the flow of the information within the hidden stage and enables information to pass long distance in a sentence, we add attention information to these active gates to influence the hidden representation as follows: where M qz and M hz are attention weight matrices.
In this way, the update and forget units in GRU can focus on not only long and short term memory but also the attention information from the question. The architecture is shown in Figure 4.

IARNN-OCCAM
In answer selection, the answer sentence may only contain small number of words that are related to the question. In IARNN-WORD and IARNN-CONTEXT, we calculate each word attention weight without considering total weights. Similar with Raiman(2015) who adds regulation to the input gate, we punish the summation of the attention weights to enforce sparsity. This is an application of Occam's Razor: Among the whole words set, we choose those with fewest number that can represent the sentence. However, assigning a pre-defined hyper-parameter for this regulation 2 is not an ideal way because it punishes all question attention weights with same strength. For different questions there may be different number of snippets in candidate answer that are required. For example, when the question type is When or Who, answer sentence may only contains a little relavant words so we should impose more sparsity on the summation of the attention. But when the question type is Why or How, there may be much more words on the sentence that are relevant to the question so we should set the regulation value small accordingly. In this work, this attention regulation is added as follows: for the specific question Q i and its representation r i q , we use a vector w qp to project it into scalar value n i p , and then we add it into the original objective J i as follows: where α i t is attention weights in Equation 4 and Equation 6. λ q is a small positive hyper-parameter. It needs to mention that we do not regulate IARNN-GATE because the attention has been embedded to gate activation.

Quantify Traditional Attention based Model Bias Problem
In order to quantify the outer attention based RNN model's attention bias problem in Section 3, we build an outer attention based model similar with Tan (2015). First of all, for the question we build its representation by averaging its hidden states in LSTM, then we build the candidate answer sentence representation in an attentive way introduced in Section 3. Next we use the cosine similarity to compare question and answer representation similarity. Finally, we adopt max-margin hinge loss as objective: where a + is ground truth answer candidate and a − stands for negative one, the scalar M is a predefined margin. When training result saturates after 50 epoches, we get the attention weight distribution (i.e. s q in Equation 2). The experiment is conducted on two answer selection datasets: Wik-iQA (Yang et al., 2015) and TrecQA (Wang et al., 2007). The normalized attention weights is reported in Figure 5. However, the above model use only forward LSTM to build hidden state representation, the attention bias problem may attribute to the biased answer distribution: the useful part of the answer to the question sometimes may located at the end of the sentence. So we try OARNN in bidirectional architecture, where the forward LSTM and backward LSTM are concatenated for hidden representation, The bidirectional attention based LSTM attention distribution is shown in Figure 6.
Analysis: As is shown in Figure 5 and 6, for one-directional OARNN, as we move from beginning to the end in a sentence, the question atten-tion gains continuously; when we use bidirectional OARNN, the hidden representations near two ends of a sentence get more attention. This is consistent with our assumption that for a hidden representation in RNN, the closer to the end of a sentence, the more attention it should drawn from question. But the relevant part may be located anywhere in a answer. As a result, when the sample size is large enough 3 , the attention weight should be unformly distributed. The traditional attention after representation style RNN may suffer from the biased attention problem. Our IARNN models are free from this problem and distribute nearly uniform (orange line) in a sentence.

IARNN evaluation
Common Setup: We use the off-the-shelf 100dimension word embeddings from word2vec 4 , and initiate all weights and attention matrices by fixing their largest singular values to 1 (Pascanu et al., 2013). IARNN-OCCAM base regulation hyperparameter λ q is set to 0.05, we add L 2 penalty with a coefficient of 10 −5 . Dropout (Srivastava et al., 2014) is further applied to every parameters with probability 30%. We use Adadelta(Zeiler, 2012) with ρ = 0.90 to update parameters.
We choose three datasets for evaluation: Insur-anceQA, WikiQA and TREC-QA. These datasets contain questions from different domains. Table 1 presents some statistics about these datasets. We adopt a max-margin hinge loss as training objective. The results are reported in terms of MAP and MRR in WikiQA and TREC-QA and accuracy in InsuranceQA.
We use bidirectional GRU for all models. We share the GRU parameter between question and answer which has shown significant improvement on performance and convergency rate (Tan et al., 2015;Feng et al., 2015).
There are two common baseline systems for above three datasets: • GRU: A non-attentive GRU-RNN that models the question and answer separately.
WikiQA ( (Yang et al., 2015) 0.652 0.6652  0  Wikipedia. In addition to the original (question,positive,negative) triplets, we randomly select a bunch of negative answer candidates from answer sentence pool and finally we get a relatively abundant 50,298 triplets. We use cosine similarity to compare the question and candidate answer sentence. The hidden variable's length is set to 165 and batch size is set to 1. We use sigmoid as GRU inner active function, we keep word embedding fixed during training. Margin M was set to 0.15 which is tuned in the development set. We adopt three additional baseline systems applied to WikiQA: (1) A bigram CNN models with average pooling (Yang et al., 2015). (2) An attention-based CNN model which uses an interactive attention matrix for both question and answer (Yin et al., 2015) 5 (3) An attention based CNN models which builds the attention matrix after sentence representation (Santos et al., 2016). The result is shown in Table 2.
InsuranceQA (Feng et al., 2015) is a domain specific answer selection dataset in which all questions is related to insurance. Its vocabulary size is comparatively small (22,353), we set the batch size to 16 and the hidden variable size to 145, hinge loss margin M is adjusted to 0.12 by evaluation behavior. Word embeddings are also learned during training. We adopt the Geometric mean of Euclidean and Sigmoid Dot (GESD) proposed in (Feng et al., 2015) to measure the similarity be-

System
Dev Test1 Test2 (Feng et al., 2015) 65   Table 4: Result of different systems in Trec-QA. (Wang and Ittycheriah, 2015) propose a question similarity model to extract features from word alignment between two questions which is suitable to FAQ based QA. It needs to mention that the system marked with † are learned on TREC-QA original full training data.
tween two representations: which shows advantage over cosine similarity in experiments.
We report accuracy instead of MAP/MRR because one question only has one right answers in InsuranceQA. The result is shown in Table 3.
TREC-QA was created by Wang et al.(2007) based on Text REtrieval Conference (TREC) QA track (8-13) data. The size of hidden variable was set to 80, M was set to 0.1. This dataset is comparatively small so we set word embedding vector Q: how old was monica lewinsky during the affair ? Monica Samille Lewinsky ( born July 23 , 1973 ) is an American woman with whom United States President Bill Clinton admitted to having had an `` improper relationship '' while she worked at the White House in 1995and 1996. Monica Samille Lewinsky ( born July 23 , 1973 is an American woman with whom United States President Bill Clinton admitted to having had an `` improper relationship '' while she worked at the White House in 1995 and 1996 . size to 50 and update it during training. It needs to mention that we do not use the original TREC-QA training data but the smaller one which has been edited by human. The result is shown in Table 4.

Result and Analysis
We can see from the result tables that the attention based RNN models achieve better results than the non-attention RNN models (GRU). OARNN and IARNN beat the non-attentive GRU in every datasets by a large margin, which proves the importance of attention mechanism in representing answer sentence in AS. For the non-attentive models, the fixed width of the hidden vectors is a bottleneck for interactive information flow, so the informative part of the question could only propagate through the similarity score which is blurred for the answer representation to be properly learned. But in attention based models, the question attention information is introduced to influence the answer sentence representation explicitly, in this way we can improve sentence representation for the specific target (or topic (Ghosh et al., 2016)).
The inner attention RNN models outperform outer attention model in three datasets, this is corresponds to our intuition that the bias attention problem in OARNN may cause a biasd sentence representation. An example of the attention heatmap is shown in Figure7. To answer the question, we should focus on "born July 23 , 1973" which is located at the beginning of the sentence. But in OARNN, the attention is biases towards the last few last words in the answer. In IARNN-CONTEXT, the attention is paid to the relevant part and thus results in a more relevant representation.
The attention with context information could also improves the result, we can see that IARNN-CONTEXT and IARNN-GATE outperform IARNN-WORD in three experiments. IARNN-WORD may ignore the importance of some words because it attends answer word by word, for example in Figure8, the specific word self or focusing may not be related to the question by itself, but their combination and the previous word relativistic is very informative for answering the question. In IARNN-CONTEXT we add attention information dynamically in RNN process, thus it could capture the relationship between word and its context.
In general, we can see from table3-5 that the IARNN-GATE outperforms IARNN-CONTEXT and IARNN-WORD. In IARNN-WORD and IARNN-CONTEXT, the attention is added to impact each word representation, but the recurrent process of updating RNN hidden state representations are not influenced. IARNN-GATE embeds the attention into RNN inner activation, the attentive activation gate are more capable of controlling the attention information in RNN. This enlights an important future work: we could add attention information as an individual activation gate, and use this additional gate to control attention information flow in RNN. The regulation of the attention weights (Occam's attention) could also improve the representation. We also conduct an experiment on Wik-iQA (training process) to measure the Occam's attention regulation on different type of questions. We use rules to classify question into 6 types (i.e. who,why,how,when,where,what), and each of them has the same number of samples to avoid data imbalance. We report the Occam'm regulation (n i p in Equation.8) in Figure 9. As we can see from the radar graph, who and where are regulized severely compared with other types of question, this is correspond to their comparetively less information in the answer candidate to answer the question. This emphasize that different types question should impose different amount of regulation on its candidate answers. The experiment result on three AS datasets shows that the improvement of Occam's attention is significant in WikiQA and in-suranceQA. Because most of the sentence are relatively long in these two datasets, and the longer the sentence, the more noise it may contain, so we should punish the summation of the attention weights to remove some irrelevant parts. Our question-specific Occam's attention punishes the summation of attention and thus achieves a better result for both IARNN-WORD and IARNN-CONTEXT.

Conclusion and Future Work
In this work we present some variants of traditional attention-based RNN models with GRU.
The key idea is attention before representation. We analyze the deficiency of traditional outer attention-based RNN models qualitatively and quantitatively. We propose three models where attention is embedded into representation process. Occam's Razor is further implemented to this attention for better representation. Our results on answer selection demonstrate that the inner attention outperforms the outer attention in RNN. Our models can be further extended to other NLP tasks such as recognizing textual entailments where attention mechanism is important for sentence rep-resentation. In the future we plan to apply our inner-attention intuition to other neural networks such as CNN or multi-layer perceptron.