Diversity driven attention model for query-based abstractive summarization

Abstractive summarization aims to generate a shorter version of the document covering all the salient points in a compact and coherent fashion. On the other hand, query-based summarization highlights those points that are relevant in the context of a given query. The encode-attend-decode paradigm has achieved notable success in machine translation, extractive summarization, dialog systems, etc. But it suffers from the drawback of generation of repeated phrases. In this work we propose a model for the query-based summarization task based on the encode-attend-decode paradigm with two key additions (i) a query attention model (in addition to document attention model) which learns to focus on different portions of the query at different time steps (instead of using a static representation for the query) and (ii) a new diversity based attention model which aims to alleviate the problem of repeating phrases in the summary. In order to enable the testing of this model we introduce a new query-based summarization dataset building on debatepedia. Our experiments show that with these two additions the proposed model clearly outperforms vanilla encode-attend-decode models with a gain of 28% (absolute) in ROUGE-L scores.


Introduction
Over the past few years neural models based on the encode-attend-decode (Bahdanau et al., 2014) paradigm have shown great success in various natural language generation (NLG) tasks such as machine translation (Bahdanau et al., 2014), abstractive summarization ((Rush et al., 2015), (Nallapati et al., 2016)) dialog (Li et al., 2016), etc. One such NLG problem which has not received enough attention in the past is query based abstractive text summarization where the aim is to generate the summary of a document in the context of a query. In general, abstractive summarization, aims to cover all the salient points of a document in a compact and coherent fashion. On the other hand, query focused summarization highlights those points that are relevant in the context of the query. Thus given a document on "the super bowl", the query "How was the half-time show?", would result in a summary that would not cover the actual game itself.
Note that there has been some work on query based extractive summarization in the past where the aim is to simply extract the most salient sentence(s) from a document and treat these as a summary. There is no natural language generation involved. Since, we were interested in abstractive (as opposed to extractive) summarization we created a new dataset based on debatepedia. This dataset contains triplets of the form (query, document, summary). Further, each summary is abstractive and not extractive in the sense that the summary does not necessarily comprise of a sentence which is simply copied from the original document.
Using this dataset as a testbed, we focus on a recurring problem in models based on the encode-attend-decode paradigm. Specifically, it is observed that the summaries produced by such models contain repeated phrases. Table 1 shows a few such examples of summaries gener-Document Snippet: The "natural death" alternative to euthanasia is not keeping someone alive via life support until they die on life support. That would, indeed, be unnatural. The natural alternative is, instead, to allow them to die off of life support. Query: Is euthanasia better than withdrawing life support (non-treatment)? Ground Truth Summary: The alternative to euthanasia is a natural death without life support. Predicted Summary: the large to euthanasia is a natural death life life use Document Snippet: Legalizing same-sex marriage would also be a recognition of basic American principles, and would represent the culmination of our nation's commitment to equal rights. It is, some have said, the last major civil-rights milestone yet to be surpassed in our two-century struggle to attain the goals we set for this nation at its formation. Query: Is gay marriage a civil right? Ground Truth Summary: Gay marriage is a fundamental equal right. Predicted Summary: gay marriage is a appropriate right right Table 1: Examples showing repeated words in the output of encoder-decoder models ated by such a model when trained on this new dataset. This problem has also been reported by (Chen et al., 2016) in the context of summarization and by (Sankaran et al., 2016) in the context of machine translation.
We first provide an intuitive explanation for this problem and then propose a solution for alleviating it. A typical encode-attend-decode model first computes a vectorial representation for the document and the query and then produces a contextual summary one word at a time. Each word is produced by feeding a new context vector to the decoder at each time step by attending to different parts of the document and query. If the decoder produces the same word or phrase repeatedly then it could mean that the context vectors fed to the decoder at these time steps are very similar.
We propose a model which explicitly prevents this by ensuring that successive context vectors are orthogonal to each other. Specifically, we subtract out any component that the current context vector has in the direction of the previous context vector. Notice that, we do not require the current context vector to be orthogonal to all previous context vectors but just its immediate predecessor. This enables the model to attend to words repeatedly if required later in the process. To account for the complete history (or all previous context vectors) we also propose an extension of this idea where we pass the sequence of context vectors through a LSTM (Hochreiter and Schmidhuber, 1997) and ensure that the current state produced by the LSTM is orthogonal to the history. At each time step, the state of the LSTM is then fed to the decoder to produce one word in the summary.
Our contributions can be summarized as follows: (i) We propose a new dataset for query based abstractive summarization and evaluate encode-attend-decode models on this dataset (ii) We study the problem of repeating phrases in NLG in the context of this dataset and propose two solutions for countering this problem. We show that our method outperforms a vanilla encoder-decoder model with a gain of 28% (absolute) in ROUGE-L score (iii) We also demonstrate that our method clearly outperforms a recent state of the art method proposed for handling the problem of repeating phrases with a gain of 7% (absolute) in ROUGE-L scores (iv) We do a qualitative analysis of the results and show that our model indeed produces outputs with fewer repetitions.
Recent research in abstractive summarization has focused on data driven neural models based on the encode-attend-decode paradigm (Bahdanau et al., 2014). For example, (Rush et al., 2015), report state of the art results on the Gi-gaWord and DUC corpus using such a model. Similarly, the work of Lopyrev (2015) uses neural networks to generate news headline from short news stories. Chopra et al. (2016) extend the work of Rush et al. (2015) and report further improvements on the two datasets. Hu et al. (2015) introduced a dataset for Chinese short text summarization and evaluated a similar RNN encoder-decoder model on it.
One recurring problem in encoder-decoder models for NLG is that they often repeat the same phrase/word multiple times in the summary (at the cost of both coherency and fluency). Sankaran et al. (2016) study this problem in the context of MT and propose a temporal attention model which enforces the attention weights for successive time steps to be different from each other. Similarly, and more relevant to this work, Chen et al. (2016) propose a distraction based attention model which maintains a history of attention vectors and context vectors. It then subtracts this history from the current attention and context vector. When evaluated on our dataset their method performs poorly. This could be because their method is very aggressive in dealing with the history (as explained later in the Experiments section). On the other hand, our method has a better way of handling history (by passing context vectors through an LSTM recurrent network) which gives us the flexibility to forget/retain some portions of the history and at the same time produce diverse context vectors at successive time steps.
We evaluate our method in the context of query based abstractive summarization -a problem which has received almost no attention in the past due to unavailability of datasets. We create a new dataset for this task and show that our method indeed produces better output by reducing the number of repeated phrases produced by encoder decoder models.  Table 2: Average length of documents/queries/summaries in the dataset

Dataset
As mentioned earlier, there are no existing datasets for query based abstractive summarization. We create such a dataset from Debatepedia an encyclopedia of pro and con arguments and quotes on critical debate topics. There are 663 debates in the corpus (we have considered only those debates which have at least one query with one document). These 663 debates belong to 53 overlapping categories such as Politics, Law, Crime, Environment, Health, Morality, Religion, etc. A given topic can belong to more than one category. For example, the topic "Eye for an Eye philosophy" belongs to both "Law" as well as "Morality". The average number of queries per debate is 5 and the average number of documents per query is 4. Please refer to the dataset url 1 for more details about number of debates per category.
For example, Figure 1 shows the queries associated with the topic "Algae Biofuel". It also lists the set of documents and an abstractive summary associated with each query. As is obvious from the example, the summary is an abstractive summary and not extracted directly from the document. We crawled 12695 such {query, document, summary} triples from debatepedia (these were all the triples that were available). Table 2 reports the average length of the query, summary and documents in this dataset.
We used 10 fold cross validation for all our experiments. Each fold uses 80% of the documents for training, 10% for validation and 10% for testing.

Proposed model
Given a query q = q 1 , q 2 , ..., q k containing k words, a document d = d 1 , d 2 , ..., d n containing n words, the task is to generate a contextual summary y = y 1 , y 2 , ..., y m containing m words. This can be modeled as the problem of finding a y * that maximizes the probability p(y|q, d) which can be further decomposed as: We now describe a way of modeling p(y t |y 1 , ..., y t−1 , q, d) using the neural encoderattention-decoder paradigm.
The proposed model contains the following components: (i) an encoder RNN for the query (ii) an encoder RNN for the document (iii) attention mechanism for the query (iv) attention mechanism for the document and (v) a decoder RNN. All the RNNs use a GRU cell. Encoder for the query: We use a recurrent neural network with Gated Recurrent Units (GRU) for encoding the query. It reads the query q = q 1 , q 2 , ..., q k from left to right and computes a hidden representation for each time-step as: where e(q i ) ∈ R d is the d-dimensional embedding of the query word q i . Encoder for the document: This is similar to the query encoder and reads the document d = d 1 , d 2 , ..., d n from left to right and computes a hidden representation for each time-step as: Attention mechanism for the query : At each time step, the decoder produces an output word by focusing on different portions of the query (document) with the help of a query (document) attention model. We first describe the query attention model which assigns weights α q t,i to each word in the query at each decoder timestep using the following equations.
where s t is the current state of the decoder at time step t (we will see an exact formula for this soon). W q ∈ R l 2 ×l 1 , U q ∈ R l 2 ×l 2 , v q ∈ R l 2 , l 1 is the size of the decoder's hidden state, l 2 is both the size of h q i and also the size of the final query representation at time step t, which is computed as: Attention mechanism for the document : We now describe the document attention model which assigns weights to each word in the document using the following equations.
where s t is the current state of the decoder at time step t (we will see an exact formula for this soon). W d ∈ R l 4 ×l 1 , U d ∈ R l 4 ×l 4 , Z ∈ R l 4 ×l 2 , v d ∈ R l 2 , l 4 is the size of h d i and also the size of the final document representation d t which is passed to the decoder at time step t as: Note that d t now encodes the relevant information from the document as well as the query (see Equation (7)) at time step t. We refer to this as the context vector for the decoder.
Decoder: The hidden state of the decoder s t at each time t is again computed using a GRU as follows: where, y t−1 gives a distribution over the vocabulary words at timestep t − 1 and is computed as: , N is the vocabulary size, y t is the final output of the model which defines a probability distribution over the output vocabulary. This is exactly the quantity defined in Equation (1) that we wanted to model (p(y t |y 1 , ..., y t−1 , q, d)).
Further, note that, e(y t−1 ) is the d-dimensional embedding of the word which has the highest probability under the distribution y t−1 . Also [e(y t−1 ), d t−1 ] means a concatenation of the vectors e(y t−1 ), d t−1 . We chose f to be the identity function. The model as described above is an instantiation of the encoder-attention-decoder idea applied to query based abstractive summarization. As mentioned earlier (and demonstrated later through experiments), this model suffers from the problem of repeating the same phrase/word in the output. We now propose a new attention model which we refer to as diversity based attention model to address this problem.

Diversity based attention model
As hypothesized earlier, if the decoder produces the same phrase/word multiple times then it is possible that the context vectors being fed to the decoder at consecutive time steps are very similar. We propose four models (D 1 , D 2 , SD 1 , SD 2 ) to directly address this problem. D 1 : In this model, after computing d t as described in Equation (8), we make it orthogonal to the context vector at time t − 1: SD 1 : The above model imposes a hard orthogonality constraint on the context vector(d t ).
We also propose a relaxed version of the above model which uses a gating parameter. This gating parameter decides what fraction of the previous context vector should be subtracted from the current context vector using the following equations: where W g ∈ R l 4 ×l 4 , b g ∈ R l 4 , l 4 is the dimension of d t as defined in equation (8). D 2 : The above model only ensures that the current context vector is diverse w.r.t the previous context vector. It ignores all history before time step t − 1. To account for the history, we treat successive context vectors as a sequence and use a modified LSTM cell to compute the new state at each time step. Specifically, we use the following set of equations to compute a diverse context at time t: (8); l 5 is number of hidden units in the LSTM cell. This final d t from Equation (13) is then used in Equation (9). Note that Equation (12) ensures that state of the LSTM at time step t is orthogonal to the previous history. Figure 3 shows a pictorial representation of the model with a diversity LSTM cell. SD 2 : This model again uses a relaxed version of the orthogonality constraint used in D 2 . Specifically, we define a gating parameter g t and replace (12) above by (14) as define below: where W g ∈ R l 5 ×l 4 , U g ∈ R l 5 ×l 4

Baseline Methods
We compare with two recently proposed baseline diversity methods (Chen et al., 2016) as described below. Note that these methods were proposed in the context of abstractive summarization (not query based abstractive summarization) and we adapt them for the task of query based abstractive summarization. Below we just highlight the key differences from our model in computing the context vector d t passed to the decoder. M1: This model accumulates all the previous context vectors as t−1 j=1 d j and incorporates this history while computing a diverse context vector: where W c , U c ∈ R l 4 ×l 4 are diagonal matrices. We then use this diversity driven context d t in Equation (9) and (10).
M2: In this model, in addition to computing a diverse context as described in Equation (15), the attention weights at each time step are also forced to be diverse from the attention weights at the previous time step.
where W a ∈ R l 1 ×l 1 , U a ∈ R l 1 ×l 4 , b a , v a ∈ R l 1 , l 1 is the number of hidden units in the decoder GRU. Once again, they maintain a history of attention weights and compute a diverse attention vector by subtracting the history from the current attention vector.

Experimental Setup
We evaluate our models on the dataset described in section 3. Note that there are no prior baselines on query based abstractive summarization so we could only compare with different variations of the encoder decoder models as described above. Further, we compare our diversity based attention models with existing models for diversity by suitably adapting them to this problem as described earlier. Specifically, we compare the performance of the following models: • Vanilla e-a-d: This is the vanilla encoderattention-decoder model adapted to the problem of abstractive summarization. It contains the following components (i) document encoder (ii) document attention model (iii) decoder. It does not contain an encoder or attention model for the query. This helps us understand the importance of the query. • Query enc : This model contains the query encoder in addition to the three components used in the vanilla model above. It does not contain any attention model for the query.
• Query att : This model contains the query attention model in addition to all the components in Query enc .
• D 1 : The diversity attention model as described in Section 4.1.
• D 2 : The LSTM based diversity attention model as described in Section 4.1.
• SD 1 : The soft diversity attention model as described in Section 4.1 • SD 2 : The soft LSTM based diversity attention model as described in Section 4.1 • B 1 : Diversity cell in Figure3 is replaced by the basic LSTM cell (i.e. c diverse t = c t instead of using Equation (12). This helps us understand whether simply using an LSTM to track the history of context vectors (without imposing a diversity constraint) is sufficient.
• M 1 : The baseline model which operates on the context vector as described in Section 5.
• M 2 : The baseline model which operates on the attention weights in addition to the context vector as described in Section 5.
We used 80% of the data for training, 10% for validation and 10% for testing. We create 10 such folds and report the average Rouge-1, Rouge-2, Rouge-L scores across the 10 folds. The hyperparameters (batch size and GRU cell sizes) of all the models are tuned on the validation set. We tried the following batch sizes : 32, 64 and the following GRU cell sizes 200, 300, 400. We used Adam (Kingma and Ba, 2014) as the optimization algorithm with the initial learning rate set to 0.0004, β 1 = 0.9, β 2 = 0.999. We used pre-trained publicly available Glove word embeddings 2 and fine-tuned them during training. The same word embeddings are used for the query words and the document words. Table 3 summarizes the results of our experiments.  In this section, we discuss the results of the experiments reported in Table 3. 1. Effect of Query: Comparing rows 1 and 2 we observe that adding an encoder for the query and allowing it to influence the outputs of the decoder indeed improves the performance. This is expected as the query contains some keywords which could help in sharpening the focus of the summary. 2. Effect of Query attention model: Comparing rows 2 and 3 we observe that using an attention model to dynamically compute the query representation at each time step improves the results. This suggests that the attention model indeed learns to focus on relevant portions of the query at different time steps. 3. Effect of Diversity models: All the diversity models introduced in the paper (rows 7, 8, 9, 10) give significant improvement over the nondiversity models. In particular, the modified LSTM based diversity model gives the best results. This is indeed very encouraging and Table  4 shows some sample summaries comparing the performance of different models. 4. Comparison with baseline diversity models: The baseline diversity model M1 performs at par with our models D1 and SD1 but not as good as D2 and SD2. However, the model M2 performs very poorly. We believe that simultaneously adding a constraint on the context vectors as well as attention weights (as is indeed the case with M2) is a bit too aggressive and leads to poor performance (although this needs further investigation). 5. Quantitative Analysis: In addition to the qualitative analysis reported in Table 4 we also did a quantitative analysis by counting the num-Source:Although cannabis does indeed have some harmful effects, it is no more harmful than legal substances like alcohol and tobacco. As a matter of fact, research by the British Medical Association shows that nicotine is far more addictive than cannabis. Furthermore, the consumption of alcohol and the smoking of cigarettes cause more deaths per year than does the use of cannabis (e.g. through lung cancer, stomach ulcers, accidents caused by drunk driving etc.). The legalization of cannabis will remove an anomaly in the law whereby substances that are more dangerous than cannabis are legal whilst the possession and use of cannabis remains unlawful. Query: is marijuana harmless enough to be considered a medicine G: marijuana is no more harmful than tobacco and alcohol Query attn : marijuana is no the drug drug for tobacco and tobacco D1: marijuana is no more harmful than tobacco and tobacco SD1: marijuana is more for evidence than tobacco and health D2: marijuana is no more harmful than tobacco and use SD2: marijuana is no more harmful than tobacco and alcohol Source:Fuel cell critics point out that hydrogen is flammable, but so is gasoline. Unlike gasoline, which can pool up and burn for a long time, hydrogen dissipates rapidly. Gas tanks tend to be easily punctured, thin-walled containers, while the latest hydrogen tanks are made from Kevlar. Also, gaseous hydrogen isn't the only method of storage under consideration-BMW is looking at liquid storage while other researchers are looking at chemical compound storage, such as boron pellets. Query: safety are hydrogen fuel cell vehicles safe G: hydrogen in cars is less dangerous than gasoline Query attn : hydrogen is hydrogen hydrogen hydrogen fuel energy D1:hydrogen in cars is less natural than gasoline SD1: hydrogen in cars is reduce risk than fuel D2: hydrogen in waste is less effective than gasoline SD2:hydrogen in cars is less dangerous than gasoline Source:The basis of all animal rights should be the Golden Rule: we should treat them as we would wish them to treat us, were any other species in our dominant position. Query: do animals have rights that makes eating them inappropriate G: animals should be treated as we would want to be treated Query att : animals should be treated as we would protect to be treated D1: animals should be treated as we most individual to be treated SD1: animals should be treated as we would physically to be treated D2: animals should be treated as we would illegal to be treated SD2: animals should be treated as those would want to be treated Table 4: Summaries generated by different models. In general, we observed that the baseline models which do not use a diversity based attention model tend to produce more repetitions. Notice that the last example shows that our model is not very aggressive in dealing with the history and is able to produce valid repetitions (treated ... treated) when needed ber of sentences containing repeated words generated by different models. Specifically for the 1268 test instances we counted the number of sentences containing repeated words as generated by different modes. Table 5 summarizes this analysis.

Conclusion
In this work we proposed a query-based summarization method. The unique feature of  Table 5: Average number of sentences with repeating words across 10 folds the model is a novel diversification mechanism based on successive orthogonalization. This gives us the flexibility to: (i) provide diverse context vectors at successive time steps and (ii) pay attention to words repeatedly if need be later in the summary (as opposed to existing models which aggressively delete the history). We also introduced a new data set and empirically verified we perform significantly better (gain of 28% (absolute) in ROUGE-L score) than applying a plain encode-attend-decode mechanism to this problem. We observe that adding an attention mechanism on the query string gives significant improvements. We also compare with a state of the art diversity model and outperform it by a good margin (gain of 7% (absolute) in ROUGE-L score). The diversification model proposed is general enough to apply to other NLG tasks with suitable modifications and we are currently working on extending this to dialog systems and general summarization.