Guiding Generation for Abstractive Text Summarization Based on Key Information Guide Network

Neural network models, based on the attentional encoder-decoder model, have good capability in abstractive text summarization. However, these models are hard to be controlled in the process of generation, which leads to a lack of key information. We propose a guiding generation model that combines the extractive method and the abstractive method. Firstly, we obtain keywords from the text by a extractive model. Then, we introduce a Key Information Guide Network (KIGN), which encodes the keywords to the key information representation, to guide the process of generation. In addition, we use a prediction-guide mechanism, which can obtain the long-term value for future decoding, to further guide the summary generation. We evaluate our model on the CNN/Daily Mail dataset. The experimental results show that our model leads to significant improvements.


Introduction
Text summarization aims to generate a brief summary from an input document while retaining the key information. There are two broad approaches to summarization: extractive and abstractive. Extractive models (Mihalcea and Tarau, 2004;Yasunaga et al., 2017) usually extract a few sentences or keywords from the source text, while abstractive models (Rush et al., 2015;Nallapati et al., 2016) generate new words and phrases that not in the source text to construct the summary.
Recently, inspired by the success of encoderdecoder model (Sutskever et al., 2014), abstractive summarization models (Nallapati et al., 2016;See et al., 2017) are able to generate the summaries with high ROUGE scores. While these models proved to be capable of capturing the regularities of the text summarization, they are hard to be controlled in the process of generation. Without external guidance, these models just get the source * Corresponding Author: Weiran Xu text as input and then output the summary, which certainly leads to a lack of key information. Zhou et al. (2017) propose a selective gate network to retain more key information in the summary. However, the selective gate network, which is controlled by the representation of the input text, controls the information flow from encoder to decoder for just once. If some key information does not pass the network, it is hard for them to appear in the summary. See et al. (2017) propose a pointer-generator model, which uses the pointer mechanism (Vinyals et al., 2015) to copy words from the input text, to deal with the out-ofvocabulary (OOV) words. Without external guidance, it is hard for the pointer to identify keywords. To address these problems, we combine the extractive model and the abstractive model and use the former one to obtain keywords as guidance for the latter one.
In this paper we propose a guiding generation model for abstractive text summarization. Firstly, we use a extractive method to obtain the keywords from the text. Then, we introduce a Key Information Guide Network (KIGN), which encodes the keywords to the key information representation and integrates it into the abstractive model, to guide the process of generation. The guidance is mainly in two aspects: the attention mechanism (Bahdanau et al., 2014) and the pointer mechanism. In addition, we propose a novel predictionguide mechanism based on He et al. (2017), which predicts the extent of key information covered in the final summary, to further guide the summary generation. Experiments show that our model achieves significant improvements.

Key Information Guide Network pointer
Figure 1: Our key information guide model. It consists of key information guide network, encoder and decoder. In the key information guide network, we encode the keywords to the key information representation k. et al., 2015) to deal with the unknown word problem.
Prediction-guide mechanism. Inspired by the success of AlphaGO, He et al. (2017) propose a prediction network to predict the long-term value of the final summary. Our prediction-guide mechanism is use to guarantee the more key information covered in the final summary.

Our Model
In this section, we describe (1) our baseline encoder-decoder model, (2) our key information guide network, and (3) our prediction-guide mechanism.

Encoder-decoder model based attention
Our baseline model is similar to that of Nallapati et al. (2016). The tokens of the input article x = {x 1 , x 2 , ..., x N } are fed into the encoder, which maps the text into a sequence of encoder hidden states {h 1 , h 2 , ..., h n }. At each decoding time step t, the decoder reads the previous word embedding w t−1 and the previous context vector c t−1 as inputs to obtain the decoder hidden state s t . The context vector c t is calculated by using the attention mechanism: (1) where v, W h , W s are learnable parameters, h i is the hidden state of the input token x i . The context vector c t , which represents what has been read from the source text, is concatenated with the decoder hidden state s t to predict the next word with a softmax layer over the whole vocabulary: where f represents a linear function.

Key information guide network
Most encoder-decoder models (Zhou et al., 2017;See et al., 2017) just get the source text as input and then output the summary, which is hard to be controlled in the process of generation and leads to a lack of key information in the summary. We propose a key information guide network to guide the process of generation from two aspects: the attention mechanism and the pointer mechanism.
In detail, we extract keywords from the text by using TextRank algorithm. As shown in Figure 1, the keywords are fed one-by-one into the key information guide network, and then we concatenate the last forward hidden state h n and backward hidden state h 1 as the key information representation k: Attention mechanism: Traditional attention mechanism is hard to identify keywords, which just uses the decoder state as a query to get the attention distribution of the encoder hidden states. We use the key information representation k as extra input to the attention mechanism, changing equation (1) to: where W k is a learnable parameter. We use the new e ti to obtain new attention distribution α e t (Equation 2) and new context vector c t (Equation 3).
Our key information representation k makes the attention mechanism more focus on the keywords. That is seem like to introduce prior knowledge to the model.
Then, we apply the key information representation k and use the new context vector c t to calculate a probability distribution over all words in the vocabulary, changing equation (4) to: where v represents that y t is from the target vocabulary.
Pointer mechanism: Due to the limitation of the vocabulary size, some keywords may not be in the target vocabulary, which will certainly lead to a lack of them in the final summary. Therefore we take the key information representation k, the context vector c t and the decoder hidden state s t as inputs to calculate a soft switch p sw , which is used to choose between generating a word from the target vocabulary or copying a word from the input text: where w T k , w T c , w T s and b sw are parameters, σ is the sigmoid function.
Our pointer mechanism, which is equipped with the key information representation, has the ability to identify the keywords. We use the new attention distribution α e ti as the probability of the input token w i and obtain the following probability distribution to predict the next word: Note that if w is an out-of-vocabulary word, P v (y t = w) is zero.
During training, we minimize a maximumlikelihood loss at each decoding time step, which is most widely used in sequence generation. We define y * t as the target word for the decoding time step t and the overall loss is:

Prediction-guide mechanism at test time
At test time, when predicting the next word, we consider not only the above probability (Equation 9), but also a long-term value predicted by the prediction-guide mechanism. The predictionguide mechanism is based on He et al. (2017). Our prediction-guide mechanism, which is a single-layer feed forward network with sigmoid activation function, predicts the extent of the key information covered in the final summary. At each decoding time step t, we take mean pooling over the decoder hidden statess t = 1 t t l=1 s l , the encode hidden statesh n = 1 n n i=1 h i and the key information representation k as inputs to calculate the long-term value.
We sample two partial summaries y p1 and y p2 for each x with random stop to gets t . Then, we finish the generation from y p to obtain M average decoder hidden statess of the completed summaries S(y p ) (using beam search), and compute the average score: where cos is the function of cosine similarity. We hope the predicted value of v(x, y p1 ) can be larger than v(x, y p2 ) if AvgCos(x, y p1 ) > AvgCos(x, y p2 ). Therefore, the loss function of the prediction-guide network is as follows: where AvgCos(x, y p1 ) > AvgCos(x, y p2 ). At test time, we first compute the normalized log probability of each candidate, and then linearly combine it with the value predicted by the prediction-guide network. In detail, given an abstractive model P (y|x) (Equation 9), a predictionguide network v(x, y) and a hyperparameter α ∈ (0, 1), the score of partial sequence y for x is computed by: where α ∈ (0, 1), is a hyperparameter.

Experiment setting
We use the CNN/Daily Mail dataset (Nallapati et al., 2016;Hermann et al., 2015) and use scripts supplied by Nallapati et al. (2016) to obtain the same version of the data, which has 28,7226 training pairs, 13,368 validation pairs and 11,490 test pairs. We use two 256-dimensional LSTMs for the bidirectional encoder and one 256-dimensional LSTM for the decoder. In our key information guide network, the approach of encoding keywords is same to the encoder. In addition, we use a vocabulary of 50k words for both source and target and do not pre-train the word embeddings -they are learned from scratch during training. During training and testing, we truncate the text to 400 tokens and limit the length of the summary to 100 tokens. We train using Adagrad (Duchi et al., 2011) with learning rate 0.15 and an initial accumulator value of 0.1. The batch size is set as 16. Following the previous work, our evaluation metric is F-score of ROUGE (Lin and Hovy, 2003). In addition, for the prediction-guide mechanism, we set the single-layer feed forward network with 800 nodes. For the hyperparameter α, we test the performances of KIGN+Prediction-guide model using different α during decoding. As can be seen from the figure 2, the performance is stable for the α ranging from 0.8 to 0.95. When α is set as 0.9, we can obtain the highest F-score of ROUGE. Besides, we set the M as 8 and adapt mini-batch training with batch size to be 16. The network is trained with AdaDelta (Zeiler, 2012).
During training and at test time we truncate the input tokens to 400 and limit the length of the output summary to 100 tokens for training and 120 tokens at test time, which is similar to See et al. (2017). We trained our keywords network model less than 200, 000 training iterations. Then ROUGE F Score we trained the single-layer feed forward network based on the KIGN model. Finally, at test time, we combine the KIGN model and the predictionguide mechanism to generate the summary.

Results and discussions
We compare our model with the baseline model (enc-dec+attn), hierarchical networks (Nallapati et al., 2016) and the baseline model equipped with pointer-mechanism since we use the pointer mechanism in our model. Table 1 shows that our key information guide network scores exceed the baseline model equipped with the pointer-mechanism by (+1.3 ROUGE-1, +0.9 ROUGE-2, +1.0 ROUGE-L). In addition, we just add the prediction-guide mechanism on the baseline model equipped with the pointer-mechanism to understand the contribution of each part. The scores of that exceed the baseline model equipped with the pointer-mechanism by (+0.8 ROUGE-1, +0.6 ROUGE-2, +0.7 ROUGE-L). Finally, combining the key information guide network and the prediction-guide mechanism, we achieve a better performance. Our best model scores exceed the baseline model with pointer-Text(truncated): google claims to have cracked a problem that has flummoxed anyone who has tried to read a doctor's note -how to read anyone's handwriting. the firm claims the latest update to its android handsets can under 82 languages in 20 distinct scripts, and works with both printed and cursive writing input with or without a stylus. it even allows users to simply draw emoji they want to send. scroll down for video. the california search giant claims the latest update to its android handsets can understand handwriting in 82 languages in 20 distinct scripts. google says its handwriting recognition works by building on large-scale language modeling, robust multi-language ocr.
Gold: google handwriting input works on android phones and tablets. handsets can under 82 languages in 20 distinct scripts. works with both printed and cursive writing input with or without a stylus.
Baseline+pointer-mechanism: google claims to have cracked a problem that has flummoxed anyone who has tried to read a doctor 's note how to read anyone 's handwriting.
Our model: google claims the latest update to its android handsets can under 82 languages in 20 distinct scripts, and works with both printed and cursive writing input with or without a stylus. mechanism by (+2.5 ROUGE-1, +1.5 ROUGE-2, +2.2 ROUGE-L). In this paper, we do not implement coverage mechanism in our model, which can greatly improve the score of ROUGE (See et al., 2017). Figure 3 is an example to show the coverage of the key information between the text and the summary and the bold words are the key information of the text. We compare the output of two models and give the gold summary. It shows that the main idea of the text is about google handwriting input working on android handsets and some function introduction. The baseline model equipped with pointer-mechanism produces the summary, which just shows that google have cracked the problem of reading handwriting, while the summary generated by our model covers almost all the key information of the text.

Conclusion
In this work, we propose a guiding generation model for abstractive text summarization. We combine the extractive model and the abstractive model. Firstly, we use the extractive method to obtain keywords from the input text. Then, we introduce a key information guide network, which encodes the keywords to the key information representation, to guide the process of generation. In addition, we propose a prediction-guide mechanism to further guide the generation at test time. Experiments show that our model leads to significant improvements.