Generating Topic-Oriented Summaries Using Neural Attention

Summarizing a document requires identifying the important parts of the document with an objective of providing a quick overview to a reader. However, a long article can span several topics and a single summary cannot do justice to all the topics. Further, the interests of readers can vary and the notion of importance can change across them. Existing summarization algorithms generate a single summary and are not capable of generating multiple summaries tuned to the interests of the readers. In this paper, we propose an attention based RNN framework to generate multiple summaries of a single document tuned to different topics of interest. Our method outperforms existing baselines and our results suggest that the attention of generative networks can be successfully biased to look at sentences relevant to a topic and effectively used to generate topic-tuned summaries.


Introduction
Automatic text summarization is the task of generating/extracting short text snippet that embodies the content of a larger document or a collection of documents in a concise fashion. Traditionally, researchers have used extractive methods for summarization -where a set of sentences is selected from an article and concatenated as-is to form the summary. Extractive methods are limited by their inability to paraphrase the content of the article using new sentences and hence are very different from the summaries created by humans, who paraphrase a given article to generate summaries. Recent works on summarization have come up with sequence-to-sequence models for generating summaries in a word-by-word fashion, thus 'generating' new sentences.
As articles get longer, it might span several topics and therefore, a single summary is often insufficient to satisfy the interests of the reader. In these cases, it is desirable to produce summaries that are aligned to the topical interests of reader to enable better consumption. Extractive summarization uses sentence-level features (Yang et al., 2017) that have been leveraged for producing query-focused or topic-based summaries. However, for RNN-based frameworks, such tuned summary generation is non-obvious due to the absence of explicit content based features.
To extend the advantages of abstractive summarization and generate topic-tuned summaries, we propose a neural encoder-decoder based framework which takes an article along with a topic of interest as input and generates a summary tuned to the target topic of interest. Our method works by training a neural framework to pay higher attention to parts of the input articles relevant to given topic. To overcome the lack of dataset containing articles with multiple topic-oriented summaries, we use a novel approach to artificially create a topic-centric training corpus from the CNN/Dailymail dataset (Hermann et al., 2015; for training our model. Table 1 shows an article with different summaries generated by the proposed approach tuned towards the topics of politics, finance and social aspects. It can be seen that the business oriented summary talks about IMF's estimates about increase in taxes and the summary for politics talks about how it relates to the government's upcoming budget. The summary for the social topic elaborates on universal basic income and social security. Thus, for the same article, the proposed approach is capable of generating different summaries tuned to the input topic of interest. Title: IMF backs Universal Basic Income in India, serves Modi govt a political opportunity Article: Ahead of Union Budget 2018, the Narendra Modi-led governments last full-year budget to be presented in February, the International Monetary Fund (IMF) has made a strong case for India adopting a fiscally neutral Universal Basic Income by eliminating both food and fuel subsidies ... Business: imf claim eliminating energy " tax subsidies " would require a increase in fuel taxes and retail fuel prices such as petrol prices and tax of rs400 ($ 6) per tonne on coal consumption ... Politics: narendra modi-led government 's last full-year budget to be presented in february. the international monetary fund has made a strong case for india adopting a fiscally neutral universal basic income by eliminating both food and fuel subsidies ... Social: universal basic income is a form of social security guaranteed to citizens and transferred directly to their bank accounts and is being debated globally ... Table 1: Topic oriented summaries generated by our method for an article (from LiveMint) touching multiple topics

Related Work
Traditional methods for summarization (Nenkova and McKeown, 2011) extract key sentences from the source text to construct the summary. Early works on abstractive summarization were focused on sentence compression based approaches (Filippova, 2010;Berg-Kirkpatrick et al., 2011;Banerjee et al., 2015) which connected fragments from multiple sentences to generate novel sentences for the summary or template based approaches that generated summaries by fitting content into a template (Wang and Cardie, 2013;Genest and Lapalme, 2011).
With the advent of deep sequence to sequence models which generated text word-byword (Sutskever et al., 2014), attention based neural network models have been proposed for summarizing long sentences. Rush et al. (2015) first demonstrated the use of neural networks to generate shorter forms of long sentences.  proposed a neural approach for abstractive summarization of large articles by applying the sequence to sequence model. See et al. (2017) further improved the performance on abstractive summarization of articles by introducing the ability to copy words from the source article (Gulcehre et al., 2016) using a pointer network (Vinyals et al., 2015), in addition to generating new words. However, all these frameworks focus on generating a single summary, and do not support topic-tuned summary generation. We use the architecture by See et al. as the starting point for our work and develop a method to generate topictuned summaries.
There have been some works on extending extractive summarization towards topical tuning. Lin and Hovy (2000) proposed the idea of extracting topic-based signature terms for summarization. Given a topic and a corpus of documents relevant and not relevant to the topic, a set of words characterizing each topic is extracted using a log-likelihood based measure. Sentences which contain these chosen words are assigned more importance while summarizing. Conroy et al. (2006) further extended the method for querybased multi-document summarization by considering sentence overlap with both query terms and topic signature words.
However, all these works rely on identifying sentence level features to compute topic affinities that are leveraged for choosing topic specific sentences for the summary. Since sequence-tosequence frameworks generate text in a word-byword fashion, incorporating sentence level statistics is not feasible in this framework. Therefore, we modify the attention of the network to focus on topic-specific parts of the input text to generate the tuned summaries.

Topic aware pointer-generator network
Our method builds on top of the pointer-generator network of See et al. which consists of an encoder and a decoder both based on LSTM architecture. Our contribution is a modified encoder to encode the article in a topic-sensitive manner. However, for the sake of completeness, we shall provide an overview of the entire network, and will reuse notations from their work to a large extent.
Given an input article as a sequence of words a = w 1 , w 2 . . . w n , and a scaled one-hot topic vector u t , where t is a topic in a predefined set of topics T , the network produces a summary s = s 1 , s 2 , . . . , s k pertaining to the topic t. The topic-vector u t has a non-zero value c > 0 only for the entry corresponding to the desired topic t of the summary. c designates the amount of bias that should be put towards the topic while generating the summary, higher values suggesting higher bias. The input article is passed through an embedding layer to transform them to lower-dimensional embedding m 1 , m 2 , . . . , m n . The topic one-hot vector u t is concatenated to each of the embedding to create the sequence (m 1 , u t ), (m 2 , u t ), . . . , (m n , u t ). This sequence is passed to a bidirectional LSTM encoder which computes a sequence of hidden states h 1 , h 2 , ..h n . The final hidden state is passed to a decoder which is also a LSTM, and it serves as the initial hidden state of the decoder. At each decoding step of the decoder, an attention distribution a t is calculated over all words of the input article, Here, s t is the hidden state of the decoder at timestep t, and v, W h , W s and b att are model parameters.
The attention model can be thought of as a probability distribution over words in the source text, which aids the decoder in generating the next word in the summary using words in the source text that have higher attention. Since the encoder has information about what is the topic of interest, the h i are calculated in such a way that the corresponding areas of the article receiving high attention are relevant to the topic. We demonstrate this later in Section 4.2. The decoder uses this attention to calculate a context vector h * t as a weighted sum of the encoder hidden states to determine the next word that is to be generated.
At each decoding time step, the decoder network also gets as input y t , the last word in the summary generated so far and computes a scalar p gen denoting the probability of the network generating a new word from the vocabulary.
where w h , w s , w y , b gen are trained vectors. The network probabilistically decides based on p gen , whether to generate a new word from the vocabulary or copy a word from the source text using the attention distribution. For each word w in the vocabulary, the model calculates P vocab (w), the probability of the word getting newly generated next. P vocab is calculated by passing a concatenation s t and h * t through a linear transformation with sigmoid activation. Also, for each word w in the input article, its total attention received yields its probability of being copied. Since some words occur in the vocabulary and also the input article, they will have non-zero probabilities of being newly generated as well as being copied. Therefore, the total probability of w being the next word generated in the summary, denoted by p is given by, The second term allows the framework to choose a word to copy from the input text using the attention distribution. The pointer-generator network further employs a coverage mechanism which modifies the loss function to encourage diversity in attention distributions over time steps. This prevents the network from repeating the same phrases again while generating a summary. To handle out-of-vocabulary words, the pointer-generator network also appends them to the vocabulary before the article is encoded.
The training loss is set to be the average negative log-likelihood of the ground truth summaries. The model is trained using back-propagation and the Adagrad gradient descent algorithm (Duchi et al., 2011).
Alternate to the the proposed architecture, one can append the topic vector to the initial hidden state of the encoder, or the initial state of the decoder. However, in our experiments, these approaches did not produce the desired tuning. Another alternative would be to use different W h attention matrix for each different topic and learn topic specific attention biases. This again did not perform well, perhaps due insufficient data available to train each topic's W h .

Generating the training data
The problem of generating multiple different summaries of the same document based on topics of interest requires a set of articles, each of them (a) annotated with multiple (u t , s) pairs, where t is a topic, and s is the summary of a oriented to topic t. However, existing datasets for singledocument summarization (Over et al., 2007;Hermann et al., 2015) do not have multiple ground truth summaries oriented to different topics for each document. Moreover, existing datasets for topic-based summarization are focused on multidocument summarization and contain multiple articles tagged with a relevant topic. For example, DUC 2005 dataset (Dang, 2005) for topic based multi-document summarization contains 50 topics with each topic having 25 to 50 documents in it. To address the lack of datasets with multiple summaries of a single document, we propose an approach create such a dataset artificially. The dataset is created in two steps.
We begin with fixing the set of topics for consideration and learn a word frequency based representation for each topic by using a corpus of articles where each article is labelled with a topic. For our experiment, we use the dataset of news articles tagged with topics like politics, sports, education etc. released at the 2017 KDD Data Science + Journalism Workshop (VoxMedia). We group the articles by the topics, so that all articles having topic t are represented by the set S t . We represent each topic t as a vector e t = (n 1 , n 2 , . . . , n v ) where v = |V | is the size of vocabulary of words V = {w 1 , w 2 , ..., w v } and n i is the number of times word w i occurs in S t . We normalize e t to sum to 1.
We then take a corpus of articles with human generated summaries as a collection of (a, s) pairs. For our purposes, we use the CNN-Dailymail dataset which has articles with summaries. We create an intermediate dataset which consists of (a, u t , s) pairs, where u t is a one-hot representation of the topic t of summary s. We begin with labeling each summary with a topic by computing the dot-product between summary (in its bag-of-words representation) and the topic vectors extracted in the previous step. Let < v s , e t > indicate the dot-product between the bag of words representation of summary s and the topic vector for topic t. For each topic t i , sim(s, t i ) =< v s , e t i > is the similarity of summary s to the topic t i . Let's say that t i has the highest similarity to summary s and t j has the second highest similarity. Then we say that summary s has topic t i with confidence c = sim(s,t i ) sim(s,t j ) . If the confidence is less than a threshold (set to 1.2 in our experiments), we drop the article and summary from our dataset. This enures that the intermediate dataset does not include summaries with more than one dominant topic. If the confidence is greater than the threshold, we add the triple (a, u t i , s) to the intermediate dataset, where the vector u t i keeps the value corresponding to t i equal to the confidence c, instead of 1 as commonly done in one-hot vectors. This allows us to retain the confidence of s being of topic t i while training our model.
To generate the final dataset, we follow the steps below: 1. Randomly pick (a 1 , u t 1 , s 1 ) and (a 2 , u t 2 , s 2 ) from the intermediate dataset such that t 1 = t 2 .
2. Make a new article a by sequentially picking up lines from a 1 and a 2 . Each addition of a new line is done by randomly selecting one of a 1 or a 2 and extracting out a new line from the beginning of it. This ensures that the lines from a 1 occur in the same order in a as originally in a 1 , and the same thing is true for a 2 too. This ensures that the sequential flow of content is retained in the merger.
3. Add (a , u t 1 , s 1 ) to the final dataset.
4. Repeat step 2 to get a new article a and add (a , u t 2 , s 2 ) to the final dataset.
6. Repeat steps 1 − 5 until the entire intermediate dataset is exhausted or all remaining instances in it have the same topic.
The created final dataset is used to train the proposed neural network for summarization. Since every article in the final dataset is a combination of two original articles, and the target summary to be generated is of one of them, the model must learn to distinguish between content coming from the two original articles. The randomization of position of sentences from each original article while merging, ensures that there is no position-specific bias that the model can use either. Since the two original articles have different topics, and the only information given to the model to hint whose summary is to be generated is the topic of one of them, the model is forced to learn what content is more relevant to the given topic and generate a summary accordingly.
We ended up with 112, 360 articles in our final dataset since many article-summary pairs from the CNNDailymail dataset were dropped due to insufficient confidence about their topics. Out of this, 103, 666 were used for training, 4, 720 for validation, and 3, 974 for testing.

Experimental Evaluation
To position our method against existing works, we use the following summarizers as our baselines.
Our first baseline is the vanilla pointer generator (PG) described in the original work of See et al. (2017). This method does not consider the desired topic of summary when generating a summary. For an unbiased evaluation, we use exactly those unmerged article-summary pairs of the CNN/Dailymail dataset for training and validation which were eventually incorporated in the final dataset. Then the trained model is applied to generate summaries of the test set of the final dataset.
Our next baseline is a frequency-based extraction method that selects lines from the input article which are strongly aligned to the desired topic u t . For each sentence, the relevance to each of the predefined topics is calculated using a dot product between their vector representations. The sentence is designated to the topic having the maximum relevance score, and the strength of alignment is the ratio of the the highest and the second highest relevance scores.
We extract all the sentences in the article which are aligned to the target topic, and run it through the pointer-generator network to create the summary. We refer to this baseline as abstractive summarizer with frequency based extraction (Freq-Abs). Alternatively, we take k sentences which have the highest strength of alignment with the target topic to create a purely extractive summary. k was set to 3 in accordance with the average number of sentences in the summaries of the training set (2.83). We call this method extractive summarizer with frequency based extraction (Freq-Ext).
Our last baseline is a topic-signature based approach which also works by extracting sentences from the article which are aligned to the target topic. However, the selection of sentences is based on topic signatures as described by Lin and Hovy (2000) and Conroy et al. (2006) instead of word frequencies. A topic signature is a set of words relevant to the topic. For any given sentence, the number of signature terms of each topic is computed. The sentence is designated to belong to the topic which has the highest number of its signature terms occurring in it.
The topic signature is determined based a set of documents T relevant to the topic, and a set of background documents T that is indicative of general topics. It is assumed that in T and T , the occurrence of each word w follows a binomial distribution with probability of occurrence p. The likelihood of observing T and T is calculated under two hypotheses -one where the probability of occurrence of w is p 1 in T and p 2 in T such that p 1 > p 2 , and the other where it is p in both T and T . The ratio of likelihoods is calculated and words for which this ratio is the highest are included as part of the topic's signature.
We extracted topic signatures using the summaries of the training dataset as our corpus. For each topic t, the corresponding summaries form the topic specific corpus T , and the remaining summaries make the background corpus. Table 2 shows a subset of the topic signatures.  Analogous to Freq-Abs and Freq-Ext, we try two alternatives here as well -abstractive summarizer with signature based extraction (Sign-Abs) and extractive summarizer with signature based extraction (Sign-Ext).

Performance on the created dataset
We used the 3, 974 article-topic-summary tuples from our final dataset to evaluate the performance of the summarizers. The models were given the input article and topic and the generated summary was compared with the ground truth summary. We use ROUGE scores to measure the quality of summaries. ROUGE scores measure the precision, recall and f-measure for the occurrence of n-grams in the generated summary with respect to the reference human generated summary. We use the ROUGE-1, ROUGE-2 and ROUGE-L variants (Lin and Och, 2004) which look at unigram, bigram and longest common subsequence overlaps between generated and reference summaries. Table 3 shows the ROUGE F1 scores for the topicbased summaries generated by different methods. It is easy to see that the proposed method yields the best performance across all the baselines.
Further, we also observed that the summaries generated by our system show abstractive nature as noted in See et al. (2017). Table 4 shows some instances where our model used new words unseen in the article.  Article: spain 's 2-0 defeat by holland on tuesday brought back bitter memories of their disastrous 2014 world cup , but coach vicente del bosque will not be too worried ... Summary: holland beat spain 2-0 at the amsterdam arena on tuesday night Article: it 's 11 years since arsenal won the title. they went from invincibles to incapables ... Summary: arsene wenger 's side have 15 wins in 17 appearances

Performance on multi-topic articles
We ran the proposed model on original articles from CNN-DailyMail dataset which were not part of the final dataset used for training. Since the articles were not annotated with topics, we generated summaries for all the different topics. We then detected articles where the method generated different summaries for different topics suggesting the presence of more than one topic in the article. Table 5 shows a few summaries generated by the proposed approach where different summaries were generated aligned to the input topics. The first article talks about the dropping of a player from a football squad. The summary oriented to the military topic talks about the assault of a police officer by the player and his criminal history.
The sports summary talks about the player's fate in the remaining games of the season. Similarly, we have different summaries for the second article, where the education oriented summary talks about the educational affiliations of the suspects and disciplinary procedures, whereas the military summary talks about the arrests. Note that these are among the original articles in the dataset which were not used for creating any of the articles in our final data used for training the model.
We also observe that the attention distribution   See et al.) that is received by each term summed over all the decoding steps during generation. An example of the variation in attention over an input article is shown in Figure 1. The degree of yellow hue is used to represent the value of coverage for each term. For the topic military, words like jail and assualting receive higher attention, whereas for the topic sports words like games and player are highlighted. Figure 1(b) also shows an example where our model summarizes by skipping a clause ("which kick off at 3pm").

Human evaluation of performance
Finally, we performed human evaluations of summaries to compare the quality of our method against the baselines. We restrict ourselves to  the best performing baseline -the topic signature based abstractive summarizer (Sign-Abs), and the vanilla pointer-generator. We fetched articles touching multiple topics using the NYTimes Search API 1 , which allows to search for articles on NYTimes which appear in the news desk for a topic (the major topic) and are tagged with another topic (the minor topic). For different pairs of topics, we retrieved relevant articles using the API and randomly selected few of them to generate summaries tuned to the two topics using our method and the baselines. We retrieved 18 total articles for the evaluation. Each article had two topics leading to 36 (article,topic) pairs for the summary generation. For each (article,topic) pair, annotators were shown two summaries -one generated by our method and the other by one of the baselines. The task was to choose the summary more relevant to the topic. Each such task was annotated by 10 different annotators. Every annotator was assigned 9 tasks and 1 extra dummy task to check if they were paying attention. Annotations from evaluators who answered incorrectly to the dummy task were discarded. We had a total of 720 annotations from the human evaluation and their summary is shown in Table 6.
The value under "Overall annotations" refers to the fraction of all human responses across all documents which rated the summary produced by our method better than the alternate summary produced by the compared baseline method. Understandably, the proposed approach is comparable in the preference to both the baselines for the major topic (0.5667) -since most standard summaries will cover the primary topic of the input. However, for the minor topic, it can be seen that the proposed approach is better than the baselines.
The value under "Document annotations" indicates the fraction of times the proposed method was preferred by half or more of the annotators for a document-topic pair. It is easy to see that the proposed approach clearly outperforms the baselines under this scenario. The difference in performance is even more significant for summaries generated for the non-major topic since our approach is capable of efficiently generating tuned summaries for minor topics as well.

Conclusion
We proposed a method for generating multiple abstractive summaries of a given document oriented towards different topics of interest. Our method works by modifying the attention mechanism of a pointer-generator neural network to make it focus on text relevant to a topic. For training our network, we devised a novel way to create a dataset where articles are tagged with topic oriented summaries. Our method outperformed previous fea-  Table 6: Evaluation of summaries of the proposed approach against Pointer Generator Framework and Topic Signature based Summarizer by human annotators ture based methods for topic oriented summarization using word frequencies or log likelihood ratios.