Concept Pointer Network for Abstractive Summarization

A quality abstractive summary should not only copy salient source texts as summaries but should also tend to generate new conceptual words to express concrete details. Inspired by the popular pointer generator sequence-to-sequence model, this paper presents a concept pointer network for improving these aspects of abstractive summarization. The network leverages knowledge-based, context-aware conceptualizations to derive an extended set of candidate concepts. The model then points to the most appropriate choice using both the concept set and original source text. This joint approach generates abstractive summaries with higher-level semantic concepts. The training model is also optimized in a way that adapts to different data, which is based on a novel method of distantly-supervised learning guided by reference summaries and testing set. Overall, the proposed approach provides statistically significant improvements over several state-of-the-art models on both the DUC-2004 and Gigaword datasets. A human evaluation of the model's abstractive abilities also supports the quality of the summaries produced within this framework.


Introduction
Abstractive summarization (ABS) has gained overwhelming success owing to a tremendous development of sequence-to-sequence (seq2seq) model and its variants (Rush et al., 2015;Chopra et al., 2016;Paulus et al., 2017;Guo et al., 2018;Gao et al., 2019). In tandem with seq2seq models, pointer generator was developed by See et al. (2017) as a solution to tackle the rare words and out-of-vocabulary (OOV) problem associated with generative-based models. The idea behind is to use attention as a pointer to determine the probability of generating a word from both a vocab-ulary distribution and the source text. Pointer generator networks have also been extensively accepted by the ABS community due to their efficacy with long document summaries (Chen and Bansal, 2018;Hsu et al., 2018), title summarization , etc.
However, the current power of abstractive summarization falls short of their potential. As the example in Figure 1 shows, a seq2seq model with a pointer mechanism (marked as the direct pointer) is likely to merely copy parts of the original text to form a summary using keywords and phrases, such as "317 athletes". Conversely, a more humanlike summary would be based on one's own understanding of the detail in the words, expressed as higher-level concepts drawn from world knowledge-like using the word "group" to replace "athletes and officials". This indicates that a good summary should not simply copy original material, it should also generate new and even abstract concepts that reflect high-level semantics.
Therefore, a pointer generator network that solely considers the source material to generate a summary does not adequately satisfy the needs of high-quality abstractive summarization. We argue that concepts have a greater ability to express deeper meanings than verbatim words. As such, it is essential to explore the potential of us-ing concepts from world knowledge to assist with abstractive summarization. Our developed model not only points to informative source texts but also leverages conceptual words from human knowledge in the summaries it generates.
Hence, in this paper, we propose a novel model based on a concept pointer generator that encourages the generation of conceptual and abstract words. As a hidden benefit, the model also alleviates the OOV problems. Our model uses pointer network to capture the salient information from a source text, and then employs another pointer to generalize the detailed words according to their upper level of expressions. Finally, the output is also consistent with language model by the seq2seq generator. Unique to our concept pointer is a set of concept candidates particular for a word that is drawn from a huge knowledge base. The set of candidates adheres to a concept distribution, where the probability of each concept being generated is linked to how strongly the candidate represents each word. Moreover, the concept distribution is iteratively updated to better explain the target word given the context of the source material and inherent semantics in the texts. Hence, the learned concept pointer points to the most suitable and expressive concepts or words. The optimization function is adaptive so as to cater for different datasets with distantly-supervised training. The network is then optimized end-to-end using reinforcement learning, with the distant-supervision strategy as a complement to further improve the summary.
Overall, the contributions of this paper are: 1) a novel concept pointer generator network that leverages context-aware conceptualization and a concept pointer, both of which are jointly integrated into the generator to deliver informative and abstract-oriented summaries; 2) a novel distant supervision training strategy that favors model adaptation and generalization, which results in performance that outperforms the wellaccepted evaluation-based reinforcement learning optimization on a test-only dataset in terms of ROUGE metrics; 3) a statistical analysis of quantitative results and human evaluations from comparative experiments with several state-of-the-art models that shows the proposed method provides promising performance.

Related Work
Abstractive summarization supposedly digests and understands the source content and, consequently, the generated summaries are typically a reorganization of the wording that sometimes form new sentences. Historically, abstractive summarization has been performed through rule-based sentence selection (Dorr et al., 2003), key information extraction (Genest and Lapalme, 2011), syntactic parsing (Bing et al., 2015) and so on. However, more recently, seq2seq models with attention have played a more dominant role in generating abstractive summaries (Rush et al., 2015;Chopra et al., 2016;Nallapati et al., 2016;Zhou et al., 2017). Extensions to the seq2seq approach include an intra-decoder attention (Paulus et al., 2017) and coverage vectors (See et al., 2017) to decrease repetition in phrasing. Copy mechanism (Gu et al., 2016) has been integrated into these models to tackle OOV problem. Zhou et al. (2018) went on to propose SeqCopyNet which copies complete sequences from an input sentence to further maintain the readability of the generated summary.
Pointer mechanism (Vinyals et al., 2015) has drawn much attention in text summarization (See et al., 2017), because this technique not only provides a potential solution for rare words and OOV but also extends abstractive summarization in a flexible way (Ç elikyilmaz et al., 2018). Further, pointer generator models can effectively adaptive to both extractor and abstractor networks (Chen and Bansal, 2018), and summaries can be generated by incorporating a pointer-generator and multiple relevant tasks (Guo et al., 2018), such as question or entailment generation, or multiple source texts .
However, work particularly targets the problem of the abstraction is rare. Abstract Meaning Representation (AMR) is used to transform a sentence into a concept graph, then merge those similar concept nodes to form a new summary graph (Liu et al., 2018). Concepts are also incorporated as auxiliary features (Guo et al., 2017). Kryscinski et al. (2018) and Weber et al. (2018) define the number of new n-grams as the primary criteria of abstractiveness. This makes sense in most cases. But, we believe that abstraction means summarizing detailed content with higher-level semantically related concepts, which has motivated the development of the model proposed in this paper.

The Proposed Model
Neural abstractive summarization can be described as a generation process where a sequential input is summarized into a shorter sequential output through a neural network. Suppose that the sequential input x = {x 1 , . . . , x i , . . . , x n } is a sequence of n number of words, and i is the index of the input. The shorter (i.e., summarized) sequence of output is denoted as y = {y 1 , . . . , y t , . . . , y m } with number of m words, and t indicates a time step. As Figure 2 shows, our model consists of two sub-modules -an encoder-decoder module and the proposed concept pointer generator module.

Encoder-Decoder Framework
This process is formulated as an encoder-decoder framework that consists of an encoder and an attention-equipped decoder. We use a two-layer bi-directional LSTM-RNN encoder and one-layer uni-directional LSTM-RNN decoder along with attention mechanism.
Formally, the encoder produces sequential hidden states as ( in the corresponding positions, and the bi-directional Each word x i in the sequence can be represented as a concatenation of the bi-directional hidden states, i.e., The decoder generates a target summary from a vocabulary distribution P vocab (w), which is based on a context vector h * t through the following process: where s t is the hidden state of the decoder at time step t , and h * t is the context vector at time step t. W 1 , W 2 , b 1 , b 2 are trainable parameters, and sfm(·) is short for softmax function.
The context vector h * t is computed by a weighted sum of the hidden representations of the source text, and the weight is denoted as the attention α t,i .
The softmax function normalizes the vector of a distribution over the input position, and v, W h , W s , b are trainable parameters.

Concept Pointer Generator
Pointer networks use attention as a pointer to select segments of the input as outputs (Vinyals et al., 2015). As such, a pointer network is a suitable mechanism for extracting salient information, while remaining flexible enough to interface with a seq2seq model for generating an abstractive summarization (See et al., 2017). Our proposed model is essentially an upgrade to this configuration that integrates a new concept pointer network within a unified framework.

Context-aware Conceptualization
"Understanding" the instances of a word requires a taxonomic knowledge base that relates those words to a concept space. In our model, we use an isA taxonomy, called the Microsoft Concept Graph 1 (Wang et al., 2015), to serve this purpose for two reasons (Wang and Wang, 2016). First, this graph provides a huge concept space with multi-word terms that cover concepts of worldly facts as concepts, instances, relationships, and values 2 . Second, the relationships between concepts and entities are probabilistic as a measure of how strongly they are related. Moreover, the probabilities are trustworthy given they have been derived from evidence found in billions of webpages, search log data, and other existing taxonomies. Our model is data-driven and, therefore, is more easily adaptable with probabilities. All these characteristics make the Microsoft Concept Graph a suitable choice for our model. More detailed examples are available in Appendix A. The concept graph specifies the probability that each instance x belongs to a concept c, p(c|x). Given a word x, we have a distribution over a set of related concepts. Yet, this raises the question of how to identify a context-appropriate concept for a word from the distributional set of candidate concepts. For instance, apple in the context of "an apple is good for you health" tends to be associated with the concept of fruit instead of company. Formally, given a word x i in a training sentence, a set of k concept candidates, C i = {c 1 i , c 2 i , · · · , c k i }, is linked to the word from the knowledge base, with distributional probabilities over the concepts, i.e., Figure 2: The architecture of our model. Blue bar represents the attention distribution over the inputs. Purple bar represents the concept distribution over the inputs. Noted that, this distribution can be sparse since not every word has its upper concept. Green bar represents the vocabulary distribution generated from seq2seq component.
The task is to find the most suitable concept c j i to fit the updated context, represented by the vector h * t in Equation (2), at each time step t.
In the case of generating summaries given updated contexts, a weighted update of the distributional concept candidates needs to be performed. In the model, the updated weight, denoted as β j i , is estimated by a softmax classifier that is jointly conditioned on the hidden representation of the word h i , the context vectors h * t , and each of concept vectors: where j ∈ [1, k], W h is a trainable parameter, and c j i is the vector of the jth concept candidate, which is a representation of the input embeddings.
Together with the concept probability from the existing knowledge base p(c j i ) and the updated weights based on the contexts β j i , a context-aware conceptualized probability of jth concept for the word x i , P c i,j , is finally estimated as where γ is a tunable parameter. Theoretically, we will end up with a number of k relevant concepts for each word C i = {c 1 i , · · · , c k i } with a probability distribution over the set, which is learned as

Concept Pointer Generator
The basic pointer generator network contains two sub-modules, one is the pointer network and the other is the generation network. These two submodules jointly determine the probabilities of the words in the final generated summary. The generation probability p gen for the generation network (See et al., 2017) is learned by For the pointer network, our model consists of a pointer to the source text and a further concept pointer to the relevant concepts that have arisen from the source content. These two separate pointers are calculated as follows. The first pointer is taken based on the attention distribution α t,i over the source text. The second concept pointer is operated over a concept distribution of the source text that is scaled element-wise by the attention distribution.
To train the model, given the likelihood of each concept in the current context, the updates could be performed in two ways. In a hard assignment, the concept that receives the highest score would be selected for the update: where a is the index of maximized generated weight based on the contexts, and P c i,a is obtained by Eqs. (4).
In random selection, each of the concept candidates could be trained randomly to update the parameters: 3080 where j represents the selected concept index. Considering the above baseline generation network and both the pointer networks, our final output distribution is where P c i can be updated by P c i arg max , or P c i random . The difference between these two choices is demonstrated in the Experiments section.

Basic MLE
The baseline objective is derived by maximizing the likelihood training for the seq2seq generation, given a reference summary y * = {y * 1 , y * 2 , · · · , y * m } for document x. The training objective is to minimize the negative loglikelihood of the target word sequence:

Evaluation based Reinforcement Learning (RL)
Similar to Paulus et al. (2017), policy gradient methods can directly optimize discrete target evaluation metrics, such as ROUGE. The basic idea is to explore new sequences and compare them to the best greedy baseline sequence. Once the baseline sequenceŷ, or sampled sequence y s , are generated, they are compared against the reference sequence y * to compute the rewards r(ŷ) and r(y s ), respectively. In the RL training stage, two separate output candidates at each time step are produced: y s is sampled from the probability distribution P (y s t |y s 1 , · · · , y s t−1 , x) , andŷ is the baseline output. The training objective is then to minimize L RL = (r(ŷ) − r(y s )) m t=1 log P (y s t |y s 1 , · · · , y s t−1 , x) (10) It is noteworthy that the samples y s are selected from a wide range of vocabularies extended by all the concept candidates. This strategy ensures that the model learns to generate sequences with higher rewards by better exploring a set of close concepts.
Thus, the combination of these two objectives yield improved task-specific scores while catering a better language model: L f inal = λL RL + (1 − λ)L M LE , where λ is a soft-switch between the two objectives. The model is pre-trained with MLE loss, then switch to the final loss.

Distant Supervision (DS) for Model Adaption
Our intuition is that, if the summary-document pairs are dissimilar to the testing set, the model could be retrained to adapt to weaken the influence of the dissimilarity on the final loss. The result would be a training model that better fits the specific testing data. The challenge is that there are no explicit supervision labels to indicate whether the training set is close to the testing set, so a new training paradigm is needed. In answer to this need and also to provide end-to-end functionality in the model, we developed a simple approach for labeling summary-document pairs by calculating the Kullback-Leibler (KL) divergence between each training reference summary and a set of testing documents. In this way, the training pairs are distantly-labelled for training the model. Specifically, the representations of the reference summaries and the testing set are computed by summing all the involved word embeddings. Given a testing document x d l , where l ∈ [1, N d ] and N d is the size of the testing corpus, the vectorbased representation of one document is where n is the number of document words involved. The reference summary is represented by y * = exp( m t=1 y * t ). We normalize these vectors through a softmax function to cater for KL calculation. The model adaption with the distant labels is defined as: indicates the KL divergence between y * and x d l , and π is a constant parameter that is tuned via adaption to the testing set. The divergences are averaged within the testing set, which indicates the overall distances between testing set and each of the reference summarydocument pairs. In this way, the samples in the training corpus are distantly annotated as either relevant or irrelevant for model adaption, noting that the model is pre-trained with the MLE loss before switching to distantly-supervised training.

Experiments
Datasets: To evaluate the effectiveness of our proposed model, we conducted training and test- Bold scores are the best between the two optimization strategies. mark indicates the improvements from the baselines to the concept pointer are statistically significant using a two-tailed t-test (p < 0.01).

Models
Gigaword DUC-2004 RG-1 RG-2 RG-L RG-1 RG-2 RG-L ABS+ † (Rush et al., 2015) 29.76 11.88 26.96 28.18 8.46 23.81 Luong-NMT † (Luong et al., 2015) 33 ing on two popular datasets. The first was the English Gigaword Fifth Edition corpus (Parker et al., 2011). We replicated the pre-processing steps in (Rush et al., 2015) to obtain the same training/testing data. After pre-processing, the corpus contained about 3.8M sentence-summary pairs as training set and 189K pairs as the development set. Once pairs with empty titles were removed, the testing set numbered 1951 pairs. The second dataset, DUC2004, was only used for testing. This dataset consists of 500 document-headline summary pairs, where each document is paired with four reference summaries written by humans.
Evaluation Metrics: We used ROUGE (Lin, 2004) as the evaluation metric, which measures the quality of a summary by computing the overlapping lexical elements between the candidate summary and a reference summary. Following previous practice, we assessed RG-1 (unigram), RG-2 (bigram) and RG-L (longest common subsequence -LCS). Noted that the English Gigaword 3 testing set contains references of different lengths, while the DUC-2004 4 testing set fixes the summary length to 75 bytes.
Training Setups: We initialize word embeddings with 128-d vectors and fine-tune them during training. Concepts share the same embeddings 3 The ROUGE evaluation option is, -m -n 2 -s 4 The ROUGE evaluation option is, -n 2 -m -b 75 -s with the words. The vocabulary size was set to 150k for both the source and target text. The hidden state size was set to 256. The vocabulary size is increased from around 602 to 2216 concepts w.r.t the different number (k = 1, · · · , 5) of concept candidates for each word. Note that the generated concepts with UNKs were subsequently deleted. Our code is available on https:// github.com/wprojectsn/codes, and the vocabularies and candidate concepts are also included. We trained our models on a single GTX TI-TAN GPU machine. We used the Adagrad optimizer with a batch size of 64 to minimize the loss. The initial learning rate and the accumulator value were set to 0.15 and 0.1, respectively. We used gradient clipping with a maximum gradient norm of 2. At the time of decoding, the summaries were produced through a beam search of size 8. The hyper-parameter settings were λ = 0.99, γ = 0.1, π = 2.92 on DUC-2004 and π = 1.68 on Gigaword. We trained our concept pointer generator for 450k iterations yielded the best performance, then took the optimization using RL rewards for RG-L at 95K iterations on DUC-2004 and at 50K iterations on Gigaword. We took the distancesupervised training at 5K iterations on DUC-2004 and at 6.5K iterations on Gigaword.
Baselines: The following state-of-the-art baselines were used as comparators. ABS+ (Rush  2015) is a tuned ABS model with additional features. Luong-NMT (Luong et al., 2015) is a two-layer LSTM encoder-decoder. RAS-Elman (Chopra et al., 2016) is a convolution encoder and an Elman RNN decoder with attention. Seq2seq+att is two-layer BiLSTM encoder and one-layer LSTM decoder equipped with attention. lvt5k-lsent (Nallapati et al., 2016) uses temporal attention to keep track of the past attentive weights of the decoder and restrains the repetition in later sequences. SEASS (Zhou et al., 2017) includes an additional selective gate to control information flow from the encoder to the decoder. Pointer-generator (See et al., 2017) is an integrated pointer network and seq2seq model. We implemented this baseline without its coverage mechanism since this is not our focus. Baseline models also include two pointer-generator based extensions (Guo et al., 2018;Li et al., 2018). CGU  sets a convolutional gated unit and self-attention for global encoding.

Results and Analysis
The following analysis focuses on investigating whether our model is, first, able to generate abstract and new concepts, and, second, how the overall quality performs against the baselines.

Quantitative Analysis
The results are presented in Table 1. We observe that our model outperformed all the strong stateof-the-art models on both datasets in all metrics except for RG-2 on Gigaword. In terms of the pointer generator performance, the improvements made by our concept pointer are statistically significant (p < 0.01) across all metrics.
OOV and Summary Length: OOV is another major challenge for current abstractive summarization models. Although generating longer summaries or less UNKs is not our focus, our model still showed improvements in this regard (Table  2). We counted the number of UNKs and all Abstractiveness: According to Chen and Bansal (2018), abstractiveness scores are computed as the percentage of novel n-grams in the generated summaries that are not included in the source documents. As shown in Table  3, compared with human-written summaries which receive the highest novelty in terms of abstractiveness, our concept pointer generator achieves closest performance with human-written summaries against the baseline. This result demonstrates a further advantage of our model in producing new and abstract concepts. Our model is designed to improve semantic relevance and promote higher abstraction. More generated summary examples can be found in Appendix B.

Analysis on Training Strategies
To evaluate the relative impact of each training strategy with the model, we tested different combinations for comparison with each other and against the baselines.
Context-aware Conceptualization: To investigate the impact of training with both the number of concepts k and the concept update strategy mentioned in Eqs. (6) and (7), we chose a different number of concept candidates, i.e., k = 1, 2, 3, 4, 5, to for the context-aware conceptualization update strategy. Performance was fully evaluated with the three ROUGE metrics as shown in Figure 3. The results only vary slightly according to the number of concepts with the random selection strategy (Eq. (7)), as shown in Figure 3(a) and 3(b). This indicates that a random strategy is not very sensitive to the number of extracted topics. This is, in part, because the concept pointer may or may not be able to point to the correct concepts from multiple candidates. While in Figure  3(c) and 3(d), the optimum settings are clearly apparent, i.e., k = 1 on Gigaword and k = 2 on DUC-2004. Overall, the hard assignment strategy (Eq. (6)) provided the best performance in practical terms, while random selection (Eq. (7)) performs stably with different settings.
Training with DS vs. RL: As shown in Table 1, our model with either a distant supervision strategy (concept pointer+DS) or reinforcement learning (concept pointer+RL) were both superior to the basic concept pointer generator on both datasets. Further, the relative improvement of the concept pointer+DS over the concept pointer+RL ranged from 3.5% to 9.6% on DUC-2004 but was inferior to concept pointer+RL on Gigaword. In comparing the results, it is clear that DS training has a noticeable effect when the testing set is substantially semantically different from the training set but provides less improvement than RL when the two are close. From this analysis, we conclude that the DS strategy is better for model adaption with abstractive summarization.

Human Evaluations
To explore the correctness of our model using human judgment, we conducted a manual evaluation with 20 post-graduate volunteers. We primarily used the following criteria to assess the generated summaries: abstraction, i.e., Are the abstract concepts contained in the summary appropriate?; and overall quality, i.e., Is the summary readable, informative, relevant, etc.? To conduct the evaluation, we randomly selected 20 examples from the DUC 2004 testing set and asked the volunteers to subjectively assess the summaries. Each example consisted of an article and three summaries, i.e., a summary by the seq2seq model, the pointer generator model, and our proposed concept pointer model. The volunteers chose the best summaries for each of the articles according to the above criteria (can be multiple choices). Obviously, the summaries were randomly shuffled, and the model used to produce each was unknown to prevent bias. The scores for each model were ranked by how many times the volunteers chose a summary w.r.t each criteria, averaged by the number of participants. The results are presented in Table 4, which show that our model outperformed both the seq2seq model and the pointer generator (See et al., 2017) in both criteria. As a last step, we manually inspect the summaries generated by our model, and some examples are presented in Appendix B. We found that the summaries were not as abstract as humanwritten summary would likely be. The overarching tendency of the model is still to copy segments of the source text and rearrange the phrases into a summary. However, the overall approach does produce more high-level concepts with correct relations compared to the baselines, which demon-strates that our solution is a promising research direction to further pursue. Additionally, the generated summaries are long, fluent, and informative.

Conclusion
This paper presents a novel concept pointer generator model to improve the abstractive summarization model and generate concept-oriented summaries. We also propose a novel distant supervision strategy for model adaption to different datasets. Both empirical and subjective experiments show that our model makes a statistically significant quality improvement over the state-ofthe-art baselines on two popular datasets.