Improving Latent Alignment in Text Summarization by Generalizing the Pointer Generator

Pointer Generators have been the de facto standard for modern summarization systems. However, this architecture faces two major drawbacks: Firstly, the pointer is limited to copying the exact words while ignoring possible inflections or abstractions, which restricts its power of capturing richer latent alignment. Secondly, the copy mechanism results in a strong bias towards extractive generations, where most sentences are produced by simply copying from the source text. In this paper, we address these problems by allowing the model to “edit” pointed tokens instead of always hard copying them. The editing is performed by transforming the pointed word vector into a target space with a learned relation embedding. On three large-scale summarization dataset, we show the model is able to (1) capture more latent alignment relations than exact word matches, (2) improve word alignment accuracy, allowing for better model interpretation and controlling, (3) generate higher-quality summaries validated by both qualitative and quantitative evaluations and (4) bring more abstraction to the generated summaries.


Introduction
Modern state-of-the-art (SOTA) summarization models are built upon the pointer generator architecture (See et al., 2017). At each decoding step, the model generates a sentinel to decide whether to sample words based on the neural attention (generation mode), or directly copy from an aligned source context (point mode) (Gu et al., 2016;Merity et al., 2017;. Though outperforming the vanilla attention models, the pointer generator only captures exact word matches. As shown in Fig. 1, for abstractive summarization, Figure 1: Alignment visualization of our model when decoding "closes". Posterior alignment is more accurate for model interpretation. In contrast, the prior alignment probability is spared to "announced" and "closure", which can be manually controlled to generate desired summaries. Decoded samples are shown when aligned to "announced" and "closure" respectively. Highlighted source words are those that can be directly aligned to a target token in the gold summary. there exists a large number of syntactic inflections (escalated → escalates) or semantic transformations (military campaign → war), where the target word also has an explicit grounding in the source context but changes its surface. In standard pointer generators, these words are not covered by the point mode. This largely restricts the application of the pointer generator, especially on highly abstractive dataset where only a few words are exactly copied. Moreover, the hard copy operation biases the model towards extractive summarization, which is undesirable for generating more human-like summaries (Kryściński et al., 2018).
To solve this problem, we propose Generalized Pointer Generator (GPG) which replaces the hard copy component with a more general soft "editing" function. We do this by learning a relation embedding to transform the pointed word into a target embedding. For example, when decoding "closes" in Figure 1, the model should first point to "closure" in the source, predict a relation to be applied (noun → third person singular verb), then transform "closure" into "closes" by applying the relation transformation. The generalized point mode is encouraged to capture such latent alignment which cannot be identified by the standard pointer generator.
This improved alignment modelling is intriguing in that (a) people can better control the generation by manipulating the alignment trajectory, (b) posterior alignment can be inferred by Bayes' theorem Shankar and Sarawagi, 2019) to provide a better tool for interpretation 2 and finally (c) explicitly capturing the alignment relation should improve generation performance. (Figure 1 shows an example of how latent alignment can improve the controllability and interpretation. Pointer generators fail to model such alignment relations that are not exact copies.) To eliminate the OOV problem, we utilize the byte-pairencoding (BPE) segmentation (Sennrich et al., 2016) to split rare words into sub-units, which has very few applications in summarization so far (Fan et al., 2018;Kiyono et al., 2018), though being a common technique in machine translation (Wu et al., 2016;Vaswani et al., 2017;Gehring et al., 2017).
Our experiments are conducted on three summarization datasets: CNN/dailymail (Hermann et al., 2015), English Gigaword (Rush et al., 2015) and XSum (Narayan et al., 2018) (a newly collected corpus for extreme summarization). We further perform human evaluation and examine the word alignment accuracy on the manually annotated DUC 2004 dataset. Overall we find our model provides the following benefits: 1. It can capture richer latent alignment and improve the word alignment accuracy, enabling better controllability and interpretation.
2. The generated summaries are more faithful to the source context because of the explicit alignment grounding. 2 The induced alignment offers useful annotations for people to identify the source correspondence for each target word. News editors can post-edit machine-generated summaries more efficiently with such annotation. For summary readers, it also helps them track back the source context when they are interested in some specific details. Derivation for inducing the posterior alignment is in the appendix A.1.
3. It improves the abstraction of generations because our model allows editing the pointed token instead of always copying exactly.
In the next section, we will first go over the background, then introduce our model and finally present the experiment results and conclusion.

Background
Let X, Y denote a source-target pair where X corresponds to a sequence of words x 1 , x 2 , . . . , x n and Y is its corresponding summary y 1 , y 2 , . . . , y m . In this section, we introduce two baseline models for automatically generating Y from X:

Seq2seq with Attention
In a seq2seq model, each source token x i is encoded into a vector h i . At each decoding step t, the decoder computes an attention distribution a t over the encoded vectors based on the current hidden state d t (Bahdanau et al., 2015): f is a score function to measure the similarity between h i and d t . The context vector c t and the probability of next token are computed as below.
• means concatenation and L, W are trainable parameters. We tie the parameters of W and the word embedding matrix as in Press and Wolf (2017);Inan et al. (2017). Namely, a target vector y * t is predicted, words having a higher inner product with y * t will have a higher probability.

Pointer Generator
The pointer generator extends the seq2seq model to support copying source words (Vinyals et al., 2015). At each time step t, the model first computes a generation probability p gen ∈ [0, 1] by: σ is a sigmoid function and MLP g is a learnable multi-layer perceptron. p gen is the probability of enabling the generation mode instead of the point mode. In the generation mode, the model computes the probability over the whole vocabulary as in Eq. 2. In the point mode, the model computes which source word to copy based on the attention distribution a t from Eq.1. The final probability is marginalized over a t,i : If we know exactly from which mode each word comes from, e.g., by assuming all co-occurred words are copied, then the marginalization can be omitted (Gulcehre et al., 2016;Wiseman et al., 2017), but normally p gen is treated as a latent variable (Gu et al., 2016;See et al., 2017).

Generalized Pointer Generator (GPG)
As seen in Eq .4, δ(y t |x i ) is a 0-1 event that is only turned on when y t is exactly the same word as x i . This restricts the expressiveness of the point mode, preventing it from paying attention to inflections, POS transitions or paraphrases. This section explains how we generalize pointer networks to cover these conditions. Redefine δ(y t |x i ): We extend δ(y t |x i ) by defining it as a smooth probability distribution over the whole vocabulary. It allows the pointer to edit x i to a different word y t instead of simply copying it. Following Eq. 2, we derive δ(y t |x i ) by first predicting a target embedding y * t,i , then applying the softmax. The difference is that we derive y * t,i as the summation of the pointed word embedding − → x i and a relation embedding r(d t , h t ): − → x i denotes the embedding of the word x i . r(d t , h t ) can be any function conditioning on d t and h t , which we parameterize with a multilayer-perceptron in our experiments. The computation of y * t,i is similar to the classical TransE model (Bordes et al., 2013) where an entity vector is added by a relation embedding to translate into the target entity. The intuition is straightforward: After pointing to x t , humans usually first decide which relation should be applied (inflection, hypernym, synonym, etc) based on the context [d t , h i ], then transform x i to the proper target word y t . Using addition transformation is backed by the observation that vector differences often reflect meaningful word analogies (Mikolov et al., 2013;Pennington et al., 2014) ("man" − "king" ≈ "woman" − "queen") and they are effective at encoding a great amount of word relations like hypernym, meronym and morphological changes (Vylomova et al., 2016;Hakami et al., 2018;Allen and Hospedales, 2019). These word relations reflect most alignment conditions in text summarization. For example, humans often change the source word to its hypernym (boy → child), to make it more specific (person → man) or apply morphological transformations (liked → like). Therefore, we assume δ(y t |x i ) can be well modelled by first predicting a relation embedding to be applied, then added to − → x i . If x i should be exactly copied like in standard pointer generators, the relation embedding is a zero vector meaning an identity transition. We also tried applying more complex transitions to − → x i like diagonal mapping (Trouillon et al., 2016), but did not observe improvements. Another option is to estimate However, this leads to poor alignment and performance drop because y t is not explicitly grounded on x i 3 . A comparison of different choices can be found in the appendix A.4. In this paper, we stick to Eq. 5 to compute δ(y t |x i ).
Estimate Marginal Likelihood: Putting Eq. 5 back to Eq. 4, the exact marginal likelihood is too expensive for training. The complexity grows linearly with the source text length n and each computation of δ(y t |x i ) requires a separate softmax operation. One option is to approximate it by sampling like in hard attention models (Xu et al., 2015;Deng et al., 2017), but the training becomes challenging due to the non-differentiable sampling process. In our work, we take an alternative strategy of marginalizing only over k most likely aligned source words. This top-k approximation is widely adopted when the target distribution is expected to be sparse and only a few modes dominate (Britz et al., 2017;Ke et al., 2018;Shankar et al., 2018). We believe this is a valid assumption in text summarization since most source Figure 2: Architecture of the generalized pointer. The same encoder is applied to encode the source and target. When decoding "closes", we first find top-k source positions with the most similar encoded state. For each position, the decoding probability is computed by adding its word embedding and a predicted relation embedding.
tokens have a vanishingly small probability to be transferred into a target word.
For each target word, how to determine the k most likely aligned source words is crucial. An ideal system should always include the gold aligned source word in the top-k selections. We tried several methods and find the best performance is achieved when encoding each source/target token into a vector, then choosing the k source words that are closest to the target word in the encoded vector space. The closeness is measured by the vector inner product 4 . The encoded vector space serves like a contextualized word embedding (McCann et al., 2017;Peters et al., 2018). Intuitively if a target word can be aligned to a source word, they should have similar semantic meaning and surrounding context thus should have similar contextualized word embeddings. The new objective is then defined as in Eq. 6: e(y t ) is the encoded vector for y t . The marginalization is performed only over the k chosen source words. Eq. 6 is a lower bound of the data likelihood because it only marginalizes over a subset of X. In general a larger k can tighten the bound to get a more accurate estimation and we analyze the effect of k in Section 5.2. Note that the only extra parameters introduced by our model are the multilayer-perceptron to compute the relation embedding r(d t , h i ). The marginalization in Eq. 6 can also be efficiently parallelized. An illustration of the generalized pointer is in Figure 2.

Related Work
Neural attention models (Bahdanau et al., 2015) with the seq2seq architecture (Sutskever et al., 2014) have achieved impressive results in text summarization tasks. However, the attention vector comes from a weighted sum of source information and does not model the source-target alignment in a probabilistic sense. This makes it difficult to interpret or control model generations through the attention mechanism. In practice, people do find the attention vector is often blurred and suffers from poor alignment (Koehn and Knowles, 2017;Kiyono et al., 2018;Jain and Wallace, 2019). Hard alignment models, on the other hand, explicitly models the alignment relation between each source-target pair. Though theoretically sound, hard alignment models are hard to train. Exact marginalization is only feasible for data with limited length (Yu et al., 2016;Aharoni and Goldberg, 2017;Backes et al., 2018), or by assuming a simple copy generation process (Vinyals et al., 2015;Gu et al., 2016;See et al., 2017). Our model can be viewed as a combination of soft attention and hard alignment, where a simple top-k approximation is used to train the alignment part (Shankar et al., 2018;Shankar and Sarawagi, 2019). The hard alignment generation probability is designed as a relation summation operation to better fit the sum-marization task. In this way, the generalized copy mode acts as a hard alignment component to capture the direct word-to-word transitions. On the contrary, the generation mode is a standard softattention structure to only model words that are purely functional, or need fusion, high-level inference and can be hardly aligned to any specific source context (Daumé III and Marcu, 2005).

Experiments and Results
In the experiment, we compare seq2seq with attention, standard pointer generators and the proposed generalized pointer generator (GPG). To further analyze the effect of the generalized pointer, we implement a GPG model with only the point mode (GPG-ptr) for comparison. We first introduce the general setup, then report the evaluation results and analysis.

General Setup
Dataset: We perform experiments on the CNN/dailymail (Hermann et al., 2015), English Gigaword (Rush et al., 2015) and XSum (Narayan et al., 2018) dataset. We put statistics of the datasets in Appendix A.2. CNN/DM contains online news with multi-sentence summaries (We use the non-anonymized version from See et al. (2017)). English Gigaword paired the first sentence of news articles with its headline. XSum corpus provides a single-sentence summary for each BBC long story. We pick these three dataset as they have different properties for us to compare models. CNN/DM strongly favors extractive summarization (Kryściński et al., 2018). Gigaword has more one-to-one word direct mapping (with simple paraphrasing) (Napoles et al., 2012) while XSum needs to perform more information fusion and inference since the source is much longer than the target (Narayan et al., 2018).
Model: We use single-layer bi-LSTM encoders for all models. For comparison, hidden layer dimensions are set the same as in Zhou et al. (2017) for Gigaword and See et al. (2017) for CNN/DM and XSum. We train with batch size 256 for gigaword and 32 for the other two. The vocabulary size is set to 30k for all dataset. Word representations are shared between the encoder and decoder. We tokenize words with WordPiece segmentation (Wu et al., 2016) to eliminate the OOV problem. More details are in Appendix A.3 Inference: We decode text using beam   to prevent repeated generations. GPG models use an exact marginalization for testing and decoding, while for training and validation we use the top-k approximation mentioned above. The decoder will first decode sub-word ids then map them back to the normal sentence. All scores are reported on the word level and thus comparable with previous results. When computing scores for multi-sentence summaries. The generations are split into sentences with the NLTK sentence tokenizer.

Results and Analysis
The results are presented in the following order: We first study the effect of the hyperparameter k, then evaluate model generations by automatic metrics and look into the generation's level of abstraction. Finally, we report the human evaluation and word alignment accuracy. Effect of K: k is the only hyperparameter introduced by our model. Figure 3 visualizes the effect of k on the test perplexity. As mentioned in Sec 3, a larger k is expected to tighten the estimation bound and improve the performance. The figure shows the perplexity generally decreases as increasing k. The effect on Gigaword and XSum saturates at k = 6 and 10 respectively, so we fix such k value for later experiments. For the longer dataset CNN/Dailymail, the perplexity might still decrease a bit afterwards, but the improvement is marginal, so we set k = 14 for the memory limit.
Automatic Evaluation: The accuracy is evaluated based on the standard metric ROUGE (Lin,  2004) and the word perplexity on the test data. We report the ROUGE-1, ROUGE-2 and ROUGE-L F-score measured by the official script. Table 1, 2 and 3 lists the results for CNN/Dailymail, Gigaword and XSum respectively. Statistically significant results are underlined 5 . On the top two rows of each table, we include two results taken from current state-of-the-art word-based models. They are incomparable with our model because of the different vocabulary, training and decoding process, but we report them for completeness. Lower rows are results from our implemented models. Pointer generators bring only slight improvements over the seq2seq baseline. This suggests that after eliminating the OOV problem, the naive seq2seq with attention model can already implicitly learn most copy operations by itself. GPG models outperform seq2seq and pointer generators on all dataset. The improvement is more significant for more abstractive corpus Gigaword and XSum, indicating our model is effective at identifying more latent alignment relations.
Notably, even the pure pointer model (GPG-ptr) 5 Results on the Gigaword test set is not significant due to the smalle test size (1951 article-summary pairs).  Table 3: ROUGE score on XSum. * marks results from Narayan et al. (2018). Underlined values are significantly better than Point.Gen. with p = 0.05. without the generation mode outperforms standard pointer generators in CNN/DM and Gigaword, implying most target tokens can be generated by aligning to a specific source word. The finding is consistent with previous research claiming CNN/DM summaries are largely extractive (Zhang et al., 2018;Kryściński et al., 2018). Though the Gigaword headline dataset is more abstractive, most words are simple paraphrases of some specific source word, so pure pointer GPGptr can work well. This is different from the XSum story summarization dataset where many target words require high-level abstraction or inference and cannot be aligned to a single source word, so combining the point and generation mode is necessary for a good performance.
The word perplexity results are consistent over all dataset (GPG < GPG-ptr < Point.Gen. < seq2seq). The reduction of perplexity does not necessarily indicate an increase for the ROUGE score, especially for pointer generators. This might attribute to the different probability computation of pointer generators, where the probability of copied words are only normalized over the source words. This brings it an inherent advantage over other models where the normalization is over the whole 30k vocabularies.
Level of Abstraction: In Tab. 4, we look into how abstractive the generated summaries are by calculating the proportion of novel unigram, bigram and trigrams that do not exist in the corresponding source text. On CNN/DM, as the generations contain multiple sentences, we further report the proportion of novel sentences (obtained with NLTK sent tokenize).
Tab. 4 reflects the clear difference in the level of abstraction (seq2seq > GPG > GPG-ptr > Point.Gen.). Though the seq2seq baseline generated most novel words, many of them are hallu-  Table 4: Proportion of novel n-grams (NN-1,2,3) and sentences (NN-S) on generated summaries. GPG generate more novel words compared with standard pointer generators, though still slightly lower than seq2seq. Figure 4: Pointing Ratio of the standard pointer generator and GPG (evaluated on the test data). GPG enables the point mode more often, but quite a few pointed tokens are edited rather than simply copied. cinated facts (see Fig 6), as has also been noted in See et al. (2017). The abstraction of GPG model is close to seq2seq and much higher than copy-based pointer generators. We believe it comes from their ability of "editing" pointed tokens rather than simple copying them.
To examine the pointing behavior of the GPG model, we visualize the average pointing ratio on three dataset in Fig. 4. The pointing ratio can be considered as the chance that a word is generated from the point mode instead of the generation mode. We compute it as (1 − p gen ) i a t,i δ(y t |x i )/p(y t ), averaged over all target tokens in the test data. For the GPG model, we further split it into the copy ratio (words that are exactly copied) and the editing ratio (words that are edited). We find the GPG model enables the point mode more frequently than standord pointer generators, especially on the Gigaword dataset (40% more). This also explains why a pure pointer model is more effective for Gigaword and CNN/DM. More than 60% target tokens can be generated from the point mode, while for XSum the ratio is less than 40%. Quite a few pointing operation includes text rewriting (green bar in Fig. 4). This could explain why our model is able to generate more novel words.
A few examples are displayed in Fig 5. We find our model frequently changes the tense (reported → report), singular/plural (death → deaths) or POS tag (jordan → jordanian) of words. Sometimes it also paraphrases (relatives → family) or abstracts a noun to its hypernym (girl → child). The word editing might be wrong though. For example, "death row prisoner" is wrongly changed to "deaths row prisoner" in the second example, possibly because this phrase is rarely seen so that the model made an error by mistaking "death" as the main subject after "thousands more" 6 .
Human evaluation: We further perform a human evaluation to assess the model generations. We focus on evaluating the fluency and faithfulness since the ROUGE score often fails to quan-  tify them (Schluter, 2017;Cao et al., 2018). 100 random source-target pairs are sampled from the human-written DUC 2004 data for task 1&2 (Over et al., 2007). Models trained on Gigaword are applied to generate corresponding summaries. The gold targets, together with the model generations are randomly shuffled then assigned to 10 human annotators. Each pair is evaluated by three different people and the most agreed score is adopted. Each pair is assigned a 0-1 score to indicate (1) whether the target is fluent in grammar, (2) whether the target faithfully conveys the source information without hallucination and (3) whether the target is considered human-generated or machine-generated (like a 0/1 Turing test). The averaged score is reported in Table 5. All models generally achieve high scores in fluency, but generations from GPG models are more faithful to the source information thereby have a larger chance of fooling people into believe they're humangenerated (over 0.1 higher score on the 0/1 Turing test). This can be explained by GPG's capability at capturing more latent alignments. As shown in Figure 4, GPG generates over half of the target words by its point mode. Words are generated by explicitly grounding on some source context instead of fabricating freely. Fig. 6 compares some generation snippets. As can be observed, seq2seq models tend to freely synthesize wrong facts not grounded on the source text, especially on the more difficult XSum dataset. In the last example, seq2seq only capture the subject "odom" and some keywords "police", "basketball" then start to freely fabricate random facts. Pointer generators are slightly better as it is trained to directly copy keyword from the source. However, once it starts to enter the generation mode ("of british" in example 2 and "has been arrested" in example 3), the generation also loses control. GPG largely alleviates the problems because it can point to an aligned source word,  then transform it by a learned relation embedding. The explicit alignment modelling encourages the model to stay close to the source information. Alignment Accuracy: We also manually annotate the word alignment on the same 100 DUC 2004 pairs. Following Daumé III and Marcu (2005), words are allowed to be aligned with a specific source word, phrase or a "null" anchor meaning that it cannot be aligned with any source word. The accuracy is only evaluated on the target words with a non-null alignment. For each target token, the most attended source word is considered as alignment (Ghader and Monz, 2017). For the pointer generator and GPG, we also induce the posterior alignment by applying the Bayes' theorem (derivation in appendix A.1). We report the alignment precision (Och and Ney, 2000) in Table 6, i.e., an alignment is considered as valid if it matches one of the human annotated ground truth.
The results show that GPG improves the alignment precision by 0.1 compared with the standard pointer generator. The posterior alignment is more accurate than the prior one (also reflected in Figure 1), enabling better human interpretation.

Conclusion
In this work, we propose generalizing the pointer generator to go beyond exact copy operation. At each decoding step, the decoder can either generate from the vocabulary, copy or edit some source words by estimating a relation embedding. Experiments on abstractive summarization show the generalized model generates more abstract summaries yet faithful to the source information. The generalized pointer is able to capture richer latent alignment relationship beyond exact copies. This helps improve the alignment accuracy, allowing better model controllability and interpretation.
We believe the generalized pointer mechanism could have potential applications in many fields where tokens are not exactly copied. By integrating off-the-shelf knowledge bases to clearly model the transition relation embedding, it should further improve the interpretability and might be espe-Article: (...) marseille prosecutor brice robin told cnn that " so far no videos were used in the crash investigation . " he added , " a person who has such a video needs to immediately give it to the investigators . " robin 's comments follow claims by two magazines , german daily bild and french paris match (...) Seq2seq: marseille prosecutor brice robin tells cnn that " so far no videos were used in the crash investigation " Point.Gen: robin 's comments follow claims by two magazines , german daily bild and french (..) GPG: " so far no videos were used in the crash investigation , " prosecutor brice robin says (..) Article: surviving relatives of a woman who claimed she was raped ## years ago by the british queen 's representative in australia are seeking to withdraw a lawsuit against him , after the case drew widespread publicity in australia . Seq2seq: family of british queen 's representative in australia seeking to withdraw lawsuit against him . Point.Gen: surviving relatives of british queen 's representative seeking to withdraw lawsuit against him . GPG: family of woman who claimed she was victim of british queen 's representative seeks to withdraw lawsuit .
Article: police were called to love ranch brothel in crystal , nevada , after he was found unresponsive on tuesday . the american had to be driven to hospital (...) mr odom , 35 , has played basketball for (...) lakers and clippers . he (...) was suspended from the nba for violating its anti-drug policy (...) was named nba sixth man of the year (...) Seq2seq: basketball legend odom odom has died at the age of 83 , police have confirmed . Point.Gen: a former nba sixth man has been arrested on suspicion of anti-drug offences in the us state of california . GPG: the american basketball association ( lakers ) star lamar odom has been found unconscious in the us state of nevada . cially helpful under low-resource settings, which we leave for future work.