Guided Neural Language Generation for Abstractive Summarization using Abstract Meaning Representation

Recent work on abstractive summarization has made progress with neural encoder-decoder architectures. However, such models are often challenged due to their lack of explicit semantic modeling of the source document and its summary. In this paper, we extend previous work on abstractive summarization using Abstract Meaning Representation (AMR) with a neural language generation stage which we guide using the source document. We demonstrate that this guidance improves summarization results by 7.4 and 10.5 points in ROUGE-2 using gold standard AMR parses and parses obtained from an off-the-shelf parser respectively. We also find that the summarization performance on later parses is 2 ROUGE-2 points higher than that of a well-established neural encoder-decoder approach trained on a larger dataset.


Introduction
Abstractive summarization is the task of automatically producing the summary of a source document through the process of paraphrasing, aggregating and/or compressing information. Recent work in abstractive summarization has made progress with neural encoder-decoder architectures (See et al., 2017;Chopra et al., 2016;Rush et al., 2015). However, these models are often challenged when they are required to combine semantic information in order to generate a longer summary (Wiseman et al., 2017). To address this shortcoming, several works have explored the use of Abstract Meaning Representation (Banarescu et al., 2013, AMR). These were motivated by AMR's capability to capture the predicate-argument structure which can be utilized in information aggregation during summarization.
However, the use of AMR also has its own shortcomings. While AMR is suitable for information aggregation, it ignores aspects of language such as tense, grammatical number, etc., which are important for the natural language generation (NLG) stage that normally occurs in the end of the summarization process. Due to the lack of such information, approaches for NLG from AMR typically infer it from regularities in the training data (Pourdamghani et al., 2016;Konstas et al., 2017;Song et al., 2016;Flanigan et al., 2016), which however is not suitable in the context of summarization. Consequently, the main previous work on AMR-based abstractive summarization (Liu et al., 2015) only generated bag-of-words from the summary AMR graph.
In this paper, we propose an approach to guide the NLG stage in AMR-based abstractive summarization using information from the source document. Our objective is twofold: (1) to retrieve the information missing from AMR but needed for NLG and (2) improve the quality of the summary. We achieve this in a two-stages process: (1) estimating the probability distribution of the side information, and (2) using it to guide a Luong et al. (2015)'s seq2seq model for NLG.
Our approach is evaluated using the Proxy Report section from the AMR dataset (Knight et al., 2017 which contains manually annotated document and summary AMR graphs. Using our proposed guided AMR-to-text NLG, we improve summarization results using both gold standard AMR parses and parses obtained using the RIGA (Barzdins and Gosko, 2016) parser by 7.4 and 10.5 ROUGE-2 points respectively. Our model also outperforms a strong baseline seq2seq model (See et al., 2017) for summarization by 2 ROUGE-2 points.

Related Work
Abstractive Summarization using AMR: In Liu et al. (2015) work, the source document's sentences were parsed into AMR graphs, which were then combined through merging, collapsing and graph expansion into a single AMR graph representing the source document. Following this, a summary AMR graph was extracted, from which a bag of concept words was obtained without attempting to form fluent text. Vilca and Cabezudo (2017) performed a summary AMR graph extraction augmented with discourse-level information and the PageRank (Page et al., 1998) algorithm. For text generation, Vilca and Cabezudo (2017) used a rule-based syntactic realizer (Gatt and Reiter, 2009) which requires substantial human input to perform adequately.
Seq2seq using Side Information: In Neural Machine Translation (NMT) field, recent work (Zhang et al., 2018) explored modifications to the decoder of seq2seq models to improve translation results. They used a search engine to retrieve sentences and their translation (referred to as translation pieces) that have high similarity with the source sentence. When similar n-grams from a source document were found in the translation pieces, they rewarded the presence of those ngrams during the decoding process through a scoring mechanism calculating the similarity between source sentence and the source side of the translation pieces. Zhang et al. (2018) reported improvements in translation results up to 6 BLEU points over their seq2seq NMT baseline. In this paper we use the same principle and reward n-grams that are found in the source document during the AMRto-Text generation process. However we use a simpler approach using a probabilistic language model in the scoring mechanism.

Guiding NLG for AMR-based summarization
We first briefly describe the AMR-based summarization method of Liu et al. (2015) and then our guided NLG approach.

AMR-based summarization
In Liu et al. (2015)'s work, each of the sentence of the source document was parsed into an AMR graph, and combined into a source graph, G = (V, E) where v ∈ V and e ∈ E are the unique concepts and the relations between pairs of concepts. They then extracted a summary graph, G using the following sub-graph prediction: where f(v) and f(e) are the feature representations of node v and edge e respectively. The final summary produced was a bag of concept words extracted from G . This output we will be replacing with our proposed guided NLG.

Unguided NLG from AMR
Our baseline is a standard (unguided) seq2seq model with attention (Luong et al., 2015) which consists of an encoder and a decoder. The encoder computes the hidden representation of the input, {z 1 , z 2 , . . . , z k }, which is the linearized summary AMR graph, G from Liu et al. (2015), following Van Noord and Bos (2017)'s preprocessing steps. Following this, the decoder generates the target words, {y 1 , y 2 , . . . , y m }, using the conditional probability P s2s (y j |y <j , z), which is calculated using the equation , where the attentional hidden state,h t is calculated using the equatioñ , where c t is the source context vector, and h t is the target RNN hidden state. The source context vector is defined as the weighted average over all the source RNN hidden states,h s , given the alignment vector, a t where a t is defined as Subsequence (LCS) between the linearized AMR parses and the summary AMR graph. We keep the top-k sentences sorted by LCS length. To distinguish this pruned document from the source document, we refer to the former as side information.
Our aim is is to combine P s2s with the probability distribution estimated using words in the side information, P side , in order to score each word given its context during decoding. We estimate P side as the linear interpolation of 2-gram to 4gram probabilities in the form of , where x j is a word occurring in side information document, P LM is an N -gram LM estimated using Maximum Likelihood: and λ i is defined as where θ is a hyper-parameter that we tune using the dev dataset during the experiments.
Lastly, we combine the probability distribution of the decoder, P s2s with that provided by the side information, P side , as follows: where ψ is a hyper-parameter determining the influence of the side information on the decoding process, a is P s2s (y j |y <j , z) and b is P side (y j |y j−1 j−3 ). s(y j |y <j , z) is used during beam search replacing P s2s (y j |y <j , z) for all words that occur in the side information. The intuition behind Eq. 8 is that we are rewarding word y j when it appears in similar context in the side information, i.e. the source document being summarized.

Experiments
We conduct experiments in order to answer the following questions about our proposed approach: (1) Is our baseline model comparable with the state-of-the-art AMR-to-text approaches? (2) Does the guidance from the source document improve the result of AMR-to-Text in the context Model BLEU Our model (unguided NLG) 21.1 NeuralAMR (Konstas et al., 2017) 22.0 TSP (Song et al., 2016) 22.4 TreeToStr (Flanigan et al., 2016) 23.0 Table 1: Results for AMR-to-text of summarization? (3) Does the improvement in AMR-to-Text hold when we use the generator for abstractive summarization using AMR? We answer each of these in the following paragraphs.
AMR-to-Text baseline comparison We compare our baseline model (described in §3.2) against previous works in AMR-to-text using the data from the recent SemEval-2016 Task 8 (May, 2016, LDC2015E86). Table 1 reports BLEU scores comparing our model against previous works.
Here, we see that our model achieves a BLEU score comparable with the state-of-the-art, and thus we argue that it is sufficient to be used in our subsequent experiments with guidance.
Guided NLG for AMR-to-Text In this experiment we apply our guided NLG mechanism described in §3.3 to our baseline seq2seq model. To isolate the effects of guidance we skip the actual summarization process and proceed to directly generating the summary text from the gold standard summary AMR graphs from the Proxy Report section. To determine the hyper-parameters, we perform a grid search using the dev dataset, where we found the best combination of ψ, θ and k are 0.95, 2.5 and 15 respectively. We have two different settings for this experiment: the oracle and non-oracle settings. In the oracle setting, we directly use the gold standard summary text as the guidance for our model. The intuition is that in this setting, our model knows precisely which words should appear in the summary text, thus providing an upper bound for the performance of our guided NLG approach. In the non-oracle setting, we use the mechanism described in §3.3. We also compare them against the baseline (unguided) model from §3.2. Table 2 reports performance for all models. The difference between the guided and the unguided model is 16.2 points in BLEU and 9.9 points in ROUGE-2, while there is room for improvement as evidenced by the difference between the oracle and non-oracle result.
Guided NLG for full summarization In this experiment we combine our guided NLG model  with Liu et al. (2015)'s work in order to generate fluent texts from their summary AMR graphs using the hyper-parameters tuned in the previous paragraph. Liu et al. (2015) used parses from both the manual annotation of the Proxy dataset as well as those obtained using the JAMR parser (Flanigan et al., 2014). Instead of JAMR we use the RIGA parser (Barzdins and Gosko, 2016)   In Table 3, we can see that our approach results 1 We were able to obtain comparable AMR summarization subgraph prediction to their reported results using their published software but not to match their bag-of-word generation results. 2 We use the OpenNMT-pytorch implementation https://github.com/OpenNMT/OpenNMT-py and a pre-trained model downloaded from http://opennmt. net/OpenNMT-py/Summarization.html which has higher result than See et al. (2017)'s summarizer. in improvements over both the unguided AMR-totext and the standard seq2seq summarization. One interesting note is that using the RIGA parses result in higher ROUGE scores than the gold parses for the guided model in our experiment. This phenomenon was also observed in Liu et al. (2015)'s experiment where the summary graphs extracted from automatic parses had higher accuracy than those extracted from manual parses. We hypothesize this can be attributed to how the AMR dataset is annotated as there might be discrepancies in different annotator's choices of AMR concepts and relations for sentences with similar wording. In contrast, the AMR parsers introduce errors, but they are consistent in their choices of AMR concepts and relations. The discrepancies in the manual annotation could have impacted the performance of the AMR summarizer that we use more negatively than the noise introduced due to the AMR parsing errors. the russian laboratory complex is a 90 -building campus and served as the location for russia 's secret biological weapons program in the soviet era of a moscow regional depository threaten moscow . In Table 4, we show sample summaries from the different models, where we can see that our guided model improves the unguided model by correcting a wrong word (a softening) into a correct one (airstrikes) and introducing a better suited word from the source document (georgian instead of georgia 's).  We also evaluated manually by asking human evaluators to judge sentences' fluency (grammatical and naturalness) on a scale of 1 (worst) to 6 (best) for the guided and unguided model (see Ta-ble 5). While the manual evaluation shows improvement over the unguided model, on the other hand, grammatical mistakes and redundant repetition in the generated text are still major problems (see Table 6) in our AMR generation.

Guided NLG Model
Problems the soldiers were injured when a attempt to defuse the bombs .
grammatical mistake on 20 october 2002 the state -run radio nepal reported on 20 october 2002 that at the evening -run radio nepal reported on 20 october 2002 that the guerrillas were killed and killed . redundant repetition Table 6: Problems in guided model's summaries.

Conclusion and Future Works
In this paper we proposed a guided NLG approach that substantially improves the output of AMRbased summarization. Our approach uses a simple guiding process based on a probabilistic language model. In future work we aim to improve summarization performance by jointly training the guiding process with the AMR-based summarization process.